A good alert includes:
- Detection context
- Investigation/response context
- Orchestration actions
- Prevalence info
- Environmental context (e.g, src IP is scanner)
- Pivots/visual to understand what else happened
- Able to answer, "Is host already under investigation?"
Detection context. Tell me what the alert is meant to detect, when is was pushed to prod/last modified and by whom. Tell me about "gotchas" and point me to examples when this detection found evil. Also, where in the attack lifecycle did we alert? This informs the right pivots.
Investigation/response context. Given a type of activity detected, guide an analyst through response.
If #BEC, what questions do we need to answer, which data sources? If coinminer in AWS, guide analyst through CloudTrail, steps to remediate.
Orchestration makes this easier.
Orchestration actions reduce variance in response and pulls decision moment fwd. Given an alert for a weird login, make recent MFA activity available w/ alert. Or if we flagged a process event, show me recent process activity. If we flagged a file, what does VT say?
Prevalence of alert metadata and evidence fields.
How often do we see this alert fire? How often does this detection find evil?
Also, prevalence of evidence fields. If we flagged a process/service, is it common or only on this host? How many hosts are talking to that domain?
Environmental context is about not having to use sticky notes in your SOC. If PAN alerted on internal scanning, is the source a known scanner? If we're running attack_sim or testing what are the accounts we should expect to see? Do these accounts have 2FA? Context about the org.
Alert mgmt is a queuing system but you can't handle them as single items on a factory line. Given an alert I can quickly pivot and see what else is happening.Lists are OK, but graphs are how you win here. Good alert UI makes it easy to visualize what else happened.
Given an alert need to know if this host/source/account is already under investigation, marked for remediation, or recently investigated. Alerts help tell the story. If there are multiple tickets for the same incident it's hard to to tell the right story. Organization is key.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Top 3 #M365 Account Takeover (ATO) actions spotted by our SOC in Q1:
1. New-inbox rule creation to hide attacker emails 2. Register new MFA device for persistence 3. Create mailbox forwarding rules to monitor victim comms and intercept sensitive info
More details in 🧵...
50% of ATO activity in M365 we identified was for New-inbox rules created by an attacker to automatically delete certain emails from a compromised account. By deleting specific emails, an attacker can reduce the chance of the victim or email admins spotting unusual activity.
25% percent of ATO activity we identified was for the registration of a new MFA device in Azure. Registering a new MFA device allows an attacker to maintain persistence.
We're seeing more and more M365 session cookie theft for initial access....
By deleting specific emails, an attacker can reduce the chance of the victim spotting unusual activity.
You can build high quality detections to spot this activity. A 🧵with real-world examples...
Account takeover (ATO) activity in M365 can involve various unauthorized actions performed by an attacker who has gained control over the account.
Of the ATO activity we identified in M365 in Q1 '23:
50% of all ATO in M365 we identified was for New-inbox rules created by the attacker to automatically delete or hide certain emails from the compromised account.
A #SOC analyst picks up an alert and decides not to work it.
In queuing theory, this is called “work rejection”–and it’s quite common in a SOC.
TL;DR - “Work rejection” is not always bad, but it can be measured and the data can help improve performance. More details in the 🧵..
A couple of work rejection plays out in the SOC. The most common:
An analyst picks up an alert that detects a *ton* of benign activity. Around the same time, an alert enters the queue that almost *always* finds evil. A decision is made...
The analyst rejects the alert that is likely benign for an alert that is more likely to find evil.
Let’s assume the analyst made the right call. They rejected an alert that was likely benign to work an alert that was evil. Work rejection resulted in effective #SOC performance.
What does a #SOC tour look like when the team is remote?
TL;DR - Not a trip to a room with blinky lights - but instead a discussion about mission, mindset, ops mgmt, results and a demo of the tech and process that make our SOC “Go”.
SOC tour in the 🧵...
Our SOC tour starts with a discussion about mission. I believe a key ingredient to high performing teams is a clear purpose and “Why”.
What’s our mission? It's to protect our customers and help them improve.
Our mission is deliberately centered around problem solving and being a strategic partner for our customers. Notice that there are zero mentions of looking at as many security blinky lights as possible. That’s intentional.
A good detection includes:
- Clear aim (e.g, remote process exec on DC)
- Unlocks end-to-end workflow (not just alert)
- Automation to improve decision quality
- Response (hint: not always contain host)
- Volume/work time calcs
- Able to answer, “where does efficacy need to be?”
On detection efficacy:
⁃ As your True Positive Rate (TPR) moves higher, your False Negative Rate moves with it
⁃ Our over arching detection efficacy goal will never be 100% TPR (facts)
⁃ However, TPR targets are diff based on classes of detections and alert severities
Math tells us there is a sweet spot between combating alert fatigue and controlling false negatives. Ours kind of looks like a ROC curve.
This measure becomes the over arching target for detection efficacy.
“Detection efficacy is an algebra problem not calculus.” - Matt B.
Before we hired our first #SOC analyst or triaged our first alert, we defined where we wanted to get to; what great looked like.
Here’s [some] of what we wrote:
We believe that a highly effective SOC:
1. leads with tech; doesn’t solve issues w/ sticky notes 2. automates repetitive tasks 3. responds and contains incidents before damage 4. has a firm handle on capacity v. loading 5. is able to answer, “are we getting better, or worse?”