Gathering my thoughts for a panel discussion tomorrow on scaling #SOC operations in a world with increasing data as part of the Sans #BlueTeamSummit.
No idea where the chat will take us, but luck favors the prepared. A 🧵 of random thoughts likely helpful for a few.
Before you scale anything, start with strategy. What does great look like? Are you already there and now you want to scale? Or do you have some work to do?
Before we scaled anything @expel_io we defined what great #MDR service looked like, and delivered it.
We started with the customer and worked our way back. What does a 10 ⭐ MDR experience look like?
We asked a lot of questions. When an incident happens, when do we notify? How do we notify? What can we tell a customer now vs. what details can we provide later?
When we collect events/alerts from various tech, how will we abstract away the underlying tech to reduce operational complexity? What alerts require human judgment? What evidence do we need to present in an alert? What tech do we need to deliver to 10 customers vs. 100.
Bottom line: we defined what great looked like, we tested our theories, collected a ton of customer feedback, and then landed on what we believed was a great MDR experience for customers before we shifted our focus towards scale. Know what great looks like before you scale.
Guiding principle time: when I say "scale", I don't just mean, automate it. I mean, take the thing you've built that's great and improve upon it. AND quality of said thing should also improve. Scale and quality must grow together. Don't trade scale for quality. Don't do it.
OK, rumor mill: I've heard rumors that there are folks in industry that think we have ~100-150 SOC analysts @expel_io. Here's what I'm budgeted for in '21:
Def not 100-150 SOC analysts. And the reactions we get from prospects are likely what you'd expect: "Wow, you're smaller than $competitor" OR "you're lean and mean"..
My take? You can't "see" tech, so you can't measure it in your head. Number of people != #SOC capability.
If you're searching for a #SOC instead of asking, "how many analysts?" ask:
"What does decision support look like? When an alert for a suspicious login fires, how do you optimize for the decision moment? How do you use tech to enable your SOC?"
Also, don't just ask for the average experience of SOC analysts. First, consider asking for median experience because averages are way susceptible to outliers. But, also ask about the senior analyst crew and the management team. These folks are shaping the culture and training.
Re: number of shift analysts, as soon as our capacity model indicates we're projecting to be more than 70% loaded in our SOC in x number of months, we'll pull up hiring and in parallel ask "how do we make our current projections change?"
On constraints: in our #SOC, analyst time *is* the constraint. We don't have an infinite amount of human time (or even 150 analysts). Given this constraint we have to be intentional about when we use analyst time . We optimize for the SOC moment. We don't use the SOC to optimize.
Here's an example: we spot a lot of #BEC attempts, so much that we were spending a good chunk of available analyst time responding to just this class of incident. Sure, we could add people (aka "capacity") but go back to strategy: define great, improve scale and quality.
We broke down the process: alert triage, investigate, respond, report. When an alert fires, we pull in enough info for an analyst to make a decision. Investigation (repeated steps) are handed off to tech and when we respond (what to do) that automatically builds the report.
We didn't change our #SOC process for BEC, we optimized for the SOC moment (alert decision) and then when we investigate, tech does the data gathering and when an analyst interprets the results and populates remediation actions, a report is update automatically.
Both scale and quality of our reporting improved. The goal!
Some final thoughts in prep this evening, our #SOC metrics today consist of 4 key areas
1. Efficiency (alert wait times, cycle times, etc.) 2. Efficacy (ex: how often does this sig find evil?) 3. Complexity (how many new alerts/decisions /day?) 4. Quality (how many defects?)
What has my attention at present is the growing complexity. As we integrate with more tech and our platform becomes more flexible to meet customer needs, complexity is growing. Complexity adds to analyst cognitive loading, and too much of this creates stress. Stress --> burnout.
To combat this we're going to focus on abstracting away more of the underlying tech so an EDR alert looks the same regardless of the originating tech - but we're going to continue to evaluate how/when we orchestrate....
We're going to be more intentional about what data we bring to an alert vs. data we bring to an investigation/incident. Yes, you CAN present too much info an alert. What data is needed to make the most informed decision? Nothing more, nothing less.
What data is presented in an alert, what data do I need to add to an alert by hitting an API of tech or 3p enrichment service? And what order does the data needed to be presented in? A consistent SOC analyst experience. Reduce complexity (abstract away diff tech) make it similar.
And yes, you can overdo alert orchestration. We've experimented a ton with AWS alerts by bringing like wayyy too much contextual info using CloudTrail logs. Often times, less is more. Experimenting will get you where you need to go. Test.Learn.Iterate.
One of the core tenets we have on the team is "be respectful of what's been built, but don't be afraid to ask, is there a better way?"...Doesn't matter what it is. With this type of approach, progress (scale in this context) is inevitable.
Test
Learn
Iterate
Anyway, good to put some tweets down to help gather my thoughts. Looking forward to the panel discussion tomorrow.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
1. Collect data, you won't know what it means 2. Collect data, *kind* of understand it 3. Collect data, understand it. Able to say: "This is what's happening, let's try changing *that*" 4. Operational control. "If we do *this*, *that* will happen"
What you measure is mostly irrelevant. It’s that you measure and understand what it means and what you can do to move your process dials up or down.
If you ask questions about your #SOC constantly (ex: how much analyst time do we spend on suspicious logins and how can we reduce that?) - progress is inevitable.
W/o constantly asking questions and answering them using data, scaling/progress is coincidental.
Quick 🧵of some of the insights and actions we're sharing with our customers based on Q2 '21 incident data.
TL;DR:
- #BEC in O365 is a huge problem. MFA everywhere, disable legacy protocols.
- We’re 👀 more ransomware attacks. Reduce/control the self-install attack surface.
Insight: #BEC attempts in 0365 was the top threat in Q2 accounting for nearly 50% of the incidents we identified
Actions:
- MFA everywhere you can
- Disable legacy protocols
- Implement conditional access policies
- Consider Azure Identity Protection or MCAS
re: Azure Identity Protection & MCAS: They build data models for each user, making it easier to spot atypical auth events. Also, better logging. There's $ to consider here, I get it. Merely providing practitioner's perspective. They're worth a look if you're struggling with BEC.
We see a lot of variance at the end of Feb that continues into the beginning of Mar. This was due to a number of runaway alerts and some signatures that needed tweaking.
What’s most interesting is that the variance decreases after we released the suppressions features on Mar 17.
We believe this is due to analysts having more granular control of the system and it’s now easier than ever get a poor performing Expel alert back under control.
Process tree below so folks can query / write detections
Also, update!
Detection moments:
- w3wp.exe spawning CMD shell
- PS download cradle to execute code from Internet
- CMD shell run as SYSTEM to run batch script from Public folder
- Many more
Bottom line: a lot of ways to spot this activity.
Build.test.learn.iterate.
Also, update. :)
And some additional details from @heyjokim after further investigating:
Attack vector/Initial Compromise: CVE-2021-27065 exploited on Exchange Server
Foothold: CHOPPER webshells
Payload: DLL Search Order Hijacking (opera_browser.exe, opera_browser.dll, opera_browser.png, code)
1. Create an inbox rule to fwd emails to the RSS Subscriptions folder 2. Query your SIEM 3. How often does this happen? 4. Can you build alert or cadence around inbox rule activity?
- Pro-active search for active / historical threats
- Pro-active search for insights
- Insights lead to better understanding of org
- Insights are springboard to action
- Actions improve security / risk / reduce attack surface
With these guiding principles in hand, here's a thread of hunting ideas that will lead to insights about your environment - and those insights should be a springboard to action.
Here are my DCs
Do you see evidence of active / historical credential theft?
Can you tell me the last time we reset the krbtgt account?
Recommendations to harden my org against credential theft?