What does a #SOC tour look like when the team is remote?
TL;DR - Not a trip to a room with blinky lights - but instead a discussion about mission, mindset, ops mgmt, results and a demo of the tech and process that make our SOC “Go”.
SOC tour in the 🧵...
Our SOC tour starts with a discussion about mission. I believe a key ingredient to high performing teams is a clear purpose and “Why”.
What’s our mission? It's to protect our customers and help them improve.
Our mission is deliberately centered around problem solving and being a strategic partner for our customers. Notice that there are zero mentions of looking at as many security blinky lights as possible. That’s intentional.
Next, we talk about culture and guiding principles - key ingredients for any #SOC. I think about culture as the behaviors and beliefs that exist when management isn’t in the room.
Culture isn't memes on a slide - it's behavior and mindset.
Next, with a clear mission and mindset - how are we organized as a team get there? Less experienced analysts are backed by seasoned responders. If there's a runaway alert (it happens), there's a team of D&R Engineers monitoring the situation ready to respond.
The tour then focuses on operations management and how we do this for a living. You have to have intimate knowledge of what your system looks like so you know when something requires attention. Is it a rattle in the system (transient issue) or a big shift in work volume?
With solid ops management we’re able to constantly learn from our analysts and optimize for the decision moment. We watch patterns and make changes to reduce manual effort. We hand off repetitive tasks to bots because automation unlocks fast and accurate decisions.
Next, our tour focuses on how we think about investigations. Great investigations are stories (based on evidence of course).
When we identify an incident we investigate to determine what happened, when, how it got there, and what we need to do about it. Stories.
Next, how we think about quality control in our #SOC. We make a couple key points:
1. We don’t trade quality for efficiency 2. You can measure quality in a SOC 3. QC checks run daily based on a set of manufacturing ISOs to spot failures to drive improvements
- We’re going to use industry standards to sample
- The sample has to be representative of the population and done daily
- Measurements of the sample need to be accurate and precise
- Metrics we produce need to be digestible
What about #SOC results? Let's talk about it. Yes, alert-to-fix in <30 minutes is quite good. But a high-degree of automation and SOC retention are equally important.
Before the tour ends we share insights. The security incidents we detect become insights for every customer.
“Identity is the new endpoint”, a lot of BEC in M365 and MFA fatigue attacks are up.
Then we jump into our platform and provide a demo of the tech and process that enable the #SOC to complete their mission. Here's a video capturing some of the items we cover: expel.com/managed-securi…
Finally, we stop by the #SOC. Most of our analysts will be remote - but a tour is about so much more than seeing a room with monitors and blinky lights. I believe a great SOC tour highlights the people, culture and mindset behind tech and process.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
A good detection includes:
- Clear aim (e.g, remote process exec on DC)
- Unlocks end-to-end workflow (not just alert)
- Automation to improve decision quality
- Response (hint: not always contain host)
- Volume/work time calcs
- Able to answer, “where does efficacy need to be?”
On detection efficacy:
⁃ As your True Positive Rate (TPR) moves higher, your False Negative Rate moves with it
⁃ Our over arching detection efficacy goal will never be 100% TPR (facts)
⁃ However, TPR targets are diff based on classes of detections and alert severities
Math tells us there is a sweet spot between combating alert fatigue and controlling false negatives. Ours kind of looks like a ROC curve.
This measure becomes the over arching target for detection efficacy.
“Detection efficacy is an algebra problem not calculus.” - Matt B.
Before we hired our first #SOC analyst or triaged our first alert, we defined where we wanted to get to; what great looked like.
Here’s [some] of what we wrote:
We believe that a highly effective SOC:
1. leads with tech; doesn’t solve issues w/ sticky notes 2. automates repetitive tasks 3. responds and contains incidents before damage 4. has a firm handle on capacity v. loading 5. is able to answer, “are we getting better, or worse?”
How to think about presenting good security metrics:
- Anchor your audience (why are these metrics important?)
- Make multiple passes with increasing detail
- Focus on structures and functions
- Ensure your audience leaves w/ meaning
Don’t read a graph, tell a story
Ex ⬇️
*Anchor your audience 1/4*
Effective leaders have a firm handle on SOC analyst capacity vs. how much work shows up. To stay ahead, one measurement we analyze is a time series of alerts sent to our SOC.
*Anchor your audience 2/4*
This is a graph of the raw trend of unique alerts sent to our SOC for review between Nov 1, 2021 and Jan 2, 2022. This time period includes two major holidays so we’ll expect some seasonality to show up around these dates.
Once a month we get in front of our exec/senior leadership team and talk about #SOC performance relative to our business goals (grow ARR, retain customers, improve gross margin).
A 🧵on how we translate business objectives to SOC metrics.
As a business we want to grow Annual Recurring Revenue (ARR), retain and grow our customers (Net Revenue Retention - NRR) and improve gross margin (net sales minus the cost of services sold). There are others but for this thread we'll focus on ARR, NRR, and gross margin.
/1
I think about growing ARR as the ability to process more work. It's more inputs. Do we have #SOC capacity available backed by the right combo of tech/people/process to service more work?
Things that feed more work: new customers, cross selling, new product launches.
/2
Purpose: Be clear with your team about what success looks like - and create a team and culture that guides you there. Go through the exercise of articulating your teams purpose.
The "purpose" we've aligned on at Expel in our SOC: protect our customers and help them improve.
People: To get to where you want to go, what are the traits, skills, and experiences you need to be successful?
Traits (who you are)
Skills (what you know)
Experiences (what you've encountered/accomplished)
A good alert includes:
- Detection context
- Investigation/response context
- Orchestration actions
- Prevalence info
- Environmental context (e.g, src IP is scanner)
- Pivots/visual to understand what else happened
- Able to answer, "Is host already under investigation?"
Detection context. Tell me what the alert is meant to detect, when is was pushed to prod/last modified and by whom. Tell me about "gotchas" and point me to examples when this detection found evil. Also, where in the attack lifecycle did we alert? This informs the right pivots.
Investigation/response context. Given a type of activity detected, guide an analyst through response.
If #BEC, what questions do we need to answer, which data sources? If coinminer in AWS, guide analyst through CloudTrail, steps to remediate.