1. What we're solving for
2. Guiding principles
3. Our (current) solution
4. Quick recap
My goal is to share what's working for us and how we get there. But I'd love to hear from others. What's working for you?
On a typical day in our #SOC we'll:
- Process Ms of alerts w/ detection engine
- Send 100s to analysts for human judgement
Those 100s of alerts result in:
- Tens of investigations
- Handful of incidents
1. QA | Focus: *Prevent* defects | Ex: Email notifications for those really spooky alerts
2. QC | Focus: *Find* defects | Ex: Let's review closed alerts
You likely already have a *ton* of QA built in.
But is there any QC?
What are the #SOC QC guiding principles?
1. We'll use industry standards to sample
2. The sample has to be representative of the total population
3. Measurements must be accurate & precise
4. Metrics we produce are digestible
5. Performed daily
We went out and researched QC in manufacturing and landed on ISO 2859-1.
TL;DR ➡️ You make things (your lot), AQL tells you have many you should inspect.
Let's say your team handles 600 alerts per day (lot size).
You should inspect 32 (sample size).
1. Alerts
2. Investigations
3. Incidents
We used change point analyst to determine the mean of each and then used AQL tables to tell me how many we should inspect each day.
Cool, cue the #Jupyter Notebook.
We take each item through a check sheet and look for defects. Did we take the right action? Did we zig when we should have zagged type of thing.
We record the number of defects by type each day, trend them and then provide feedback to the team via #Slack workflow.
SOC #QC wins:
- Spotted issues using a class of tech ➡️ held training
- Variance wrt how we investigated auth alerts ➡️ built orchestration
- Wobble w/ reporting quality ➡️ built tech
I'd love to hear about your quality program. What works? What didn't? Success stories? We're always on the lookout for ways to improve.
Also, if you've made it this far in the thread, thanks for taking the time!