At Google, Ops work needs to be less than 50% of the total work done by SRE
At Google, burning an error budget within 24 hours is worth an alert.
The cumulative errors graph is the total number of errors received by the system
The third graph is the estimated threshold to alert at if the burn rate is exceeded in the next 24 hours
For a distributed systems, there are there signals one generally emits - logs, metrics and traces (all a derivative of events). #velocityconf
Monitoring and alerting isn’t a substitute to debugging
Also, a classic @allspaw quote from @Monitorama
#velocityconf
At google, there’s a limit of 2 pages per on-call shift.
TLDR - Error budgets are way better than cause based alerts. You get paged much less and when you do get paged, it’s for important stuff.