Like Security, SRE’s value hides in all the incidents that don’t happen.
So its easy to ignore. But people and legislative bodies value it.
The Success in SRE is Silent
and if our success remains silent, our profession (and software development in general) will go the way of security: regulation.
Regulation means more gatekeeping, for people and for small companies. it means enforced “best practices” that are counterproductive and suck the joy out of our work. @caseyrosenthal#srecon22
How can we demonstrate the value of SRE?
not quantitative methods. Metrics like nines of availability or MTTR don’t represent customer experience.
Describe SRE success with qualitative methods. Ask developers for reactions, for learning. Notice behavior change. And then (with the most effort) demonstrate business results.
Note that the outcome is not “reliability.” We can’t “prove reliability.”
This afternoon at #srecon, Adam Mckaig and Tahia Khan from @datadoghq about the evolution of their metrics backend
The high-level architecture looks very familiar to me. The slightly more detailed less so — many parts!
For scale, break up incoming data, put into kafka.
hash(customer_id) -> partition_id
… but then one kafka topic gets overloaded, so…
hash(customer_id) -> topic_id, partition_id
to send to topics in different clusters.
Today at #srecon, @allspaw and @ri_cook give deep insight on real tools, incident timelines, and clumsy automation.
But not in person. 😭
Great tools (as opposed to machines) are near to hand and conform to the person who wields them. Like a hammer, or `top`. Yeah.
They are opinionated, but not prescriptive.
(machines do what they do, and you conform to them)
In software, tools like `top` help us see what’s going on in the digital space. @ri_cook et al see our work taking place on two sides of a divide. There’s meatspace (where we are) and digital space (where the software runs). You can’t reach out and feel digital stuffs.
What can we learn from ALL the incidents? @courtneynash at @verica_io compiles reports from lots of companies into the VOID: Verica Open Incident Database. #SREcon
While every incident and every company is different, the distributions have the same shape. They are “positively skewed:” more short incidents than long ones.
People.
There is a difference between a backend and an API.
Taking the endpoints that you wrote for your site, slapping some documentation on them and publishing it
does not make an API.
An API needs designing. It needs a conscious language and consistent conventions.
Standard auth.
Paging.
Careful error codes and messages.
Versioning.
A backend is whatever your front end needs. It should change when your front end needs it to change.
Don’t restrict it to historical behavior because other systems have grown dependent on it.