This afternoon at #srecon, Adam Mckaig and Tahia Khan from @datadoghq about the evolution of their metrics backend
The high-level architecture looks very familiar to me. The slightly more detailed less so — many parts!
For scale, break up incoming data, put into kafka.
hash(customer_id) -> partition_id
… but then one kafka topic gets overloaded, so…
hash(customer_id) -> topic_id, partition_id
to send to topics in different clusters.
Later, some customers are too big.
So for those customers:
hash(metric_id) -> topic, partition
Since metrics are queried individually, @datadoghq can split up data to that fine grain and each query will still only need to hit one partition. #SREcon
Partitions still get unbalanced. Some customers, and some metrics, are way bigger than others.
So @datadoghq got smart with its partitioning, implementing Slicer based on a paper from Google. #srecon
The storage layer knows nothing about the partitioning scheme.
Intake and Query need the mapping from (customer, metric) to (cluster, partition) so they can send to & query from the same node.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
I laugh at people who talk about “exactly-once delivery”
The specs that claim it have been proven wrong.
But we have methods (like idempotency) to do things well. @mjpt777#YowLondon
Make handover/resumption protocols.
“This is what I thought I sent to you last, did you get it?”
“Here’s what I got from you last, let’s work it out from there”
If we go from Idea to Behavior change to new Idea…
how quickly we can do that depends on the structure. @kentbeck
If we go Idea to Behavior to Idea to Behavior
as fast as we can,
it’s gonna get slower and slower and then the developers will get frustrated and leave and the new developers will be even slower…
So sometimes, we make a structure change before the behavior change. @KentBeck
SREs in the audience? (Dozens of hands)
Experienced SREs? (Like 2.5 hands)
We @RedHat used to ship products. Build a thing, package it, send to customers. Then it was their problem. Customer hires a consultant or figures it out.
Now we mostly ship services. Now it’s our headache, reliability and uptime etc. It’s different
The team deserves someone
who wants to manage people.
who is not bitter about meetings
who is interested in sociotechnical systems and nurturing careers
whose technical skills are strong enough to evaluate their work.