Let's talk... about raw events, logs, aggregates and data structures. The meat and potatoes of computering.
What *is* an event? For the purposes of this thread, let's define an event as one hop in the lifecycle of a request. (Oodles of details here: honeycomb.io/blog/2018/06/h…)
So if a request hits your edge, then API service, then calls out to 4 other internal services and makes 5 db queries, that request generated 10 events.
(If it made 20 cache lookups and 10 http calls to services we don't control and can't observe, those don't count as events...
... because this is all about *perspective*, observing from inside of the code. You can flip to the perspective of your internal services but not the external ones. And it probably isn't useful or efficient to instrument your memcache lookups. So those aren't "events")
OK. Now part of the reason people think structured data is cost-prohibitive is that they're doing it wrong. Spewing log lines from within functions, constructing log lines with just a couple nouns of data, logging the same nouns 10-100x to correlate across the life cycle.
Then they hear they should structure their logs, so they add structure to their existing shitty logs, which adds a few bytes, and then wonder why they aren't getting any benefit -- just paying more.
You need a fundamentally different approach to reap the max benefits.
(Oops meeting -- to be continued!)
<10 hours later>
OK LETS DO THIS
So let's talk about the correct level of granularity/abstraction for collecting service-level information. This is not a monitoring use case, but sure af ain't gdb either. This is *systems-level introspection* or plain ol' systems debugging.
In distributed systems, the hardest part is often not finding the bug in your code, but tracking down which component is actually the source of the problem so you know what code to look at.
Or finding the requests that exhibit the bug, and deducing what they all have in common.
Observability isn't about stepping through all the functions in your code. You can do that on your laptop, once you have a few sample requests that manifest the problem. Observability is about swiftly isolating the source of the problem.
The most effective way to structure your instrumentation, so you get the maximum bang for your buck, is to emit a single arbitrarily wide event per request per service hop.
We're talking wiiiide. We usually see 200-500 dimensions in a mature app. But just one write.
Initialize the empty debugging event when the request enters the service. Stuff any and all interesting details while executing into that event. Ship your phat event off right before you exit or error the service. (Catch alll the signals.)
(This is how all the honeycomb beeline integrations work, btw. Plus a little magic to get you started easy with some prepopulated stuff.)
Stuff you're gonna want to track is stuff like:
🎀 Metadata like src, dst, headers
🎀 The timing stats and contents of every network call out
🎀 Every db query, normalized query, execution time etc
🎀 Infra details like AZ, instance type, provider
🎀 Language/env details like $lang version
🎀 Any and all unique identifying bits you can get your paws on: UUID, request ID, shopping cart ID, any other ID <<- HIGHEST VALUE DETAILS
🎀 Any other useful application context, starting with service name
🎀 Possibly system resource state at point in time e.g. /proc/net/ipv4
All of it. In one fat structured blob. Not sprinkled around your code in functions like satanic fairy dust. You will crush your logging system that way, and you'd need to do exhaustive post-processing to recreate the shared context by joining on request-id (if you're lucky).
And don't even with unstructured logs, you deserve what you get if you logging strings.
The difference between messy strings and one rich, dense structured event is the dif between grep and all of computer science. (Can't believe I have to explain this to software engineers.)
You're rich text searching when you should be tossing your regexps and doing read-time computations and calculations and breakdowns and filters. Are ye engineers or are ye English majors?*
(*all of the English major engineers that i know definitely know better than this)
(though if you ARE in the market for a nifty post-processor that takes youyr shitty strings and munges them into proper computer science, check out @cribl_io from @clintsharp. bonus: you can fork the output straight into honeycomb)
Lastly, I hope this makes it plain why your observability needs require 1) NOT pre-aggregating at the source, but rather sampling to control costs, and 2) the ability to drill down to at least a sample of the original raw events for your aggregations and calculations at read time
Because observability is about giving yourself the ability to ask new questions -- to debug weird behavior, describe outliers, to correlate a bunch of events that all manifest your bug -- without shipping new custom code for every question.
And aggregation is a one way trip.
You can always, always derive aggregates and rollups and pretty dashboards from your events. You can never derive raw events from your metrics. So you need to keep some sample of your raw events around for as long as you expect to need to debug in production.
(weighted samples of course -- keep all of the uncommon events, a fraction of the ultra-common events, and tweak the rest somewhere in between. it's EXTRAORDINARY how powerful this is for operational data, tho not to be used for transactions or replication data.)
... and now that we've covered logs, events, data structures, sampling, instrumentation, and debugging systems vs debugging code, i do believe that we are done. apologies for the prolonged delay!
.. fuck i shoulda done a blog post shouldn't i
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Let's talk about OpenTelemetry, or "OTel", as the kids like to call it.
I remember emitting sooo many frustrated twitter rants back in 2017-2018 about how *behind* we were as an industry when it comes to standards for instrumentation and logging.
Then OTel shows up.
For those of you who have been living under a rock, OTel is an open standard for generating, collecting, and exporting telemetry in a vendor agnostic way.
Before OTel, every vendor had its own libraries, and switching (or trying out) new vendors was a *bitch*.
Yeah, it's a bit more complicated to set up than your standard printf or logging library, but it also adds more discipline and convenience around things like tracing and the sort of arbitrarily-wide structured data blobs (bundled per request, per event) that o11y requires.
It's hard to formulate career goals in your first decade or so as an engineer; there is just SO MUCH to learn. Most of us just kinda wing it.
But this is a goal that I think will serve you well: do a tour of duty at a startup and another at a bigco, in your first 10y as an eng.
Besides the obvious benefits of knowing how to operate in two domains, it also prevents you from reaching premature seniority. (charity.wtf/2020/11/01/que…)
The best gift you can give your future self is the habit of regularly returning to the well to learn, feeling like a beginner.
Several people asked this. It's a good question! I will share my thoughts, but I am certainly not religious about this. You should do what works for you and your teams and their workflows. 📈🥂☺️
1) "assuming you have good deduplication"... can a pretty big assumption. You never want to be in a situation where you spend more time tweaking dupe, retry, re-alert thresholds than fixing the problem.
2) having to remember to go futz with a ticket after every little thing feels like a lot of busywork. You've already committed some code, mentioned it in #ops or wherever, and now you have to go paste all that information into a task (or many tasks) too?
@beajammingh the title particularly caught my eye. for the past month or two i've been sitting on a rant about how i no longer associate the term "devops"** with modern problems, but with fighting the last war.
** infinitely malleable as it may be
yes, if you have massive software engineering teams and operations teams and they are all siloed off from each other, then you should be breaking down (i can't even say it, the phrase is so annoying) ... stuff.
but this is a temporary stage, right? a bridge to a better world.
I've done a lot of yowling about high cardinality -- what it is, why you can't have observability without it.
I haven't made nearly as much noise about ✨high dimensionality✨. Which is unfortunate, because it is every bit as fundamental to true observability. Let's fix this!
If you accept my definition of observability (the ability to understand any unknown system state just by asking questions from the outside; it's all about the unknown-unknowns) then you understand why o11y is built on building blocks of arbitrarily-wide structured data blobs.
If you want to brush up on any of this, here are some links on observability:
Close! "If you're considering replacing $(working tool) with $(different tool for same function), don't do it unless you expect a 10x productivity improvement"
cvs to git? ✅
mysql to postgres? ❌
puppet to chef? ❌
redhat to ubuntu? ❌
The costs of ripping and replacing, training humans, updating references and docs, the overhead of managing two systems in the meantime, etc -- are so high that otherwise you are likely better off investing that time in making the existing solution work for you.
Of course, every situation is unique. And the interesting conversations are usually around where that 10x break-even point will be.
The big one of the past half-decade has been when to move from virtualization to containerization.