Charity Majors Profile picture
Find me on bsky at @charity.wtf. 🐝🏳️‍🌈🦄

Jan 19, 2020, 9 tweets

It is staggering how incredibly durable the myth of "you can't afford events, use metrics" has proven to be. 🤔

I think there are several contributing factors. First of all, most people's frame of reference is logs. Shitty, spammy, undisciplined string-infested logs.

The median log line contains maybe 1-5 nouns of information, and repeats any/all correlating identifiers on every line. That's...not a lot of information density per write or buffer flush.

But it gets worse! The strings are often padded with sentences and human readable crap,

and the log lines themselves are virtually useless unless you reassemble the full context of the event in post processing.

Your write amplification is massive (could easily be tens, hundreds per request) and a typo here can be fatal to disk space or budget.

Events, on the other hand, are set at one per request per service. A mature instrumented service tends to have 300-500 dimensions, most of which are populated.

Adding another dimension doesn't mean another write, just appending a few more chars to the existing one.

So structurally events are compact, dense and resistant to bloat -- and no post processing necessary to make them usable.

No printing out the unique ids and time stamps again and again, on every log line. No need to allocate the memory and setup tcp every time.

And that's just what you save by aggregating context around the event. I know y'all don't have access to an efficient columnar store; the closest options are probably elastic (built for text search) and druid (lacks flexible schemas). Surely there's something in the works tho.

We've written extensively on some of the things we did to optimize storage costs, from compression to replacing repeated strings with pointers, to (most recently) aging the files out to S3 and moving the query planner to lambda jobs.

(Aka "We serverlessed our database 😍")

All that without even mentioning the loaded S-word: sampling.

To be clear: honeycomb does not depend on sampling in ANY way, many of our customers don't sample at all, it's completely up to you. But dynamic sampling is a fucking superpower. You ignore it at your own peril.

Any time a monitoring vendor tells you smugly that THEY don't throw away any data, ask them what time interval they aggregate on.

(That's called throwing data away too, btw, and it's way more fatal to observability than simply getting a representative sample.)

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling