let's talk about two kinds of data that systems tend to generate: auditable and operational.

* auditable data: transaction logs, replication logs, billing/finance events etc
* operational data: telemetry, metrics and context that describe each request and system component
in the early days we tend to jam as much as possible into one data pipeline, toss it around using kafka/logstash, and sort it out someday.

but they have extraordinarily different characteristics and use cases, which become super fucking obvious and 💵expensive💵 as you scale.
(oops back)

with 'auditable logs' (better term?) losing data is very bad, and you should try hard to retain every record. because of this, you must be very disciplined in what you accept: strict schema, very compact rows. friction around changing schema is ok-- even desired.
you should always be able to assume, when querying e.g. your payment history or mysql replica, that the dataset is effectively complete. and the shape of it stays highly predictable so you can forecast costs.

cool.

operational datasets are the polar opposite of all this.
auditable: predictable, schema'd, compact, often contains PII/PHI/sensitive data

operational: messy, flexible, toss in whatever seems like it may be useful someday, spiky, unpredictable, oddly shaped, lots of numbers to do math on, ballooning write amplification #'s per request
you never want to incentivize your developers NOT to capture some little detail that may be useful for your team someday when trying to track down an unknown-unknown for the first time. this is *not* where you want people thinking conservatively about what to persist.
however, every request that comes in through your front door may generate tens or hundreds of events for your observability stack, esp if you're a microservices shop.

..and i'm not talking cheap little counters, i'm talking the gold standard: rich, wide, structured log events.
startups usually start with lots of these datasets mixed up together, in whatever kafka or logstash monstrosity pipeline they have cobbled together, but this is what always forces them apart: cost.

the infinite perfect retention cost model for operational data will break you
you cannot keep, and should not try to keep, all the operational spew that issues forth from your systems. dynamic sampling is your friend here, as are server-side limits to lop the top off the mountain when e.g. the site goes down and you get a billion operational msgs at once
common things are common, so keep a small % of them. rare things are rare, so keep all of them. everything exists on a spectrum.

sampling (& separating out any auditable data) is how you get to have your cake and eat it too (and not go broke because you bought the damn bakery)
honestly i recommend people start out by sampling from day one, just to get their developers out of the habit of expecting operational data to be lossless and just as accurate as their billing data. it's a terrible fucking habit to get into.
unless it's an auditable data source, you should never assume that.

you're never looking for "one event" in an operational dataset. you're looking for "some manifestation of a bug" or "some use case or pattern that looks like this" or "behavior that matches this report"
and i'm telling you, if the bug was tripped once out of a trillion requests, maybe that's not the bug you need to be focusing on.

if it is? ok just tune your sampling to catch 100% of that (user, whatever), and wait for it to happen again. it will.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Charity Majors

Charity Majors Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @mipsytipsy

Feb 19
Let's talk about OpenTelemetry, or "OTel", as the kids like to call it.

I remember emitting sooo many frustrated twitter rants back in 2017-2018 about how *behind* we were as an industry when it comes to standards for instrumentation and logging.

Then OTel shows up.
For those of you who have been living under a rock, OTel is an open standard for generating, collecting, and exporting telemetry in a vendor agnostic way.

Before OTel, every vendor had its own libraries, and switching (or trying out) new vendors was a *bitch*.
Yeah, it's a bit more complicated to set up than your standard printf or logging library, but it also adds more discipline and convenience around things like tracing and the sort of arbitrarily-wide structured data blobs (bundled per request, per event) that o11y requires.
Read 15 tweets
Feb 17
I want to give this a slightly longer treatment. ☺️ (Gergely and I *just* talked about it, so it's all rustling around in my head.)

I think it's a ✨great✨ idea for every engineer to spend at least a couple years at both a big company and a startup (series B or earlier).
It's hard to formulate career goals in your first decade or so as an engineer; there is just SO MUCH to learn. Most of us just kinda wing it.

But this is a goal that I think will serve you well: do a tour of duty at a startup and another at a bigco, in your first 10y as an eng.
Besides the obvious benefits of knowing how to operate in two domains, it also prevents you from reaching premature seniority. (charity.wtf/2020/11/01/que…)

The best gift you can give your future self is the habit of regularly returning to the well to learn, feeling like a beginner.
Read 20 tweets
Feb 10
Several people asked this. It's a good question! I will share my thoughts, but I am certainly not religious about this. You should do what works for you and your teams and their workflows. 📈🥂☺️
1) "assuming you have good deduplication"... can a pretty big assumption. You never want to be in a situation where you spend more time tweaking dupe, retry, re-alert thresholds than fixing the problem.
2) having to remember to go futz with a ticket after every little thing feels like a lot of busywork. You've already committed some code, mentioned it in #ops or wherever, and now you have to go paste all that information into a task (or many tasks) too?
Read 12 tweets
Feb 9
a caviar-quality rant on deployment, security, testing ... actually more like five rants stuffed into a single trenchcoat. via @beajammingh

mumble.org.uk/blog/2022/02/0…
@beajammingh the title particularly caught my eye. for the past month or two i've been sitting on a rant about how i no longer associate the term "devops"** with modern problems, but with fighting the last war.

** infinitely malleable as it may be
yes, if you have massive software engineering teams and operations teams and they are all siloed off from each other, then you should be breaking down (i can't even say it, the phrase is so annoying) ... stuff.

but this is a temporary stage, right? a bridge to a better world.
Read 18 tweets
Feb 9
I've done a lot of yowling about high cardinality -- what it is, why you can't have observability without it.

I haven't made nearly as much noise about ✨high dimensionality✨. Which is unfortunate, because it is every bit as fundamental to true observability. Let's fix this!
If you accept my definition of observability (the ability to understand any unknown system state just by asking questions from the outside; it's all about the unknown-unknowns) then you understand why o11y is built on building blocks of arbitrarily-wide structured data blobs.
If you want to brush up on any of this, here are some links on observability:

* honeycomb.io/blog/so-you-wa…
* thenewstack.io/observability-…
* charity.wtf/2020/03/03/obs…

and on wide events:

* charity.wtf/2019/02/05/log…
* kislayverma.com/programming/pu…
Read 16 tweets
Feb 6
Close! "If you're considering replacing $(working tool) with $(different tool for same function), don't do it unless you expect a 10x productivity improvement"

cvs to git? ✅
mysql to postgres? ❌
puppet to chef? ❌
redhat to ubuntu? ❌
The costs of ripping and replacing, training humans, updating references and docs, the overhead of managing two systems in the meantime, etc -- are so high that otherwise you are likely better off investing that time in making the existing solution work for you.
Of course, every situation is unique. And the interesting conversations are usually around where that 10x break-even point will be.

The big one of the past half-decade has been when to move from virtualization to containerization.
Read 12 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(