12,399 views

Charity Majors

@mipsytipsy

, 12 tweets, 3 min read

My Authors

https://twitter.com/ckarmstrong/status/1203559822534172672

https://twitter.com/ckarmstrong/status/1203559822534172672

Again, making it so things never break is NOT THE GOAL.

Making it so many things can break before users are impacted is the goal. Making it so that any user impact is glaringly obvious and easy to identify and confirm and mitigate is the goal.

https://twitter.com/ckarmstrong/status/1203559822534172672

We have spent decades getting engineers used to developing thru the lens of their test suite.

Now we just need to expand that a smidge...and develop thru the lens of their instrumentation in production. Build for reality, not a simulacrum.

Observability-driven development, not test-driven development. Because code is just the beginning.

Reality is code plus architecture and infrastructure, time and elapsed time, dependencies, method of deployment, user activity, and any other concurrent activity.

Running lots of tests can increase your confidence in some piece of code.

What they can't do is tell you how confident in your confidence you should be, or how easy it will be to validate or find any bugs, or how many are impacted by the bug, and on and on. You need prod.

You need your engineers drilled in instrumenting to understand ✨every commit✨. How might it break or degrade? How will they know if and when it does?

The *overwhelming majority* of bugs are far too small and subtle to trip a monitoring check and page someone. (Thank God.)

And the answer to this mismatch is NOT, "ok so add a million more alerts to page on every edge case." Fuuuck. That.

The answer is to go and look at the shit you just deployed, thru the instrumentation you shipped with it, and verify it is working as you intended.

Say you just shipped a storage engine improvement to compact columns with lots of strings.

You might make sure your instrumentation is capturing column data type, before size and after size, a was_compressed flag, time elapsed compressing, compression format,

(not done yet) also userid, app id, datetime compression ran, any errors or warnings, why it wasn't eligible for compression if it wasn't attempted, any stats on fragmentation or physical layout, location, read/write access time and modification time, relevant indexes....

And as you're rolling it out, you might ship first to 1% (nothing scarier than storage format changes!) and then watch a graph showing old build_id vs new build_id errors and requests.

Obviously, you can watch for elevated errors in the newer version. But also:

1) is compaction running?
2) is it running only on the right data types?
3) is it reclaiming space?
4) what errors or warnings is it generating?
5) look for some data that should be skipped or bailed on. Is it?
6) go look at some data from the perspective of a user. Look ok?

How much more confidence do you have in what you just shipped now? Quite a bit.

And if you know you can use feature flags to immediately enable/disable the code, and history tells you that most bugs are caught swiftly and trivially... Well.. This is a bad example 😬

...because I would never ship a literal data storage engine format change on a Friday, or without a lot more paranoia.

BUT! I would totally ship this instrumentation in "dry run" mode and let it run for the weekend to see what it WOULD do. 🥰

Enjoying this thread?

Keep Current with Charity Majors

Stay in touch and get notified when new unrolls are available from this author!

This Thread may be Removed Anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Enjoying this thread?

Try unrolling a thread yourself!

More from @mipsytipsy see all

Related threads

Trending hashtags

Did Thread Reader help you today?