James Cowling Profile picture
Apr 10 6 tweets 3 min read Read on X
Time for a big systems advice thread!

In distributed systems there's no magic "push everything to prod at once" button. Every service gets pushed independently and nodes within a service get updated incrementally. If you mess up forwards/backwards compatibility you can fail irrecoverably.

So how to avoid this?

1/5: Decouple data and code changes. Never push out a release that changes how data is stored at the same time as the code that uses this new data. If there's a bug and you need to roll back to the old version of your code it won't be able to handle the new data in the new format. Instead push out a release that first changes the data in a way that’s compatible with both the old and new code (e.g., optional fields etc), when that’s stable push out the new code that uses it, then when that’s stable you can change the data to remove backwards compatibility. This is known as a “migration” in the database world and yes it’s annoying, but yes you need to do it.
2/5: Don’t change two services at once. If service A talks to service B, you can’t just add a new API to both of them and push them out. What if someone pushes A but not B? What if there’s a bug in B that needs to be rolled back? Just like with data changes, API changes need to be made in a forwards and backwards compatible way. Engineers forget to do this all the time.
3/5: Only allow one version “step” to exist in prod at any time. It’s common to have most of your nodes at version 5 but a few are still at version 4 because they haven’t finished migrating. Never ever allow someone to push to version 6 while version 4 is still running. Otherwise it’s too hard for engineers to reason about which version is “stable” when making multi-step migrations. You need monitoring and alerting for this plus protections agains corner cases, e.g., if a node was offline and then came back online after missing a code push.
4/5: Codify forwards/backwards compatibility in your release process by pushing some nodes to the new version, run them for a while and keep track of monitoring, then roll them back before doing the full release again. If someone messed up a migration you want to smoke this out on some staging nodes or at small scale while an operator is watching not once your entire system is down.
5/5: Design cleanly composed systems with simple APIs, thin clients, type safety, information hiding, and well-articulated guarantees. Avoid anything fancy or ambiguous in your APIs. This is the only way you can feasibly maintain a large distributed system with multiple versions of every service running at once. It’s also an area where skilled humans significantly outperform LLMs, like it’s not even close. Don’t turn off your brain just yet.
Follow-up note to @convex customers: you don't have to think about most of these things, e.g., we force you to finish a schema migration before pushing the code that uses it, we handle version skew internally for you, etc, but you still do have to think about what happens if an old client shows up and calls a function that doesn't exist anymore. This is rarely a problem on the web but is an issue for mobile apps. We'd like to make this problem go away too but for now if you're a mobile developer you probably already know you should be careful about deprecating APIs, since client code can live for a long time.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with James Cowling

James Cowling Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @jamesacowling

Jan 16, 2025
How to be a Principal Engineer/Senior Principal Engineer/Senior Architect/fancy-sounding-title Engineer, a thread:

1. You're evaluated on how much more the company succeeds because you're there, not the lines of code you wrote. If you can unblock someone, do that. If you need to kill a two year project that's not going anywhere, do that. Do what is right, not what makes you look good.
2. Your job is the strategy stuff and the dirty work stuff. All the cool stuff in the middle is for everyone else. You're not too senior to carry a pager or respond to outages, this keeps you in touch with how things are going. Image
3. If you get too caught up in the day-to-day (the how) to step back and think about whether you're going in the right direction (the why and the what) the team will eventually go adrift and people will lose motivation. You have to proactively own this.
Read 11 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(