Post

James Cowling

@jamesacowling

Apr 10 • 6 tweets • 3 min read • Read on X

Scrolly

Time for a big systems advice thread!

In distributed systems there's no magic "push everything to prod at once" button. Every service gets pushed independently and nodes within a service get updated incrementally. If you mess up forwards/backwards compatibility you can fail irrecoverably.

So how to avoid this?

1/5: Decouple data and code changes. Never push out a release that changes how data is stored at the same time as the code that uses this new data. If there's a bug and you need to roll back to the old version of your code it won't be able to handle the new data in the new format. Instead push out a release that first changes the data in a way that’s compatible with both the old and new code (e.g., optional fields etc), when that’s stable push out the new code that uses it, then when that’s stable you can change the data to remove backwards compatibility. This is known as a “migration” in the database world and yes it’s annoying, but yes you need to do it.

2/5: Don’t change two services at once. If service A talks to service B, you can’t just add a new API to both of them and push them out. What if someone pushes A but not B? What if there’s a bug in B that needs to be rolled back? Just like with data changes, API changes need to be made in a forwards and backwards compatible way. Engineers forget to do this all the time.

3/5: Only allow one version “step” to exist in prod at any time. It’s common to have most of your nodes at version 5 but a few are still at version 4 because they haven’t finished migrating. Never ever allow someone to push to version 6 while version 4 is still running. Otherwise it’s too hard for engineers to reason about which version is “stable” when making multi-step migrations. You need monitoring and alerting for this plus protections agains corner cases, e.g., if a node was offline and then came back online after missing a code push.

4/5: Codify forwards/backwards compatibility in your release process by pushing some nodes to the new version, run them for a while and keep track of monitoring, then roll them back before doing the full release again. If someone messed up a migration you want to smoke this out on some staging nodes or at small scale while an operator is watching not once your entire system is down.

5/5: Design cleanly composed systems with simple APIs, thin clients, type safety, information hiding, and well-articulated guarantees. Avoid anything fancy or ambiguous in your APIs. This is the only way you can feasibly maintain a large distributed system with multiple versions of every service running at once. It’s also an area where skilled humans significantly outperform LLMs, like it’s not even close. Don’t turn off your brain just yet.

Follow-up note to @convex customers: you don't have to think about most of these things, e.g., we force you to finish a schema migration before pushing the code that uses it, we handle version skew internally for you, etc, but you still do have to think about what happens if an old client shows up and calls a function that doesn't exist anymore. This is rarely a problem on the web but is an issue for mobile apps. We'd like to make this problem go away too but for now if you're a mobile developer you probably already know you should be careful about deprecating APIs, since client code can live for a long time.

• • •

Missing some Tweet in this thread? You can try to force a refresh

Share this page!

Enter URL or ID to Unroll

James Cowling

Try unrolling a thread yourself!

More from @jamesacowling

James Cowling

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!