I got promoted in < 1.5 years at Amazon to mid level cuz I finished a migration project that my old team could not solve for 3 years. Here's my advice on large scale software migrations:
1. use gating mechanisms to direct traffic
- if you don't use a gating logic that is easily turned on and off, you will have the risk of potentially serving no traffic for however long it takes your service CI/CD pipeline to revert (our team's takes upwards of 3 hours)
2. use one box prod environments or blue/green deployments
- either having a clone or a small % of prod traffic in a single deployment group will allow you to observe metrics to see that things are at least wired well before serving 100% of production
3. write down on paper all dependencies and POCs
- in my old team, the migration effort took very long because we needed to find alternatives for the old way of doing things in native AWS and it was a lot of hunting down software owners. Writing everyone down and maintaining good relationships with downstream teams helped greatly in reducing effort and getting them to change things on their end as needed (like permissions to AWS resources for example)
4. be okay with not seeing results for a while
- migrations often need lots of planning and weighing of different alternatives and options. Be okay with the discomfort of seemingly not making progress. Software engineering isn't all about the PRs that you put out there. It's about making GOOD solutions that last a long ass time without becoming tech debt
5. make sure EVERYTHING is backward compatible
- please. for the love of God. do this shit. API migrations especially need backwards compatible models or there can be hundreds of unwanted errors that could be hard to debug until you see actual prod traffic
6. maintain the old and new version at the same time for at least a week WITH key service metrics and dashboards for both
- you need to compare both versions in a migration using the gating logic from step 1 or else you're shooting in the dark. Good engineers know exactly how their service is behaving and you can't do this without a dashboard of some sort showing the most important service and client side metrics
Follow for more real software and MLE advice :)
• • •
Missing some Tweet in this thread? You can try to
force a refresh