If you missed @clare_liguori's Continuous Delivery session this week (like I did) then good news, it's available on-demand now 🎊


And here's my play-by-play for the session

This is a typical CD pipeline in AWS.

This is far more complex than the most complex CD pipeline I have ever had! Just cos it's complex, doesn't mean it's over-engineered though. Given the blast radius, I'm glad they do releases carefully and safely.
If you look closely, beyond all the alpha, beta, gamma environments, it's one-box in a region first then the rest of the region, I assume starting with the least risky regions first.
For anyone thinking about going multi-region (after the recent Kinesis outage), this is one of the complexities you need to consider. To do multi-region right and deploy safely (minimize blast radius), this is one of the complexities you have to factor in.
This "deploy small at first then more broadly" principle applies to #serverless apps too, though you can't "deploy to one box". You can do it with canary deployments instead, CodeDeploy supports this practice for Lambda (using weighted aliases) out-of-the-box.
However, weighted-alias has no session-affinity and it's not possible to propagate the canary decision along the call chain (e.g. when an API function invokes another function via SNS/EventBridge, etc.)...

More details in this post:
This problem applies to API Gateway's canary support too, which has no session affinity, so a user making 2 requests (for a paginated endpoint) can yo-yo between the canary and current production channels.
For simple use cases, this might be fine, but hardly ideal for minimizing blast radius, or if you want to do some A/B tests on new features.

Personally, I love what you can do with @LaunchDarkly such a slick control panel and super easy to use ❤️
But for Lambda functions, it gets a bit trickier because you need so many persistent connections... so, your best bet is to use a proxy (run it in Fargate) as I described in this post: lumigo.io/blog/canary-de…

It can get expensive though, because you're hitting DynamoDB a lot!
Anyway, I digress...

"One box used to mean one VM, but over time it has also come to mean one container or a small percentage of Lambda function invocations"

ha, so they use weighted-alias for microservices that run on Lambda too, starting at 10% at first
And instead of rolling out the other 90% all at once (which is still risky), they use rolling deployment, which, CodeDeploy supports also.
Do this "one-box => rolling deploy pattern to the rest" pattern in one region first, then rinse and repeat for the other regions.

And within each region, apply the same pattern to AZs too.
To crawl back some speed (otherwise, every deployment would take weeks...) they deploy the regions in waves.

First few waves deploy to one region and one AZ at a time, later waves (after you build some confidence) deploy to multiple regions in parallel
mm.. this is interesting!

Deployments are staggered, so multiple deployments can be in motion and at different stages at once.

How does this affect rollback I wonder 🤔 e.g. if v1 deployment craps out at wave 5 and triggers rollback, what of wave 1 which is deploying v5?
I had to build custom mechanisms to stop parallel deployments in the past, because of the complication to rollbacks. Interesting to see AWS had gone the other way. But I get why they do it, to get some speed back.
Summary for this section of the talk. Automatic rollback next, really interested to see how that works with respect to these staggered deployment waves.
"At Amazon, we don't want to have to sit and stare at the dashboard every time we do a deployment, we want to deployments to be hands-off"

Have thresholds on a bunch of metrics (regional, zonal aggregates as well as per-box) to trigger automatic rollbacks.
And they also use monitoring canaries (which, as an AWS customer, we have CloudWatch Synthetics for that) so they can look at system health more holistically to trigger rollback.
"The impact from a deployment doesn't always show up during a deployment"

haha, been there... once had a slow memory leak that showed up 2 weeks after a deployment 🤦‍♂️
The pipeline continues to monitor the metrics during bake time for a deployment. And it'll hold the deployment during the bake time, and not let the deployment move onto the next stage under after the bake time. Otherwise, you can be seeing the impact of another deployment.
Finally, here it is:

1. auto-rollback only in regions/zones that tripped the threshold
2. eng has to decide whether to roll back the whole thing or retry

e.g. something else could have happened in the offending region, which is why thresholds were crossed
If they decide to rollback, then everything gets rolled back.

Or, they can roll forward and push out a v3 deployment instead, which fixes whatever problem that got picked up in that failed region.
In the scenario where you have v2 and v3 deployments happening at the same time (at different stages), if v2 hit a snag and has to rollback the whole thing. I wonder if they'd rollback any v3 changes that's been applied to the regions in earlier waves too. 🤔
That seems the only sensible thing to do.

Anyway, Claire moves onto how to design your changes so they can be rolled back automatically.

As much as possible, make backward-compatible changes 💯
Otherwise, you're forced to make a phased deployment where each phase contains backward-compatible changes.

This mirrors a lot of database migrations when you move from one database to another one and you can't do it with downtime.
Seriously though, if you need to make breaking changes, first see if you can do it with a small downtime. It'll save you so much complexity and extra work.

It's not an option at AWS scale of course, but you're not AWS.
Before they even get to the production deployment, there's a bunch of pre-production test environments.

And they basically practice the one-box deployment in the gamma (production-like) environment.
The one-box deployment in Gamma gives them a bit of backward-compatibility test, that it's ok for there to be two versions running side-by-side. So the monitoring canaries would help pick up incompatibilities there.
Some teams go even further with backward-compatibility test by adding another zeta stage to make sure new frontend works with current production backend.
That was great. So nice to see what AWS is doing to ensure deployments are safe and fast (well, as fast as can be without putting customers at risk)

If you wanna catch the session yourself, here's the on-demand video: virtual.awsevents.com/media/1_ua3d99…

• • •

Missing some Tweet in this thread? You can try to force a refresh

Keep Current with Yan Cui is making the AppSync Masterclass

Yan Cui is making the AppSync Masterclass Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!


Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @theburningmonk

6 Dec
Stitching together my previous tweets on @dyanacek's session (BLD205) on "Monitoring production services at Amazon" so it's easier to share

#aws #reinvent2020

"For amazon.com we found the "above the fold" latency is what customers are the most sensitive to" Image
This is an interesting insight, that not all service latencies are equal and that improving the overall page latency might actually end up hurting the user experience if it negatively impacts the "above the fold" latency as a result. 💡
Read 8 tweets
5 Dec
I've gotten a few questions about Aurora Serverless v2 preview, so here's what I've learnt so far. Please feel free to chime in if I've missed anything important or got any of the facts wrong.

Alright, here goes the 🧵...
Q: does it replace the existing Aurora Serverless offering?
A: no, it lives side-by-side with the existing Aurora Serverless, which will still be available to you as "v1".
Q: Aurora Serverless v1 takes a few seconds to scale up, that's too much for our use case where we get a lot of spikes. Is that the same with v2?
A: no, v2 scales up in milliseconds, during preview the max ACU is only 32 though
Read 11 tweets
3 Dec
Great overview of permission management in AWS by @bjohnso5y (SEC308)

Lots of tools to secure your AWS environment (maybe that's why it's so hard to get right, lots of things to consider) but I love how it starts with "separate workloads using multiple accounts" Image
SCP for org-wide restrictions (e.g. Deny ec2:* 😉).

IAM perm boundary to stop ppl from creating permissions that exceed their own.

block S3 public access

These are the things that deny access to things (hence guardrails)

Use IAM principal and resource policies to grant perms ImageImage
"You should be using roles so you can focus on temporary credentials" 👍

Shouldn't be using IAM users and groups anymore, go set up AWS SSO and throw away the password for the root user (and use the forgotten password mechanism if you need to recover access)
Read 13 tweets
3 Dec
Great session by @MarcJBrooker earlier on building technology standards at Amazon scale, and some interesting tidbits about the secret sauce behind Lambda and how they make technology choices - e.g. in whether to use Rust for the stateful load balancer v2 for Lambda.

Nice shout out to some of the benefits of Rust - no GC (good for p99+ percentile latency), memory safety with its ownership system theburningmonk.com/2015/05/rust-m… great support for multi-threading (which still works with the ownership system)
And why not to use Rust.

The interesting Q is how to balance technical strengths vs weaknesses that are more organizational.
Read 20 tweets
2 Dec
This is part 2 of my #aws #reinvent hot takes on the big #serverless related announcements.

Part 1 is here for anyone who missed it:

Same deal as before, if this gets 168 retweets then I'll do another batch 👍

Alright, here comes the mega 🧵...
1. Lambda now bills you by the ms as opposed to 100 ms. So if your function runs for 42ms you will be billed for 42ms, not 100ms.

This instantly makes everyone's lambda bills cheaper without having to lift a finger. It's the best kind of optimization 😎

However, this might not mean much in practice for a lot of you because your Lambda bill is $5/month, so saving even 50% only buys you a cup of Starbucks coffee a month.

Still, that's a FREE cup of coffee!

Read 44 tweets
1 Dec
Given all the excitement over Lambda's per-ms billing change today, some of you might be thinking how much money you can save by shaving 10ms off your function.

Fight that temptation 🧘‍♂️until you can prove the ROI on doing the optimization.

#serverless #aws #awslambda
The moral of this tweet is as relevant now as ever:

Assuming $50 (which is VERY conservative) per dev per hour, it would have taken them 40 months to break even on just having the meeting, before writing a single line of code!
With the per-ms billing, you're automatically saving on your Lambda cost already, by NOT having your invocation time rounded up to the next 100ms.

Unless you're invoking a function at such high frequency, those micro-optimizations won't be worth the eng time you have to invest.
Read 4 tweets

Did Thread Reader help you today?

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!