As an ECS on Fargate fan, I got a lot of questions about the recent incident / outage.

Let's talk about it... 🧵 1/43

TL;DR: meh. Incidents happen and this one did not teach us anything new about building on AWS. I expect more profound drama during these roaring 20s!
Disclaimer: everything fails all the time! I view incidents as "normal" and this thread reflects that.

Yes, AWS had incidents. Yes, us-east-1 is HUGE so it gets interesting incidents. Yes, AWS is bad at communication.

This is the default and well-known state!
I won't address bullshit arguments in here. Please go intellectually masturbate somewhere else!

No, multi-cloud is not a valid solution. No, multi-region is still really freaking complex and most of the time not worth it for the business. No, your datacenter won't do any better.
Now, let's get to our incident! Let's do a recap and see what happened.

If you login to AWS and go to phd.aws.amazon.com/phd/home#/acco… you can see the full incident timeline and messages.

I'll spare you the horror and discuss the timeline here, using AWS' default PDT timezone.
On August 24, around 13:00, in us-east-1, ECS on Fargate and EKS on Fargate start experiencing errors: new containers sometimes fail to launch.

Running containers are totally fine. After a couple retries new containers successfully launch too.

Some don't even notice the errors.
Errors increase more and more.

At 15:30 errors are now common and AWS formally declares an incident. Personal Health Dashboards are updated.

It's officially an incident now, with people annoyed it took AWS this long to confirm the issue ranting on Twitter and on Slacks.
Skilled AWS engineers investigate and debug the problem and try some potential fixes.

During all this time, containers may or may not successfully start. Some containers start, some don't — AWS reports times when 50% and 70% of calls succeed and that's inline with what I saw.
Since AWS does not have trivial incidents, it of course turns out it's a harder problem to solve than expected 😨

Looks like Fargate's underlying EC2 instances are not getting cleaned up properly and they become stuck.

There are indeed servers in serverless and they are cranky!
At 21:20 AWS decides to FULLY stop ECS on Fargate and EKS on Fargate container launches.

100% of requests fail now 😱

It's been 8 hours since the first error and 6 hours since the incident started. Skilled AWS engineers could not fix it otherwise so this drastic action is taken
With all the requests stopped, AWS engineers manage to fix the problem.

Now, ECS on Fargate and EKS on Fargate are both highly complex distributed systems. You don't just turn a hyperscale service back on! Gradual recovery needs to happen and that's what we see unfold next.
At 00:45 AWS starts allowing some ECS on Fargate and EKS on Fargate containers to start.

It's up to the universe if you're part of the lucky few that get to start containers now. Literally a few random AWS accounts see some request succeeding now.
For the next hour, highly experienced AWS engineers are watching the system come back up and things are continuing to look good.

More and more accounts are now successfully creating new ECS on Fargate and EKS on Fargate containers.
By 01:55 the full 100% of AWS customers can start containers.

The system is still coming back up, so these containers are started at slower rates: 1 container per second per AWS account (1TPS).

At 04:00, since everything still looks good, the rate is increased to 2TPS.
As an aside, in 2020, ECS on Fargate during normal operations and at full speed scaled at 1TPS!

This is slow, but not THAT slow.
At 04:34, since everything is still looking good, the rate is increased to 5TPS.

The fantastic AWS engineers continue to monitor things to make sure they don't break again, either from the same issue or from another issue due to the complex service coming back to life.
At 04:57, everything is progressing as expected and there are no issues, so the rate is yet again increased to 10TPS.

As an aside, in 2021, ECS on Fargate and EKS on Fargate at full speed scaled at 10TPS by default.

This is slow, but not that slow.
An hour later, at 06:05 things are still looking good and the rate is increased to 20TPS. The incident is formally declared as resolved.

Yay, we survived through this!

Yay, yet another incident in us-east-1!

Yay, Fargate got it's first big incident just before turning 5!
TL;DR:

⛅ 13:00 to 21:20 degraded performance: some containers start, some don't

☠️ 21:20 to depending on your luck between 23:40 and 01:55 outage, all new launches fail. Already running containers are fine

🌤️ 00:45 to 06:00 degraded performance: containers start, but slower
As an aside, ever since Lambda came out, AWS builds on AWS. They don't have another separate AWS or any special privileges.

AWS services that run on ECS on Fargate also experiencing degraded performance or outages: CloudShell, App Runner, and more!
I'd love for AWS to publish "Service A depends on X, Y, and Z" and to alert all Service-A users if Service-Z has an outage.

We know AWS maintains a dependency graph cause they talked about how when new regions are launched sometimes they discover loops 🤣

#awswishlist
Back to our incident, now that we know what happened, did we learn anything new from this incident? Any changes in how we architect?

Let's talk about it...

TL;DR: nothing new if you've been doing cloud for a while, otherwise congrats on being one of the lucky 10.000!
If you've been doing cloud for a while you likely did not learn anything new.

In the past we had multiple EC2 incidents where new instances would not start but running instances were fine => this is not a new failure mode.

Maybe some curiosities, but noting new or dramatic here
If you did not experience previous outages or if you did not have the opportunity to learn from previous incidents, congratulations!

You are one of the lucky 10,000 that get to learn from this outage!

To understand the 10,000 figure check out xkcd.com/1053/
🧑‍🏫 Lesson number 0: everything always fails and AWS is bad at communication!

🤷

Addendum: you can build a 99.9999% available service for 100 million or you can build a service that fails like... 3 days a year for 1 million. Business will often chose the latter and that's ok!
🧑‍🏫 Lesson number 1: have good observability in place!

If your systems were impacted, did you notice the errors before AWS declared an incident? Did you understand them?

Genuinely the impact might've been too small to trigger anything, so it's ok if you did not notice it!
Potential indicator: 📈SLIs/SLOs for response times

If you were running at say 90% capacity, then response times and latency should go up, no?

Again, these might'n've triggered and that's ok! It depends on your scale — for example, only 1 of my clients had an SLI alert.
Potential indicator:📈ECS/EKS events

For everything regarding your cluster an event is emitted: ECS can natively send events to EventBridge and for EKS you can parse the logs or use a third party tool.

This is quite an in-depth implementation tho — usually not worth it.
Potential indicator: 📈Number of actual running containers versus number of desired running containers

This is debatably the easiest and most common metric. Depending on your scale, it might'n've been triggered tho!

Some clients had this trigger, some did not.
🧑‍🏫 Lesson number 2: when AWS declares an incident, you should declare an incident too!

It's fine if you missed the sporadic initial errors, but once AWS declared an incident your on-call had to be alerted!

ALL my clients got altered and you should've had this too!
Yes, it may be a noisy alert at times, but it's worth it — this also creates a really nice culture of "oh, everything DOES fail all the time".

These events are served on a "best effort" basis, so also subscribe to RSS and Twitter feeds: docs.aws.amazon.com/health/latest/…
🧑‍🏫 Lesson number 3: when there's an incident, know what matters for your company!

AWS made it clear that only new containers are affected and proposed mitigations:
1) migrate from Fargate to EC2 if you can (hard, but cheap)
2) stop scaling down (easy, but expensive)
At 15:30 the incident was official and it was made clear that *only new* containers are affected.

At 20:46 AWS explicitly said we should disable scaling down for Fargate containers since running containers are fine.

Only 30 minutes later were new launches disabled!
⏲️ your company cares more about staying up that money?

The second the incident started or, at worst, when AWS explicitly said it, you should've scaled up to a sane maximum.

Usually scaling up and down between 30 and 50 containers? Forced static 75 containers, scaling disabled!
This is what most folks did and it made this incident a non-event.

"Oh, only new containers are affected. Meh, let's make sure we have no new containers for a while"

This is what most of my customers did. It took a while for all containers to successfully start though.
If your on-call engineers did not feel comfortable making this tradeoff OR if your engineers did not know what your company prioritizes between money and staying up... you have a problem!

Seriously, look into this! ⚠️
💸 Does your company care more about money that staying up?

That's a valid option too! Gracefully fail your apps, enable some downtime mode, do whatever you plan is in these cases. You do have a plan, right?

In my case, some customers just disabled a couple non-vital workflows.
The "switch workers from Fargate to EC2" line was interesting and unexpected.

For EKS clusters with both Fargate and EC2 workers this switch was trivial, if your code could handle it. Definitely something to architect for!

For anything else... I'd say it's over-engineering.
🧑‍🏫 Lesson number 4: you need operational feature flags!

Again, everything fails all the time, so make sure you can handle failure gracefully.

Operational feature flags are lifesaving in moments like these.
↕️ "We only have 30% of our desired capacity, flip the switch so we don't do that complex processing anymore. This way we can handle all the traffic with 30%"

↕️ "Service X is barely working due to the incident, let's disable that for now but we'll let the rest of the app work"
↕️ "Oops, scaling is broken. Let's disable aggressive scaling and let's reuse workers since we don't know how easily new workers will come up"

These depend a lot on what your company is doing and on how your services are architected, but there are always options!
🧑‍🏫 Lesson number 5: only one production AWS account is not ideal!

When things recovered, they went on a per-AWS-account basis. If you had multiple AWS accounts, some accounts recovered way faster that others. If you only had 1 AWS account that was at the end of the line...
I think that's all the lessons from this incident, but I am biased on this since I've been doing cloud for a while now 🤔

Do you have any other learnings from this incident? Anything unexpected happen? Did your experience not match the one I had?
If so, please reply to this!
In conclusion, meh.

If you've been doing cloud for a while, this was a "heh, Fargate got it's first big incident and it's not what I expected"

If you're newer to cloud, congrats! You got to learn 🎉

Read the unrolled version of this thread here: typefully.com/iamvlaaaaaaad/…

43/43

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Vlad Ionescu (he/him)

Vlad Ionescu (he/him) Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @iamvlaaaaaaad

Jun 8
Twitter thread based on demand and cause I can't be bothered to spend 6 months writing a blog post: platform teams are dead.

Platform teams as in teams that build an internal platform that "abstracts the cloud" and "makes things easier for developers" and "our k8s platform" ⚰️
TL;DR: we aren't living in 2016 anymore, and the cloud moves fast. Platform teams are expensive and hard to do, offer a mediocre service at best, destroy velocity, and create bad incentives.

Avoid this mistake and build guardrails and company-specific DX tooling instead!
Before we get into it, some disclaimers.

Disclaimer 1: I am yet another overconfident white male, with not enough experience, that is just tweeting things so I can get you to subscribe to my SoundCloud.
Make peace with that; don't take everything I say as gospel.

🙅Hateful DMs
Read 39 tweets
May 27
@cvrabie @axhamre @GergelyOrosz @LaunchDarkly @Optimizely @SplitSoftware Self-hosted feature flagging is WAY WORSE from every single point of view.

I've seen multiple implementations and multiple approaches to self-hosting and they ALL failed spectacularly.
Worse, everybody involved acquired a repulsion of feature flags due to the terrible experience
@cvrabie @axhamre @GergelyOrosz @LaunchDarkly @Optimizely @SplitSoftware Reason 1: operation and maintenance.

I won't expand a lot on this, cause it's a well known reason. You have to operate, maintain, and support the software and that costs time and money without brining any business benefit.
@cvrabie @axhamre @GergelyOrosz @LaunchDarkly @Optimizely @SplitSoftware Oh, bonus to reason number 1 operation and maintenance: you're having an outage or some failure aaaaaaand the feature flagging solution is down too and now you can't fail gracefully.

Same reason you should not self-host your observability tooling either.
Read 9 tweets
Apr 13
What's the fastest way to scale containers on AWS in 2022?

Is EKS faster than ECS? What's faster: serverless workers (Fargate) or to serverful workers (EC2)?
What about App Runner and Lambda?

Now we know: vladionescu.me/posts/scaling-…
Full details are in the blog post, but EKS TL;DR:

- EKS on Fargate is now faster than EKS on EC2 🤯
- EKS on EC2 using IPv6 is a tiny bit faster than IPv4
- EKS on EC2 using Karpenter is a tiny bit faster than Cluster-Autoscaler, but it's complex

vladionescu.me/posts/scaling-… Hand-drawn-style graph showing the scaling  performance fr
Full details are in the blog post, but ECS TL;DR:

- ECS on Fargate improved even more 🚀
- ECS on Fargate using ARM is a tiny bit faster
- ECS on Fargate using Windows containers is surprisingly fast
- ECS on EC2 got much faster with Capacity Providers

vladionescu.me/posts/scaling-… Hand-drawn-style graph showing the scaling  performance fr
Read 7 tweets
Oct 17, 2021
@nathankpeck @andrewbrown @AWSOpen @antonbabenko @HashiCorp NOBODY is saying AWS should not release their own modules.

We're complaining about:
- @AWS and @HashiCorp ignoring the open-source community
- @AWS and @HashiCorp releasing the lowest level quality modules possible
- @AWS and @HashiCorp marketing and pushing this on developers
@nathankpeck @andrewbrown @AWSOpen @antonbabenko @HashiCorp @AWS I'd love to see some AWS-official modules specifically designed to work with CDKTF. That would be awesome!

I'd love to see some AWS-opinionated modules. That would be awesome!

Those are both ENTIRELY DIFFERENT THINGS. We're not talking about that. Nothing like that was released
@nathankpeck @andrewbrown @AWSOpen @antonbabenko @HashiCorp @AWS I'd love to see AWS-official modules built for the newly-released AWS CloudControl API. That would be awesome and there's nothing like that on the market!

The MARKETED and RELEASED modules don't use that at all: github.com/search?q=org%3…
Read 4 tweets
Sep 28, 2021
I am MASSIVELY pissed off about this. I wasn't initially going to tweet about it cause I am trying to work on my mental health, but here we are.

AWS is fucking up the Terraform ecosystem and hurting the community, a thread...



🧵
First up, if you're doing Terraform you won't write most of the code yourself. You will use modules. Think of modules like code libraries or packages. In a world of left-pad, we have that for Terraform too!

Don't write your own sort function = don't write your own VPC module

🧵
Setting up even the tiniest VPC requires a lot: Subnets, Route Tables, NAT gateways, and so on. Everybody offloads that to a Terraform module.

Even tiny resources (SQS queues) can benefit from being in a module for consistent tagging, for example. Again, beware left-pad.

🧵
Read 12 tweets
Apr 24, 2021
Let's discuss the "Ask HN: Does anyone else find the AWS Lambda developer experience frustrating?" thread!

There are some valid points that are hidden by rants, hype and product-pushing, and a lack of history.

🧵
Let's get the dirty out first: cloud is the new mainframe. Lambda is CGI-bin v2. Terrible AWS lock-in.

Yeah, I don't care. I get to offload effort to AWS so I can focus on my business. That's a win in my book.

Keep in mind I got paid ridiculous sums of money to fix k8s fuckups.
The second reply by @andreineculau is the correct answer: don't do anything local. Unit tests locally, and after that push and test in the cloud!

A @telepresenceio for Lambda idea is proposed.
Read 15 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(