My Authors
Read all threads
Anyone else see regional networking issues in AWS us-east-1 this morning around 8am UTC? Honeycomb was seeing some serious traffic drops both from ALBs to backends, as well as between hosts in our VPCs, for 30 min, in all AZs.
Update: this recurred and our AWS instances are failing to talk to each other again, for the past hour :(
Our case is 6569575271 if any friendly AWS folks want to look. also cc @QuinnyPig
@QuinnyPig This is the impact on our SLO (it's _bad_): honeycomb product screenshot showing 3 yellow bands of outage on a heatmap
Update: still worsening. AWS has neither updated their support dashboard, or officially communicated with us.

This customer is becoming grumpy and considering a project to go multi-cloud instead of renewing RI commitments at the end of this month...
@AWSSupport is something going on with ALBs? Our ALBs are systematically returning HTTP 502/504 because they're timing out talking to our backends, which are _definitely_ up and running fine.
@AWSSupport How do I know this is @AWSSupport's issue? Well...

the built in BubbleUp on SLO success vs failures shows a preponderance of HTTP 502/504, scattered across backends rather than one bad backend, & no pattern on traffic type, etc. honeycomb bubbleup
Update: we... think we have a resolution and are very sorry to AWS, this is actually on our end after all.
Narrator: they were not up and running fine.
So, what happened: we had simultaneous OOM-stampedes caused not by query of death, but instead by memory leak and simultaneous +/- 5 min instance restarts for regular deploys...

Hindsight bias says that if we'd looked at the right graphs, we'd have seen.
None of Honeycomb's traces showed this because the traces in flight were destroyed by the OOM and never transmitted to metamonitoring. The black box monitor of AWS ALB logs did tell us we had a problem, but we misinterpreted the signs.

Another retrospect graph of the stampede:
It took fresh eyes from an engineer who wasn't working the incident to first-principles check "hey, why is process uptime resetting to 0?" to get us out of the path we'd carved ourselves into.

Full retrospective and incident report to come.
There were a lot of human factors issues here - initial detection via burn alert (which we were still treating as a beta and not a full paging alert), reluctance by me therefore to wake oncall in PST, me being distracted by conference & oncall wanting to go back to sleep...
and then when people did get properly awake in the PST working hours, it took a while for people to actually thoroughly debug and re-examine rather than taking the haphazard diagnosis as authoritative. Lessons learned.
.@jhscott came by our retrospective meeting this morning! I hope he found it interesting!

And we'll get a blog post out next week with the public incident report :)
Missing some Tweet in this thread? You can try to force a refresh.

Enjoying this thread?

Keep Current with Liz Fong-Jones (方禮真)

Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Follow Us on Twitter!

Did Thread Reader help you today?

Support us! We are indie developers!

This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!