I'm going to attempt to summarize the AWS outage on 25 Nov that impacted a good part of the internet in 6 drawings (from the 2,000+ word detailed postmortem by @awscloud at aws.amazon.com/message/11201/). A thread.

1. Meet AWS Kinesis, the realtime processing backbone of AWS:
2. Incoming requests hit the FE fleet. Each FE machine maintains a shardmap to BE clusters. Machines in this cluster do the realtime processing.

Classic setup. Except for the scale, which we can assume is massive. That "frontend fleet" is likely large. The BE fleet? Gigantic.
3. New FE machines were added to the FE fleet as per usual. A few hours later, things start to work odd. The team investigates and another few hours later they realize the root cause. It's to do with how the FE fleet works.

Each FE machine has a thread open to sync in the fleet:
4. Remember how we noted the FE fleet is likely large at Amazon scale? Well it was so large that FE machines ran out of threads to talk with the new machines. Now this is an issue because...
5. ... the shard maps (very) slowly start to become corrupted. When a FE machine reshards a BE cluster, it cannot reach a few machines in the cluster and the other way around. It apparently took a few hours for this to trigger an alarm.
6. Once the damage was done, unlucky requests did not have correct routing to the process streams.

The solution was to remove the new instances and reset the FE fleet (which itself is complicated).

Crazy the # machines hit the OS thread limit. Don't know what this number was.
The future prevention is to use bigger machines, and have alerts to not get close to this thread issue. I wonder though if it's feasible to not have the one-thread-per-machine protocol. Probably too big of a change.

Kudos for the transparency @awscloud! aws.amazon.com/message/11201/

• • •

Missing some Tweet in this thread? You can try to force a refresh

Keep Current with Gergely Orosz

Gergely Orosz Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!


Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @GergelyOrosz

16 Nov
I've been helping a bootcamp grad frontend dev friend prepare for interviews - they worked as a jr dev for 2 years after the bootcamp. But were out of a job the past 6 months.

They just got an offer as a JS engineer!

Thread on 10 prep resources & job market observations.
1. Interviewing for frontend positions today is HARD. IMO the web is the most in-flux in terms of interviewing approaches between backend and mobile.

You get a huge variety of interviews. Some places dive into React hooks. Others ask vanilla JS. Others algorithms / DS.
2. "Refresh" the basics.

Go through a *good* JS book that goes deep. YDKJS (github.com/getify/You-Don…) by @getify and Eloquent Javascript (eloquentjavascript.net) by @MarijnJH are both great, free online, and the prints good quality (I personally learn far better from prints).
Read 12 tweets
15 Nov
The book @intensivedata has got to be the most information-packed one I've read. Summary of all major DB storage techniques, explained in 35 pages in the book. Thread.

1. "Plain old" key-value store in a textfile
2. Indexing a key-value store (e.g. a CSV) with hash indexes (1/6)
3. Segmenting files as they grow via compaction
4. SSTables - sorted string tables (sentence of key-value pairs sorted by keys).
5. LSM-trees (Log-Structured Merge Tree)
6. B-trees: standard storage in many relational/non-relational databases (2/6)
6.1 B-tree reliability & optimization (write-ahead-logs, latches, copy-on-write)
6.2 B-trees vs LSM trees
7. Other indexing approaches: clustered indexes, covering indexes, fuzzy indexing, in-memory DBs (3/6)
Read 6 tweets
1 Nov
18 things that companies with a good developer culture (mostly) have. A thread.

First, some basics.
1. Psychological safety & a blameless culture. You can be yourself without fear.
2. Fair compensation, roughly on par with the market.
3. Common-sense flexibility w working hours.
Next: clarity & collaboration
4. Understand the "why" before starting work.
5. A backlog that devs also contribute to.
6. Communicating directly with others, not through e.g. managers
7. Working with other disciplines (e.g. product, UX)
8. Celebrating that people take initiatives
Sustainable engineering culture
9. Functionally complete != production ready
10. Code reviews and testing are part of the everyday dev process.
11. CI and CD. 'nuff said.
12. Healthy oncall. Fixing poor oncall has priority over product work.
13. Internal open source model.
Read 5 tweets
22 Oct
"Can you summarize this 200-page dev resume book in 7 tweets or less?"

Challenge accepted. Here we go.

1. Know what the goal of your resume is. This is what most people get wrong. It's not about your professional history. It's to get that next call with the recruiter/HM. (1/7)
2. Understand how the hiring pipeline works: who will read your resume, and what your competition is at smaller, vs larger companies.

Know that an employee referral *dramatically* increases your chances of passing the resume screen round. (2/7)
3. Use an easy-to-scan template. Recruiters do a quick scan, then a thorough scan (if they find key details in the quick scan). Make this "quick scan" as easy as possible.

Here's a good template from the book: blog.pragmaticengineer.com/the-pragmatic-… (3/7)
Read 7 tweets
21 Oct
I've been writing the ebook thetechresume.com on the side for a few months, and launched it 13 days ago.

In this 13 days it made $13,000, and $18K since I started, with more than 2,000 customers.

This is about 13x what I expected. Thread on how it got here.
It all started with COVID, layoffs happening across the tech industry, and me wondering how I can help. I offered to do a few resume reviews, being a hiring manager myself:

I thought I'll get a handful of responses. I got 300+. There was no way I could give thorough feedback on all of it. So I decided to scale myself: do in-depth review of the first 50, take notes, then send those compiled notes to others. Here's that PDF: thetechresume.com/samples/origin…
Read 12 tweets
6 Sep
As usual, @mipsytipsy breaks down what *good* engineering management incentives look like: charity.wtf/2020/09/06/if-…

I’ll write about my (similar) take in more depth. But a few facts most people are surprised about, on what becoming an eng manager at Uber is/was like (thread):
1. It’s not a promotion, level or money wise. EM1 == Sr Eng comp wise. Sr EM == Staff Eng. Director == Sr Staff. And so on.

When I moved to management, my comp actually dropped: before I was at the “top” of Sr eng, but now closer to the bottom of EMs when it came to bonuses.
2. You don’t just “become” a manager. You *have* to go through an apprentice manager program and graduate. Graduating is ridiculously hard: as hard as manager promo. My case had 20 stakeholders giving feedback on me. Why this hard? To avoid making poor managers full-time EMs.
Read 8 tweets

Did Thread Reader help you today?

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!