One of the challenges at Uber was building monitoring and alerting that worked reliably.

The problem was how Uber was (is) city-based and global alerting would not catch regional (city/country-level issues).

Two stories on why this is difficult:
1. A PayPal employee on a Japan business trip alerted us in 2016 that PayPal is not working there. He was right: it wasn’t working for 2+ months, across the country. How did we miss it?

There were 20 PayPal trip attempts/day only, and Japan was one of 60+ countries.
In the global scale, this accounted for a tiny fraction.

So we did what makes sense: added country-level alerting.

First, this became a data cardinality problem. 1,000 cities x 15 payment methods... not trivial to track all. We settled on countries & top cities.
With country/city alerting, now the sparseness of the data became a problem. It’s still to alert reliably on 20 data points/day.

The new system worked, but it was really noisy for months, until we “tamed” it.
2. On 8 Nov 2016, a team member in Amsterdam got paged to an outage in India. Cash trips were not working. They debugged, but all was fine... except for the data. Almost NO cash trips.

Turns out, this was the day of cash demonetisation. This would not be the first such alert:
Every year we’d have many alerts from countries and cities tied to real-world events. Hurricanes, large concerts, local marketing events. All of them alerted the system as unusual, and paged engineers. But they were “normal” in the local context.
Monitoring & alerting is an area you can go down the rabbit hole, especially in a multi-country, multi-city, real-world-bound context.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Gergely Orosz

Gergely Orosz Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @GergelyOrosz

17 Jan
A list of tech companies and their experimentation platforms. If you're an engineer and use (or want to use) experimentation / AB tests/feature flags, this is worth a read. A thread. Image
1. Uber. Lots of articles on Uber's engineering blog, from the platform itself (eng.uber.com/experimentatio…), through analyzing outcomes (eng.uber.com/analyzing-expe…) and a talk on decoupling experimentation logs from business metrics (conferences.oreilly.com/strata/strata-…)
2. Doordash. Reading most of this was "deja vu: this is *so* similar to what we are doing at Uber!" It's a good writeup : doordash.engineering/2020/09/09/exp…

Considering how fast Doordash went from <10% market share to market leader in the US, they definitely know how to experiment well.
Read 13 tweets
3 Jan
“Why does {company/app} have more than X engineers?” where X typically greater than 20/50/100.

Here’s how and why this makes sense for *the company* from a business-point of view. A thread.
1. What you see as “one app” is indeed, a lot of small parts that all contribute to the company making money.

Take the Twitter app. Almost all functionality (timeline, lists, profile, moments etc) are here to drive engagement. Then there’s ads and ad tools (I’m simplifying ofc)
2. A company never asks “how many engineers do we need overall?” They look at business cases.

“If we hire 4 engineers, we can build Lists. We expect to reduce churn by 4% annually which results in $15M/yr revenue. The cost of this team is A LOT less than this.”
Read 7 tweets
28 Dec 20
I've learned more about startups by self-publishing a book than with years of reading. I wrote a post about this: blog.pragmaticengineer.com/want-to-start-…

Thread on 7 things I now have more appreciation for, having experienced them first hand. Image
1. Marketing. "Build it and they will come" is not how products (or books) are bought. You need a marketing plan.

I put one together late: and delayed the launch to get some of the marketing ideas going. It was worth it, in the end. Image
2. Media exposure. My own "marketing network" was far smaller compared to exposure on a large publication (like HN). You can't really plan for or rely on this as marketing, but these are bigger waves than one can expect. @philip_kiely has a similar story. Image
Read 9 tweets
12 Dec 20
I'm going to attempt to summarize the AWS outage on 25 Nov that impacted a good part of the internet in 6 drawings (from the 2,000+ word detailed postmortem by @awscloud at aws.amazon.com/message/11201/). A thread.

1. Meet AWS Kinesis, the realtime processing backbone of AWS:
2. Incoming requests hit the FE fleet. Each FE machine maintains a shardmap to BE clusters. Machines in this cluster do the realtime processing.

Classic setup. Except for the scale, which we can assume is massive. That "frontend fleet" is likely large. The BE fleet? Gigantic.
3. New FE machines were added to the FE fleet as per usual. A few hours later, things start to work odd. The team investigates and another few hours later they realize the root cause. It's to do with how the FE fleet works.

Each FE machine has a thread open to sync in the fleet:
Read 7 tweets
16 Nov 20
I've been helping a bootcamp grad frontend dev friend prepare for interviews - they worked as a jr dev for 2 years after the bootcamp. But were out of a job the past 6 months.

They just got an offer as a JS engineer!

Thread on 10 prep resources & job market observations.
1. Interviewing for frontend positions today is HARD. IMO the web is the most in-flux in terms of interviewing approaches between backend and mobile.

You get a huge variety of interviews. Some places dive into React hooks. Others ask vanilla JS. Others algorithms / DS.
2. "Refresh" the basics.

Go through a *good* JS book that goes deep. YDKJS (github.com/getify/You-Don…) by @getify and Eloquent Javascript (eloquentjavascript.net) by @MarijnJH are both great, free online, and the prints good quality (I personally learn far better from prints).
Read 12 tweets
15 Nov 20
The book @intensivedata has got to be the most information-packed one I've read. Summary of all major DB storage techniques, explained in 35 pages in the book. Thread.

1. "Plain old" key-value store in a textfile
2. Indexing a key-value store (e.g. a CSV) with hash indexes (1/6)
3. Segmenting files as they grow via compaction
4. SSTables - sorted string tables (sentence of key-value pairs sorted by keys).
5. LSM-trees (Log-Structured Merge Tree)
6. B-trees: standard storage in many relational/non-relational databases (2/6)
6.1 B-tree reliability & optimization (write-ahead-logs, latches, copy-on-write)
6.2 B-trees vs LSM trees
7. Other indexing approaches: clustered indexes, covering indexes, fuzzy indexing, in-memory DBs (3/6)
Read 6 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!