Gergely Orosz Profile picture
Mar 13 20 tweets 5 min read
As it's been ~3 years, figured I'll answer "What caused the Uber Eats glitch that allowed ordering free food for a weekend in India?"

This was an outage on my watch. Given Quora is paywalled - can't post the answer w/o a sub - here's the story on idempotency & breaking changes: Image
1. What happened? One morning someone in India tried to order food via UberEats in India, using Paytm as a payment method. But they didn't have enough balance.

Got an error message.

Ordered again.

The order went through!! Without having money for it.

News spread quick. Image
2. This was a payments-related bug. The problem with these is how the bug was in the reconciliation flow. And Uber reconciled with Paytm maybe once a week.

How Uber discovered this: restaurants started going offline thanks to huge order quantities in very short times.
3. After it was clear something was up, Uber shut down Paytm as a payment method and started the investigation.

My team owner the Paytm payment method at the time, so this was me and my team.

We naturally looked at what code changes we've made in the timeframe. None.
4. So if we made zero changes on our end, what happened?

Turns out the Paytm team did a change late on a Friday that looked innocent enough.

It silently changed an API endpoint from behaving idempotent to non-idempotent.

Why does idempotency matter?
5. Idempotency means that you can safely repeat requests as you get the same response every time.

I remember the endpoint was charge-related.

Before, it always returned the same error when trying to charge a wallet without enough credits. With the change, not anymore:
6. Before
1. "Try to charge wallet X without funds" -> Error1
2. "Try to charge wallet X without funds again" -> Error1

After
1. "Try to charge wallet X without funds" -> Error1
2. "Try to charge wallet X without funds again" -> A Brand New Error
7. Now this might look like a small change, but on Uber's side, the assumption was the endpoint was idempotent, so there was no testing on getting anything else back. The new error was unknown and not mapped to anything.

Long story short it was interpreted as "success".
8. So Paytm returned an error never documented before without telling its partners. Some partners assumed idempotency changes are breaking API changes to be communicated: but they were not. Uber was one of these partners.

The result? Free food until discovered.
9. So who paid for the free food?

Restaurants got paid and customers abusing this functionality were never pursued.

The responsible party needed to foot the bill. But who was responsible?
10. I can't share the settlement, so leaving a poll here to decide. Who do you think should have footed the cost for the bug?

The API provider changing their API to return a new error? The API consumer not parsing a new error introduced - but not communicated?

Who should pay?
Both parties were at fault here, which is why liability is tricky.

1. The API consumer should have coded more defensibly & not assume implicit API behaviors are deliberate.

2. The API provider should have communicated changes ahead of time, and not provide implicit idempotency.
Being in the middle of this outage, a few things I learned:

- Don't assume "unknown" means "good". Assume the opposite.

- The worst outages make for the best stories later.

- College students can eat SO MUCH. They were responsible for the majority of food orders during outage!
Just to make things more gray, a correction. The new API behavior was not a clear-cut error if my memory correct:

1. "Try to charge wallet X without funds" -> Error1 (as before)
2. "Try to charge wallet X without funds again" -> A status that is not an error (also not success)
Lots of questions on “why did Uber not handle HTTP error codes?”

Because there were none. This API at the time retuned only 200s where the body had a message to be parsed which indicated success / status message / error.

Status codes would have made this trivial to catch.
“Did you have tests?”

Yes! As always the integration was unit tested with all possible API behaviours *at the time of building the integration*.

“Could have you not failed closed vs failing open?”

Of course we should have. It’s the morale of the story from consumer side.
Why would you *ever* fail open when there’s something unknown?

Growth! You prefer to provide a great experience even if the provider has issues. Reconcile later.

This was the case in 2015, when the integration code was written. By 2019, the mentality changed. The code: not yet.
Lots of replies on the payments API design.

I don’t want to give Paytm a hard time: they were a lot better vs lots of other PSPs we worked with (my team owned ~15 PSP integrations). We integrated with *much* worse APIs & providers.

Paytm - unlike many - kept & keeps improving.
Ah, and Willem led writing the postmortem on our side (Uber). Here are takeaways we had (from memory):

One thing I *really* appreciated at Uber was how every outage was treated as a learning opportunity. It was a blameless culture and boy, did we learn.

Lots of people saying Uber should have just interpreted the unknown message as “unsuccessful”. Not quite.

Here’s a story from a startup that did just that… double and triple charging their customers.

Alerting on never-before-seen responses is key over just assuming yay or nay. Image

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Gergely Orosz

Gergely Orosz Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @GergelyOrosz

Mar 12
Every now and then I come across the pretentious fact that almost every tech company can claim: how it’s harder to get in there than Harvard.

Buddy.

1 Harvard application == weeks/months of work put in.

Your stat: inbound resumes to a job ad, ~5 minutes/applicant effort.
And yes, it’s true that virtually every tech company with a decent brand rejects far more than 96% of applicants. Especially if they put a pay and into the job description.

Most of them rejected during a very quick resume screen.
And as Blinkist confidently claims that by having a 4% interview to hire ratio and this hiring the smartest people… Big Tech typically has sub 1% (yes, really).

Cloudflare: 0.7%. Here are the numbers.

Obviously they don’t compare with Harvard because why would they.
Read 5 tweets
Mar 12
Althought in most of tech we're usually dismissive of how governments approach technology, an exception to this should be the UK Government.

They are building a solid engineering culture, and able to hire and retain great people solving meaningful eng challenges: 1/5
On tech conferences I met people working in the UK government. And was amazed to hear how they're pretty cutting edge in things like accessiblity at scale, digital literacy and empowering engineers.

I mean look at their statement. It starts with "platforms". They get it. 2/5
The most innovative thing about the UK government is not about the engineering practices though. It's the things surrounding engineering.

They have an excellent guide on writing technical content: gov.uk/guidance/conte… 3/5
Read 6 tweets
Mar 11
YouTube started to return 403 errors when watching your own uploaded videos.

I sent a bugreport to YouTube, assuming it's something with my account. But no, it impacts everyone.

This is the classic definition of a low priority, low impact outage that is still very annoying.
What likely happened is this:

1. An innocent code change pushed to prod, all tests and code reviews passed

2. The 403 error increase is so small that no YouTube monitoring system catches it

3. People like me write to customer support

4. Customer support opens a JIRA ticket
5. PM triages JIRA tickets a few days later

6. PM talks with engineers who say "yeah, prob a code change. Add it to the sprint."

7. The JIRA ticket has a "low" priority on it. Stuff like shipping the new feature comes ahead of this ticket

8. Won't fit in this sprint. Oh well.
Read 6 tweets
Mar 10
As many people have remote work predictions, here is mine:

Remote work will be here to stay for tech... but not everywhere. The big companies will successfully bring most people back to the office in a hybrid (2-3 days/week) setting. Traditional companies will aim to follow. 1/4
But in the process, remote work explodes in popularity. Startups, mid-sized companies and a few larger ones go, stay & thrive full-remote.

Businesses helping this transition also thrive - e.g. ones by @mar15sa & @SergioRocks (you should follow them on remote work insights). 2/4
Hiring and onboarding people in a remote setting vs in the office/hybrid will continue to be a massive difference, especially with junior engineers.

This is an achilles heel remote-first companies will need to solve: and an advantage hybrid ones currently have. 3/4
Read 6 tweets
Mar 10
Big Tech that has announced return to the office - usually as a hybrid setup with 2/3 days/week - and when it's due:

- Microsoft: 28 March
- Meta: 28 March
- Google: 4 April
- Apple: 11 April

Who wins: policy? Exceptions for devs threatening to quit? Startups hiring remote?

🍿
For all the above companies, the plan has always been to return the office.

What has changed since is how many of their competitors became remote-first since. E.g. Twitter, Shopify. And how well-funded startups are hiring full-remote and are desperate to hire from these places.
Several DMs later:

Google, Microsoft and Facebook are all extremely chill about engineers coming back to the office. Most engineers I talked to won’t go back / have exceptions / their manager allowing remote.

Only place where it’s serious is Apple. Seems no way out there.
Read 4 tweets
Mar 9
The mind-blowing nature of the tech job market:

A new grad software engineer can make more than a Head of Engineering, both working locally. An outlier: but it happens.

It's because "tech" is a very broad field, and different companies have different compensation models.
This realization hit me as I'm finalizing my recording of the overview of the Netherlands tech job market.

There's a data point for a Head of Engineering making €61,000 and data points for new grads in Amsterdam, at Uber making ~€90K/year (€68K base, the rest bonus + equity).
Also why titles are not that telling in many cases.

When at Skyscanner, as a principal engineer I made about £90K/year, in London.

Moved to Uber and became a senior engineer. My comp doubled, and the nature of the work was similar.

More on this: blog.pragmaticengineer.com/the-seniority-…
Read 5 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(