As it's been ~3 years, figured I'll answer "What caused the Uber Eats glitch that allowed ordering free food for a weekend in India?"
This was an outage on my watch. Given Quora is paywalled - can't post the answer w/o a sub - here's the story on idempotency & breaking changes:
1. What happened? One morning someone in India tried to order food via UberEats in India, using Paytm as a payment method. But they didn't have enough balance.
Got an error message.
Ordered again.
The order went through!! Without having money for it.
News spread quick.
2. This was a payments-related bug. The problem with these is how the bug was in the reconciliation flow. And Uber reconciled with Paytm maybe once a week.
How Uber discovered this: restaurants started going offline thanks to huge order quantities in very short times.
3. After it was clear something was up, Uber shut down Paytm as a payment method and started the investigation.
My team owner the Paytm payment method at the time, so this was me and my team.
We naturally looked at what code changes we've made in the timeframe. None.
4. So if we made zero changes on our end, what happened?
Turns out the Paytm team did a change late on a Friday that looked innocent enough.
It silently changed an API endpoint from behaving idempotent to non-idempotent.
Why does idempotency matter?
5. Idempotency means that you can safely repeat requests as you get the same response every time.
I remember the endpoint was charge-related.
Before, it always returned the same error when trying to charge a wallet without enough credits. With the change, not anymore:
6. Before 1. "Try to charge wallet X without funds" -> Error1 2. "Try to charge wallet X without funds again" -> Error1
After 1. "Try to charge wallet X without funds" -> Error1 2. "Try to charge wallet X without funds again" -> A Brand New Error
7. Now this might look like a small change, but on Uber's side, the assumption was the endpoint was idempotent, so there was no testing on getting anything else back. The new error was unknown and not mapped to anything.
Long story short it was interpreted as "success".
8. So Paytm returned an error never documented before without telling its partners. Some partners assumed idempotency changes are breaking API changes to be communicated: but they were not. Uber was one of these partners.
The result? Free food until discovered.
9. So who paid for the free food?
Restaurants got paid and customers abusing this functionality were never pursued.
The responsible party needed to foot the bill. But who was responsible?
10. I can't share the settlement, so leaving a poll here to decide. Who do you think should have footed the cost for the bug?
The API provider changing their API to return a new error? The API consumer not parsing a new error introduced - but not communicated?
Who should pay?
Both parties were at fault here, which is why liability is tricky.
1. The API consumer should have coded more defensibly & not assume implicit API behaviors are deliberate.
2. The API provider should have communicated changes ahead of time, and not provide implicit idempotency.
Being in the middle of this outage, a few things I learned:
- Don't assume "unknown" means "good". Assume the opposite.
- The worst outages make for the best stories later.
- College students can eat SO MUCH. They were responsible for the majority of food orders during outage!
Just to make things more gray, a correction. The new API behavior was not a clear-cut error if my memory correct:
1. "Try to charge wallet X without funds" -> Error1 (as before) 2. "Try to charge wallet X without funds again" -> A status that is not an error (also not success)
Lots of questions on “why did Uber not handle HTTP error codes?”
Because there were none. This API at the time retuned only 200s where the body had a message to be parsed which indicated success / status message / error.
Status codes would have made this trivial to catch.
“Did you have tests?”
Yes! As always the integration was unit tested with all possible API behaviours *at the time of building the integration*.
“Could have you not failed closed vs failing open?”
Of course we should have. It’s the morale of the story from consumer side.
Why would you *ever* fail open when there’s something unknown?
Growth! You prefer to provide a great experience even if the provider has issues. Reconcile later.
This was the case in 2015, when the integration code was written. By 2019, the mentality changed. The code: not yet.
Lots of replies on the payments API design.
I don’t want to give Paytm a hard time: they were a lot better vs lots of other PSPs we worked with (my team owned ~15 PSP integrations). We integrated with *much* worse APIs & providers.
Paytm - unlike many - kept & keeps improving.
Ah, and Willem led writing the postmortem on our side (Uber). Here are takeaways we had (from memory):
One thing I *really* appreciated at Uber was how every outage was treated as a learning opportunity. It was a blameless culture and boy, did we learn.
So predictable that we’ll see an explosion of digital products selling “ideas for million dollar businesses” that you can “just vibe code quickly”.
Basically: “buy my digital product for $500, spend $1,500 on Lovable / Claude Code and become a millionaire.”
Another hype train
Ofc these products promoted by influencers will work just as well as crypto sh*tcoins launched by influencers in 2023.
We’ll see doctored evidence (“someone who built one of ideas idea is at $5K MRR after 2 weeks”) and nontechnical people will spend thousands for $0 in return
The predictable winners: AI infra companies! Lovable, Vercel (with v0), Claude Code, Cursor, Replit, Gemini and any and all products that (at least partially) position themselves as “AI tools to build your idea that work even if you’re not a developer”
And it’s stated. A gold rush where - and the surest winners are those selling the shovels!
I generally like Anthropic: but the more they paint a dystopian future where AI “manages” people (“AI middle-managers”) the more I am starting to think they are losing their marbles.
LLMs is a tool humans should use. The tail should not wag the dog; Anthropic should know better
And frankly I’m getting tired of Anthropic being loud about how their AI will lead to mass unemployment, and while claiming to be a responsible lab to develop AI.
If your master plan is to wipe out the labor market for profit: you’re not responsible.
I DO feel recently that Anthropic is the single least responsible lab out there.
Thanks to their CEO parroting how their AI will lead to massive job losses: not being concerned the least, and seemingly *wanting* this outcome (even if it’s not realistic).
I am hearing SO many stories about people realizing coding with AI tools (aka “vibe coding”) is a game changer after “reviving” an old side project or idea on the side and making so much progress
But… while I often hear the excitement on starting: not hearing “finished” often!
Almost like these tools were amazing at making rapid progress at first… but it still takes a ton of effort to finish things and feels like most people go back to leaving side projects unfinished (even if in a more advanced state?)
FWIW guilty as charged
I got a bunch of side projects “revived” and was amazed at how fast it was
Then I just… kind of let them on the side? Turns out the reason I don’t touch them is because… they are just not a focus. Even tho it’s less effort now: still effort!!