As a Kube noob who's been cut by a few sharp edges, this type of battle report was super useful to me :) Some stuff I learned:
Their backend is a monolith but they route different collections of endpoints to different nodepools—this is a clever way I'd never thought about to limit the blast radius of performance issues (not Kube specific either, and may be a common practice I'd just never heard of!)
GKE regional clusters incur big bandwidth charges for cross-AZ traffic; you can avoid by using multiple zonal clusters
TBH it doesn't look *that* awful from the chart—the egress it shows costs <=5k/mo and I'd guess Git storage is near-pathological for this—but useful warning
They, like me, got resource requests vs limits wrong to start with.
After thinking for a bit, I came to the conclusion that (request != limit) is probably a bad idea for production traffic bc it leads to pods getting an unpredictable amount of resources that can change...
depending on where the pod is scheduled and who its "neighbors" are. That unpredictability makes it hard to size workloads well.
But I couldn't find anyone else giving this advice; lots of blog posts explain what requests/limits *do*, but few cover good usage patterns!
Autoscaling + slow pod startup times + spiky workloads (+ overcommitted CPU?) = 😢
We also have slow startup times (fortunately <2m) so another useful warning!
(I think "reserved pod capacity" = using low-priority "pause" pods as described here: replex.io/blog/kubernete…)
btw, I would love other pointers to good posts about lessons learned using Kube in anger! Google seems to be inundated with tutorials, or maybe I just had bad keywords
• • •
Missing some Tweet in this thread? You can try to
force a refresh
I've been reflecting recently on Wave's growth spurt in 2019-21. Most teams grew 2-4x a year for multiple years, and culture and effectiveness stayed remarkably strong compared to what I'd have expected (or heard of elsewhere).
Some thoughts on what might have helped:
1. We held a super high bar for integrity + mission alignment. This was huge—IMO the root cause of most (hard to fix) dysfunction is people optimizing for themselves and not the team, so ~full mission alignment does more than anything to keep orgs from breaking. (See also QT:)
2. We made concrete growth goals and plans pretty far (6-12mo) in advance, giving us an early start on problems like "ack, we're about to have more tech lead openings than we can fill" that take a long lead time to fix. We still got underwater, but way less than we might have.
A lot of talk about managing focuses on "decisionmaking": how to run decision meetings, who gets to sign off on what, how they flow up + down the hierarchy...
But IMO, management isn't (mainly) about decisions; it's about understanding and tweaking a complex system (of people).
Most individual A-vs-B(-vs-etc) decisions don't matter much in some sense. First because most decisions are reversible, so low-stakes. Second because making a decision is often straightforward: think about the pros and cons; think about what you care about; take your best shot.
But that framing only applies when you're at a pivotal point where you need a specific A-vs-B decision. It's not a good fit for most work, because most work happens as a result of the accumulation of thousands of micro-decisions continuously sprayed from the decision firehose.
One thing I love about both Wave (past job) and Anthropic (current) is how much everyone who works here cares about the mission. I think it's easy to underrate how much of a difference this makes if you haven't worked in that kind of environment. Some ways it's different...
(Sidebar: It's hard to write about this without sounding naive, because every startup pays lip service to having a mission. But IMO most times it's bogus—the mission isn't that credible or compelling, or they don't prioritize alignment when hiring as highly as Wave/Anthropic do.)
Anyway, while the most obvious benefit of working at a mission-oriented company is that it's way more motivating (see e.g.
A thing I often find myself suggesting to new managers is to "exert more backpressure."
Backpressure is a concept from fluid dynamics (and distributed systems) meaning the way in which a system resists overload—e.g. by slowing down, dropping requests, or completely failing.
You generally want to build systems that correctly handle downstream backpressure (i.e. propagate it upstream), and exert upstream backpressure in ways that are easy for the upstream to handle.
An example of good(ish) backpressure handling is TCP: dropped packets are interpreted as congestion due to maxing out bandwidth, so the sender responds by throttling their data volume. (Though note this produces bad behavior if there's packet loss for other reasons!)
There's a really interesting disconnect between how most people talk about "flat hierarchies" with few/no managers (a joyful world of freedom and harmony) vs how people who actually experienced them talk about them (a bunch of sad people flailing around chaotically)
A flat hierarchy does have one upside, which is that it makes it much less bad if your boss is bad, since they're spread too thin to spend a lot of time making you in particular miserable. I wonder how many flat-management proponents have only experienced bad managers.
*Good* managers help with lots of important things:
- Figuring out what's important to work on
- Resolving things that are blocking the work
- Course-correcting when new info comes up
- Addressing bad feelings or interpersonal issues
- Coaching people to improve over time
I've been overseeing a few cost-reduction projects at Wave recently, which is kind of fun since you have a much more objective feedback loop for whether your estimation / prioritization was right.
Here's some advice for this type of work I've found myself repeating:
1. Use numbers everywhere! Sounds basic, but I hear people talk about "not very much money," "a lot of users," etc. surprisingly frequently. Get in the habit of replacing every vague quantity word like that with a real number.
2. Do lots of back-of-the-envelope estimates. Sometimes you don't know the exact number, but you can still get an order-of-magnitude estimate by making up plausible values. "Our top 10 dashboards consume 40% of compute, so if we could make them all 2x faster, we'd save..."