Read on Twitter

12,399 views

Andy Ellis

@csoandy

, 30 tweets, 6 min read Read on Twitter

@Cloudflare

@Cloudflare

First: kudos to @Cloudflare for transparency here and throughout their incident.
Next: some thoughts on safety in distributed systems like this. (I don’t know how CF does it, so don’t take this as criticism of their practices, merely some musings from similar experiences) 1/

https://twitter.com/mjos_crypto/status/1146168236393807872

While test and QA is important, massive distributed systems with unconstrained user inputs are hard to simulate, so deployment to production is *always* risky. Call it “operational field testing,” but there is always the chance you’re going to find new failure modes there. 2/

There are several areas to scale safety: staged rollout, rapid rollback, error detection, edge failure rejection. 3/

Staged rollout, where you slowly roll out changes and watch for weird errors, has some costs. It makes changes take longer (either overlapping with other changes, or limiting total changes). You may not trigger the failure in a detectable way early. It increases human cost! 4/

Staged rollout’s costs are the most visible every day to your customers, who, except on bad days, want instantaneous deployment. 5/

Rapid rollback is “simply” being prepared to revert changes - which is effectively instituting a new change to an older state. But other things might have changed since then. Distributed systems rarely have one channel of changes, so 50 other things might have also changed. 6/

So the state you “rollback” to is almost always a state you were never in. Usually that’s okay - but it’s possible to enter an equally deadly state, if an intervening change would have failed in the world before your rollback-inducing change. 7/

Instrumental to identifying a “bad” change is the ability to detect failures, and isolate them to a triggering change. In some failures, that’s easy. “We broke the whole network” is easy to spot, and it happens right after a change. 8/

But what if your change only goes bad when 5% of your requests have Accept-Language: Mandarin? And you deployed in the US day? You won’t see a failure for hours. Or if your change only affects a weird HTTP corner case, used by one customer? 9/

Detecting changes that affect one customer is rife with false positives: they change themselves all the time! 10/

Edge failure detection - having individual systems notice something is wrong - has areas of different complexity. Crash rejection of dynamic config is easiest: before reading, move new config to a temp location; if you crash on reading, you can’t find bad config on restart. 11/

But many failures aren’t crashes. You can prune runaway processes - which requires complex system accounting and process segmentation - if it’s a subset of your code paths. 12/

Runaway security processes — like a WAF rule — are a difficult balancing act. Do you let the request through and term the process? Better performance, but creates a bypass attack risk. Kill the request? Saves cycles, but unhappy customer risk? Keep trying? Might never end. 13/

And an edge server has to “decide” if the failure is because of the request (kill the request), the config (can you self-revert?), or a server failure (remove from service). 14/

It’s dangerous for servers to remove themselves from service, especially if the problem is global. A server can hint - telling the loadbalancing not to send it traffic, but serving any traffic it gets. Loadbalancing can ignore that hint (and *that’s* an area of complexity). 15/

There’s a lot more depth on this topic area; here’s a few pieces for more reading. akamai.com/us/en/multimed… akamai.com/us/en/multimed… 16/FIN

@eastdakota

@eastdakota

After @eastdakota shared this, I got to be cc’ed in several calls for CF to run two distinct networks, and to not have their customer portal behind their CDN. I’ve been in similar conversations over the last two decades, so here’re some more thoughts.

https://twitter.com/eastdakota/status/1146426189570908160

2:1/

Let’s look at the “run two networks approach.” Call them A and B - and they need to have similar, if not identical, functionality. (At Akamai, we have many networks, but they mostly have divergent capabilities; this isn’t about that). 2:2/

One approach is to have A and B run two distinct codebases. To do that, you really need to implement in different languages, too, and be careful about shared open source libraries. Your dev cost will more than double, because you also need to pay for anti-coordination. 2:3/

Instead, maybe you just phase changes between A and B, making it a form of slow rollout, maybe offset by a week. That’ll only protect you from instant catastrophes, and only if you have full capacity on each subset. 2:4/

While having double capacity may sound simply obvious if you’re used to running a website, it’s really not wise at scale - for instance, we’d need to have around another hundred terabits per second of idle capacity. 2:5/

A third approach to A/B might be to follow a Debian model - have stable/unstable networks, and customers on each - maybe your “free/low-margin” customers get the unstable branch. But guess who actually wants and uses the features you’ll put there? The high-margin customers! 2:6/

And if you aren’t overcapacitating those networks, in any model, you’re basically committing to a comparable PR disaster as a full network failure. Maybe you only to apologize to half as many customers, but that half will want to know why they were on the wrong network. 2:7/

(June, 2004: we had a 94 minute failure on one of our FOUR DNS networks, the one that provided loadbalancing as a service. It was hard to explain to journalists that it hadn’t been a total outage.) 2:8/

How about the question of “Should a CDN’s customer portal not use that CDN?”
The answer is “It’s a bad idea, but all other ideas are worse.”
2:9/

Portals have to be behind a CDN. We aren’t selling snake oil; if you want to keep a website up and available, a CDN is the single best defensive layer. A scrubbing network cones in second, everything else isn’t worth mentioning. 2:10/

@eastdakota

@eastdakota

Okay, but does a portal have to be behind its own CDN? Behind a competitor’s CDN seems odd. I suspect the FTC might have issues. But the CDN’s CISO ought to have issues! I don’t want @eastdakota looking at my customer admin traffic, and I’m sure that’s mutual. 2:11/

Also, odds are you want your portal using your own CDN’s features, so that makes even fallback hard. That’s a challenge that multi-CDN enterprises grapple with; only being able to use the lowest common services. 2:12/

But ignore the trust issue. Honestly, of course your sales team is going to try to sell on a competitor’s failure! (There is a graceful way to do so, which we don’t always achieve). But that’d be hard if their CDN went down, and took your portal with it! 2:13/

That isn’t to say a CDN shouldn’t think about the customer bootstrap issue – but there aren’t really obvious fixes. 2:14/FIN

Like this thread? Get email updates or save it to PDF!

Subscribe to Andy Ellis

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Like this thread? Get email updates or save it to PDF!

Subscribe to Andy Ellis

This content may be removed anytime!

Try unrolling a thread yourself!

More from @csoandy see all

Related threads

Trending hashtags

Did Thread Reader help you today?