Marc Brooker Profile picture
Serverless, databases, and serverless databases at AWS. I use 'cat' every time. Views are my own. On Mastodon: @marcbrooker@fediscience.org
Rajiv Chauhan Profile picture xxchan Profile picture lalithkumar Profile picture 4 subscribed
Aug 1, 2023 7 tweets 2 min read
15 years! I started at AWS, with the EC2 team in Cape Town, on the 1st of August 2008. It's been a real pleasure to have a front row seat for the growth of cloud, to be involved in the genesis of serverless, and to have exciting problems to work on every day. Some memories: At lunch on my first day (at a hemp-themed restaurant across the road from CPT1) the team explained how our upcoming 'EBS' product was going to work. I honestly didn't believe them. It just seemed too magical to present low-level block storage to machines only in software.
Nov 2, 2022 15 tweets 4 min read
A couple weeks back, I did a talk titled "Distributed Systems Solve Only Half My Problems (and I have a lot of problems)" at HPTS'22. Talks at HPTS aren't recorded, so here's a summary of what I said. I started talking about our journey with Formal Methods at AWS since we published our CACM paper in 2015. cacm.acm.org/magazines/2015…
Nov 1, 2022 5 tweets 2 min read
Today (well, last night) on "most extreme side project": fixing a $3 pair of kid's safety scissors. The pivot between the blades (plastic ☹️) broke and was lost. No matter! We can take out some Aluminium round, turn it to the right size, tap it to M3 (that's 3mm, around 1/8") and combine it with a little screw. M3 isn't so tiny as taps go, but still feels fragile.
Sep 29, 2022 4 tweets 2 min read
Jack @vanlightly and Marcus Kuppe (@lemmster) have been doing some great work adding simulations to TLA+: conf.tlapl.us/2022/JackMarku… Use TLA+ to not only check safety and liveness properties, but also understand the statistical behavior of systems! This reduces work (only one spec to write), reduces mistakes (same spec can be checked for safety and livesness properties), and makes simulation make accessible. I mean, look at this (from their slide deck): Image
Sep 23, 2022 10 tweets 3 min read
Erlang's work on telephone systems in the early 20th century is foundational to how we think about, and build, distributed and cloud systems 100 years later. How can this work, done before modern computing was even a field, be so important? Erlang spent a big part of his career thinking about how to apply statistics to the behavior of telephone networks. Given a certain amount of capacity, how many customers can this telephone system serve? As our customer base grows, how will the quality of service they get change?
Sep 20, 2022 5 tweets 1 min read
I've finally emptied my side project stack! With this new adjustable outfeed roller, I can finally rip the long board that needs cutting and get on with the actual project. Image It's a mixture of aluminium, 3d printed parts in polycarbonate and nylon, brass, and steel.
Sep 17, 2022 9 tweets 2 min read
To be more specific, here are some example simulation results comparing (strict) serializable to (strong) SI. What we're measuring here is % transaction commit success, for different read and write set sizes (for a workload with loads of contention). Image Note how SI is only sensitive to write set sizes (|W|), because it cares only about write-write conflicts. Serializability is sensitive to both read (|R|) and write set sizes, because it cares about read-write conflicts.
Sep 1, 2022 10 tweets 3 min read
Histograms are rightfully a popular tool for visualizing and thinking about latency. But I believe that empirical distribution functions (eCDFs) are almost always a better choice. Let's look at an example to understand why. This highly bimodal distribution: What the histogram shows us is that there's a strong mode somewhere just above zero, and another around 2.6ms. So far so good, it's easy to read this off the histogram.
Aug 18, 2022 5 tweets 2 min read
That it's theoretically impossible to build highly-available strongly-consistent scale-out databases (due to the CAP theorem). The trick is in the definitions. Compare this definition of CAP "A" availability (from Gilbert and Lynch) to what most people will assume you mean when you say "highly available". "Every request received by a non-failing node must result in a response" 🤔 Image
Aug 11, 2022 5 tweets 2 min read
When do you want backoff and jitter, and when do you want adaptive retries? Are they just two ways to do the same thing, or is there something different about them? New blog post: brooker.co.za/blog/2022/08/1… This is a follow-up to my Builder's Library article about backoff and jitter (aws.amazon.com/builders-libra…) and my last post about retries (brooker.co.za/blog/2022/02/2…).
Aug 3, 2022 5 tweets 2 min read
My mental shortcut for estimating the effect of multitenancy is that the percentile-to-mean ratio drops approximately with sqrt(N). This isn't quite true, but is close enough for most estimation purposes. Image The graph shows the ratio between the mean load and 99th percentile load for a fleet of machines. As you can see, it's not quite a linear relationship with sqrt(N), but not too far off.
Aug 2, 2022 7 tweets 3 min read
"Use One Big Server" (specbranch.com/posts/one-big-…) is trending, and while it's mostly reasonable it misses a couple of big things (or dismisses them). But there's one misconception here that I wanted to talk about: peak load. Image The economics of multitenancy don't work that way. The reality is that the peak load of multitenant systems is almost always much lower than the sum of the peak load of their tenants. This is simply because the peaks don't happen at the same time.
Jul 20, 2022 4 tweets 2 min read
I enjoyed reading this post by @JakubMikians: jakub-m.github.io/2022/07/17/lap… It brings together some ideas from the S3 team's lightweight formal methods paper, with Lamport clocks. It's also a good excuse to revisit "Time, Clocks and the Ordering of Events in a Distributed System" microsoft.com/en-us/research…
Jul 18, 2022 5 tweets 2 min read
If you're interested in correctness of distributed systems, you'll likely enjoy "Demystifying and Checking Silent Semantic Violations in Large Distributed Systems" usenix.org/system/files/o… from folks at JHU. "A vexing problem occurs when a system is operational
but silently breaks its semantics without apparent anomalies." Indeed! Systems don't always break in obvious ways (in fact that may be the exception).
Jun 21, 2022 10 tweets 4 min read
Joe's right about this. But why do caches lead to long outages? Let's explore one reason with a small simulation, starting with a really simple two-tier system, and seeing what happens when a cache gets emptied. First, our architecture. There's a client offering some rate of requests, an LRU cache with a limited size, and a backend database that can handle some fixed rate of requests (less than the client offers). Image
Jun 9, 2022 10 tweets 2 min read
Meta's "Cache Made Consistent" engineering.fb.com/2022/06/08/cor… paper covers what seems like some cool work on observability and correctness. But I think they're understating what it is that fundamentally makes caches difficult. Why are caches interesting? They offer cheaper, faster, or more scalable access to data. They do that with locality, distribution, incompleteness ("just the working set"), specialization (e.g. materialized views), etc.
Jun 2, 2022 11 tweets 2 min read
New blog post: "Formal Methods Only Solve Half My Problems" brooker.co.za/blog/2022/06/0… about the need for tools that allow us to reason quickly, and quantitatively, about distributed systems at the design stage. Formal tools (like TLA+ and P) have proven to be extremely useful during the design stage of large systems, mostly to demonstrate safety properties ("nothing bad happens") and liveness properties ("something good eventually happens").
May 16, 2022 7 tweets 2 min read
If you use Goodhart's Law ("Any measure used as a target stops being a good measure") as a tool for thinking, you'll probably enjoy "Categorizing Variants of Goodhart’s Law" arxiv.org/pdf/1803.04585… They break the Goodhart phenomenon into four categories, and explore the dynamics that drive each category. Most enlightening for me was that "adversarial" activity (where a person or agent actively thwarts measurement) isn't necessary in three of the four categories.
Apr 29, 2022 8 tweets 1 min read
Erasure coding really is a great, and under-used, technique for reducing tail latency in systems that fetch data. Say you're trying to fetch 1GB of data. The simplest way to do that is to fetch it all from one place, but then you're bound by the bandwidth offered by that one source.
Mar 28, 2022 7 tweets 2 min read
After my thread last week about simulation, a bunch of people asked me for an example of a simple system simulation. So I wrote a basic one just to demonstrate the technique: github.com/mbrooker/simul… It helps explore a queue theory mystery that occurs each winter. Over the last few years, there's been a lot of talk online about huge queues at popular ski resorts, long waits, and big crowds. But the statistics show relatively modest increases in people actually going skiing. So what's up?
Mar 23, 2022 10 tweets 2 min read
One thing that a good postmortem (or COE) process can be is an opportunity to help convert system design from a "wicked" problem into a "kind" problem, accelerating learning and improvement. What does that mean? (thread) According to Hogarth et al, in a kind environment "feedback links outcomes directly to the appropriate actions or judgements". Learning in these environments is easier, because feedback guides you in the right direction.