Latest Twitter Threads by @MarcJBrooker on Thread Reader App

Jan 10 • 7 tweets • 2 min read

Great question.

Aurora DSQL is a strongly consistent, synchronously replicated, synchronously conflict-resolved, active-active database.

Amazon MemoryDB is an eventually consistent, asynchronously replication, asynchronously conflict-resolved, active-active database.

🧵

https://twitter.com/namiazad/status/1877553653336174782

That means DSQL gets strong consistency and isolation for all readers and writers, and no data is lost on region failure.

The trade-offs are with latency, and availability. On commit, a multi-region DSQL needs to wait for data replication before the commit succeeds.

Aug 1, 2023 • 7 tweets • 2 min read

15 years! I started at AWS, with the EC2 team in Cape Town, on the 1st of August 2008. It's been a real pleasure to have a front row seat for the growth of cloud, to be involved in the genesis of serverless, and to have exciting problems to work on every day. Some memories: At lunch on my first day (at a hemp-themed restaurant across the road from CPT1) the team explained how our upcoming 'EBS' product was going to work. I honestly didn't believe them. It just seemed too magical to present low-level block storage to machines only in software.

Nov 2, 2022 • 15 tweets • 4 min read

A couple weeks back, I did a talk titled "Distributed Systems Solve Only Half My Problems (and I have a lot of problems)" at HPTS'22. Talks at HPTS aren't recorded, so here's a summary of what I said. I started talking about our journey with Formal Methods at AWS since we published our CACM paper in 2015. cacm.acm.org/magazines/2015…

Nov 1, 2022 • 5 tweets • 2 min read

Today (well, last night) on "most extreme side project": fixing a $3 pair of kid's safety scissors.

The pivot between the blades (plastic ☹️) broke and was lost. No matter! We can take out some Aluminium round, turn it to the right size, tap it to M3 (that's 3mm, around 1/8") and combine it with a little screw. M3 isn't so tiny as taps go, but still feels fragile.

Sep 29, 2022 • 4 tweets • 2 min read

Jack @vanlightly and Marcus Kuppe (@lemmster) have been doing some great work adding simulations to TLA+: conf.tlapl.us/2022/JackMarku… Use TLA+ to not only check safety and liveness properties, but also understand the statistical behavior of systems! This reduces work (only one spec to write), reduces mistakes (same spec can be checked for safety and livesness properties), and makes simulation make accessible. I mean, look at this (from their slide deck):

Sep 23, 2022 • 10 tweets • 3 min read

Erlang's work on telephone systems in the early 20th century is foundational to how we think about, and build, distributed and cloud systems 100 years later. How can this work, done before modern computing was even a field, be so important?

https://twitter.com/MarcJBrooker/status/1573109939475816450

Erlang spent a big part of his career thinking about how to apply statistics to the behavior of telephone networks. Given a certain amount of capacity, how many customers can this telephone system serve? As our customer base grows, how will the quality of service they get change?

Sep 20, 2022 • 5 tweets • 1 min read

I've finally emptied my side project stack! With this new adjustable outfeed roller, I can finally rip the long board that needs cutting and get on with the actual project.

It's a mixture of aluminium, 3d printed parts in polycarbonate and nylon, brass, and steel.

Sep 17, 2022 • 9 tweets • 2 min read

To be more specific, here are some example simulation results comparing (strict) serializable to (strong) SI. What we're measuring here is % transaction commit success, for different read and write set sizes (for a workload with loads of contention).

https://twitter.com/MarcJBrooker/status/1571135391230742528

Note how SI is only sensitive to write set sizes (|W|), because it cares only about write-write conflicts. Serializability is sensitive to both read (|R|) and write set sizes, because it cares about read-write conflicts.

Sep 1, 2022 • 10 tweets • 3 min read

Histograms are rightfully a popular tool for visualizing and thinking about latency. But I believe that empirical distribution functions (eCDFs) are almost always a better choice. Let's look at an example to understand why. This highly bimodal distribution:

What the histogram shows us is that there's a strong mode somewhere just above zero, and another around 2.6ms. So far so good, it's easy to read this off the histogram.

Aug 18, 2022 • 5 tweets • 2 min read

That it's theoretically impossible to build highly-available strongly-consistent scale-out databases (due to the CAP theorem).

https://twitter.com/jonathaneyer/status/1560122825553260545

The trick is in the definitions. Compare this definition of CAP "A" availability (from Gilbert and Lynch) to what most people will assume you mean when you say "highly available". "Every request received by a non-failing node must result in a response" 🤔

Aug 11, 2022 • 5 tweets • 2 min read

When do you want backoff and jitter, and when do you want adaptive retries? Are they just two ways to do the same thing, or is there something different about them? New blog post: brooker.co.za/blog/2022/08/1… This is a follow-up to my Builder's Library article about backoff and jitter (aws.amazon.com/builders-libra…) and my last post about retries (brooker.co.za/blog/2022/02/2…).

Aug 3, 2022 • 5 tweets • 2 min read

My mental shortcut for estimating the effect of multitenancy is that the percentile-to-mean ratio drops approximately with sqrt(N). This isn't quite true, but is close enough for most estimation purposes.

https://twitter.com/jindalabhilash/status/1554661576770146304

The graph shows the ratio between the mean load and 99th percentile load for a fleet of machines. As you can see, it's not quite a linear relationship with sqrt(N), but not too far off.

Aug 2, 2022 • 7 tweets • 3 min read

"Use One Big Server" (specbranch.com/posts/one-big-…) is trending, and while it's mostly reasonable it misses a couple of big things (or dismisses them). But there's one misconception here that I wanted to talk about: peak load.

The economics of multitenancy don't work that way. The reality is that the peak load of multitenant systems is almost always much lower than the sum of the peak load of their tenants. This is simply because the peaks don't happen at the same time.

Jul 20, 2022 • 4 tweets • 2 min read

I enjoyed reading this post by @JakubMikians: jakub-m.github.io/2022/07/17/lap… It brings together some ideas from the S3 team's lightweight formal methods paper, with Lamport clocks. It's also a good excuse to revisit "Time, Clocks and the Ordering of Events in a Distributed System" microsoft.com/en-us/research…

Jul 18, 2022 • 5 tweets • 2 min read

If you're interested in correctness of distributed systems, you'll likely enjoy "Demystifying and Checking Silent Semantic Violations in Large Distributed Systems" usenix.org/system/files/o… from folks at JHU. "A vexing problem occurs when a system is operational
but silently breaks its semantics without apparent anomalies." Indeed! Systems don't always break in obvious ways (in fact that may be the exception).

Jun 21, 2022 • 10 tweets • 4 min read

Joe's right about this. But why do caches lead to long outages? Let's explore one reason with a small simulation, starting with a really simple two-tier system, and seeing what happens when a cache gets emptied.

https://twitter.com/_joemag_/status/1539084285881114624

First, our architecture. There's a client offering some rate of requests, an LRU cache with a limited size, and a backend database that can handle some fixed rate of requests (less than the client offers).

Jun 9, 2022 • 10 tweets • 2 min read

Meta's "Cache Made Consistent" engineering.fb.com/2022/06/08/cor… paper covers what seems like some cool work on observability and correctness. But I think they're understating what it is that fundamentally makes caches difficult. Why are caches interesting? They offer cheaper, faster, or more scalable access to data. They do that with locality, distribution, incompleteness ("just the working set"), specialization (e.g. materialized views), etc.

Jun 2, 2022 • 11 tweets • 2 min read

New blog post: "Formal Methods Only Solve Half My Problems" brooker.co.za/blog/2022/06/0… about the need for tools that allow us to reason quickly, and quantitatively, about distributed systems at the design stage. Formal tools (like TLA+ and P) have proven to be extremely useful during the design stage of large systems, mostly to demonstrate safety properties ("nothing bad happens") and liveness properties ("something good eventually happens").

May 16, 2022 • 7 tweets • 2 min read

If you use Goodhart's Law ("Any measure used as a target stops being a good measure") as a tool for thinking, you'll probably enjoy "Categorizing Variants of Goodhart’s Law" arxiv.org/pdf/1803.04585… They break the Goodhart phenomenon into four categories, and explore the dynamics that drive each category. Most enlightening for me was that "adversarial" activity (where a person or agent actively thwarts measurement) isn't necessary in three of the four categories.

Apr 29, 2022 • 8 tweets • 1 min read

Erasure coding really is a great, and under-used, technique for reducing tail latency in systems that fetch data. Say you're trying to fetch 1GB of data. The simplest way to do that is to fetch it all from one place, but then you're bound by the bandwidth offered by that one source.

Mar 28, 2022 • 7 tweets • 2 min read

After my thread last week about simulation, a bunch of people asked me for an example of a simple system simulation. So I wrote a basic one just to demonstrate the technique: github.com/mbrooker/simul… It helps explore a queue theory mystery that occurs each winter. Over the last few years, there's been a lot of talk online about huge queues at popular ski resorts, long waits, and big crowds. But the statistics show relatively modest increases in people actually going skiing. So what's up?

Share this page!

Enter URL or ID to Unroll