Denise Rystsov Profile picture
Nov 11, 2022 12 tweets 4 min read Read on X
For a long time I've been thinking that using a closed loop (sync) for measuring latency is wrong Image
It's affected by the coordinated omission problem: imagine that all but one do_action executions take 1ms and the bad one takes one minute. If we look at p99.(9) we won't notice a problem and assume that everything is fine while the rogue request may block the system.
The "right" way out is to break the loop (async) and to issue the requests at the fixed rate (load). With this approach the rogue requests won't block the control flow and we notice the degradation Image
But things are not black and white. For example, it's hard to use that to determine the throughput of the system:
- if the load is below the throughput it's below by the definition :)
- if it's above then the system chocks and the latency skyrockets
While with the sync approach it's pretty simple to see the max throughput but first let's address the coordinated omission problem. First we chart the reported data Image
Then we break the time into buckets and measure the number of operations per ms per bucket ImageImage
When a rogue request takes a lot of time it will be visible as the throughput drop. Not only per bucket but as full (when we sum all ops and divide by the duration of the experiment) throughput too if the pause is big enough Image
Ok, but still how to measure the true throughput of a system? We repeat the experiments multiple times and each time we increase the number of parallel sync clients.
Then when we plot full throughput and latency (p50, p99, it doesn't matter) from the number of then clients and we get the following chart Image
What's cool about the chart is that it always has that shape. When the system reaches the saturation point the throughput converges to the max throughput and the latency starts to grow linearly.
The magic of the sync client is the natural feedback loop so instead of choking the system like the async way does, the sync load adjusts to the system capacity and we get a clear picture.
For example this is an actual chart of the Redpanda's transactional performance. We see that the max tps throughput (for the cluster I used) is 10k distributed transactions per sec Image

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Denise Rystsov

Denise Rystsov Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @rystsov

May 1, 2020
You may wonder why I am staring on this king oyster mushroom, well buckle up it's going to be a long thread
When I was watching Firefly I noticed that one of its character (Hoban Washburne) looks familiar and searched if an actor played Seamus Zelazny Harper in the Andromeda tv series. Happened that they just look similar
The same repeated with the Star Trek: Discovery and Paul Stamets
Read 22 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(