Read on Twitter

12,399 views

Clare Liguori

@clare_liguori

, 17 tweets, 3 min read Read on Twitter

https://twitter.com/colmmacc/status/1034502453385822208

https://twitter.com/colmmacc/status/1034502453385822208

Colm's thread on shuffle sharding reminded me of how important it is that clients participate in fault tolerance, and how frustrated I get when a client library *doesn't* do this by default in my application. Let's talk about some best practices!

https://twitter.com/colmmacc/status/1034502453385822208

There are three important behaviors for fault-tolerant clients:
1) Retry
2) Timeout
3) Backoff
Good client libraries have knobs for each one, so you can tune for your application's needs.

Retries are a must-have! They'll most likely get your request directed to a healthy node if one is having issues, and will help you weather any transient network issues.

Great clients have logic that retries only *some* kinds of failures, like connection errors and http 500s, and doesn't retry on errors that are likely non-transient like http 400s.

Timeouts are important so that 1) you get the opportunity to retry! and 2) slow requests don't hog all your available threads waiting on a response

There are usually different kinds of timeouts you can set in good clients, with sane defaults: connection timeout, socket timeout, read timeout, write timeout, individual request timeout, overall timeout including retries, etc

My favorite timeout setting (read: the one that has bitten me many, many times) is probably socket timeouts i.e. the amount of time a request's connection can sit there idle before the request gives up and fails

On many systems, a dead socket won't timeout by default for ~2 hours (yes, that's HOURS). As in, get a stuck request, go off to dinner, come back, and your request will still be sitting there idle hogging a thread!

You can ratchet this down a bit at the system level by configuring keep-alives (see the Redshift guidance below), but in general you'll want to configure timeouts at the application and client level based on your needs.
docs.aws.amazon.com/redshift/lates…

Many systems and client libraries will use a socket read timeout of infinity by default! As in, FOREVERRRR ... until the next application restart. I am not waiting around forever, no thanks, my application has better things to do!

Moving on: Backoffs! Backoffs can be as simple as a sleep(1) in your retry loop, but exponential backoffs will give you the most bang for your buck. Requests during short issues will get retried quickly and then succeed, and longer issues won't require a ton of retries

The AWS SDKs are great examples of fault-tolerant client libraries, with configurable retries, exponential backoff, and timeouts. The defaults are a good start, but remember to monitor and tune for best performance
aws.amazon.com/blogs/develope…

Many other client libraries now have fault-tolerant options by default too! Apache HttpClient 4.x for example has DefaultHttpRequestRetryHandler and DefaultBackoffStrategy, but set the timeouts explicitly (otherwise, you could be waiting for hours, y'all, HOURS)

Two clients that repeatedly trip me up wrt fault tolerance:
~~~ ssh and curl! ~~~
Super common tools, but both require non-default config to be used in a reliable application. Think build scripts, automated ops tools, and monitoring canaries that you want to withstand failures.

So, for resilient tooling:

Add this to your SSH config:
Host *
ConnectTimeout 10
ConnectionAttempts 10

And use these curl options:
curl --retry 3 --connect-timeout 10 --max-timeout 20 --retry-max-time 30

(tune the exact numbers for your application's needs)

In the containers world, I'm excited about using sidecar proxies like Envoy to help applications set sane retries and timeouts, regardless of the application's client libraries. Notice in this example: no special curl flags are required to enable retries! blog.christianposta.com/microservices/…

So there's my brain dump about client-side fault tolerance. What other best practices are out there?

Like this thread? Get email updates or save it to PDF!

Subscribe to Clare Liguori

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Like this thread? Get email updates or save it to PDF!

Subscribe to Clare Liguori

This content may be removed anytime!

Try unrolling a thread yourself!

More from @clare_liguori see all

Related threads

Trending hashtags

Did Thread Reader help you today?