Tweet

A.D.

30 May, 8 tweets, 2 min read

@apachekafka

For the last year, I've been running ~30 production @apachekafka clusters with ~1PB of overall storage.
These are the things that I've learnt:
#apache #kafka

1. Biggest performance gains are coming from IOPS throughput, more memory and lower latency between brokers.
Not cpu. Even if you are sending 1-5MB messages and using SASL.

2. And yes, everything can break. Consumer offsets are tricky. Don't mess with it, manual changes are very risky.

3. Always have dev cluster. Always test your changes on it. Even smallest ones.

Most cloud providers allow disk volume increase, but not to decrease. Plan resource allocation carefully.

Biggest difference compared to running RESTful services is that seeing change takes a lot of time. Say, you are increasing partition count or thread allocation - the effect won't be immediate.

There are lots of optional bottlenecks, but the hardest one is JMX. Most of the metrics are coming from it. Don't expect for a sub-second response on heavy clusters (>5000 partitions).

And yes, I have a love-hate relationship with Apache Kafka. I love it when it runs, I hate when something breaks and it takes you forever to find the root cause.

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Share this page!

A.D.

Try unrolling a thread yourself!

Did Thread Reader help you today?

Like this author's thread?