For the last year, I've been running ~30 production @apachekafka clusters with ~1PB of overall storage.
These are the things that I've learnt: #apache#kafka
1. Biggest performance gains are coming from IOPS throughput, more memory and lower latency between brokers.
Not cpu. Even if you are sending 1-5MB messages and using SASL.
2. And yes, everything can break. Consumer offsets are tricky. Don't mess with it, manual changes are very risky.
3. Always have dev cluster. Always test your changes on it. Even smallest ones.
Most cloud providers allow disk volume increase, but not to decrease. Plan resource allocation carefully.
Biggest difference compared to running RESTful services is that seeing change takes a lot of time. Say, you are increasing partition count or thread allocation - the effect won't be immediate.
There are lots of optional bottlenecks, but the hardest one is JMX. Most of the metrics are coming from it. Don't expect for a sub-second response on heavy clusters (>5000 partitions).
And yes, I have a love-hate relationship with Apache Kafka. I love it when it runs, I hate when something breaks and it takes you forever to find the root cause.
• • •
Missing some Tweet in this thread? You can try to
force a refresh