A Kafka in the cloud doing 30MB/s costs more than $110,000 a year.
A $1,000 laptop can do 10x that.
Where did we go wrong? 👇
The Cloud. Namely - its absurd networking charges 👎
Let’s break it down simply:
• AWS charges you $0.01/GB for data crossing AZs (but in the same region).
• They charge you on each GB in and out. Meaning each time a GB passes, you pay twice - for the one who sends it (outgoing) and the one who receives it (incoming)
• For a normal Kafka cluster with replication factor of 3 and a read fanout of 3x, you are going to be charged:
• 2x for 2/3rd of the produce throughput
• 4x for 100% of the produce from replicating it
• 6x of 2/3rd of the produce throughput for consumption.
(but it can get a lot worse - read until the end to see)
Simple example:
• 3-broker cluster, each in a separate AZ
• 3 producers, each in a separate AZ
• 3 consumer groups with 3 consumers each, each group with consumers in a separate AZ
The producers are producing 30MB/s in total to the same leader.
2/3 producers are in a different AZ, so 20MB/s of produce traffic is being charged at cross-zone rates. 👌
It’s charged both on the OUT (producer’s side) and IN (broker’s side).
The leader is replicating the full 30MB/s to both of its replicas.
This is again being charged both on the OUT (leader’s side) and IN (follower’s side), for both replication links. (60MB/s)
Then, each of the 3 consumer groups has 3 consumers.
All consumers read from the leader, with 2/3 in a different zone.
This results in 20MB/s of consume traffic charged at cross-zone rates PER GROUP. (60MB/s total)
Again charged both on the OUT (broker’s side) and IN (consumer’s side).
The total amounts to 140MB/s worth of cross-AZ traffic. Charged both ways.
When one MB is $0.00001/s, this means we’re paying $0.0028/s. 🤔
That’s:
• $241 a day 😕
• $7500 a month 😥
• $88,300 a year 🤯
It all goes down the drain on network traffic ALONE. 🔥
What about the hardware?
Quick napkin math assuming:
• 7 day retention
• all of the data is on EBS (not using tiered storage since it's not GA yet)
• keeping 50% of the disk free for operational purpose (don't ask me what happens if we run out of disk)
• the 3 brokers are running modest r4.xlarge instances (kinda overkill but hey, why not)
We'd pay:
• $19,440/yr for the EBS storage
• $6,990/yr for the EC2 instances
That’s right - you’re paying just $26.4k/yr for the hardware and 88.3k for the network (3.3x the hardware)
For a total of $115k/yr. 💸
I’m not even counting load balancer costs, which could be $12k by some quick napkin math too.
How ridiculous is that? 😂
Want it to get more ridiculous?
This calculation assumes you’re hosting your own Kafka cluster in the same AWS account.
💡If you use a managed Kafka provider that’s not AWS, or otherwise just another AWS account, you’re typically connecting to them through a public endpoint.
AWS then charges all traffic at the cross-AZ $0.01/GB rate internet traffic rate, even if it's in the same AZ.
The end result?
$113,000 a year for network costs. 💀
For 30MB/s. (!!!)
btw - 30 MB/s is absolutely nothing for Kafka... 🤡
It is most often network/disk bounded.
Doing 3GB/s is not hard. 👌
The higher throughput you go, the more absurdly large this discrepancy between network and hardware cost becomes.
For example - this exact setup could probably do 3x the traffic (90MB/s), assuming storage space isn't a concern.
Then you'd have:
• $264,000 a year for the cross-AZ rate. 🥲
• $339,000 a year for the internet rate. 💀
Why is this cost (more than 300k a year) and complexity (this calculation) the case when three laptops can run this practically for free?
Where did we go wrong?
Worth Noting:
There are a few optimizations that can be done here:
• consumers can use fetch from follower, which results in free read traffic (no cross-AZ charges) in the first example. But the second example would still be charged internet costs. 🤝
• you can avoid internet costs by VPC-peering or Private Link-ing the two AWS accounts. This is largely what most cloud providers do, otherwise it becomes prohibitively expensive. It can be super complex to do. 🔧
• AWS can give you large discounts (up to 90%+ afaict) on the quoted prices, depending on your usage. It’s unclear what customer gets what discount. 💰
And perhaps the best example - you can use an ingeniously-designed product like WarpStream that eliminates all of this complexity and cost. ⭐️
It's no wonder they got acquired after just 13 months of operation.
I had to edit this because I got the public endpoint traffic cost wrong.
There's surprisingly little info out online about this, and many people seem confused about it.
After latest research, I conclude public endpoints get charged at the usual $0.01/GB rate, I believe on both sides.
It doesn't affect the first part of the calculation, but later on I got some vastly inaccurate numbers.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Knowing Kafka internals gets you ahead of 94% of Kafka admins.
But reading the code takes a lot of effort.
I read it so you don't have to.
Here are 9 ways how Tiered Storage (KIP-405)'s write path works in Kafka.
1. Async Writes
Simply put, Kafka tiers to an external store asynchronously. Your data is saved locally first.
Because of that, it temporarily keeps both copies for a (small) subset of the data.
Reads can come from either place, with preference always being the local one.
2. When does Kafka tier?
• Only when the segment is closed. (so you don't have producers actively writing) 🙅♂️
• Has all of the required index files created. 👌
• The LSO (high watermark) has advanced beyond the segment. (i.e its fully replicated and transactions complete) 🙌
Here are the top 3 features you should know about 🔥
RIP ZooKeeper 💀
After 14 faithful years of service, Kafka completely removes support for ZooKeeper.
With KRaft, controller brokers now reach consensus amidst each other and store all cluster metadata in an internal topic.
Queues ✨
Kafka gets a brand new type of consumer group called Share Consumers.
Share Consumers offer queue-like semantics:
• per-message acknowledgement and retry
• ability to have many consumers collaboratively read the same partition