Have worked on Apache Kafka for 6+ years, now I write about it. (& the general data space)
Low-frequency, highly-technical tweets. ✌️
Dec 20, 2024 • 7 tweets • 3 min read
Kafka is about to get a lot cheaper 🔥
@ivan0yu has published KIP-1123: Rack-aware partitioning for the Kafka Producer. 🎅
Just one week after my calculator post, where I shared the idea of a Produce to Same-AZ Leader KIP!
What a christmas gift! 🎁
🧵
Kafka producers usually write to a particular partition they choose. That partition lives on a random broker.
That broker can be in another AZ.
In the cloud, cross-AZ networking is expensive! 💰
But why do producers choose a particular partition?
Dec 14, 2024 • 45 tweets • 17 min read
has a vendor ever told you that open source Kafka is expensive?
here are 7 ways cost calculators lie to you
(featuring a real story)
the story begins in 2023 when WarpStream was first released with a poignant piece called "Kafka is Dead, long live Kafka"
It was an innovative Kafka-compatible system that avoided inter-zone networking and disks. 💡
It's main value proposition was that it was 10x cheaper.
Dec 4, 2024 • 22 tweets • 8 min read
Yesterday, AWS shook the data lake world by releasing two new S3 features that will forever cement its place there. 👑
Every data engineer must become familiar with them.
A short thread on these game-changers 🧵 (2 minute read)
AWS announced two major S3 features:
• S3 Tables
• S3 Metadata
Together, they form a very strong solidification of S3’s already integral role in the modern data lake.
But a lot of the first takes I saw on the internet are misunderstanding what this is. 🙄
Nov 2, 2024 • 17 tweets • 5 min read
An expensive Kafka cluster sells for $1M.
Cheap Kafka sells for … $220M
The story of how Confluent acquired WarpStream after just 13 months of operations 👇
In August of 2023, WarpStream shook up the Kafka industry by announcing a novel Kafka-API compatible cloud-native implementation that used no disks.
Instead? It used S3. 🧠
It was a viral HackerNews post named “Kafka is Dead, Long Live Kafka!”.
Aug 13, 2024 • 8 tweets • 3 min read
Uber spends $2B on R&D annually. 🤯
Which open source technologies do they choose and what do they use them for?
How Uber processes 100s of GB/s 🧵
Uber is the poster child of hypergrowth.
It was founded in the ZIRP era (2010)
5 years later - it completed 1 billion trips.
2.5 years after - it completes 10 BILLION trips 🔥
Exponential growth, also reflected in its data infrastructure:
Hadoop grew 1000x in 7 years. 🤯
Jul 30, 2024 • 10 tweets • 5 min read
Apache Kafka 3.8.0 was just released! 🔥
What comes with this new release?
Here are the top features you should know about:
1/9 🧵 (2-minute read) 2/9
KIP-390: Support Compression Level allows you to configure the compression level of each supported algorithm.
Used for both broker-side & producer-side compression.
Uber processes 35 petabytes of data each day, or 405 GB/s... 🤯
How do you architect a system to support such an insane scale?
The answer, in 2 minutes: 👇
Each day, Uber processes TRILLIONS of messages.
Why?
To serve use cases that require a lot of data to be answered in a timely manner.
How else would you be able to do things like:
• match a rider and a driver? 🤝
• optimize the route in real-time? ✅
• provide a real-time adjusted ETA on when the driver will arrive & when you'll get there? ⌚️
• know what's happening in your global business? 🌎
These are simple, first order questions.
But once you provide engineers with access to real-time data, a lot of ideas start to flourish.
Soon, you find yourself swamped with use cases that product managers want to try out, and later productionize.
So. How do you scale & model your infra then?
According to the use cases!
A naive approach would build a new data pipeline each time a new use case pops up - but that doesn’t scale - neither in cost nor maintainability. ❌
Thankfully, technologies today allow us to get the best of both worlds.
And since companies like Uber are typically ahead of the curve (in the amount of data that they collect and the way they use it) – chances are you will hit this soon enough in your company too. 😬
🤔How did Uber do it?
They stacked a bunch of technologies on top of one another, in a logical way.
💡 Interestingly, they shared that the stack grew chronologically in the same manner too – i.e. they started with the basic storage layer, then streaming, then compute, etc.
Here are 4 vastly different examples from Uber & their solution 🧵
🍔 1. UberEats Restaurant Manager
If you own a restaurant that sells via UberEats, you’d like to be able to see how well you’re doing!
The number of customers by hour, the money made, trends, customer reviews and etc. so that you can tell what is happening. 😎
This use case requires:
• fresh data🍐
• low latency - the data should load fast - each website page has a few of these dashboards. ⚡️
• strict accuracy on some metrics - the dashboards that show you financial data made should not be wrong! 💵
The good thing?
The types of queries are fixed!
You know precisely what the web app is going to request.
🤔 How did they solve this?
🍷 They use Apache Pinot with pre-aggregated indices of the large volume of raw records. This greatly reduces the latency in pulling the data.
A fun story with a related internal dashboard is that once their CEO saw inaccurate financial data in an internal dashboard and opened a Jira to the platform team saying “this number is wrong”.
That is the only Jira their CEO ever opened!
Jun 24, 2023 • 32 tweets • 11 min read
Stochastic, Sublinear Streaming Algorithms.
What?
Stochastic. Sublinear. Streaming. Algorithms.
... What?
OK - let's start with the problem first 👇
You have a Kafka topic with billions of records representing latency metrics.
You want to compute their p95, p99, p999.
How?
Certain big data aggregations are really costly.
Computing the p99 of a stream of values, without knowing the number of values prior, requires you to:
• retain ALL values
• sort them
• rank
With billions of records, this doesn't scale.
You can't parallelize it either 👉
Jun 9, 2023 • 16 tweets • 3 min read
If you can steal one ops tip from me, let it be this:
The top 3 Kafka Metrics you need to monitor:
1. ❌ Offline Partition 2. 🚩 Under Min ISR Partitions 3. 🔶 Under Replicated Partition (URP)
👇
PS: I’ll leave you with a gift at the end 🎁
There are the easiest metrics to take a glance at & instantly be able to assess the cluster’s overall high level health.
Of course, there’s 1000 other things that can be wrong and you always need to monitor everything
…
but how many tweets would that require me to do? 😑
Jun 2, 2023 • 18 tweets • 6 min read
I am incredibly excited to announce that after years of contemplation, I am finally launching an AI startup that will revolutionize the way we do ...
just kidding.
I am launching a newsletter around Apache Kafka & event streaming.
1. What 2. Why 3. How
👇
2 Minute Streaming!
The idea is simple - cap every post at 476 words (2-minute read).
This forces me to keep it as concise and clear as possible.
Keep it high quality & manageable - 1 issue a week.
We all have too little time. But. Everybody can spare 2 minutes a week.
Jun 1, 2023 • 9 tweets • 5 min read
How many lines of code does it take to build an event streaming platform?
About 1.3 million. 🤯
...
I did a little bit of number crunching on the @apachekafka repo.
Over its 11 years as an Apache (@TheASF) project, Kafka grew from 220k lines of code to 1.3M! (550%)