Stanislav Kozlovski Profile picture
Have worked on Apache Kafka for 6+ years, now I write about it. (& the general data space) Low-frequency, highly-technical tweets. ✌️
Dec 20, 2024 7 tweets 3 min read
Kafka is about to get a lot cheaper 🔥

@ivan0yu has published KIP-1123: Rack-aware partitioning for the Kafka Producer. 🎅

Just one week after my calculator post, where I shared the idea of a Produce to Same-AZ Leader KIP!

What a christmas gift! 🎁

🧵 Image Kafka producers usually write to a particular partition they choose. That partition lives on a random broker.

That broker can be in another AZ.

In the cloud, cross-AZ networking is expensive! 💰

But why do producers choose a particular partition? Image
Dec 14, 2024 45 tweets 17 min read
has a vendor ever told you that open source Kafka is expensive?

here are 7 ways cost calculators lie to you

(featuring a real story) Image the story begins in 2023 when WarpStream was first released with a poignant piece called "Kafka is Dead, long live Kafka"

It was an innovative Kafka-compatible system that avoided inter-zone networking and disks. 💡

It's main value proposition was that it was 10x cheaper. Image
Dec 4, 2024 22 tweets 8 min read
Yesterday, AWS shook the data lake world by releasing two new S3 features that will forever cement its place there. 👑

Every data engineer must become familiar with them.

A short thread on these game-changers 🧵 (2 minute read) AWS announced two major S3 features:

• S3 Tables
• S3 Metadata

Together, they form a very strong solidification of S3’s already integral role in the modern data lake.

But a lot of the first takes I saw on the internet are misunderstanding what this is. 🙄
Nov 2, 2024 17 tweets 5 min read
An expensive Kafka cluster sells for $1M.

Cheap Kafka sells for … $220M

The story of how Confluent acquired WarpStream after just 13 months of operations 👇 Image In August of 2023, WarpStream shook up the Kafka industry by announcing a novel Kafka-API compatible cloud-native implementation that used no disks.

Instead? It used S3. 🧠

It was a viral HackerNews post named “Kafka is Dead, Long Live Kafka!”. Image
Aug 13, 2024 8 tweets 3 min read
Uber spends $2B on R&D annually. 🤯

Which open source technologies do they choose and what do they use them for?

How Uber processes 100s of GB/s 🧵 Uber is the poster child of hypergrowth.

It was founded in the ZIRP era (2010)

5 years later - it completed 1 billion trips.

2.5 years after - it completes 10 BILLION trips 🔥

Exponential growth, also reflected in its data infrastructure:

Hadoop grew 1000x in 7 years. 🤯 Image
Jul 30, 2024 10 tweets 5 min read
Apache Kafka 3.8.0 was just released! 🔥

What comes with this new release?

Here are the top features you should know about:

1/9 🧵 (2-minute read) Image 2/9

KIP-390: Support Compression Level allows you to configure the compression level of each supported algorithm.

Used for both broker-side & producer-side compression.

It can increase performance substantially!


Image
Jun 29, 2024 28 tweets 10 min read
Slack uses Apache Kafka at scale:

• 6.5 Gbps
• 700TB of data
• 100s of nodes

Here's their fun story 👇 It all started in 2016 when they were using Redis as a queue.

Web Apps -> Redis -> Workers

Any jobs that were too slow for a web request went there
• Unfurling Links
• Notifications
• Search Index Updates
• Security Checks
• etc.

In 2016 they had a big incident with it 🥲 Image
May 24, 2024 23 tweets 7 min read
Myth: Kafka is not a database.

Let's disprove that.

👇 Image The first question we have to ask ourselves is:

What makes a database? Image
May 2, 2024 8 tweets 3 min read
Hot off the press just minutes ago, Confluent announced Freight clusters.

They’re a new innovative cluster type that lowers prices by up to 90% (!) 🔥

What is it and how do they do it?

🧵 Image It’s all about latency.

As @addisonhuddy presented, some use cases require strictly low latency (eg microservices) but others don’t.

Things like:
- metrics
- telemetry
- logs

don’t care if they’re 50ms, 500ms or 2s in p99. Image
Mar 1, 2024 10 tweets 4 min read
What people think Apache Kafka is:

- a dumb pipe powered by simple log data structure 🤕

What Apache Kafka actually is:

- a beast of a streaming platform consisting of 1.2 million lines of code 💪

I spent hours analyzing the codebase. Let's dive into the data 🧵 Image Can you guess how many lines of code Kafka started with? 🐣

Picture a number in your mind.

I will now give you a hint:

the repository grew at an average rate of 24% per release.

There were 24 releases.

(cue mental algebra 🙂) Image
Feb 27, 2024 8 tweets 4 min read
Apache Kafka 3.7.0 was just released! 🔥

What comes with this new release?

Here are the top features you should know about:

(2-minute read) 🧵 Image Most excitingly (for me), we are getting an Early Access of the new simplified Consumer Rebalance Protocol (KIP-848).

See how it completely revamps the protocol and removes complexity away from the consumer in the video I made here 👉

...

and go try it!
Dec 29, 2023 6 tweets 6 min read
Uber processes 35 petabytes of data each day, or 405 GB/s... 🤯

How do you architect a system to support such an insane scale?

The answer, in 2 minutes: 👇

Each day, Uber processes TRILLIONS of messages.

Why?

To serve use cases that require a lot of data to be answered in a timely manner.

How else would you be able to do things like:

• match a rider and a driver? 🤝
• optimize the route in real-time? ✅
• provide a real-time adjusted ETA on when the driver will arrive & when you'll get there? ⌚️
• know what's happening in your global business? 🌎

These are simple, first order questions.

But once you provide engineers with access to real-time data, a lot of ideas start to flourish.

Soon, you find yourself swamped with use cases that product managers want to try out, and later productionize.

So. How do you scale & model your infra then?

According to the use cases!

A naive approach would build a new data pipeline each time a new use case pops up - but that doesn’t scale - neither in cost nor maintainability. ❌

Thankfully, technologies today allow us to get the best of both worlds.

And since companies like Uber are typically ahead of the curve (in the amount of data that they collect and the way they use it) – chances are you will hit this soon enough in your company too. 😬

🤔How did Uber do it?

They stacked a bunch of technologies on top of one another, in a logical way.

💡 Interestingly, they shared that the stack grew chronologically in the same manner too – i.e. they started with the basic storage layer, then streaming, then compute, etc.

Here are 4 vastly different examples from Uber & their solution 🧵Image 🍔 1. UberEats Restaurant Manager

If you own a restaurant that sells via UberEats, you’d like to be able to see how well you’re doing!

The number of customers by hour, the money made, trends, customer reviews and etc. so that you can tell what is happening. 😎

This use case requires:
• fresh data🍐
• low latency - the data should load fast - each website page has a few of these dashboards. ⚡️
• strict accuracy on some metrics - the dashboards that show you financial data made should not be wrong! 💵

The good thing?

The types of queries are fixed!

You know precisely what the web app is going to request.

🤔 How did they solve this?

🍷 They use Apache Pinot with pre-aggregated indices of the large volume of raw records. This greatly reduces the latency in pulling the data.

A fun story with a related internal dashboard is that once their CEO saw inaccurate financial data in an internal dashboard and opened a Jira to the platform team saying “this number is wrong”.

That is the only Jira their CEO ever opened!Image
Jun 24, 2023 32 tweets 11 min read
Stochastic, Sublinear Streaming Algorithms.

What?

Stochastic. Sublinear. Streaming. Algorithms.

... What?

OK - let's start with the problem first 👇

You have a Kafka topic with billions of records representing latency metrics.

You want to compute their p95, p99, p999.

How? Image Certain big data aggregations are really costly.

Computing the p99 of a stream of values, without knowing the number of values prior, requires you to:
• retain ALL values
• sort them
• rank

With billions of records, this doesn't scale.

You can't parallelize it either 👉 Image
Jun 9, 2023 16 tweets 3 min read
If you can steal one ops tip from me, let it be this:

The top 3 Kafka Metrics you need to monitor:

1. ❌ Offline Partition
2. 🚩 Under Min ISR Partitions
3. 🔶 Under Replicated Partition (URP)

👇

PS: I’ll leave you with a gift at the end 🎁 Image There are the easiest metrics to take a glance at & instantly be able to assess the cluster’s overall high level health.

Of course, there’s 1000 other things that can be wrong and you always need to monitor everything



but how many tweets would that require me to do? 😑
Jun 2, 2023 18 tweets 6 min read
I am incredibly excited to announce that after years of contemplation, I am finally launching an AI startup that will revolutionize the way we do ...

just kidding.

I am launching a newsletter around Apache Kafka & event streaming.

1. What
2. Why
3. How

👇 Image 2 Minute Streaming!

The idea is simple - cap every post at 476 words (2-minute read).

This forces me to keep it as concise and clear as possible.

Keep it high quality & manageable - 1 issue a week.

We all have too little time. But. Everybody can spare 2 minutes a week.
Jun 1, 2023 9 tweets 5 min read
How many lines of code does it take to build an event streaming platform?

About 1.3 million. 🤯

...

I did a little bit of number crunching on the @apachekafka repo.

Over its 11 years as an Apache (@TheASF) project, Kafka grew from 220k lines of code to 1.3M! (550%)

👇 Image I have to give kudos to @jaykreps, @junrao & @nehanarkhede - they were hard workers!

220k LoC & 1085 files to begin with between 3 people or so is quite the feat! Image
May 31, 2023 32 tweets 11 min read
Think you know scale?

😂

You don't know S3 scale.

Dive in with me as we explore the curious bits from high level to low level:👇 Image You saw the numbers.

...

How does it achieve them?

The story of S3 is leveraging its massive scale to its fullest extent to offer something that would be impossible otherwise.

S3 is a 17+ year-old service consisting of more than 300 microservices 🤯

In 31 regions, 99 AZs 🌎 Image
May 25, 2023 35 tweets 11 min read
Slack uses Apache Kafka at scale:

- 6.5Gbps
- 700TB of data
- 100s of nodes

Here's their story 👇 It all started in 2016 when they were using Redis as a queue.

Web Apps -> Redis -> Workers

Any jobs that were too slow for a web request went there - unfurling links, notifications, search index updates, security checks, etc.

In 2016 they had a big incident with it 😱 Image