Latest Twitter Threads by @BdKozlovski on Thread Reader App

Apr 11 • 12 tweets • 5 min read

Yesterday, CloudFlare dropped a bomb that I believe may change the future of Lakehouse storage.

R2 + Iceberg should become the de-facto choice for hybrid and multi-cloud data lakehouse architectures.

Here's why it may break the cloud monopoly 🧵

What's Iceberg? 🧊

An open table format. It enhances file formats like Parquet by adding extra layers of metadata that enable ACID transactions and more.

It decouples the storage layer (data) from the query layers (engines) that use it - called the headless data architecture.

Mar 21 • 10 tweets • 4 min read

Knowing Kafka internals gets you ahead of 94% of Kafka admins.

But reading the code takes a lot of effort.

I read it so you don't have to.

Here are 9 ways how Tiered Storage (KIP-405)'s write path works in Kafka.

1. Async Writes

Simply put, Kafka tiers to an external store asynchronously. Your data is saved locally first.

Because of that, it temporarily keeps both copies for a (small) subset of the data.

Reads can come from either place, with preference always being the local one.

Mar 18 • 9 tweets • 3 min read

Apache Kafka 4.0 was just released!

It bumps Kafka to 1.4 million lines of code.

What comes with this new release?

Here are the top 3 features you should know about 🔥

RIP ZooKeeper 💀

After 14 faithful years of service, Kafka completely removes support for ZooKeeper.

With KRaft, controller brokers now reach consensus amidst each other and store all cluster metadata in an internal topic.

Feb 8 • 24 tweets • 9 min read

KIP-405 released 4 months ago and forever changed Apache Kafka.

Can you guess why?

Here are 14 reasons 🧵

Kafka was always designed for on-premise deployments.

Made to run on cheap commodity hardware (HDDs) and scale horizontally.
The years showed this setup doesn't scale easily.

Managing state is really hard. Moving data is slow. Stateless applications are much easier.

Dec 20, 2024 • 7 tweets • 3 min read

Kafka is about to get a lot cheaper 🔥

@ivan0yu has published KIP-1123: Rack-aware partitioning for the Kafka Producer. 🎅

Just one week after my calculator post, where I shared the idea of a Produce to Same-AZ Leader KIP!

What a christmas gift! 🎁

🧵

Kafka producers usually write to a particular partition they choose. That partition lives on a random broker.

That broker can be in another AZ.

In the cloud, cross-AZ networking is expensive! 💰

But why do producers choose a particular partition?

Dec 14, 2024 • 45 tweets • 17 min read

has a vendor ever told you that open source Kafka is expensive?

here are 7 ways cost calculators lie to you

(featuring a real story)

the story begins in 2023 when WarpStream was first released with a poignant piece called "Kafka is Dead, long live Kafka"

It was an innovative Kafka-compatible system that avoided inter-zone networking and disks. 💡

It's main value proposition was that it was 10x cheaper.

Dec 4, 2024 • 22 tweets • 8 min read

Yesterday, AWS shook the data lake world by releasing two new S3 features that will forever cement its place there. 👑

Every data engineer must become familiar with them.

A short thread on these game-changers 🧵 (2 minute read)

AWS announced two major S3 features:

• S3 Tables
• S3 Metadata

Together, they form a very strong solidification of S3’s already integral role in the modern data lake.

But a lot of the first takes I saw on the internet are misunderstanding what this is. 🙄

Nov 2, 2024 • 17 tweets • 5 min read

An expensive Kafka cluster sells for $1M.

Cheap Kafka sells for … $220M

The story of how Confluent acquired WarpStream after just 13 months of operations 👇

In August of 2023, WarpStream shook up the Kafka industry by announcing a novel Kafka-API compatible cloud-native implementation that used no disks.

Instead? It used S3. 🧠

It was a viral HackerNews post named “Kafka is Dead, Long Live Kafka!”.

Aug 13, 2024 • 8 tweets • 3 min read

Uber spends $2B on R&D annually. 🤯

Which open source technologies do they choose and what do they use them for?

How Uber processes 100s of GB/s 🧵

Uber is the poster child of hypergrowth.

It was founded in the ZIRP era (2010)

5 years later - it completed 1 billion trips.

2.5 years after - it completes 10 BILLION trips 🔥

Exponential growth, also reflected in its data infrastructure:

Hadoop grew 1000x in 7 years. 🤯

Jul 30, 2024 • 10 tweets • 5 min read

Apache Kafka 3.8.0 was just released! 🔥

What comes with this new release?

Here are the top features you should know about:

1/9 🧵 (2-minute read)

2/9

KIP-390: Support Compression Level allows you to configure the compression level of each supported algorithm.

Used for both broker-side & producer-side compression.

It can increase performance substantially!

https://x.com/BdKozlovski/status/1816487085232234528

Jun 29, 2024 • 28 tweets • 10 min read

Slack uses Apache Kafka at scale:

• 6.5 Gbps
• 700TB of data
• 100s of nodes

Here's their fun story 👇

It all started in 2016 when they were using Redis as a queue.

Web Apps -> Redis -> Workers

Any jobs that were too slow for a web request went there
• Unfurling Links
• Notifications
• Search Index Updates
• Security Checks
• etc.

In 2016 they had a big incident with it 🥲

May 24, 2024 • 23 tweets • 7 min read

Myth: Kafka is not a database.

Let's disprove that.

👇

The first question we have to ask ourselves is:

What makes a database?

May 2, 2024 • 8 tweets • 3 min read

Hot off the press just minutes ago, Confluent announced Freight clusters.

They’re a new innovative cluster type that lowers prices by up to 90% (!) 🔥

What is it and how do they do it?

🧵

It’s all about latency.

As @addisonhuddy presented, some use cases require strictly low latency (eg microservices) but others don’t.

Things like:
- metrics
- telemetry
- logs

don’t care if they’re 50ms, 500ms or 2s in p99.

Mar 1, 2024 • 10 tweets • 4 min read

What people think Apache Kafka is:

- a dumb pipe powered by simple log data structure 🤕

What Apache Kafka actually is:

- a beast of a streaming platform consisting of 1.2 million lines of code 💪

I spent hours analyzing the codebase. Let's dive into the data 🧵

Can you guess how many lines of code Kafka started with? 🐣

Picture a number in your mind.

I will now give you a hint:

the repository grew at an average rate of 24% per release.

There were 24 releases.

(cue mental algebra 🙂)

Feb 27, 2024 • 8 tweets • 4 min read

Apache Kafka 3.7.0 was just released! 🔥

What comes with this new release?

Here are the top features you should know about:

(2-minute read) 🧵

Most excitingly (for me), we are getting an Early Access of the new simplified Consumer Rebalance Protocol (KIP-848).

See how it completely revamps the protocol and removes complexity away from the consumer in the video I made here 👉

...

and go try it!

https://twitter.com/BdKozlovski/status/1679187632679419908

Dec 29, 2023 • 6 tweets • 6 min read

Uber processes 35 petabytes of data each day, or 405 GB/s... 🤯

How do you architect a system to support such an insane scale?

The answer, in 2 minutes: 👇

Each day, Uber processes TRILLIONS of messages.

Why?

To serve use cases that require a lot of data to be answered in a timely manner.

How else would you be able to do things like:

• match a rider and a driver? 🤝
• optimize the route in real-time? ✅
• provide a real-time adjusted ETA on when the driver will arrive & when you'll get there? ⌚️
• know what's happening in your global business? 🌎

These are simple, first order questions.

But once you provide engineers with access to real-time data, a lot of ideas start to flourish.

Soon, you find yourself swamped with use cases that product managers want to try out, and later productionize.

So. How do you scale & model your infra then?

According to the use cases!

A naive approach would build a new data pipeline each time a new use case pops up - but that doesn’t scale - neither in cost nor maintainability. ❌

Thankfully, technologies today allow us to get the best of both worlds.

And since companies like Uber are typically ahead of the curve (in the amount of data that they collect and the way they use it) – chances are you will hit this soon enough in your company too. 😬

🤔How did Uber do it?

They stacked a bunch of technologies on top of one another, in a logical way.

💡 Interestingly, they shared that the stack grew chronologically in the same manner too – i.e. they started with the basic storage layer, then streaming, then compute, etc.

Here are 4 vastly different examples from Uber & their solution 🧵

🍔 1. UberEats Restaurant Manager

If you own a restaurant that sells via UberEats, you’d like to be able to see how well you’re doing!

The number of customers by hour, the money made, trends, customer reviews and etc. so that you can tell what is happening. 😎

This use case requires:
• fresh data🍐
• low latency - the data should load fast - each website page has a few of these dashboards. ⚡️
• strict accuracy on some metrics - the dashboards that show you financial data made should not be wrong! 💵

The good thing?

The types of queries are fixed!

You know precisely what the web app is going to request.

🤔 How did they solve this?

🍷 They use Apache Pinot with pre-aggregated indices of the large volume of raw records. This greatly reduces the latency in pulling the data.

A fun story with a related internal dashboard is that once their CEO saw inaccurate financial data in an internal dashboard and opened a Jira to the platform team saying “this number is wrong”.

That is the only Jira their CEO ever opened!

Jun 24, 2023 • 32 tweets • 11 min read

Stochastic, Sublinear Streaming Algorithms.

What?

Stochastic. Sublinear. Streaming. Algorithms.

... What?

OK - let's start with the problem first 👇

You have a Kafka topic with billions of records representing latency metrics.

You want to compute their p95, p99, p999.

How?

Certain big data aggregations are really costly.

Computing the p99 of a stream of values, without knowing the number of values prior, requires you to:
• retain ALL values
• sort them
• rank

With billions of records, this doesn't scale.

You can't parallelize it either 👉

Jun 9, 2023 • 16 tweets • 3 min read

If you can steal one ops tip from me, let it be this:

The top 3 Kafka Metrics you need to monitor:

1. ❌ Offline Partition
2. 🚩 Under Min ISR Partitions
3. 🔶 Under Replicated Partition (URP)

👇

PS: I’ll leave you with a gift at the end 🎁

There are the easiest metrics to take a glance at & instantly be able to assess the cluster’s overall high level health.

Of course, there’s 1000 other things that can be wrong and you always need to monitor everything

…

but how many tweets would that require me to do? 😑

Jun 2, 2023 • 18 tweets • 6 min read

I am incredibly excited to announce that after years of contemplation, I am finally launching an AI startup that will revolutionize the way we do ...

just kidding.

I am launching a newsletter around Apache Kafka & event streaming.

1. What
2. Why
3. How

👇

2 Minute Streaming!

The idea is simple - cap every post at 476 words (2-minute read).

This forces me to keep it as concise and clear as possible.

Keep it high quality & manageable - 1 issue a week.

We all have too little time. But. Everybody can spare 2 minutes a week.

Jun 1, 2023 • 9 tweets • 5 min read

How many lines of code does it take to build an event streaming platform?

About 1.3 million. 🤯

...

I did a little bit of number crunching on the @apachekafka repo.

Over its 11 years as an Apache (@TheASF) project, Kafka grew from 220k lines of code to 1.3M! (550%)

👇

I have to give kudos to @jaykreps, @junrao & @nehanarkhede - they were hard workers!

220k LoC & 1085 files to begin with between 3 people or so is quite the feat!

May 31, 2023 • 32 tweets • 11 min read

Think you know scale?

😂

You don't know S3 scale.

Dive in with me as we explore the curious bits from high level to low level:👇

You saw the numbers.

...

How does it achieve them?

The story of S3 is leveraging its massive scale to its fullest extent to offer something that would be impossible otherwise.

S3 is a 17+ year-old service consisting of more than 300 microservices 🤯

In 31 regions, 99 AZs 🌎

Share this page!

Enter URL or ID to Unroll