12,399 views

ksqlDB

@ksqlDB

, 35 tweets, 7 min read

My Authors

1/ People often wonder how tumbling (sometimes called “fixed”) windows work in stream processing. They’re an incredibly useful construct, and their implementation can be surprisingly interesting. Let’s dig in!

There are lots of problems where a piece of data is only relevant to other pieces of data that occur close to it in terms of time. In other words, you need to group proximal events together based on when they happened. Windows are an abstraction for doing just that.

They group data together based on time. They can support fast updates and retrieval with O(1) operations, even in the presence of out of order data. They can also be made solidly memory efficient.

Consider a highway lined with velocity sensors. If you wanted to track the average speed of a car, you’d probably want something more fine grained than one global average over a 20 mile stretch.

Instead, it might be more useful to know what the car’s average speed is in 1 minute intervals as it’s driven. This let’s you model change over time.

The physical analogy in the name is a pretty good approximation for how it works. When you look out the window of your office or home, you clearly only see a subset of things out there in the world.

But there are nearly limitless, unviewable things through the walls to the left and right. Windows let you view a manageable amount of things at any given time without overwhelming you with everything else.

Simple enough idea, right? So how do you make and use a window? Each window has a lower and upper boundary denoting its total duration. Events that arrive between the window bounds (say 3:00 PM to 4:00 PM) are captured by the window.

So an event with timestamp 3:15 PM would fall into that window, and an event timestamped 4:05 would not. In most implementations, the lower bound is inclusive, and the upper bound is exclusive.

When you use windows, you generally don’t create N windows for the N different ranges of time that you might want to track. Instead, windows are dynamically created and updated as new events arrive. This ends up being a really flexible way of working with time.

In addition, there are lots of different types of windows. Today, we’re just going to look at one: tumbling windows.

Tumbling windows are windows which are non-overlapping. If the desired window duration is an hour, you have one for 3:00 - 4:00, another for 4:00 - 5:00, etc.

A key property is that any single event falls into exactly one window. The analogy of the name is also helpful. If you placed a window on the ground and picked up one end, you could “tumble” it forward in a straight line like a domino so that the edge furthest from you is closest

Every time you tumble the window forward, you get a new non-overlapping window. To demonstrate what this all looks like, let’s go back to that highway speed sensor example. We’re going to model tumbling windows with 10 seconds of duration.

It’s often the case that windows are used to contain aggregated values for their respective durations, and for this we’ll aggregate with an average. Whenever we add a new event to a window, we average what’s there with the new event.

This is a pretty frequent thing to do because you’re often modeling change of a group of events together.

To begin, suppose one of our sensors captures a car traveling 40 mph at t=2. We’ll model time using a simple integer since it’s easy to read. Since t=2 falls between 0 and 10, this falls into the left-most window.

At t=6, our sensor captures another reading for 42 mph. This again falls into the first window and the average is updated.

At t=9, we get another reading for 46 mph. The same window updates again. You’re probably getting the idea by now.

Now let’s say that we capture a couple of readings at t=12 and t=18 for 48 and 55 mph respectively. This opens a new window.

If we keep playing this out, we’ll start to see a more fulsome visualization, like so:

As a user of windows, you can simply ask for the aggregated value over any window boundaries you like. In a well implemented system, this is very fast and can be an O(1) operation.

But how can that be when working with unbounded streams of events? How can it be efficient? How does it work with out of order data? How does it not blow up memory? That’s where the implementation side is handy to learn about.

There’s plenty of different implementations of windowing, but one that’s pretty easy to understand is Window-ID (“Semantics and Evaluation Techniques for Window Aggregates in Data Streams”)

If you want to dig in yourself, the paper is a surprisingly easy read. datalab.cs.pdx.edu/niagara/sigmod…

This approach is particularly neat because it can operate over any totally ordered domain, not just time. You can imagine using this technique to model something like windows over levels of severity.

There are three main algorithms in this paper: `windows`, `wids` and `extents`. All three leverage dead simple arithmetic. The basic idea is that every window has a monotonically increasing identifier that is an offset from some global minimum.

In practice, the UNIX timestamp 0 is often used as that global minimum. `wids` and `extents` are inverse functions that can compute the window that an event belongs to, and similarly which windows are responsible for particular bounds of time.

What’s cool about this technique is that it handles out of order data and sparse time periods in a highly efficient manner. Because every window can be looked up using simple arithmetic, that means that receiving really old data, or data way in the future can still be an O(1) op.

This clearly depends on how you store the events, but the obvious approach is to use something like a hash map and map window IDs to contents -- which is where that efficiency pays off.

This approach is also lazy in the sense that new windows are only created on demand, and don’t require intermediary windows to be created between any two time periods if there’s no data for them.

@ksqlDB

@ksqlDB

If you want to dig deeper, the paper makes the overall approach really clear. And if you want to give tumbling windows a try, you can also use them in @ksqlDB.

Here’s an example of a 5 second tumbling window:

SELECT card_number, count(*) FROM authorization_attempts
WINDOW TUMBLING (SIZE 5 SECONDS)
GROUP BY card_number HAVING COUNT(*) > 3
EMIT CHANGES;

#fitsinatweet

ksqlDB abstracts away the mechanics of setting up windows so that you just need to specify a duration. This is really nice from a usability perspective, though we should note that plenty of other SQL systems do this too. docs.ksqldb.io/en/latest/conc…

@ksqlDB

@ksqlDB

If you want to play around with this more and you’re new to @ksqlDB, you can get up and running here: ksqldb.io

Next time, we’ll cover “hopping” windows. What’s really interesting about hopping windows is that tumbling windows are in fact a special case of them. Looking forward to the next one. :)

Enjoying this thread?

Try unrolling a thread yourself!

Trending hashtags

Enjoying this thread?

Try unrolling a thread yourself!

Related hashtags

Related threads

Trending hashtags

Did Thread Reader help you today?