2. The sort-merge-join algorithm (the default way to join big data with Spark and sparklyr).
3. Broadcast joins.
4. Salted joins.
All directly in R. Let's go.
2/17
1. How a regular join works.
Joining is a key part of data prepartion prior to analysis or modelling.
When we join datasets we match rows based on one or more ID columns.
The attached diagrams show how a left join works - I will focus on left joins in this thread.
3/17
Joins are easy with dplyr.
But if we have big data, we can't load everything into memory in one go. This is where Spark comes in.
4/17
2. The sort-merge-join algorithm
In Spark, our data are stored in different partitions. This allows us to:
- work on partitions in parallel
- when size is an issue, load some partitions and save others
And we can do joins on these partitions.
5/17
The sort-merge-join algorithm works as follows:
- Step 1 - shuffle the data into partitions
- Step 2 - sort the data
- Step 3 - join each partition
The attached diagrams show the process.
6/17
Let's have a look at that with sparklyr. The main steps are:
1. Setup spark connection 2. Load data with copy_to 3. Join the data
You will notice some sorting has been applied to df3.
7/17
As mentioned in previous threads, sorting is expensive - it moves data between partitions.
To speed up code, perform these operations on data as small as possible, as infrequently as possible.
Another way to speed things up is to use a different kind of join...
8/17
3. Broadcast joins
Broadcast joins are faster when one dataset is much smaller than another.
Rather than sort and partition the small dataset the algorithm makes multiple copies of it.
Often the added cost of this is smaller than the standard method.
9/17
The only addition to our code is using the sdf_broadcast function.
If you compare the output data from the broadcast join with the previous output you will see they are ordered differently.
Why?
Because of the broadcast join - we didn't need to sort dataset 1.
10/17
I confess I haven't been completely honest with you up until this point.
Spark is clever. It will automatically choose a broadcast join over a sort-merge-join if certain size criteria are met. If you want more control over your joins you can specify in your code.
11/17
4. Salted joins.
Salted joins are useful when your data are skewed.
To work efficiently, Spark needs roughly equal sized partitions. After sorting, if one partition was much larger than another, you may want to split the large one further.
This is called salting.
12/17
Consider a different example.
The first diagram shows a skewed dataset and how to join them in the standard way.
The second diagram shows a salted join. The rows with ID = 1 are much more common than ID = 2.
13/17
We can salt the join by splitting the ID = 1 cases into several, unique IDs.
The code in the first image uses the default approach while the second image shows the salting method.
14/17
You don't need to worry about salting though unless you have performance issues.
In this way Spark differs from food where I would always advocate salting. This is pretty poor health advice, though I was a chef years ago so take from that what you will.
15/17
TL;DR:
1. The sparklyr package lets you work with Spark directly from R
2. With sparklyr we can perform joins on big data
3. Broadcasting or salting your joins can help you speed up your code - if you know when to use them
16/17
Thanks for reading, if you liked this thread follow me @neilgcurrie for mainly R and data science stuff (though I make no promises).
Big data is any data which is too large to be loaded into memory all in one go. Ever tried to read a large dataset into R and everything fell over or slowed waaaaaay down? This is where Spark comes in.
1/18
Spark is an open source tool for processing big data. It splits data into partitions for processing to overcome RAM limitations and writes to disk when needed. It is complicated but Spark handles most of the difficult parts. And you can work with Spark directly from R.