Tweet

Neil Currie

Jun 30 • 18 tweets • 7 min read

Ever wondered how to join big data in R?

A thread on using Spark in R👇🧵

#rstats #spark #datascience

https://twitter.com/neilgcurrie/status/1537102647924117504

This is thread # 3 in a series exploring using Spark in R with the sparklyr package. You can find the others here:

# 1:

https://twitter.com/neilgcurrie/status/1537102647924117504

# 2:

https://twitter.com/neilgcurrie/status/1539658503249309709

1/17

Here's what you'll learn reading this thread:

1. How a regular left join works.

2. The sort-merge-join algorithm (the default way to join big data with Spark and sparklyr).

3. Broadcast joins.

4. Salted joins.

All directly in R. Let's go.

2/17

1. How a regular join works.

Joining is a key part of data prepartion prior to analysis or modelling.

When we join datasets we match rows based on one or more ID columns.

The attached diagrams show how a left join works - I will focus on left joins in this thread.

3/17

Joins are easy with dplyr.

But if we have big data, we can't load everything into memory in one go. This is where Spark comes in.

4/17

2. The sort-merge-join algorithm

In Spark, our data are stored in different partitions. This allows us to:

- work on partitions in parallel
- when size is an issue, load some partitions and save others

And we can do joins on these partitions.

5/17

The sort-merge-join algorithm works as follows:

- Step 1 - shuffle the data into partitions
- Step 2 - sort the data
- Step 3 - join each partition

The attached diagrams show the process.

6/17

Let's have a look at that with sparklyr. The main steps are:

1. Setup spark connection
2. Load data with copy_to
3. Join the data

You will notice some sorting has been applied to df3.

7/17

As mentioned in previous threads, sorting is expensive - it moves data between partitions.

To speed up code, perform these operations on data as small as possible, as infrequently as possible.

Another way to speed things up is to use a different kind of join...

8/17

3. Broadcast joins

Broadcast joins are faster when one dataset is much smaller than another.

Rather than sort and partition the small dataset the algorithm makes multiple copies of it.

Often the added cost of this is smaller than the standard method.

9/17

The only addition to our code is using the sdf_broadcast function.

If you compare the output data from the broadcast join with the previous output you will see they are ordered differently.

Why?

Because of the broadcast join - we didn't need to sort dataset 1.

10/17

I confess I haven't been completely honest with you up until this point.

Spark is clever. It will automatically choose a broadcast join over a sort-merge-join if certain size criteria are met. If you want more control over your joins you can specify in your code.

11/17

4. Salted joins.

Salted joins are useful when your data are skewed.

To work efficiently, Spark needs roughly equal sized partitions. After sorting, if one partition was much larger than another, you may want to split the large one further.

This is called salting.

12/17

Consider a different example.

The first diagram shows a skewed dataset and how to join them in the standard way.

The second diagram shows a salted join. The rows with ID = 1 are much more common than ID = 2.

13/17

We can salt the join by splitting the ID = 1 cases into several, unique IDs.

The code in the first image uses the default approach while the second image shows the salting method.

14/17

You don't need to worry about salting though unless you have performance issues.

In this way Spark differs from food where I would always advocate salting. This is pretty poor health advice, though I was a chef years ago so take from that what you will.

15/17

TL;DR:

1. The sparklyr package lets you work with Spark directly from R

2. With sparklyr we can perform joins on big data

3. Broadcasting or salting your joins can help you speed up your code - if you know when to use them

16/17

@neilgcurrie

Thanks for reading, if you liked this thread follow me @neilgcurrie for mainly R and data science stuff (though I make no promises).

Code: github.com/neilcuz/thread…

17/17

• • •

Missing some Tweet in this thread? You can try to force a refresh

Share this page!

Neil Currie

People who liked this thread also liked...

Try unrolling a thread yourself!

More from @neilgcurrie

Neil Currie

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?