Neil Currie Profile picture
Jun 30 18 tweets 7 min read
Ever wondered how to join big data in R?

A thread on using Spark in R👇🧵

#rstats #spark #datascience
This is thread # 3 in a series exploring using Spark in R with the sparklyr package. You can find the others here:

# 1:


# 2:

1/17
Here's what you'll learn reading this thread:

1. How a regular left join works.

2. The sort-merge-join algorithm (the default way to join big data with Spark and sparklyr).

3. Broadcast joins.

4. Salted joins.

All directly in R. Let's go.

2/17
1. How a regular join works.

Joining is a key part of data prepartion prior to analysis or modelling.

When we join datasets we match rows based on one or more ID columns.

The attached diagrams show how a left join works - I will focus on left joins in this thread.

3/17 The diagram shows the usual process of joining data.The diagram shows the joined dataset.
Joins are easy with dplyr.

But if we have big data, we can't load everything into memory in one go. This is where Spark comes in.

4/17 library(dplyr)  id <- 1:4 animal <- c("crocodile",
2. The sort-merge-join algorithm

In Spark, our data are stored in different partitions. This allows us to:

- work on partitions in parallel
- when size is an issue, load some partitions and save others

And we can do joins on these partitions.

5/17
The sort-merge-join algorithm works as follows:

- Step 1 - shuffle the data into partitions
- Step 2 - sort the data
- Step 3 - join each partition

The attached diagrams show the process.

6/17 The diagram shows the process of shuffling dataset 1 into paThe diagram shows the process of shuffling dataset 2 into paThe diagram shows the joining process. Data in dataset 1, paThe diagram shows the new, joined dataset.
Let's have a look at that with sparklyr. The main steps are:

1. Setup spark connection
2. Load data with copy_to
3. Join the data

You will notice some sorting has been applied to df3.

7/17 # Create spark connection sc <- spark_connect(master = "
As mentioned in previous threads, sorting is expensive - it moves data between partitions.

To speed up code, perform these operations on data as small as possible, as infrequently as possible.

Another way to speed things up is to use a different kind of join...

8/17
3. Broadcast joins

Broadcast joins are faster when one dataset is much smaller than another.

Rather than sort and partition the small dataset the algorithm makes multiple copies of it.

Often the added cost of this is smaller than the standard method.

9/17 The diagram shows how a broadcast join works.
The only addition to our code is using the sdf_broadcast function.

If you compare the output data from the broadcast join with the previous output you will see they are ordered differently.

Why?

Because of the broadcast join - we didn't need to sort dataset 1.

10/17 sdf3_broadcast <- sparklyr::left_join(sdf1,
I confess I haven't been completely honest with you up until this point.

Spark is clever. It will automatically choose a broadcast join over a sort-merge-join if certain size criteria are met. If you want more control over your joins you can specify in your code.

11/17
4. Salted joins.

Salted joins are useful when your data are skewed.

To work efficiently, Spark needs roughly equal sized partitions. After sorting, if one partition was much larger than another, you may want to split the large one further.

This is called salting.

12/17
Consider a different example.

The first diagram shows a skewed dataset and how to join them in the standard way.

The second diagram shows a salted join. The rows with ID = 1 are much more common than ID = 2.

13/17 The first diagram shows joining on a skewed datasetThe second diagram shows joining using the salting method.
We can salt the join by splitting the ID = 1 cases into several, unique IDs.

The code in the first image uses the default approach while the second image shows the salting method.

14/17 # Create dummy data  num_rows <- 100000  professions <- c(&qdf2 <- tibble(id_salted = paste(1, 1:8, sep = "-")
You don't need to worry about salting though unless you have performance issues.

In this way Spark differs from food where I would always advocate salting. This is pretty poor health advice, though I was a chef years ago so take from that what you will.

15/17
TL;DR:

1. The sparklyr package lets you work with Spark directly from R

2. With sparklyr we can perform joins on big data

3. Broadcasting or salting your joins can help you speed up your code - if you know when to use them

16/17
Thanks for reading, if you liked this thread follow me @neilgcurrie for mainly R and data science stuff (though I make no promises).

Code: github.com/neilcuz/thread…

17/17

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Neil Currie

Neil Currie Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @neilgcurrie

Jun 15
Ever wondered how to manipulate big data with R?

A thread on using Spark in R👇🧵

#rstats #spark #datascience
Big data is any data which is too large to be loaded into memory all in one go. Ever tried to read a large dataset into R and everything fell over or slowed waaaaaay down? This is where Spark comes in.

1/18
Spark is an open source tool for processing big data. It splits data into partitions for processing to overcome RAM limitations and writes to disk when needed. It is complicated but Spark handles most of the difficult parts. And you can work with Spark directly from R.

2/18
Read 19 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(