Tweet

Neil Currie

Jun 15 • 19 tweets • 5 min read

Ever wondered how to manipulate big data with R?

A thread on using Spark in R👇🧵

#rstats #spark #datascience

Big data is any data which is too large to be loaded into memory all in one go. Ever tried to read a large dataset into R and everything fell over or slowed waaaaaay down? This is where Spark comes in.

1/18

Spark is an open source tool for processing big data. It splits data into partitions for processing to overcome RAM limitations and writes to disk when needed. It is complicated but Spark handles most of the difficult parts. And you can work with Spark directly from R.

2/18

The sparklyr package offers syntax familiar to anyone with experience of dplyr. You can select, filter and mutate your data in almost the same way. There is one big difference though.

3/18

Spark is lazily evaluated. What this means is, when you write your commands, they aren't executed straight away. Think of it as a recipe. Only when we tell sparklyr to run will our commands with the collect function will they be executed. I'll show you with an example.

4/18

The data I'll use is NYC Taxi Trips Data from Google on Kaggle in the form of a csv file. It contains 10 million rows and is 1.5GB. Is it big data? Not quite, I can still open in R with my laptop but it's not far off.

Data: kaggle.com/datasets/neilc…

5/18

Say I want to do the following:

1. Filter the dataset so only trip_distance >= 5 and passenger_count > 1

2. Calculate the total amount paid per passenger.

3. Calculate the mean total amount paid per passenger by pickup_location_id.

6/18

# Method 1: standard approach

Steps:
1. read data in using read::read_csv
2. use dplyr commands to manipulate the dataset.

7/18

# Method 2: sparklyr approach

First I need to install spark. You only need to run this once. I had to download a version of Java too.

Java8: java.com/en/download/ma…

8/18

And then running it.

Steps:
1. Setup a connection to the dataset with spark_connect
2. Write a 'recipe' using sparklyr with dplyr syntax.
3. Tell spark to execute with the collect function.
4. Remember to disconnect when you are done.

9/18

What about big data though? In the previous example, Spark wasn't necessary. Again we turn to Kaggle. This time using the NF-UQ-NIDS Network Intrusion dataset.

kaggle.com/datasets/aryas…

10/18

Say I want to do the following:

1. Filter the dataset so only IN_PKTS > 3 and IN_BYTES > 400.

2. Calculate IN_BYTES / IN_PKTS and filter so the result > 200.

3. Convert to a normal data frame.

11/18

The standard approach fails here. The dataset is 13.7GB. With 16GB RAM minus all the overhead for the operating system my R just falls over.

12/18

No problems for sparklyr though. This ran in about 5 minutes for me.

13/18

Not all operations are as suited to parallel processing. Imagine your data is in partitions. If you filter each partition so var1 > 1 or mutate var2 to replace some text you can do that to each partition separately then combine without any problems.

14/18

Sorting is a different story. Imagine you want to arrange your data by var1. If you do that to each partition there is little chance your data will be sorted correctly when those partitions are combined together. Sorting is expensive so leave it to the end!

15/18

This is just scratching the surface with what you can do and some of the nuance involved. I haven't even touched on modelling, joins, spark functions or other options for optimising your code.

16/18

TL;DR

The sparklyr package lets you to work with Spark directly from R. This is a powerful approach for wrangling and modelling with big data. For anyone who knows dplyr, the syntax is 👌.

17/18

@neilgcurrie

Thanks for reading, if you liked this thread follow me @neilgcurrie for mainly R and data science stuff (though I make no promises).

Code: github.com/neilcuz/thread…

18/18

• • •

Missing some Tweet in this thread? You can try to force a refresh

Share this page!

Neil Currie

People who liked this thread also liked...

Try unrolling a thread yourself!

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?