Neil Currie Profile picture
Jun 15 19 tweets 5 min read
Ever wondered how to manipulate big data with R?

A thread on using Spark in R👇🧵

#rstats #spark #datascience
Big data is any data which is too large to be loaded into memory all in one go. Ever tried to read a large dataset into R and everything fell over or slowed waaaaaay down? This is where Spark comes in.

1/18
Spark is an open source tool for processing big data. It splits data into partitions for processing to overcome RAM limitations and writes to disk when needed. It is complicated but Spark handles most of the difficult parts. And you can work with Spark directly from R.

2/18
The sparklyr package offers syntax familiar to anyone with experience of dplyr. You can select, filter and mutate your data in almost the same way. There is one big difference though.

3/18
Spark is lazily evaluated. What this means is, when you write your commands, they aren't executed straight away. Think of it as a recipe. Only when we tell sparklyr to run will our commands with the collect function will they be executed. I'll show you with an example.

4/18
The data I'll use is NYC Taxi Trips Data from Google on Kaggle in the form of a csv file. It contains 10 million rows and is 1.5GB. Is it big data? Not quite, I can still open in R with my laptop but it's not far off.

Data: kaggle.com/datasets/neilc…

5/18
Say I want to do the following:

1. Filter the dataset so only trip_distance >= 5 and passenger_count > 1

2. Calculate the total amount paid per passenger.

3. Calculate the mean total amount paid per passenger by pickup_location_id.

6/18
# Method 1: standard approach

Steps:
1. read data in using read::read_csv
2. use dplyr commands to manipulate the dataset.

7/18 library(glue) library(here)...
# Method 2: sparklyr approach

First I need to install spark. You only need to run this once. I had to download a version of Java too.

Java8: java.com/en/download/ma…

8/18 Image
And then running it.

Steps:
1. Setup a connection to the dataset with spark_connect
2. Write a 'recipe' using sparklyr with dplyr syntax.
3. Tell spark to execute with the collect function.
4. Remember to disconnect when you are done.

9/18 Image
What about big data though? In the previous example, Spark wasn't necessary. Again we turn to Kaggle. This time using the NF-UQ-NIDS Network Intrusion dataset.

kaggle.com/datasets/aryas…

10/18
Say I want to do the following:

1. Filter the dataset so only IN_PKTS > 3 and IN_BYTES > 400.

2. Calculate IN_BYTES / IN_PKTS and filter so the result > 200.

3. Convert to a normal data frame.

11/18
The standard approach fails here. The dataset is 13.7GB. With 16GB RAM minus all the overhead for the operating system my R just falls over.

12/18
No problems for sparklyr though. This ran in about 5 minutes for me.

13/18 Image
Not all operations are as suited to parallel processing. Imagine your data is in partitions. If you filter each partition so var1 > 1 or mutate var2 to replace some text you can do that to each partition separately then combine without any problems.

14/18
Sorting is a different story. Imagine you want to arrange your data by var1. If you do that to each partition there is little chance your data will be sorted correctly when those partitions are combined together. Sorting is expensive so leave it to the end!

15/18
This is just scratching the surface with what you can do and some of the nuance involved. I haven't even touched on modelling, joins, spark functions or other options for optimising your code.

16/18
TL;DR

The sparklyr package lets you to work with Spark directly from R. This is a powerful approach for wrangling and modelling with big data. For anyone who knows dplyr, the syntax is 👌.

17/18
Thanks for reading, if you liked this thread follow me @neilgcurrie for mainly R and data science stuff (though I make no promises).

Code: github.com/neilcuz/thread…

18/18

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Neil Currie

Neil Currie Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(