Arrow is easily used in R - just use install.packages as you normally would. Nothing extra is required.
It comes with a bunch of functions for reading and writing csv, parquet and other files.
A big bonus is you can use it with dplyr.
3/18
You can select, filter and mutate your data in almost the same way.
There is one big difference though.
4/18
Arrow is lazily evaluated.
What this means is, when you write your commands, they aren't executed straight away. Think of it as a recipe.
Only when we tell Arrow to run with the collect function will our instructions be executed.
I'll show you with an example.
5/18
The data I'll use is NYC Taxi Trips Data from Google on Kaggle in the form of a csv file. It contains 10 million rows and is 1.5GB. Is it big data? Not quite, I can still open in R with my laptop but it's not far off.
1. Filter the dataset so only trip_distance >= 5 and passenger_count > 1
2. Calculate the total amount paid per passenger.
3. Calculate the mean total amount paid per passenger by pickup_location_id.
7/18
Method 1: standard approach
Steps:
1. read data in using readr::read_csv
2. use dplyr commands to manipulate the dataset.
8/18
Method 2: Arrow approach
Steps:
1. read data in using arrow::read_csv_arrow
2. use the exact same dplyr commands
3. collect to run the commands
This ran in around 4 seconds compared to 19 seconds for method 1.
9/18
What about big data though? In the previous example, Spark wasn't necessary. Again we turn to Kaggle. This time using the NF-UQ-NIDS Network Intrusion dataset.
Big data is any data which is too large to be loaded into memory all in one go. Ever tried to read a large dataset into R and everything fell over or slowed waaaaaay down? This is where Spark comes in.
1/18
Spark is an open source tool for processing big data. It splits data into partitions for processing to overcome RAM limitations and writes to disk when needed. It is complicated but Spark handles most of the difficult parts. And you can work with Spark directly from R.