Tweet

Neil Currie

Aug 3 • 19 tweets • 5 min read

Ever wondered how to wrangle big data with R?

A thread on using Apache Arrow in R👇🧵

#rstats #bigdata #datascience

Big data is any data which is too large to be loaded into memory all in one go.

Ever tried to read a large dataset into R and everything fell over or slowed waaay down?

This is where Arrow comes in.

1/18

Arrow is a software platform designed for working with large datasets - you can use it in R, Python, Julia and more.

It uses an in-memory columnar format for structuring data.

Long story short, this means it's fast.

arrow.apache.org/overview/

2/18

Arrow is easily used in R - just use install.packages as you normally would. Nothing extra is required.

It comes with a bunch of functions for reading and writing csv, parquet and other files.

A big bonus is you can use it with dplyr.

3/18

You can select, filter and mutate your data in almost the same way.

There is one big difference though.

4/18

Arrow is lazily evaluated.

What this means is, when you write your commands, they aren't executed straight away. Think of it as a recipe.

Only when we tell Arrow to run with the collect function will our instructions be executed.

I'll show you with an example.

5/18

The data I'll use is NYC Taxi Trips Data from Google on Kaggle in the form of a csv file. It contains 10 million rows and is 1.5GB. Is it big data? Not quite, I can still open in R with my laptop but it's not far off.

Data: kaggle.com/datasets/neilc…

6/18

Say I want to do the following:

1. Filter the dataset so only trip_distance >= 5 and passenger_count > 1

2. Calculate the total amount paid per passenger.

3. Calculate the mean total amount paid per passenger by pickup_location_id.

7/18

Method 1: standard approach

Steps:

1. read data in using readr::read_csv

2. use dplyr commands to manipulate the dataset.

8/18

$file_taxi <- glue("{he...$

Method 2: Arrow approach

Steps:

1. read data in using arrow::read_csv_arrow

2. use the exact same dplyr commands

3. collect to run the commands

This ran in around 4 seconds compared to 19 seconds for method 1.

9/18

What about big data though? In the previous example, Spark wasn't necessary. Again we turn to Kaggle. This time using the NF-UQ-NIDS Network Intrusion dataset.

kaggle.com/datasets/aryas…

10/18

Say I want to do the following:

1. Filter the dataset so only IN_PKTS > 3 and IN_BYTES > 400.

2. Calculate IN_BYTES / IN_PKTS and filter so the result > 200.

3. Convert to a normal data frame.

11/18

The standard approach fails here. The dataset is 13.7GB. With 16GB RAM minus all the overhead for the operating system my R just falls over.

12/18

No problems for Arrow though. This ran in about 7 minutes for me.

13/18

I've previously written threads about using sparklyr for working with big data.

One good question is what package to use - sparklyr or arrow?

The NIDS example took 2 minutes longer than with sparklyr but this was one run of a single example.

More testing is needed.

14/18

A pro for Arrow is it is standalone.

Arrow requires no setup beyond the initial package install and no ongoing maintenance.

If you need to do ML though Arrow doesn't have these tools so stick to Spark.

15/18

Lastly, a big shout out to the R Twitter community who put me on to Arrow in the first place.

16/18

To recap:

The arrow package lets you wrangle with big data directly in R.

It's fast and easy to install with no extra setup beyond installing the package.

For anyone who knows dplyr you are in luck - the syntax is exactly the same 👌.

17/18

@neilgcurrie

Thanks for reading, if you liked this thread follow me @neilgcurrie for mainly R and data science stuff (though I make no promises).

Code:

github.com/neilcuz/thread…

Code and links to my other threads:

github.com/neilcuz/thread…

18/18

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @neilgcurrie

Neil Currie

@neilgcurrie

Jul 21

How to create dumbbell plots with ggplot2

Impress clients, communicate complex insights simply and up your data vis game.

👇🧵

#rstats #datavis

Dumbbell plots are a great way of communicating information that can be grouped then split into two points.

I used one for last weeks TidyTuesday

github.com/neilcuz/tidytu…

But making them can be a little tricky.

I can show you how step-by-step.

1/15

Dumbbell plots can be created using 2 ggplot2 building blocks: geom_point and geom_line.

So let's start building from the ground up with some examples in each before combining to make a dumbbell plot.

2/15

Read 16 tweets

Neil Currie

@neilgcurrie

Jul 15

Fed up of using spreadsheets but can't escape them?

How to build spreadsheets directly from R (so you can save time and reduce errors) 👇🧵

#rstats #datascience #excel

In this thread I will cover how to:

- create a workbook with openxlsx
- add data
- add formulas
- format cells
- format the sheet
- conditional formatting.

All in an R script.

Let's go.

1/13

I don't know about you but over the years my dislike of spreadsheets has grown. I have seen some horrors in my time.

They are messy, slow and not reproducible. Being manual they are also error prone.

And it just so happens the entire world is built on them.

2/13

Read 13 tweets

Neil Currie

@neilgcurrie

Jul 13

If you want to speed up your code, learning R's timing functions are essential.

A short thread on timing your code in R 👇🧵

#rstats #datascience

1. Sys.time

Sys.time is a base R function which returns the current time.

You can save the current time to a variable and, with some simple maths, figure out how much time has passed.

2. tictoc

The tictoc package works in the same way as above but is more elegant.

You can setup different timers with a string name to keep track of them.

Read 8 tweets

Neil Currie

@neilgcurrie

Jul 5

Ever heard of parallel processing but not known where to start? Want your code to run faster with a simple trick?

A short thread introducing the furrr package in R👇🧵

#rstats #datascience

The furrr package is based on the mapping functions in the purrr package.

In lots of cases these functions can replace the use of for loops in R, simplifying your code.

R for Data Science has a great chapter on the purrr package:

r4ds.had.co.nz/iteration.html

In the purrr package processing is done in the standard, sequential way.

But furrr has some tricks up it's sleeve to speed up your code by taking advantage of parallel processing.

Read 13 tweets

Neil Currie

@neilgcurrie

Jun 30

Ever wondered how to join big data in R?

A thread on using Spark in R👇🧵

#rstats #spark #datascience

https://twitter.com/neilgcurrie/status/1537102647924117504

This is thread # 3 in a series exploring using Spark in R with the sparklyr package. You can find the others here:

# 1:

https://twitter.com/neilgcurrie/status/1537102647924117504

# 2:

https://twitter.com/neilgcurrie/status/1539658503249309709

1/17

Here's what you'll learn reading this thread:

1. How a regular left join works.

2. The sort-merge-join algorithm (the default way to join big data with Spark and sparklyr).

3. Broadcast joins.

4. Salted joins.

All directly in R. Let's go.

2/17

Read 18 tweets

Neil Currie

@neilgcurrie

Jun 15

Ever wondered how to manipulate big data with R?

A thread on using Spark in R👇🧵

#rstats #spark #datascience

Big data is any data which is too large to be loaded into memory all in one go. Ever tried to read a large dataset into R and everything fell over or slowed waaaaaay down? This is where Spark comes in.

1/18

Spark is an open source tool for processing big data. It splits data into partitions for processing to overcome RAM limitations and writes to disk when needed. It is complicated but Spark handles most of the difficult parts. And you can work with Spark directly from R.

2/18

Read 19 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Neil Currie

People who liked this thread also liked...

Try unrolling a thread yourself!

More from @neilgcurrie

Neil Currie

Neil Currie

Neil Currie

Neil Currie

Neil Currie

Neil Currie

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?