Neil Currie Profile picture
Sep 7 10 tweets 5 min read
How to use R with DuckDB (so you can turbo charge your data queries and handle big data effortlessly)

#rstats #bigdata #duckdb

1.
DuckDB is a fantastic tool currently seeing a rapid rise in the data world.

It is designed for rapid analytical queries and works brilliantly with big data.

Best of all you can use it directly in tools like #rstats, #python, #java and others.

Let's see how it works in R.

2.
You can install DuckDB straight from the console with install.packages("duckdb").

Unlike some other big data tools it is entirely self-contained.

This means no extra ongoing maintenance - your IT department will thank you for that.

3.
Let's look at an example.

The data I'll use is NYC Taxi Trips Data from Google on Kaggle in CSV format. It contains 10 million rows and is 1.5GB. Is it big data? Not quite, I can still open in R with my laptop but it's not far off.

Data: kaggle.com/datasets/neilc…

4.
First we will create our database.

With a little SQL code inside the dbExecute function we create a table called trips and populate it with the taxis data from the CSV file.

The database is now ready to query.

5. # Create the database, call it taxis database_path <- paste0
Now we can use dplyr syntax to query (or if you prefer pass SQL code to the function dbGetQuery).

For me it works rapidly, around 0.15 seconds. But how does that compare to a more standard approach?

6. start_time <- Sys.time() # time the code  fare_summary1 <- c
Using vroom (which is no slouch) + dplyr it runs in 4.65 seconds. Though there is some overhead setting up the connection.

That's a massive difference when you start using big data with complicated processing.

This is just a flavour of DuckDB's capabilities.

7. # Time a more standard approach (using vroom to read in the
Links to code are here:

github.com/neilcuz/thread…

Code and links to my other threads:

github.com/neilcuz/thread…

8.
To recap:

- DuckDB is a brilliant data tool for writing fast analytical queries
- It handles big data with ease
- You can use it directly in #rstats (+ #python #java & others)
- It uses syntax you will be familiar with if you know a little dplyr

9.
I hope you've found this thread helpful. Follow me @neilgcurrie for more R and data tweets.

Like/Retweet the first tweet below if you can:

10.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Neil Currie

Neil Currie Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @neilgcurrie

Aug 3
Ever wondered how to wrangle big data with R?

A thread on using Apache Arrow in R👇🧵

#rstats #bigdata #datascience
Big data is any data which is too large to be loaded into memory all in one go.

Ever tried to read a large dataset into R and everything fell over or slowed waaay down?

This is where Arrow comes in.

1/18
Arrow is a software platform designed for working with large datasets - you can use it in R, Python, Julia and more.

It uses an in-memory columnar format for structuring data.

Long story short, this means it's fast.

arrow.apache.org/overview/

2/18
Read 19 tweets
Jul 21
How to create dumbbell plots with ggplot2

Impress clients, communicate complex insights simply and up your data vis game.

👇🧵

#rstats #datavis Image
Dumbbell plots are a great way of communicating information that can be grouped then split into two points.

I used one for last weeks TidyTuesday

github.com/neilcuz/tidytu…

But making them can be a little tricky.

I can show you how step-by-step.

1/15
Dumbbell plots can be created using 2 ggplot2 building blocks: geom_point and geom_line.

So let's start building from the ground up with some examples in each before combining to make a dumbbell plot.

2/15
Read 16 tweets
Jul 15
Fed up of using spreadsheets but can't escape them?

How to build spreadsheets directly from R (so you can save time and reduce errors) 👇🧵

#rstats #datascience #excel
In this thread I will cover how to:

- create a workbook with openxlsx
- add data
- add formulas
- format cells
- format the sheet
- conditional formatting.

All in an R script.

Let's go.

1/13
I don't know about you but over the years my dislike of spreadsheets has grown. I have seen some horrors in my time.

They are messy, slow and not reproducible. Being manual they are also error prone.

And it just so happens the entire world is built on them.

2/13
Read 13 tweets
Jul 13
If you want to speed up your code, learning R's timing functions are essential.

A short thread on timing your code in R 👇🧵

#rstats #datascience
1. Sys.time

Sys.time is a base R function which returns the current time.

You can save the current time to a variable and, with some simple maths, figure out how much time has passed. library(tidyverse) library(tictoc) library(microbenchmark)
2. tictoc

The tictoc package works in the same way as above but is more elegant.

You can setup different timers with a string name to keep track of them. # similar to the Sys.time approach  tic()  diamonds |>    mu
Read 8 tweets
Jul 5
Ever heard of parallel processing but not known where to start? Want your code to run faster with a simple trick?

A short thread introducing the furrr package in R👇🧵

#rstats #datascience
The furrr package is based on the mapping functions in the purrr package.

In lots of cases these functions can replace the use of for loops in R, simplifying your code.

R for Data Science has a great chapter on the purrr package:

r4ds.had.co.nz/iteration.html
In the purrr package processing is done in the standard, sequential way.

But furrr has some tricks up it's sleeve to speed up your code by taking advantage of parallel processing.
Read 13 tweets
Jun 30
Ever wondered how to join big data in R?

A thread on using Spark in R👇🧵

#rstats #spark #datascience
This is thread # 3 in a series exploring using Spark in R with the sparklyr package. You can find the others here:

# 1:


# 2:

1/17
Here's what you'll learn reading this thread:

1. How a regular left join works.

2. The sort-merge-join algorithm (the default way to join big data with Spark and sparklyr).

3. Broadcast joins.

4. Salted joins.

All directly in R. Let's go.

2/17
Read 18 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(