Tweet

Joshua Ebner

30 Dec 21, 19 tweets, 8 min read

Merging two or more datasets is extremely important in data science.

Here's a quick thread that covers the basics of data merges in Python.

🧵[1/19]

#Python #datascience #DataAnalytics

[2/19]

In Python ...

You can combine two Pandas dataframes using the "merge" function.

You can also use the "join" function (which defaults to joining on the index)

#Python #datascience #DataAnalytics

[3/19]

When you merge dataframes, you'll typically have a so-called "key" variable.

This is the variable upon which you'll join the dataframes.

#Python #datascience

[4/19]

Typically, the "key" is a variable with unique values that can identify a particular record.

You can think of this as an ID variable (although if there's not a single unique variable, it's possible to merge data on multiple variables)

#Python #datascience #DataAnalytics

[5/19]

So in the example above, we're merging the two dataframes on the employee_id variable.

This variable exists in both dataframes, and is unique in both dataframes.

[6/19]

If you look carefully, you'll notice that there are several rows in both dataframes where there's a match for employee_id

101 exists in both dataframes
102 exists in both dataframes
103 exists in both dataframes
104 exists in both dataframes

[7/19]

But there are some values of employee_id that only exist in one dataframe or the other.

For example ...
900 only exists in one dataframe
901 exists in the other dataframe

[8/19]

When we do a 'merge' of two dataframes in Python, we use the 'on=' parameter to specify the key variable ... the variable where we're looking for matching values.

If there's a match, then the rows are typically joined up and put in the output dataframe.

[9/19]

But the question is how to deal with the non-matching rows.

There are actually different time of merges (AKA, joins) that deal with non-matching rows differently.

[10/19]

An inner merge keeps only the rows that match exactly for the 'on' variable

A left merge keeps everything in the "left" dataframe (the dataframe that's syntactically on the left hand side), and adds data from matching data on the right.

#datascience #DataAnalytics

[11/19]

There are actually other types of merges/joins, but they are less commonly used.

If you're just starting out, I recommend that you learn how to do inner merges and left merges first, since those are the most common.

[12/19]

As always, the 80/20 rule applies.

[13/19]

Merges and data joins are very important in data science.

[14/19]

Typically, when you work on a project, the data you need will be scattered across multiple sources.

[15/19]

Part of the data cleaning and data wrangling phase of work is cleaning up the individual datasets, and *merging* them together into a final dataframe that's ready for analysis

#datascience #DataAnalytics

[16/19]

Because you'll often need to combine multiple datasets before you work on a project, you really need to understand merges and joins.

#datascience #DataAnalytics

[17/19]

To be clear, I've glossed over a lot of details in this thread.

Things like joining on the index, multiple key variables, alternative merge types, etc.

[18/19]

But this thread should give you some of the basics of how to merge two datasets in Python.

@Josh_Ebner

[19/19]

If you want to learn more about data science and data wrangling, then follow me here: @Josh_Ebner

Every day, I post threads and tutorials about data science and machine learning in Python and R.

#Python #rstats #datascience #machinelearning #dataanalytics

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

Read 5 tweets

Joshua Ebner

@Josh_Ebner

31 Aug 21

https://twitter.com/Josh_Ebner/status/1432827174453661698

The big thing that I'd change here is the color palette.

This color palette is hard to interpret and frankly, just look a little ugly.

#datascience #DataVisualization

[1/11]

https://twitter.com/Josh_Ebner/status/1432827174453661698

https://twitter.com/Josh_Ebner/status/1432483276791353351?s=20

[2/11]

The fix here is pretty simple.

The data are sequential in nature. There's a low and a high.

When you have sequential data, you should almost always look at sequential color palettes.

https://twitter.com/Josh_Ebner/status/1432483276791353351?s=20

https://twitter.com/Josh_Ebner/status/1432483283435200516?s=20

[3/11]

More specifically:

For sequential data, your go-to palettes should almost always be perceptually uniform sequential palettes like viridis or magma.

https://twitter.com/Josh_Ebner/status/1432483283435200516?s=20

Read 11 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Joshua Ebner

Try unrolling a thread yourself!

More from @Josh_Ebner

Joshua Ebner

Joshua Ebner

Joshua Ebner

Joshua Ebner

Joshua Ebner

Joshua Ebner

Did Thread Reader help you today?

Like this author's thread?