Merging two or more datasets is extremely important in data science.

Here's a quick thread that covers the basics of data merges in Python.

🧵[1/19]

#Python #datascience #DataAnalytics
[2/19]

In Python ...

You can combine two Pandas dataframes using the "merge" function.

You can also use the "join" function (which defaults to joining on the index)

#Python #datascience #DataAnalytics
[3/19]

When you merge dataframes, you'll typically have a so-called "key" variable.

This is the variable upon which you'll join the dataframes.

#Python #datascience
[4/19]

Typically, the "key" is a variable with unique values that can identify a particular record.

You can think of this as an ID variable (although if there's not a single unique variable, it's possible to merge data on multiple variables)

#Python #datascience #DataAnalytics
[5/19]

So in the example above, we're merging the two dataframes on the employee_id variable.

This variable exists in both dataframes, and is unique in both dataframes.
[6/19]

If you look carefully, you'll notice that there are several rows in both dataframes where there's a match for employee_id

101 exists in both dataframes
102 exists in both dataframes
103 exists in both dataframes
104 exists in both dataframes
[7/19]

But there are some values of employee_id that only exist in one dataframe or the other.

For example ...
900 only exists in one dataframe
901 exists in the other dataframe
[8/19]

When we do a 'merge' of two dataframes in Python, we use the 'on=' parameter to specify the key variable ... the variable where we're looking for matching values.

If there's a match, then the rows are typically joined up and put in the output dataframe.
[9/19]

But the question is how to deal with the non-matching rows.

There are actually different time of merges (AKA, joins) that deal with non-matching rows differently.
[10/19]

An inner merge keeps only the rows that match exactly for the 'on' variable

A left merge keeps everything in the "left" dataframe (the dataframe that's syntactically on the left hand side), and adds data from matching data on the right.

#datascience #DataAnalytics
[11/19]

There are actually other types of merges/joins, but they are less commonly used.

If you're just starting out, I recommend that you learn how to do inner merges and left merges first, since those are the most common.
[12/19]

As always, the 80/20 rule applies.
[13/19]

Merges and data joins are very important in data science.
[14/19]

Typically, when you work on a project, the data you need will be scattered across multiple sources.
[15/19]

Part of the data cleaning and data wrangling phase of work is cleaning up the individual datasets, and *merging* them together into a final dataframe that's ready for analysis

#datascience #DataAnalytics
[16/19]

Because you'll often need to combine multiple datasets before you work on a project, you really need to understand merges and joins.

#datascience #DataAnalytics
[17/19]

To be clear, I've glossed over a lot of details in this thread.

Things like joining on the index, multiple key variables, alternative merge types, etc.
[18/19]

But this thread should give you some of the basics of how to merge two datasets in Python.
[19/19]

If you want to learn more about data science and data wrangling, then follow me here: @Josh_Ebner

Every day, I post threads and tutorials about data science and machine learning in Python and R.

#Python #rstats #datascience #machinelearning #dataanalytics

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Joshua Ebner

Joshua Ebner Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @Josh_Ebner

29 Dec 21
In Python ...

You can combine Numpy arrays vertically or horizontally using np.concatenate

#Python #pythoncode #datascience
The first argument to the function is a list (or collection) of arrays that you want to combine.

You can actually combine many arrays ...just put them inside the list.
The axis parameter controls the direction along which you combine the arrays.

For 2D arrays ...

'axis = 0' combines vertically
'axis = 1' combines horizontally
Read 6 tweets
29 Dec 21
In Matplotlib ...

You can get the RGBA representation of a color with the to_rgba() function.

#Python #pythoncode #datavisualization
You'll notice that the output of to_rgba is a tuple with four floats: (%red, %green, %blue, alpha)

#Python
In RGBA, the alpha channel represents the opacity of the color, where:

– 0.0 is fully transparent
– 1.0 is fully opaque
Read 4 tweets
28 Dec 21
In Python, you can visualize images with the Plotly IMshow function.

🧵[1/8]

sharpsightlabs.com/blog/plotly-im…

#Python #pythonlearning #datascience #datavisualization
[2/8]

You can use Plotly IMshow for a few uses.

You can use it to plot heatmaps ...

But you can also use it to plot images.
[3/8]

The syntax for Plotly IMshow is pretty simple.

You call the function as px.imshow and then provide the name of the image file you want to visualize.

(This assumes you've imported Plotly express as px)

#Python #pythoncode Image
Read 8 tweets
28 Dec 21
In Python ...

You can use the Pandas dropna method to drop rows with missing values.

#Python #pythoncode #datascience ImageImageImageImage
As seen above, you can limit dropna to specific columns with the 'subset=' parameter.

With 'subset=', you can specify the columns in which dropna will look for missing values

#Python #pythonlearning #datascience
If it finds missing values in any of those columns, it will drop the row.

But it will ignore missing values in other columns.
Read 6 tweets
28 Dec 21
In Python ...

You can use Numpy all to test conditions about the properties of a Numpy array.

#Python #pythoncode #datascience ImageImageImageImage
☝️

So for instance, in the example above, I test if all of the values are greater than 2, by column.

#Python
To do this, you need to know how np.all works ...

But you also need to know how to use axes.

#Python #pythonlearning
Read 5 tweets
31 Aug 21
The big thing that I'd change here is the color palette.

This color palette is hard to interpret and frankly, just look a little ugly.

#datascience #DataVisualization

[1/11]
[2/11]

The fix here is pretty simple.

The data are sequential in nature. There's a low and a high.

When you have sequential data, you should almost always look at sequential color palettes.

[3/11]

More specifically:

For sequential data, your go-to palettes should almost always be perceptually uniform sequential palettes like viridis or magma.

Read 11 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(