So in the example above, we're merging the two dataframes on the employee_id variable.
This variable exists in both dataframes, and is unique in both dataframes.
[6/19]
If you look carefully, you'll notice that there are several rows in both dataframes where there's a match for employee_id
101 exists in both dataframes
102 exists in both dataframes
103 exists in both dataframes
104 exists in both dataframes
[7/19]
But there are some values of employee_id that only exist in one dataframe or the other.
For example ...
900 only exists in one dataframe
901 exists in the other dataframe
[8/19]
When we do a 'merge' of two dataframes in Python, we use the 'on=' parameter to specify the key variable ... the variable where we're looking for matching values.
If there's a match, then the rows are typically joined up and put in the output dataframe.
[9/19]
But the question is how to deal with the non-matching rows.
There are actually different time of merges (AKA, joins) that deal with non-matching rows differently.
[10/19]
An inner merge keeps only the rows that match exactly for the 'on' variable
A left merge keeps everything in the "left" dataframe (the dataframe that's syntactically on the left hand side), and adds data from matching data on the right.
There are actually other types of merges/joins, but they are less commonly used.
If you're just starting out, I recommend that you learn how to do inner merges and left merges first, since those are the most common.
[12/19]
As always, the 80/20 rule applies.
[13/19]
Merges and data joins are very important in data science.
[14/19]
Typically, when you work on a project, the data you need will be scattered across multiple sources.
[15/19]
Part of the data cleaning and data wrangling phase of work is cleaning up the individual datasets, and *merging* them together into a final dataframe that's ready for analysis