We are going back to the basics to simplify ML algorithms.
... today's turn is Multiple Linear Regression! ๐๐ป
In MLR, imagine you're baking.
You've got different ingredients or variables.
You need the perfect recipe (model) for your cake (prediction).
Each ingredient's quantity (coefficient) affects the taste (outcome).
1๏ธโฃ ๐๐๐ง๐ ๐๐๐ง๐๐๐ฅ๐๐ก๐ ๐ฃ๐๐๐ฆ๐
We're using height and weight - a classic duo often assumed to have a linear relationship.
But assumptions in data science? No way! ๐ง
Let's find out:
- Do height and weight really share a linear bond?
2๏ธโฃ ๐๐๐ง๐ ๐๐ซ๐ฃ๐๐ข๐ฅ๐๐ง๐๐ข๐ก ๐ง๐๐ ๐! ๐ต๏ธโโ๏ธ
Before we get our hands dirty with modeling, let's take a closer look at our data.
Remember, the essence of a great model lies in truly understanding your data first. ๐๏ธ
However... what about Gender?
๐๐๐ก๐๐๐ฅ'๐ฆ ๐ฅ๐ข๐๐
Let's start with the basics: when we plot height against weight, we see a linear pattern emerge.
However... when we consider gender...
It turns out that it significantly affects the weight for a given height.
3๏ธโฃ ๐๐๐ฌ๐ข๐ก๐ ๐๐๐๐๐๐ง
Splitting our data by gender, we can perform two SINGLE linear regression.
The slopes of these lines are almost identical, which indicates a similar behavior.
But what about the intercepts?
They tell us that start from different baselines. ๐ฆ
4๏ธโฃ ๐ ๐จ๐๐ง๐-๐ฉ๐๐ฅ๐๐๐๐๐ ๐
We can add multiple variables to perform a MULTIPLE Linear Regression.
The core theory is the same: We still use a linear function to predict our target.
But we can track N independent values
So we can consider both Height and Gender โก๏ธ N=2
5๏ธโฃ ๐ง๐ฌ๐ฃ๐๐ฆ ๐ข๐ ๐ฉ๐๐ฅ๐๐๐๐๐๐ฆ ๐ฒ
MLR accepts both numbers and categories.
HEIGHT is a numerical variable - which is a variable that can be measured.
GENDER is a category - It splits our data into different groups.
To consider categories in our model, they have to be encoded into a binary variable.
So say hello to dummy variables! ๐๐ป
We can easily convert our gender variable into a boolean one with 1 and 0.
6๏ธโฃ ๐ง๐๐ ๐๐ค๐จ๐๐ง๐๐ข๐ก ๐งฎ
Our regression equation is like a secret recipe.
It tells us how much of each ingredient (variables) we need.
Any unit increase in height makes the weight increase.
But gender affects this relationship too.
So we need to compute the weights!
7๏ธโฃ ๐๐๐ก๐๐ ๐ฅ๐๐ฆ๐จ๐๐ง๐ฆ ๐
We can use scikit-learn to implement such MLR.
The code is quite straightforward and we can easily obtain all three weights.
We get a single equation for both cases.
When considering that gender is either 0 or 1, we obtain two equations.
And they are quite similar to the ones we obtained in the beginning.
So this is all for now on Linear Regression.
Next week I'll write about Logistic Regression!
So you better stay tuned! ๐ค
Did you like this thread?
Then join my freshly started DataBites newsletter to get all my content right to your mail every Sunday! ๐งฉ
Today let's exemplify SQL's execution order with a simple query๐๐ป
1๏ธโฃ ๐ฆ๐ง๐๐ฅ๐ง๐๐ก๐ ๐๐ฅ๐ข๐ ๐ข๐จ๐ฅ ๐ฅ๐๐ช ๐ง๐๐๐๐
We use a dummy table with the salary of employees depending on their field and experience,
๐ฏ Our main goal?
Understand the field that earns the most.
2๏ธโฃ ๐ฆ๐ค๐ ๐ค๐จ๐๐ฅ๐ฌ ๐ฆ๐ง๐ฅ๐จ๐๐ง๐จ๐ฅ๐ (to use)
We define a query to obtain our goal data.
Today I am starting with a new ML model
... so it is the turn of the Support Vector Machine! ๐๐ป
0๏ธโฃ ๐ฅ๐๐๐๐ฃ
SVM is a ML method that finds the optimal hyperplane separating classes by maximizing margin, using support vectors to ensure the greatest distance between class data points.
1๏ธโฃ ๐ ๐๐ง๐๐๐ ๐๐ง๐๐๐๐ ๐๐ก๐ง๐จ๐๐ง๐๐ข๐ก ๐งฎ
To classify our data, we apply some intuition:
The dot product is the projection of one vector along with another. So we can use it to determine whether a data point is one class or the other.