Learn to calculate regression equations and perform hypothesis tests with The Manga Guide to Regression Analysis.
You also learn: simple, multiple, and logistic regression to predict iced tea orders and bakery revenues, and calculate confidence intervals and odds ratios.
The curse of dimensionality is a major roadblock for machine learning practitioners.
But most don't fully understand it.
Don't be left in the dark - join me in this thread as I clarify and demystify this concept ππ½π§΅
The Curse of Dimensionality (let's just call it "The Curse") refers to problems that occur when you try to use statistical methods in high-dimensional space.
As the number of features (dimensionality) increases, the data becomes relatively more sparse, and often exponentially more samples are needed to make statistically significant predictions.
Feature selection is a crucial part of building a good machine learning model.
But most data scientists don't think before they select features.
The fact is: feature selection in machine learning is not always necessary.
Here are 5 situation when you don't need it ππ½π§΅
1. You have a small dataset that doesn't have many features.
If the data you're using is small and doesn't have many features, you don't need to do feature selection.
2. The features are already carefully selected
If the features you're using have already been carefully chosen and are important for the task you are trying to do, you don't need to do feature selection.
The number one cause of machine learning model failure is data set drift.
Yet most data scientists and machine learning practitioners don't know why their data sets are drifting.
Here are 6 of the most common reasons for data set drift in machine learning ππ½π§΅
What is dataset drift? It's when the statistical properties of a dataset change over time, which can negatively impact the performance of a machine learning model.
1. Changes in the data distribution:
The distribution of the data used to train the model may change over time, leading to dataset drift. This could be due to changes in the underlying process that generates the data, or due to changes in the data collection process itself.