There are several types of data distributions you might encounter in a dataset.
Here are some common ones 👇🧵
1️⃣ Normal Distribution (Bell Curve):
Characterized by a symmetric bell-shaped curve, where most of the data points cluster around the mean, with fewer and fewer appearing as you move away from the mean.
2️⃣ Uniform Distribution:
Each value within a certain range has an equal probability of occurring. The distribution is flat, with no peaks.
3️⃣ Binomial Distribution:
Describes the number of successes in a fixed number of trials, with each trial having the same probability of success. It is characterized by a peak at the most probable number of successes.
4️⃣ Poisson Distribution:
Used for count-based data, like the number of events happening in a fixed interval of time or space. It is characterized by a peak at lower values, with the frequency of values decreasing as they increase.
5️⃣ Exponential Distribution:
Describes the time between events in a process where events occur continuously and independently at a constant average rate.
6️⃣ Skewed Distribution (Left or Right):
In a skewed distribution, the tail of the distribution is longer on one side. In a right-skewed distribution, the tail is longer on the right, while in a left-skewed distribution, it is longer on the left.
7️⃣ Bimodal/Multimodal Distribution:
A distribution with two or more peaks. These peaks may vary in height and spread.
8️⃣ Log-Normal Distribution:
This distribution is applicable when the logarithm of the variable is normally distributed. The distribution is skewed to the right.
These distributions are fundamental in statistics and data analysis, as they provide insights into the nature of the data and inform appropriate analytical strategies.
Soon I'll share when you may find each of them and how to handle them to optimise your model...
Generating or engineering features from Time Series data when using an ML approach involves extracting meaningful information that can be used by algorithms to understand patterns, make predictions, or identify trends.
Here are some feature engineering techniques 🧵👇
▶️ Time-Based Features:
Extracting features like hour, day, week, month, year, or season can be very informative, especially if the time series shows periodicity or seasonality.
▶️ Lag Features:
These are values at previous time steps. For instance, the value of a time series at time t-1, t-2, etc., can be used as a feature to predict the value at time t.
Here you have the top 10 common errors Junior (and not Junior) Data Scientists make.
Check this list before training your model to make sure you don't make these mistakes! 👇 🧵
1️⃣ Data Leakage in Training and Evaluation:
A significant mistake is allowing data leakage from training to evaluation sets. This can give a false impression of model performance, especially in cases involving temporal elements.
2️⃣ Inadequate Handling of Temporal Data:
Junior data scientists often struggle with handling temporal data, leading to data leakage. Suggestions like participating in ML competitions (e.g., Kaggle) were given for better learning.
Exploratory Data Analysis (EDA) is a process used for investigating your data to discover patterns, anomalies, relationships, or trends using statistical summaries and visual methods.
Let's find out more 🧵👇
It is essential for understanding the data's underlying structure and characteristics before applying more formal statistical or Machine Learning methods.
Some key points that we should normally check are👇
▶️Distribution of Data:
Assessing the distribution of data (e.g., normal, skewed) using histograms, box plots, and summary statistics helps understand the central tendency and variability.
Linear Regression is a fundamental algorithm in supervised Machine Learning used for predictive modeling.
Learn more about it here 🧵 👇
It involves analyzing the relationship between an independent variable, x, and a dependent variable, y.
In simple terms, it tries to draw a straight line through data to predict future values.
The main goal of Linear Regression is to find the most accurate line that represents the relationship between x and y. This line is used to predict the value of y based on x. Basically, it consists of drawing the best possible line through a set of points on a graph.