David Andrés 🤖📈🐍 Profile picture
Nov 27 12 tweets 2 min read Twitter logo Read on Twitter
There are several types of data distributions you might encounter in a dataset.

Here are some common ones 👇🧵 Image
1️⃣ Normal Distribution (Bell Curve):

Characterized by a symmetric bell-shaped curve, where most of the data points cluster around the mean, with fewer and fewer appearing as you move away from the mean.
2️⃣ Uniform Distribution:

Each value within a certain range has an equal probability of occurring. The distribution is flat, with no peaks.
3️⃣ Binomial Distribution:

Describes the number of successes in a fixed number of trials, with each trial having the same probability of success. It is characterized by a peak at the most probable number of successes.
4️⃣ Poisson Distribution:

Used for count-based data, like the number of events happening in a fixed interval of time or space. It is characterized by a peak at lower values, with the frequency of values decreasing as they increase.
5️⃣ Exponential Distribution:

Describes the time between events in a process where events occur continuously and independently at a constant average rate.
6️⃣ Skewed Distribution (Left or Right):

In a skewed distribution, the tail of the distribution is longer on one side. In a right-skewed distribution, the tail is longer on the right, while in a left-skewed distribution, it is longer on the left.
7️⃣ Bimodal/Multimodal Distribution:

A distribution with two or more peaks. These peaks may vary in height and spread.
8️⃣ Log-Normal Distribution:

This distribution is applicable when the logarithm of the variable is normally distributed. The distribution is skewed to the right.
These distributions are fundamental in statistics and data analysis, as they provide insights into the nature of the data and inform appropriate analytical strategies.
Soon I'll share when you may find each of them and how to handle them to optimise your model...

Follow me and subscribe to 💊 MLPills not to miss it 👇
mlpills.dev/subscribe/
You should also join our newsletter, DSBoost🚀

Every week we share:
🔹Interviews
🔹Podcast notes
🔹Learning resources
🔹Interesting collections of content

Subscribe for free👇👇
dsboost.dev

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with David Andrés 🤖📈🐍

David Andrés 🤖📈🐍 Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @daansan_ml

Nov 29
Generating or engineering features from Time Series data when using an ML approach involves extracting meaningful information that can be used by algorithms to understand patterns, make predictions, or identify trends.

Here are some feature engineering techniques 🧵👇 Image
▶️ Time-Based Features:

Extracting features like hour, day, week, month, year, or season can be very informative, especially if the time series shows periodicity or seasonality.
▶️ Lag Features:

These are values at previous time steps. For instance, the value of a time series at time t-1, t-2, etc., can be used as a feature to predict the value at time t.
Read 16 tweets
Nov 28
Here you have the top 10 common errors Junior (and not Junior) Data Scientists make.

Check this list before training your model to make sure you don't make these mistakes! 👇 🧵 Image
1️⃣ Data Leakage in Training and Evaluation:

A significant mistake is allowing data leakage from training to evaluation sets. This can give a false impression of model performance, especially in cases involving temporal elements.
2️⃣ Inadequate Handling of Temporal Data:

Junior data scientists often struggle with handling temporal data, leading to data leakage. Suggestions like participating in ML competitions (e.g., Kaggle) were given for better learning.
Read 12 tweets
Nov 25
Exploratory Data Analysis (EDA) is a process used for investigating your data to discover patterns, anomalies, relationships, or trends using statistical summaries and visual methods.

Let's find out more 🧵👇 Image
It is essential for understanding the data's underlying structure and characteristics before applying more formal statistical or Machine Learning methods.

Some key points that we should normally check are👇
▶️Distribution of Data:

Assessing the distribution of data (e.g., normal, skewed) using histograms, box plots, and summary statistics helps understand the central tendency and variability.
Read 14 tweets
Nov 19
Cleaning your data before building your Time Series model is crucial.

Learn how to do it, step by step 🧵👇 Image
1️⃣ Handle missing values
2️⃣ Remove trend

👇
medium.datadriveninvestor.com/effective-stra…
3️⃣ Remove seasonality
4️⃣ Check for stationarity and make it stationary if necessary
5️⃣ Normalize the data

👇
medium.datadriveninvestor.com/practical-appr…
Read 6 tweets
Nov 18
⚠️You need to be careful with Time Bias!

Learn what it is and why it happens.

🧵 👇 Image
Time bias is a type of bias that can occur in time series analysis and forecasting.

This happens when the historical data used to build a time series model or forecast does not accurately represent the current or future conditions.
Time bias can arise for several reasons:

1️⃣ Changes in trends or patterns

2️⃣ Data contains outliers or anomalous events
Read 11 tweets
Nov 16
Linear Regression is a fundamental algorithm in supervised Machine Learning used for predictive modeling.

Learn more about it here 🧵 👇 Image
It involves analyzing the relationship between an independent variable, x, and a dependent variable, y.

In simple terms, it tries to draw a straight line through data to predict future values.
The main goal of Linear Regression is to find the most accurate line that represents the relationship between x and y. This line is used to predict the value of y based on x. Basically, it consists of drawing the best possible line through a set of points on a graph. Image
Read 10 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(