Tweet

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

David Andrés 🤖📈🐍

@daansan_ml

Nov 27 • 12 tweets • 2 min read Twitter logo

Read on Twitter

There are several types of data distributions you might encounter in a dataset.

Here are some common ones 👇🧵

1️⃣ Normal Distribution (Bell Curve):

Characterized by a symmetric bell-shaped curve, where most of the data points cluster around the mean, with fewer and fewer appearing as you move away from the mean.

2️⃣ Uniform Distribution:

Each value within a certain range has an equal probability of occurring. The distribution is flat, with no peaks.

3️⃣ Binomial Distribution:

Describes the number of successes in a fixed number of trials, with each trial having the same probability of success. It is characterized by a peak at the most probable number of successes.

4️⃣ Poisson Distribution:

Used for count-based data, like the number of events happening in a fixed interval of time or space. It is characterized by a peak at lower values, with the frequency of values decreasing as they increase.

5️⃣ Exponential Distribution:

Describes the time between events in a process where events occur continuously and independently at a constant average rate.

6️⃣ Skewed Distribution (Left or Right):

In a skewed distribution, the tail of the distribution is longer on one side. In a right-skewed distribution, the tail is longer on the right, while in a left-skewed distribution, it is longer on the left.

7️⃣ Bimodal/Multimodal Distribution:

A distribution with two or more peaks. These peaks may vary in height and spread.

8️⃣ Log-Normal Distribution:

This distribution is applicable when the logarithm of the variable is normally distributed. The distribution is skewed to the right.

These distributions are fundamental in statistics and data analysis, as they provide insights into the nature of the data and inform appropriate analytical strategies.

Soon I'll share when you may find each of them and how to handle them to optimise your model...

Follow me and subscribe to 💊 MLPills not to miss it 👇
mlpills.dev/subscribe/

You should also join our newsletter, DSBoost🚀

Every week we share:
🔹Interviews
🔹Podcast notes
🔹Learning resources
🔹Interesting collections of content

Subscribe for free👇👇
dsboost.dev

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @daansan_ml

David Andrés 🤖📈🐍

@daansan_ml

Nov 29

Generating or engineering features from Time Series data when using an ML approach involves extracting meaningful information that can be used by algorithms to understand patterns, make predictions, or identify trends.

Here are some feature engineering techniques 🧵👇

▶️ Time-Based Features:

Extracting features like hour, day, week, month, year, or season can be very informative, especially if the time series shows periodicity or seasonality.

▶️ Lag Features:

These are values at previous time steps. For instance, the value of a time series at time t-1, t-2, etc., can be used as a feature to predict the value at time t.

Read 16 tweets

David Andrés 🤖📈🐍

@daansan_ml

Nov 28

Here you have the top 10 common errors Junior (and not Junior) Data Scientists make.

Check this list before training your model to make sure you don't make these mistakes! 👇 🧵

1️⃣ Data Leakage in Training and Evaluation:

A significant mistake is allowing data leakage from training to evaluation sets. This can give a false impression of model performance, especially in cases involving temporal elements.

2️⃣ Inadequate Handling of Temporal Data:

Junior data scientists often struggle with handling temporal data, leading to data leakage. Suggestions like participating in ML competitions (e.g., Kaggle) were given for better learning.

Read 12 tweets

David Andrés 🤖📈🐍

@daansan_ml

Nov 25

Exploratory Data Analysis (EDA) is a process used for investigating your data to discover patterns, anomalies, relationships, or trends using statistical summaries and visual methods.

Let's find out more 🧵👇

It is essential for understanding the data's underlying structure and characteristics before applying more formal statistical or Machine Learning methods.

Some key points that we should normally check are👇

▶️Distribution of Data:

Assessing the distribution of data (e.g., normal, skewed) using histograms, box plots, and summary statistics helps understand the central tendency and variability.

Read 14 tweets

David Andrés 🤖📈🐍

@daansan_ml

Nov 19

Cleaning your data before building your Time Series model is crucial.

Learn how to do it, step by step 🧵👇

1️⃣ Handle missing values
2️⃣ Remove trend

👇
medium.datadriveninvestor.com/effective-stra…

3️⃣ Remove seasonality
4️⃣ Check for stationarity and make it stationary if necessary
5️⃣ Normalize the data

👇
medium.datadriveninvestor.com/practical-appr…

Read 6 tweets

David Andrés 🤖📈🐍

@daansan_ml

Nov 18

⚠️You need to be careful with Time Bias!

Learn what it is and why it happens.

🧵 👇

Time bias is a type of bias that can occur in time series analysis and forecasting.

This happens when the historical data used to build a time series model or forecast does not accurately represent the current or future conditions.

Time bias can arise for several reasons:

1️⃣ Changes in trends or patterns

2️⃣ Data contains outliers or anomalous events

Read 11 tweets

David Andrés 🤖📈🐍

@daansan_ml

Nov 16

Linear Regression is a fundamental algorithm in supervised Machine Learning used for predictive modeling.

Learn more about it here 🧵 👇

It involves analyzing the relationship between an independent variable, x, and a dependent variable, y.

In simple terms, it tries to draw a straight line through data to predict future values.

The main goal of Linear Regression is to find the most accurate line that represents the relationship between x and y. This line is used to predict the value of y based on x. Basically, it consists of drawing the best possible line through a set of points on a graph.

Read 10 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

David Andrés 🤖📈🐍

Try unrolling a thread yourself!

More from @daansan_ml

David Andrés 🤖📈🐍

David Andrés 🤖📈🐍

David Andrés 🤖📈🐍

David Andrés 🤖📈🐍

David Andrés 🤖📈🐍

David Andrés 🤖📈🐍

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!