Future Is Generative AI + Data Science | Helping My Students Become Generative AI Data Scientists ($200,000 /year career) 👇
9 subscribers
May 23 • 13 tweets • 3 min read
Bayes' Theorem is a fundamental concept in data science.
But it took me 2 years to understand its importance.
In 2 minutes, I'll share my best findings over the last 2 years exploring Bayesian Statistics. Let's go. 1. Background:
"An Essay towards solving a Problem in the Doctrine of Chances," was published in 1763, two years after Bayes' death. In this essay, Bayes addressed the problem of inverse probability, which is the basis of what is now known as Bayesian probability.
May 22 • 12 tweets • 4 min read
Top 7 most important statistical analysis concepts that have helped me as a Data Scientist.
This is a complete 7-step beginner ROADMAP for learning stats for data science. Let's go:
Step 1: Learn These Descriptive Statistics
Mean, median, mode, variance, standard deviation. Used to summarize data and spot variability. These are key for any data scientist to understand what’s in front of them in their data sets.
May 22 • 12 tweets • 3 min read
Type 1 and Type 2 errors are confusing. In 3 minutes, I'll demolish your confusion. Let's dive in. đź§µ 1. Type 1 Error (False Positive):
This occurs when the pregnancy test tells Tom, the man, that he is pregnant. Obviously, Tom cannot be pregnant, so this result is a false alarm. In statistical terms, it's detecting an effect (in this case, pregnancy) when it actually doesn't exist.
May 18 • 9 tweets • 4 min read
Stop doing Customer Segmentation with plain vanilla Scikit Learn.
Add these 7 Python libraries to your RFM, clustering, and
customer segmentation projects: 1. Data preparation
- load data with pandas
- impute/mask with Feature-engine
6 statistical methods that can be used for A/B Testing (and when to use them).
A/B Testing is a staple of data science and data analyst interviews.
And it's the Number 1 technique that companies benefit from in improving customer revenue.
So here are 6 of the most common stat methods used in A/B testing.
May 15 • 15 tweets • 4 min read
Understanding P-Values is essential for improving regression models.
In 2 minutes, I'll crush your confusion.
Let's go: 1. The p-value:
A p-value in statistics is a measure used to assess the strength of the evidence against a null hypothesis.
May 14 • 9 tweets • 3 min read
Tableau is about to die.
Introducing PandasAI, a free alternative for fast Business Intelligence.
Let dive in: 1. PandasAI
PandaAI transforms your natural language questions into actionable insights — fast, smartly, and effortlessly.
May 13 • 10 tweets • 3 min read
🚨 BREAKING: Microsoft launches a free Python library that converts ANY document to Markdown
Introducing Markitdown. Let me explain. đź§µ 1. Document Parsing Pipelines
MarkItDown is a lightweight Python utility for converting various files to Markdown for use with LLMs and related text analysis pipelines.
May 11 • 14 tweets • 4 min read
The 10 types of clustering that all data scientists need to know.
Let's dive in: 1. K-Means Clustering:
This is a centroid-based algorithm, where the goal is to minimize the sum of distances between points and their respective cluster centroid.
May 9 • 11 tweets • 3 min read
RIP Tableau and PowerBI.
Enter Julius AI.
This is what Julius can do: 1. The $10 Billion problem with Tableau and PowerBI?
Dashboards are static.
But businesses are dynamic.
That's why I'm so excited about this new tool: Julius AI
May 9 • 13 tweets • 5 min read
Principal Component Analysis (PCA) is the gold standard in dimensionality reduction.
But almost every beginner struggles understanding how it works (and why to use it).
In 3 minutes, I'll demolish your confusion: 1. What is PCA?
PCA is a statistical technique used in data analysis, mainly for dimensionality reduction. It's beneficial when dealing with large datasets with many variables, and it helps simplify the data's complexity while retaining as much variability as possible.
May 8 • 11 tweets • 4 min read
K-means is one of the most powerful algorithms for data scientists.
But it's confusing for beginners. Let's fix that: 1. What is K-means?
Is a popular unsupervised machine learning algorithm used for clustering. It's a core algorithm used for customer segmentation, inventory categorization, market segmentation, and even anomaly detection.
May 6 • 15 tweets • 5 min read
🚨BREAKING: OpenAI just dropped a practical guide to building agents
There's a secret.
It unlocks agential workflows for data analysts, data scientists, and data engineers.
This is how: đź§µ
Here are my key takeaways by section of the report specifically addressing how data professionals can use the report.
1: What Is an Agent?
Agents are LLM‑powered systems that autonomously execute multi‑step workflows on a user’s behalf—going beyond single‑turn prompts to manage tool calls, conditional logic, and goal‑based loops
May 4 • 8 tweets • 3 min read
90% of data scientists struggle with time series.
But all it takes is mastering 1 technique: time series decomposition.
Here's why: 1. What is Time Series Decomposition?
A statistical method used to deconstruct a time series into several components, each representing underlying patterns in the data.
There are 3 key components: Trend, Seasonal, and Residual. Let's break them down.
May 3 • 9 tweets • 3 min read
A Python Library for Time Series by Salesforce.
Let me introduce you to Merlion. 1. What is Merlion?
Merlion is a Python library for time series intelligence.
It provides an end-to-end machine learning framework that includes loading and transforming data, building and training models, post-processing model outputs, and evaluating model performance.
May 2 • 9 tweets • 3 min read
A Python Library for Time Series using Hidden Markov Models.
Let me introduce you to hmmlearn. 1. Hidden Markov Models
A Hidden Markov Model (HMM) is a statistical model that describes a sequence of observable events where the underlying process generating those events is not directly visible, meaning there are "hidden states" that influence the observed data, but you can only see the results of those states, not the states themselves
Apr 27 • 13 tweets • 4 min read
Understanding probability is essential in data science.
In 4 minutes, I'll demolish your confusion.
Let's go! 1. Statistical Distributions:
There are 100s of distributions to choose from when modeling data. Choices seem endless. Use this as a guide to simplify the choice.
Apr 26 • 13 tweets • 4 min read
Forecasting time series is what made me stand out as a data scientist.
But it took me 1 year to master ARIMA.
In 1 minute, I'll evaporate your confusion. Let's go. 1. Autoregressive Forecast Models
ARIMA and SARIMA are both statistical models used for forecasting time series data, where the goal is to predict future points in the series. The implement a concept called Autoregression.
Apr 25 • 8 tweets • 3 min read
🚨BREAKING: New Python library for agentic data processing and ETL with AI
Introducing DocETL.
Here's what you need to know: 1. What is DocETL?
It's a tool for creating and executing data processing pipelines, especially suited for complex document processing tasks.
It offers:
- An interactive UI playground
- A Python package for running production pipelines
Apr 22 • 13 tweets • 4 min read
Probability distributions are critical to data science and business decision-making.
In 3 minutes, I'll unpack 3 years of studying probability distributions (and share how I applied it to a $15,000,000 business project).
Let's go! đź§µ 1. Probability Distribution Fundamentals:
In statistics, a probability distribution is a mathematical function that provides the probabilities of occurrence of different possible outcomes in an experiment. It's a way to describe how likely different outcomes will occur. There are two main types of probability distributions: Discrete and Uniform.
Apr 21 • 6 tweets • 3 min read
RIP Data Scientists.
The Generative AI Data Scientist is NOW what companies want.
This is actually good news. Let me explain:
Companies are sitting on mountains of unstructured data.
PDF
Word docs
Meeting notes
Emails
Videos
Audio Transcripts
This is useful data. But it's unusable in its existing form.