🔥 Matt Dancho (Business Science) 🔥 Profile picture
Future Is Generative AI + Data Science | Helping My Students Become Generative AI Data Scientists & AI Engineers ($200,000+ career) 👇
9 subscribers
Aug 9 • 9 tweets • 4 min read
Stop doing Customer Segmentation with plain vanilla Scikit Learn.

Add these 7 Python libraries to your RFM, clustering, and
customer segmentation projects: Image 1. Data preparation

- load data with pandas
- impute/mask with Feature-engine

Website: feature-engine.trainindata.com/en/latest/inde…Image
Aug 9 • 6 tweets • 2 min read
RIP Data Scientists.

The Generative AI Data Scientist is NOW what companies want.

This is actually good news. Let me explain: Image Companies are sitting on mountains of unstructured data.

PDF
Word docs
Meeting notes
Emails
Videos
Audio Transcripts

This is useful data. But it's unusable in its existing form. Image
Aug 8 • 11 tweets • 4 min read
RIP Tableau and PowerBI.

Enter Julius AI.

This is what Julius can do: Image 1. The $10 Billion problem with Tableau and PowerBI?

Dashboards are static.

But businesses are dynamic.

That's why I'm so excited about this new tool: Julius AI Image
Aug 8 • 12 tweets • 4 min read
Boxplots are one of the most useful tools in my Data Science arsenal.

In 6 minutes, I'll eviscerate your confusion.

Let's dive in. Image 1. What is a boxplot?

A boxplot is a standardized way of displaying the distribution of data based on a five-number summary: minimum, first quartile (Q1), median, third quartile (Q3), and maximum. Image
Aug 7 • 11 tweets • 3 min read
The 3 types of machine learning (that every data scientist should know).

In 3 minutes I'll eviscerate your confusion. Let's go: đź§µ Image 1. The 3 Fundamental Types of Machine Learning:

- Supervised Learning
- Unsupervised Learning
- Reinforcement Learning.

Let's break them down:
Aug 7 • 13 tweets • 4 min read
K-means is an essential algorithm for Data Science.

But it's confusing for beginners.

Let me demolish your confusion: Image 1. K-Means

K-means is a popular unsupervised machine learning algorithm used for clustering. It's a core algorithm used for customer segmentation, inventory categorization, market segmentation, and even anomaly detection. Image
Aug 5 • 12 tweets • 4 min read
Principal Component Analysis (PCA) is the gold standard in dimensionality reduction.

But almost every beginner struggles understanding how it works (and why to use it).

In 3 minutes, I'll demolish your confusion: Image 1. What is PCA?

PCA is a statistical technique used in data analysis, mainly for dimensionality reduction. It's beneficial when dealing with large datasets with many variables, and it helps simplify the data's complexity while retaining as much variability as possible. Image
Aug 4 • 9 tweets • 3 min read
🚨BREAKING: New Python library for Bayesian Marketing Mix Modeling and Customer Lifetime Value

It's called PyMC Marketing.

This is what you need to know: đź§µ Image 1. What is PyMC Marketing?

PyMC-Marketing is a state-of-the-art Bayesian modeling library that's designed for Marketing Mix Modeling (MMM) and Customer Lifetime Value (CLV) prediction.
Aug 2 • 9 tweets • 3 min read
Random forest was wild to me.

In 3 minutes, I'll share 3 weeks of research on Random Forest.

Let's go: Image 1. What is a Random Forest?

Random Forest builds multiple decision trees and merges them together to get a more accurate and stable prediction. Each tree in the random forest gives a prediction, and the most voted prediction is considered as the final result.
Jul 30 • 6 tweets • 2 min read
RIP Data Scientists.

The Generative AI Data Scientist is NOW what companies want.

This is actually good news. Let me explain: Image Companies are sitting on mountains of unstructured data.

PDF
Word docs
Meeting notes
Emails
Videos
Audio Transcripts

This is useful data. But it's unusable in its existing form. Image
Jul 29 • 13 tweets • 3 min read
Bayes' Theorem is a fundamental concept in data science.

But it took me 2 years to understand its importance.

In 2 minutes, I'll share my best findings over the last 2 years exploring Bayesian Statistics. Let's go. Image 1. Background:

"An Essay towards solving a Problem in the Doctrine of Chances," was published in 1763, two years after Bayes' death. In this essay, Bayes addressed the problem of inverse probability, which is the basis of what is now known as Bayesian probability.
Jul 28 • 15 tweets • 4 min read
Understanding P-Values is essential for improving regression models.

In 2 minutes, I'll crush your confusion.

Let's go: Image 1. The p-value:

A p-value in statistics is a measure used to assess the strength of the evidence against a null hypothesis. Image
Jul 28 • 8 tweets • 3 min read
🚨BREAKING: New Python library for agentic data processing and ETL with AI

Introducing DocETL.

Here's what you need to know: Image 1. What is DocETL?

It's a tool for creating and executing data processing pipelines, especially suited for complex document processing tasks.

It offers:

- An interactive UI playground
- A Python package for running production pipelines Image
Jul 27 • 10 tweets • 3 min read
🚨 BREAKING: Microsoft launches a free Python library that converts ANY document to Markdown

Introducing Markitdown. Let me explain. đź§µ Image 1. Document Parsing Pipelines

MarkItDown is a lightweight Python utility for converting various files to Markdown for use with LLMs and related text analysis pipelines. Image
Jul 27 • 10 tweets • 4 min read
Logistic Regression is the most important foundational algorithm in Classification Modeling.

In 2 minutes, I'll crush your confusion.

Let's dive in: Image 1. Logistic regression is a statistical method used for analyzing a dataset in which there are one or more independent variables that determine a binary outcome (in which there are only two possible outcomes). This is commonly called a binary classification problem.
Jul 26 • 13 tweets • 4 min read
Understanding probability is essential in data science.

In 4 minutes, I'll demolish your confusion.

Let's go! Image 1. Statistical Distributions:

There are 100s of distributions to choose from when modeling data. Choices seem endless. Use this as a guide to simplify the choice. Image
Jul 26 • 13 tweets • 5 min read
Principal Component Analysis (PCA) is the gold standard in dimensionality reduction.

But almost every beginner struggles understanding how it works (and why to use it).

In 3 minutes, I'll demolish your confusion: Image 1. What is PCA?

PCA is a statistical technique used in data analysis, mainly for dimensionality reduction. It's beneficial when dealing with large datasets with many variables, and it helps simplify the data's complexity while retaining as much variability as possible. Image
Jul 18 • 13 tweets • 3 min read
The concept that helped me go from bad models to good models: Bias and Variance.

In 4 minutes, I'll share 4 years of experience in managing bias and variance in my machine learning models. Let's go. đź§µ Image 1. Generalization:

Bias and variance control your models ability to generalize on new, unseen data, not just the data it was trained on. The goal in machine learning is to build models that generalize well. To do so, I manage bias and variance.
Jul 17 • 13 tweets • 4 min read
K-means is an essential algorithm for Data Science.

But it's confusing for beginners.

Let me demolish your confusion: Image 1. K-Means

K-means is a popular unsupervised machine learning algorithm used for clustering. It's a core algorithm used for customer segmentation, inventory categorization, market segmentation, and even anomaly detection. Image
Jul 16 • 8 tweets • 3 min read
Tableau is about to die.

Introducing PandasAI, a free alternative for fast Business Intelligence.

Let dive in: Image 1. PandasAI

PandaAI transforms your natural language questions into actionable insights — fast, smartly, and effortlessly.
Jul 14 • 7 tweets • 3 min read
85% of data scientists do customer segmentation the WRONG WAY.

AI Agents fix this—here's how I made an AI that clusters customers & recommends marketing actions (and you can too). 🧵 Image Traditional K-Means finds clusters, but that's just the start.

The real challenge?

Interpreting clusters for business value. Image