🔥 Matt Dancho (Business Science) 🔥 Profile picture
Future Is Generative AI + Data Science | Helping My Students Become Generative AI Data Scientists ($200,000 /year career) 👇
8 subscribers
Feb 18 10 tweets 2 min read
Outliers have led me to 100s of business insights. But first I had to find them.

In 3 minutes let me kill your confusion. Let's dive into outliers: Image 1. Outliers

Outliers or anomalies in a dataset are data points that differ significantly from other observations. They are often important insights signifying key events.
Feb 17 9 tweets 3 min read
Tableau and PowerBI are getting killed by free AI tools.

Case in Point: Microsoft's AI Data Formulator.

100% free in Python. Let's dive in: Image 1. Data Formulator: Create Rich Visualizations with AI

Data Formulator is an AI-powered tool for data analysts to iteratively create rich visualizations.

Data Formulator is an application from Microsoft Research that uses large language models to transform data, expediting the practice of data visualization.
Feb 16 8 tweets 2 min read
Google just dropped a new Generative AI Python library for SQL Databases.

Introducing Google GenAI Toolbox.

This is what you need to know: Image 1. Meet the Google GenAI Toolbox

An open-source server designed to simplify building Gen AI tools for your databases. It streamlines development, letting you integrate powerful data tools with just a few lines of code.
Feb 16 8 tweets 2 min read
Move over Tableau and PowerBI.

There's a new Python library that automates Business Intelligence with AI using Text2SQL.

Let me introduce you to WrenAI: Image 1. Meet WrenAI

WrenAI is the future of Generative Business Intelligence (GenBI). It transforms complex data into intuitive insights through a conversational, no-code interface.
Feb 16 13 tweets 2 min read
Forecasting time series is what made me stand out as a data scientist.

But it took me 1 year to master ARIMA.

In 1 minute, I'll evaporate your confusion. Let's go. Image 1. Autoregressive Forecast Models

ARIMA and SARIMA are both statistical models used for forecasting time series data, where the goal is to predict future points in the series. The implement a concept called Autoregression.
Feb 15 11 tweets 2 min read
The most overlooked skill by data scientists?

Time Series Analysis.

In 3 minutes, I'll demolish your confusion. Let's go: 🧵 Image 1. Time Series Analysis:

Time series analysis is a statistical technique that deals with time-ordered data points. It's commonly used to analyze and interpret trends, patterns, and relationships within data that is recorded over time (e.g. with timestamps).
Feb 14 11 tweets 3 min read
For years, I was hyperparameter tuning XGBoost models wrong.

In 3 minutes, I'll share one secret that took me 3 years to figure out.

When I did, it cut my training time 10X (and gave an instant 25% boost in model accuracy).

Let's dive in. 🧵 Image 1. What is XGBoost?

XGBoost (eXtreme Gradient Boosting) is a popular machine learning algorithm, especially for structured (tabular) data. It's claim to fame is winning tons of Kaggle Competitions. But more importantly, it's fast, accurate, and easy to use. But it's also easy to screw it up.
Feb 13 11 tweets 3 min read
When I was first exposed to the Confusion Matrix, I was lost.

There was a HUGE mistake I was making with False Negatives that took me 5 years to fix.

I'll teach you in 5 minutes. Let's dive in. 🧵 Image 1. The Confusion Matrix

A confusion matrix is a tool often used in machine learning to visualize the performance of a classification model. It's a table that allows you to compare the model's predictions against the actual values.
Feb 11 14 tweets 3 min read
Understanding P-Values is essential for improving regression models.

In 2 minutes, I'll crush your confusion. Image 1. The p-value:

A p-value in statistics is a measure used to assess the strength of the evidence against a null hypothesis.
Feb 11 12 tweets 3 min read
Probability distributions are critical to data science and business decision-making.

In 3 minutes, I'll unpack 3 years of studying probability distributions (and share how I applied it to a $15,000,000 business project).

Let's go! 🧵 Image 1. Probability Distribution Fundamentals:

In statistics, a probability distribution is a mathematical function that provides the probabilities of occurrence of different possible outcomes in an experiment. It's a way to describe how likely different outcomes will occur. There are two main types of probability distributions: Discrete and Uniform.
Feb 10 9 tweets 2 min read
The 3 types of machine learning (that every data scientist should know).

In 3 minutes I'll eviscerate your confusion. Let's go: 🧵 Image 1. The 3 Fundamental Types of Machine Learning:

- Supervised Learning
- Unsupervised Learning
- Reinforcement Learning.

Let's break them down:
Feb 9 10 tweets 3 min read
R-squared is one of the most commonly used metrics to measure performance.

But it took me 2 years to figure out mistakes that were killing my regression models.

In 2 minutes, I'll share how I fixed 2 years of mistakes (and made 50% more accurate models than my peers). Let's go:Image 1. R-squared (R2):

R2 is a statistical measure used in regression models that provides a measure of how well the observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model.
Feb 8 12 tweets 3 min read
The concept that helped me go from bad models to good models: Bias and Variance.

In 4 minutes, I'll share 4 years of experience in managing bias and variance in my machine learning models. Let's go. 🧵 Image 1. Generalization:

Bias and variance control your models ability to generalize on new, unseen data, not just the data it was trained on. The goal in machine learning is to build models that generalize well. To do so, I manage bias and variance.
Feb 7 11 tweets 3 min read
Type 1 and Type 2 errors are confusing. In 3 minutes, I'll demolish your confusion. Let's dive in. 🧵 Image 1. Type 1 Error (False Positive):

This occurs when the pregnancy test tells Tom, the man, that he is pregnant. Obviously, Tom cannot be pregnant, so this result is a false alarm. In statistical terms, it's detecting an effect (in this case, pregnancy) when it actually doesn't exist.
Feb 6 12 tweets 3 min read
Bayes' Theorem is a fundamental concept in data science.

But it took me 2 years to understand its importance.

In 2 minutes, I'll share my best findings over the last 2 years exploring Bayesian Statistics. Let's go. Image 1. Background:

"An Essay towards solving a Problem in the Doctrine of Chances," was published in 1763, two years after Bayes' death. In this essay, Bayes addressed the problem of inverse probability, which is the basis of what is now known as Bayesian probability.
Feb 5 11 tweets 3 min read
Residuals are the key to improving model performance.

But it took me 5 years to figure this out.

In 5 minutes, I'll share what took me 5 years to figure out. Let's go. 🧵 Image 1. What are residuals?

In statistics and machine learning, "residuals" refer to the differences between observed values and the values predicted by a model. These are your model errors
Feb 4 12 tweets 3 min read
Logistic Regression is how my simple lead scoring model grew revenue to $15,000,000.

In 3 minutes, here's what took me 3 months to figure out (business case included).

Let's dive in. 🧵 Image 1. Binary Classification:

Logistic regression is a statistical method used for analyzing a dataset in which one or more independent variables determine a binary outcome (in which there are only two possible outcomes). This is commonly called a binary classification problem. 0 = customer didn't buy, 1 = customer bought!
Feb 1 10 tweets 3 min read
90% of data scientists can improve their SQL for business intelligence.

In 3 minutes, learn the 20% of SQL gets 80% of results: Image 🔍 SELECT Basics:

Start with SELECT * FROM table to retrieve all rows & columns.

Remember, SQL isn’t case-sensitive—but capitalizing keywords (SELECT, FROM) makes your queries easier to read. Image
Jan 30 12 tweets 2 min read
Data Scientist vs. AI Engineer (Generative AI Edition)

I've been studying AI for 18 months. This is what I discovered about the rise of this new role: 1) Context: The Rise of AI Engineering

- Data scientists have been called the “sexiest job of the 21st century.”
- But generative AI breakthroughs have led to a new role: AI engineers.
- Think of data scientists as data driven decisioneers vs. AI engineers as AI system builders.
Jan 24 10 tweets 5 min read
The cost of the Python AI / ML stack:

Langchain $0
Langgraph $0
Scikit Learn $0
H2O $0
Torch $0
Pandas $0
Numpy $0
Plotly $0
Statsmodels $0
Ollama $0
OpenAI (<$1.00 per month)

Becoming a Generative AI Data Scientist cost me $12: 🧵 Image 1. Environment:

- VSCode
- Conda
- Jupyter VSCode Integration

Start here: code.visualstudio.com/docs/datascien…Image
Jan 20 7 tweets 2 min read
Can AI do Time Series Forecasting?

This is what I found out. Image Over the past 2 years, I've been studying AI. Why?

Because there are 1,000s of ways we can combine AI with Data Science.

Time series is one of them. Image