🔥 Matt Dancho (Business Science) 🔥 Profile picture
Future Is Generative AI + Data Science | Helping My Students Become Generative AI Data Scientists ($200,000 /year career) 👇
9 subscribers
May 14 • 9 tweets • 3 min read
Tableau is about to die.

Introducing PandasAI, a free alternative for fast Business Intelligence.

Let dive in: Image 1. PandasAI

PandaAI transforms your natural language questions into actionable insights — fast, smartly, and effortlessly.
May 13 • 10 tweets • 3 min read
🚨 BREAKING: Microsoft launches a free Python library that converts ANY document to Markdown

Introducing Markitdown. Let me explain. đź§µ Image 1. Document Parsing Pipelines

MarkItDown is a lightweight Python utility for converting various files to Markdown for use with LLMs and related text analysis pipelines. Image
May 11 • 14 tweets • 4 min read
The 10 types of clustering that all data scientists need to know.

Let's dive in: Image 1. K-Means Clustering:

This is a centroid-based algorithm, where the goal is to minimize the sum of distances between points and their respective cluster centroid. Image
May 9 • 11 tweets • 3 min read
RIP Tableau and PowerBI.

Enter Julius AI.

This is what Julius can do: Image 1. The $10 Billion problem with Tableau and PowerBI?

Dashboards are static.

But businesses are dynamic.

That's why I'm so excited about this new tool: Julius AI Image
May 9 • 13 tweets • 5 min read
Principal Component Analysis (PCA) is the gold standard in dimensionality reduction.

But almost every beginner struggles understanding how it works (and why to use it).

In 3 minutes, I'll demolish your confusion: Image 1. What is PCA?

PCA is a statistical technique used in data analysis, mainly for dimensionality reduction. It's beneficial when dealing with large datasets with many variables, and it helps simplify the data's complexity while retaining as much variability as possible. Image
May 8 • 11 tweets • 4 min read
K-means is one of the most powerful algorithms for data scientists.

But it's confusing for beginners. Let's fix that: Image 1. What is K-means?

Is a popular unsupervised machine learning algorithm used for clustering. It's a core algorithm used for customer segmentation, inventory categorization, market segmentation, and even anomaly detection. Image
May 6 • 15 tweets • 5 min read
🚨BREAKING: OpenAI just dropped a practical guide to building agents

There's a secret.

It unlocks agential workflows for data analysts, data scientists, and data engineers.

This is how: đź§µ Image Here are my key takeaways by section of the report specifically addressing how data professionals can use the report.

1: What Is an Agent?

Agents are LLM‑powered systems that autonomously execute multi‑step workflows on a user’s behalf—going beyond single‑turn prompts to manage tool calls, conditional logic, and goal‑based loopsImage
May 4 • 8 tweets • 3 min read
90% of data scientists struggle with time series.

But all it takes is mastering 1 technique: time series decomposition.

Here's why: Image 1. What is Time Series Decomposition?

A statistical method used to deconstruct a time series into several components, each representing underlying patterns in the data.

There are 3 key components: Trend, Seasonal, and Residual. Let's break them down. Image
May 3 • 9 tweets • 3 min read
A Python Library for Time Series by Salesforce.

Let me introduce you to Merlion. Image 1. What is Merlion?

Merlion is a Python library for time series intelligence.

It provides an end-to-end machine learning framework that includes loading and transforming data, building and training models, post-processing model outputs, and evaluating model performance. Image
May 2 • 9 tweets • 3 min read
A Python Library for Time Series using Hidden Markov Models.

Let me introduce you to hmmlearn. Image 1. Hidden Markov Models

A Hidden Markov Model (HMM) is a statistical model that describes a sequence of observable events where the underlying process generating those events is not directly visible, meaning there are "hidden states" that influence the observed data, but you can only see the results of those states, not the states themselvesImage
Apr 27 • 13 tweets • 4 min read
Understanding probability is essential in data science.

In 4 minutes, I'll demolish your confusion.

Let's go! Image 1. Statistical Distributions:

There are 100s of distributions to choose from when modeling data. Choices seem endless. Use this as a guide to simplify the choice. Image
Apr 26 • 13 tweets • 4 min read
Forecasting time series is what made me stand out as a data scientist.

But it took me 1 year to master ARIMA.

In 1 minute, I'll evaporate your confusion. Let's go. Image 1. Autoregressive Forecast Models

ARIMA and SARIMA are both statistical models used for forecasting time series data, where the goal is to predict future points in the series. The implement a concept called Autoregression. Image
Apr 25 • 8 tweets • 3 min read
🚨BREAKING: New Python library for agentic data processing and ETL with AI

Introducing DocETL.

Here's what you need to know: Image 1. What is DocETL?

It's a tool for creating and executing data processing pipelines, especially suited for complex document processing tasks.

It offers:

- An interactive UI playground
- A Python package for running production pipelines Image
Apr 22 • 13 tweets • 4 min read
Probability distributions are critical to data science and business decision-making.

In 3 minutes, I'll unpack 3 years of studying probability distributions (and share how I applied it to a $15,000,000 business project).

Let's go! đź§µ Image 1. Probability Distribution Fundamentals:

In statistics, a probability distribution is a mathematical function that provides the probabilities of occurrence of different possible outcomes in an experiment. It's a way to describe how likely different outcomes will occur. There are two main types of probability distributions: Discrete and Uniform.
Apr 21 • 6 tweets • 3 min read
RIP Data Scientists.

The Generative AI Data Scientist is NOW what companies want.

This is actually good news. Let me explain: Image Companies are sitting on mountains of unstructured data.

PDF
Word docs
Meeting notes
Emails
Videos
Audio Transcripts

This is useful data. But it's unusable in its existing form. Image
Apr 20 • 9 tweets • 3 min read
🚨 BREAKING: IBM launches a free Python library that converts ANY document to data

Introducing Docling. Here's what you need to know: đź§µ Image 1. What is Docling?

Docling is a Python library that simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem. Image
Apr 20 • 10 tweets • 4 min read
🚨 BREAKING: Microsoft launches a free Python library that converts ANY document to Markdown

Introducing Markitdown. Let me explain. đź§µ Image 1. Document Parsing Pipelines

MarkItDown is a lightweight Python utility for converting various files to Markdown for use with LLMs and related text analysis pipelines. Image
Apr 19 • 15 tweets • 3 min read
Understanding P-Values is essential for improving regression models.

In 2 minutes, I'll crush your confusion. Image 1. The p-value:

A p-value in statistics is a measure used to assess the strength of the evidence against a null hypothesis.
Apr 17 • 9 tweets • 3 min read
🚨NEW: Python library for LLM Prompt Management

This is what it does: Image The Python library is called Promptify.

It combines a prompter, LLMs, and pipeline to Solve NLP Problems with LLM's.

You can easily generate different NLP Task prompts for popular generative models like GPT, PaLM, and more with Promptify. Image
Apr 16 • 10 tweets • 4 min read
ROC and AUC are important concepts for evaluating classification models in business (e.g. lead scoring).

In 3 minutes, I'll demystify AUC. Image 1. ROC Curve:

The ROC curve, which stands for the Receiver Operating Characteristic curve, is a graphical representation used to evaluate the performance of a binary classifier system as its discrimination threshold is varied. Image
Apr 15 • 9 tweets • 3 min read
Logistic Regression is the most important foundational algorithm in Classification Modeling.

In 2 minutes, I'll crush your confusion.

Let's dive in: Image 1. Logistic regression is a statistical method used for analyzing a dataset in which there are one or more independent variables that determine a binary outcome (in which there are only two possible outcomes). This is commonly called a binary classification problem.