Future Is Generative AI + Data Science | Helping My Students Become Generative AI Data Scientists ($200,000 /year career) 👇
9 subscribers
Jun 29 • 9 tweets • 3 min read
Logistic Regression is the most important foundational algorithm in Classification Modeling.
In 2 minutes, I'll crush your confusion.
Let's dive in: 1. Logistic regression is a statistical method used for analyzing a dataset in which there are one or more independent variables that determine a binary outcome (in which there are only two possible outcomes). This is commonly called a binary classification problem.
Jun 27 • 12 tweets • 4 min read
These 7 statistical analysis concepts have helped me as an AI Data Scientist.
Let's go: 🧵
Step 1: Learn These Descriptive Statistics
Mean, median, mode, variance, standard deviation. Used to summarize data and spot variability. These are key for any data scientist to understand what’s in front of them in their data sets.
Jun 27 • 9 tweets • 3 min read
🚨BREAKING: New Python library for Bayesian Marketing Mix Modeling and Customer Lifetime Value
It's called PyMC Marketing.
This is what you need to know: 🧵 1. What is PyMC Marketing?
PyMC-Marketing is a state-of-the-art Bayesian modeling library that's designed for Marketing Mix Modeling (MMM) and Customer Lifetime Value (CLV) prediction.
Jun 26 • 8 tweets • 3 min read
Stop Prompting LLMs.
Start Programming LLMs.
Introducing DSPy by Stanford NLP.
This is why you need to learn it: 1. Why DSPy?
DSPy is the open-source framework for programming—rather than prompting—language models.
It allows you to iterate fast on building modular AI systems.
Jun 22 • 15 tweets • 4 min read
Understanding P-Values is essential for improving regression models.
In 2 minutes, I'll crush your confusion.
Let's go: 1. The p-value:
A p-value in statistics is a measure used to assess the strength of the evidence against a null hypothesis.
Jun 16 • 8 tweets • 3 min read
🚨 BREAKING: IBM launches a free Python library that converts ANY document to data
Introducing Docling. Here's what you need to know: 🧵 1. What is Docling?
Docling is a Python library that simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
Jun 15 • 13 tweets • 4 min read
Understanding probability is essential in data science.
In 4 minutes, I'll demolish your confusion.
Let's go! 1. Statistical Distributions:
There are 100s of distributions to choose from when modeling data. Choices seem endless. Use this as a guide to simplify the choice.
Jun 14 • 11 tweets • 4 min read
K-means is one of the most powerful algorithms for data scientists.
But it's confusing for beginners. Let's fix that: 1. What is K-means?
Is a popular unsupervised machine learning algorithm used for clustering. It's a core algorithm used for customer segmentation, inventory categorization, market segmentation, and even anomaly detection.
Jun 14 • 15 tweets • 4 min read
The 10 types of clustering that all data scientists need to know.
Let's dive in: 1. K-Means Clustering:
This is a centroid-based algorithm, where the goal is to minimize the sum of distances between points and their respective cluster centroid.
Jun 12 • 8 tweets • 3 min read
🚨BREAKING: New Python library for agentic data processing and ETL with AI
Introducing DocETL.
Here's what you need to know: 1. What is DocETL?
It's a tool for creating and executing data processing pipelines, especially suited for complex document processing tasks.
It offers:
- An interactive UI playground
- A Python package for running production pipelines
Jun 11 • 10 tweets • 4 min read
Python is insane for time series.
Case in point: Pytimetk 📈
PyTimetk’s Mission: To make time series analysis easier, faster, and more enjoyable in Python.
Pytimetk uses a Polars backend for massive speedups.
Jun 5 • 9 tweets • 3 min read
A Python Library for Time Series using Hidden Markov Models.
Let me introduce you to hmmlearn. 1. Hidden Markov Models
A Hidden Markov Model (HMM) is a statistical model that describes a sequence of observable events where the underlying process generating those events is not directly visible, meaning there are "hidden states" that influence the observed data, but you can only see the results of those states, not the states themselves
Jun 4 • 7 tweets • 3 min read
🚨NEW AI for Data Scientists Workshop
This is what's coming:
Generative AI is the future of Data Science.
Those who can build with LLMs and Python have unlimited career potential.
Jun 4 • 8 tweets • 3 min read
❌Move over PowerBI. There's a new AI analyst in town.
💡Introducing ThoughtSpot. 1. AI Analyst
ThoughtSpot’s Spotter is an AI analyst that uses generative AI to answer complex business questions in natural language, delivering visualizations and insights instantly.
It supports iterative querying (e.g., “What’s next?”) without predefined dashboards.
Jun 3 • 12 tweets • 4 min read
Top 7 most important statistical analysis concepts that have helped me as a Data Scientist.
This is a complete 7-step beginner ROADMAP for learning stats for data science. Let's go:
Step 1: Learn These Descriptive Statistics
Mean, median, mode, variance, standard deviation. Used to summarize data and spot variability. These are key for any data scientist to understand what’s in front of them in their data sets.
Jun 1 • 9 tweets • 3 min read
🚨 BREAKING: IBM launches a free Python library that converts ANY document to data
Introducing Docling. Here's what you need to know: 🧵 1. What is Docling?
Docling is a Python library that simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
May 30 • 13 tweets • 4 min read
The concept that helped me go from bad models to good models: Bias and Variance.
In 4 minutes, I'll share 4 years of experience in managing bias and variance in my machine learning models. Let's go. 🧵 1. Generalization:
Bias and variance control your models ability to generalize on new, unseen data, not just the data it was trained on. The goal in machine learning is to build models that generalize well. To do so, I manage bias and variance.
May 29 • 12 tweets • 4 min read
Understanding probability is essential in data science.
In 4 minutes, I'll demolish your confusion.
Let's go! 1. Statistical Distributions:
There are 100s of distributions to choose from when modeling data. Choices seem endless. Use this as a guide to simplify the choice.
May 23 • 13 tweets • 3 min read
Bayes' Theorem is a fundamental concept in data science.
But it took me 2 years to understand its importance.
In 2 minutes, I'll share my best findings over the last 2 years exploring Bayesian Statistics. Let's go. 1. Background:
"An Essay towards solving a Problem in the Doctrine of Chances," was published in 1763, two years after Bayes' death. In this essay, Bayes addressed the problem of inverse probability, which is the basis of what is now known as Bayesian probability.
May 22 • 12 tweets • 4 min read
Top 7 most important statistical analysis concepts that have helped me as a Data Scientist.
This is a complete 7-step beginner ROADMAP for learning stats for data science. Let's go:
Step 1: Learn These Descriptive Statistics
Mean, median, mode, variance, standard deviation. Used to summarize data and spot variability. These are key for any data scientist to understand what’s in front of them in their data sets.
May 22 • 12 tweets • 3 min read
Type 1 and Type 2 errors are confusing. In 3 minutes, I'll demolish your confusion. Let's dive in. 🧵 1. Type 1 Error (False Positive):
This occurs when the pregnancy test tells Tom, the man, that he is pregnant. Obviously, Tom cannot be pregnant, so this result is a false alarm. In statistical terms, it's detecting an effect (in this case, pregnancy) when it actually doesn't exist.