Future Is Generative AI + Data Science | Helping My Students Become Generative AI Data Scientists ($200,000 /year career) 👇
9 subscribers
Mar 22 • 7 tweets • 2 min read
Data Science for Business.
The book that helped me connect the dots. Let's dive in: 1. CRISP Data Mining Process
The foundation for applying data science to business is the CRISP method.
This is a helpful framework for integrating data science with the business understanding.
Mar 21 • 10 tweets • 4 min read
90% of data scientists can improve their SQL for business intelligence.
In 3 minutes, learn the 20% of SQL gets 80% of results:
🔍 SELECT Basics:
Start with SELECT * FROM table to retrieve all rows & columns.
Remember, SQL isn’t case-sensitive—but capitalizing keywords (SELECT, FROM) makes your queries easier to read.
Mar 20 • 12 tweets • 4 min read
Understanding probability is essential in data science.
In 4 minutes, I'll demolish your confusion.
Let's go! 1. Statistical Distributions:
There are 100s of distributions to choose from when modeling data. Choices seem endless. Use this as a guide to simplify the choice.
Mar 17 • 11 tweets • 4 min read
6 statistical methods that can be used for A/B Testing (and when to use them).
A/B Testing is a staple of data science and data analyst interviews.
And it's the Number 1 technique that companies benefit from in improving customer revenue.
So here are 6 of the most common stat methods used in A/B testing.
Mar 16 • 11 tweets • 4 min read
R-squared is one of the most commonly used metrics to measure performance.
But it took me 2 years to figure out mistakes that were killing my regression models.
In 2 minutes, I'll share how I fixed 2 years of mistakes (and made 50% more accurate models than my peers). Let's go:1. R-squared (R2):
R2 is a statistical measure used in regression models that provides a measure of how well the observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model.
Mar 15 • 10 tweets • 4 min read
Logistic Regression is the most important foundational algorithm in Classification Modeling.
In 2 minutes, I'll teach you what took me 2 months to learn.
Let's go: 🧵 1. Logistic regression:
Is a statistical method used for analyzing a dataset in which there are one or more independent variables that determine a binary outcome (in which there are only two possible outcomes). This is commonly called a binary classification problem.
Mar 13 • 9 tweets • 3 min read
PowerBI and Tableau are about to die.
Vanna AI is a new open-source Python framework that enables realtime analytics and SQL generation.
Let's explore:
Vanna is an MIT-licensed open-source Python RAG (Retrieval-Augmented Generation) framework for SQL generation and related functionality.
Mar 12 • 5 tweets • 2 min read
Data scientists are out.
The Generative AI Data Scientist is in.
Let me explain:
Companies are sitting on mountains of unstructured data.
PDF
Word docs
Meeting notes
Emails
Videos
Audio Transcripts
This is useful data. But it's unusable in its existing form.
Mar 11 • 11 tweets • 4 min read
Principal Component Analysis (PCA) is the gold standard in dimensionality reduction.
But PCA is hard to understand for beginners.
Let me destroy your confusion: 1. What is PCA?
PCA is a statistical technique used in data analysis, mainly for dimensionality reduction. It's beneficial when dealing with large datasets with many variables, and it helps simplify the data's complexity while retaining as much variability as possible.
Mar 10 • 12 tweets • 4 min read
K-means is an essential algorithm for Data Science.
But it's confusing for beginners.
Let me demolish your confusion: 1. K-Means
K-means is a popular unsupervised machine learning algorithm used for clustering. It's a core algorithm used for customer segmentation, inventory categorization, market segmentation, and even anomaly detection.
Mar 9 • 10 tweets • 3 min read
R-squared is one of the most commonly used metrics to measure performance.
But it took me 2 years to figure out mistakes that were killing my regression models.
In 2 minutes, I'll share how I fixed 2 years of mistakes (and made 50% more accurate models than my peers). Let's go:1. R-squared (R2):
R2 is a statistical measure used in regression models that provides a measure of how well the observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model.
Mar 8 • 11 tweets • 3 min read
Correlation is the skill that has singlehandedly benefitted me the most in my career.
In 3 minutes I'll demolish your confusion (and share strengths and weaknesses you might be missing).
Let's go: 1. Correlation:
Correlation is a statistical measure that describes the extent to which two variables change together. It can indicate whether and how strongly pairs of variables are related.
Mar 6 • 9 tweets • 2 min read
Google just dropped a new Generative AI Python library for SQL Databases.
Introducing Google GenAI Toolbox.
This is what you need to know: 1. Meet the Google GenAI Toolbox
An open-source server designed to simplify building Gen AI tools for your databases. It streamlines development, letting you integrate powerful data tools with just a few lines of code.
Mar 5 • 9 tweets • 4 min read
A Python Library for Time Series using Hidden Markov Models.
Let me introduce you to hmmlearn. 1. Hidden Markov Models
A Hidden Markov Model (HMM) is a statistical model that describes a sequence of observable events where the underlying process generating those events is not directly visible, meaning there are "hidden states" that influence the observed data, but you can only see the results of those states, not the states themselves
Mar 1 • 9 tweets • 3 min read
Python has crazy forecasting libraries.
Let me introduce you to Kats, by Meta (Facebook)
Kats is a toolkit to analyze time series data and a lightweight, easy-to-use, and generalizable framework to perform time series analysis. It covers:
Every dashboard can now be created in seconds with these Free Agents:
Agents can now create these dashboards:
1. Content Performance 2. Email Performance 3. Google Analytics 4. Historical Sales Trends 4. Churn and Subscription Renewal
Feb 23 • 11 tweets • 3 min read
6 statistical methods that can be used for A/B Testing (and when to use them). 🧵
A/B Testing is a staple of data science and data analyst interviews.
And it's the Number 1 technique that companies benefit from in improving customer revenue.
So here's a 6 of the most common stat methods used in A/B testing.
Let's dive in.
Feb 22 • 5 tweets • 2 min read
It took me 5 years to master all 24 of these machine learning concepts.
In the next 24 days, I'll teach them to you one by one (with examples of how I've used them). Here's what's coming:
1. Linear Regression 2. Clustering 3. Decision Tree 4. Neural Networks 5. Reinforcement Learning 6. Logistic Regression 7. Naive Bayes8. Supervised Learning 9. Support Vector Machine 10. Probability 11. Random Forest 12. Variance 13. Evaluation Metrics 14. Bagging 15. Data Wrangling 16. Dimensionality Reduction 17. K-nearest Neighbors Algorithm 18. Programming 19. Regularization 20. Statistics 21. Binomial Distribution 22. Bootstrap Sampling 23. Exploratory Data Analysis 24. Data Collection
Feb 20 • 7 tweets • 2 min read
Data Science for Business.
The book that helped me connect the dots. Let's dive in: 1. CRISP Data Mining Process
The foundation for applying data science to business is the CRISP method.
This is a helpful framework for integrating data science with the business understanding.
Feb 20 • 10 tweets • 3 min read
90% of data scientists can improve their SQL for business intelligence.
In 3 minutes, learn the 20% of SQL gets 80% of results:
🔍 SELECT Basics:
Start with SELECT * FROM table to retrieve all rows & columns.
Remember, SQL isn’t case-sensitive—but capitalizing keywords (SELECT, FROM) makes your queries easier to read.