Future Is Generative AI + Data Science | Helping My Students Become Generative AI Data Scientists & AI Engineers ($200,000+ career) 👇
9 subscribers
Aug 30 • 12 tweets • 4 min read
Linear Regression is one of the most important tools in a Data Scientist's toolbox.
Yet it's super confusing for beginners.
Let's fix that: đź§µ 1. Ordinary Least Squares (OLS) Regression
Most common form of Linear Regression. OLS regression aims to find the best-fitting linear equation that describes the relationship between the dependent variable (often denoted as Y) and independent variables (denoted as X1, X2, ..., Xn).
Aug 28 • 7 tweets • 3 min read
Came across this new library for LLM Prompt Management in Python.
This is what it does:
The Python library is called Promptify.
It combines a prompter, LLMs, and pipeline to Solve NLP Problems with LLM's.
You can easily generate different NLP Task prompts for popular generative models like GPT, PaLM, and more with Promptify.
Aug 28 • 8 tweets • 3 min read
🚨BREAKING: New Python library for agentic data processing and ETL with AI
Introducing DocETL.
Here's what you need to know: 1. What is DocETL?
It's a tool for creating and executing data processing pipelines, especially suited for complex document processing tasks.
It offers:
- An interactive UI playground
- A Python package for running production pipelines
Aug 27 • 15 tweets • 3 min read
Understanding P-Values is essential for improving regression models.
In 2 minutes, I'll crush your confusion. 1. The p-value:
A p-value in statistics is a measure used to assess the strength of the evidence against a null hypothesis.
Aug 27 • 9 tweets • 3 min read
Logistic Regression is the most important foundational algorithm in Classification Modeling.
In 2 minutes, I'll crush your confusion.
Let's dive in: 1. Logistic regression is a statistical method used for analyzing a dataset in which there are one or more independent variables that determine a binary outcome (in which there are only two possible outcomes). This is commonly called a binary classification problem.
Aug 25 • 8 tweets • 3 min read
This guy built an entire AI Data Science Team in Python.
100% Open Source
This is how to get it (for FREE) đź§µ 1. What is it?
An AI-powered data science team of agents to help you perform common data science tasks 10X faster.
Aug 24 • 6 tweets • 3 min read
Data scientists are OUT.
The Generative AI Data Scientist is IN.
This is why (and how you can make the transition): đź§µ
Companies are sitting on mountains of unstructured data.
PDF
Word docs
Meeting notes
Emails
Videos
Audio Transcripts
This is useful data. But it's unusable in its existing form.
Aug 22 • 10 tweets • 4 min read
This 277-page PDF unlocks the secrets of Large Language Models.
Here's what's inside: đź§µ
Chapter 1 introduces the basics of pre-training.
This is the foundation of large language models, and common pre-training methods and model architectures will be discussed here.
Aug 20 • 7 tweets • 3 min read
Stop Prompting LLMs.
Start Programming LLMs.
Introducing DSPy by Stanford NLP.
This is why you need to learn it: 1. Why DSPy?
DSPy is the open-source framework for programming—rather than prompting—language models.
It allows you to iterate fast on building modular AI systems.
Aug 19 • 6 tweets • 3 min read
Is data cleaning time-consuming?
This is how I went from 3 hours to 5 seconds:
Data cleaning is one of those parts of the data science process that can take 3+ hours.
So in December, I decided to make an AI agent that cleans data for me.
This is what I made:
Aug 19 • 5 tweets • 2 min read
STOP DOING CUSTOMER SEGMENTATION WITH MACHINE LEARNING.
Start using AI.
This is how:
ML is great for 1 thing: finding clusters.
That's only 33% of the problem.
The other 66% is identifying what those clusters mean (and figuring out how to market to them).
Aug 18 • 7 tweets • 3 min read
🚨 Synthetic Data is the Future of AI
Introducing The Synthetic Data Vault (SDV).
This is what you need to know:
Synthetic Data is the Future of AI
Synthetic data keeps your data private.
SDV generates fake datasets that look REAL.
Here's how:
Aug 16 • 12 tweets • 4 min read
Correlation is the skill that has singlehandedly benefitted me the most in my career.
In 3 minutes I'll demolish your confusion (and share strengths and weaknesses you might be missing).
Let's go: 1. Correlation:
Correlation is a statistical measure that describes the extent to which two variables change together. It can indicate whether and how strongly pairs of variables are related.
Aug 15 • 12 tweets • 4 min read
These 7 statistical analysis concepts have helped me as an AI Data Scientist.
Let's go: đź§µ
Step 1: Learn These Descriptive Statistics
Mean, median, mode, variance, standard deviation. Used to summarize data and spot variability. These are key for any data scientist to understand what’s in front of them in their data sets.
Aug 14 • 13 tweets • 4 min read
K-means is an essential algorithm for Data Science.
But it's confusing for beginners.
Let me demolish your confusion: 1. K-Means
K-means is a popular unsupervised machine learning algorithm used for clustering. It's a core algorithm used for customer segmentation, inventory categorization, market segmentation, and even anomaly detection.
Aug 14 • 14 tweets • 4 min read
The 10 types of clustering that all data scientists need to know.
Let's dive in: 1. K-Means Clustering:
This is a centroid-based algorithm, where the goal is to minimize the sum of distances between points and their respective cluster centroid.
Aug 13 • 8 tweets • 3 min read
🚨 BREAKING: IBM launches a free Python library that converts ANY document to data
Introducing Docling. Here's what you need to know: đź§µ 1. What is Docling?
Docling is a Python library that simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
Aug 12 • 12 tweets • 3 min read
Type 1 and Type 2 errors are confusing. In 3 minutes, I'll demolish your confusion. Let's dive in. đź§µ 1. Type 1 Error (False Positive):
This occurs when the pregnancy test tells Tom, the man, that he is pregnant. Obviously, Tom cannot be pregnant, so this result is a false alarm. In statistical terms, it's detecting an effect (in this case, pregnancy) when it actually doesn't exist.
Aug 9 • 9 tweets • 4 min read
Stop doing Customer Segmentation with plain vanilla Scikit Learn.
Add these 7 Python libraries to your RFM, clustering, and
customer segmentation projects: 1. Data preparation
- load data with pandas
- impute/mask with Feature-engine