Future Is Generative AI + Data Science | Helping My Students Become Generative AI Data Scientists ($200,000 /year career) 👇
9 subscribers
May 3 • 9 tweets • 3 min read
A Python Library for Time Series by Salesforce.
Let me introduce you to Merlion. 1. What is Merlion?
Merlion is a Python library for time series intelligence.
It provides an end-to-end machine learning framework that includes loading and transforming data, building and training models, post-processing model outputs, and evaluating model performance.
May 2 • 9 tweets • 3 min read
A Python Library for Time Series using Hidden Markov Models.
Let me introduce you to hmmlearn. 1. Hidden Markov Models
A Hidden Markov Model (HMM) is a statistical model that describes a sequence of observable events where the underlying process generating those events is not directly visible, meaning there are "hidden states" that influence the observed data, but you can only see the results of those states, not the states themselves
Apr 27 • 13 tweets • 4 min read
Understanding probability is essential in data science.
In 4 minutes, I'll demolish your confusion.
Let's go! 1. Statistical Distributions:
There are 100s of distributions to choose from when modeling data. Choices seem endless. Use this as a guide to simplify the choice.
Apr 26 • 13 tweets • 4 min read
Forecasting time series is what made me stand out as a data scientist.
But it took me 1 year to master ARIMA.
In 1 minute, I'll evaporate your confusion. Let's go. 1. Autoregressive Forecast Models
ARIMA and SARIMA are both statistical models used for forecasting time series data, where the goal is to predict future points in the series. The implement a concept called Autoregression.
Apr 25 • 8 tweets • 3 min read
🚨BREAKING: New Python library for agentic data processing and ETL with AI
Introducing DocETL.
Here's what you need to know: 1. What is DocETL?
It's a tool for creating and executing data processing pipelines, especially suited for complex document processing tasks.
It offers:
- An interactive UI playground
- A Python package for running production pipelines
Apr 22 • 13 tweets • 4 min read
Probability distributions are critical to data science and business decision-making.
In 3 minutes, I'll unpack 3 years of studying probability distributions (and share how I applied it to a $15,000,000 business project).
Let's go! đź§µ 1. Probability Distribution Fundamentals:
In statistics, a probability distribution is a mathematical function that provides the probabilities of occurrence of different possible outcomes in an experiment. It's a way to describe how likely different outcomes will occur. There are two main types of probability distributions: Discrete and Uniform.
Apr 21 • 6 tweets • 3 min read
RIP Data Scientists.
The Generative AI Data Scientist is NOW what companies want.
This is actually good news. Let me explain:
Companies are sitting on mountains of unstructured data.
PDF
Word docs
Meeting notes
Emails
Videos
Audio Transcripts
This is useful data. But it's unusable in its existing form.
Apr 20 • 9 tweets • 3 min read
🚨 BREAKING: IBM launches a free Python library that converts ANY document to data
Introducing Docling. Here's what you need to know: đź§µ 1. What is Docling?
Docling is a Python library that simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
Apr 20 • 10 tweets • 4 min read
🚨 BREAKING: Microsoft launches a free Python library that converts ANY document to Markdown
Introducing Markitdown. Let me explain. đź§µ 1. Document Parsing Pipelines
MarkItDown is a lightweight Python utility for converting various files to Markdown for use with LLMs and related text analysis pipelines.
Apr 19 • 15 tweets • 3 min read
Understanding P-Values is essential for improving regression models.
In 2 minutes, I'll crush your confusion. 1. The p-value:
A p-value in statistics is a measure used to assess the strength of the evidence against a null hypothesis.
Apr 17 • 9 tweets • 3 min read
🚨NEW: Python library for LLM Prompt Management
This is what it does:
The Python library is called Promptify.
It combines a prompter, LLMs, and pipeline to Solve NLP Problems with LLM's.
You can easily generate different NLP Task prompts for popular generative models like GPT, PaLM, and more with Promptify.
Apr 16 • 10 tweets • 4 min read
ROC and AUC are important concepts for evaluating classification models in business (e.g. lead scoring).
In 3 minutes, I'll demystify AUC. 1. ROC Curve:
The ROC curve, which stands for the Receiver Operating Characteristic curve, is a graphical representation used to evaluate the performance of a binary classifier system as its discrimination threshold is varied.
Apr 15 • 9 tweets • 3 min read
Logistic Regression is the most important foundational algorithm in Classification Modeling.
In 2 minutes, I'll crush your confusion.
Let's dive in: 1. Logistic regression is a statistical method used for analyzing a dataset in which there are one or more independent variables that determine a binary outcome (in which there are only two possible outcomes). This is commonly called a binary classification problem.
Apr 15 • 5 tweets • 2 min read
Data Scientists are OUT.
AI Data Scientists are IN.
95% of data scientists are overlooking this fact.
That's a massive opportunity for you.
You just need 3 AI Skills:
1. LangChain $0 2. LangGraph $0 3. OpenAI API ($12/month)
Cost: $12 per year
Salary: $210,000 per year
That's a no-brainer. Want help?
Apr 13 • 9 tweets • 4 min read
🚨 Google published a 69-page prompt engineering masterclass.
❌Move over PowerBI. There's a new AI analyst in town.
đź’ˇIntroducing ThoughtSpot. 1. AI Analyst
ThoughtSpot’s Spotter is an AI analyst that uses generative AI to answer complex business questions in natural language, delivering visualizations and insights instantly.
It supports iterative querying (e.g., “What’s next?”) without predefined dashboards.
Apr 12 • 8 tweets • 3 min read
RIP Tableau.
Introducing PandasAI, a free alternative for fast Business Intelligence.
Let dive in: đź§µ 1. PandasAI
PandaAI transforms your natural language questions into actionable insights — fast, smartly, and effortlessly.
Apr 11 • 12 tweets • 3 min read
Understanding probability is essential in data science.
In 4 minutes, I'll demolish your confusion.
Let's go! 1. Statistical Distributions:
There are 100s of distributions to choose from when modeling data. Choices seem endless. Use this as a guide to simplify the choice.
Apr 10 • 9 tweets • 3 min read
🚨 BREAKING: Google just open sourced Agent Development Kit (ADK) in Python
This is what you need to know: đź§µ 1. What is Google ADK?
Agent Development Kit (ADK) is a flexible and modular framework for developing and deploying AI agents.
ADK can be used with popular LLMs and open-source generative AI tools and is designed with a focus on tight integration with the Google ecosystem and Gemini models.
Apr 8 • 8 tweets • 3 min read
Stop Prompting LLMs.
Start Programming LLMs.
Introducing DSPy by Stanford NLP.
This is why you need to learn it: 1. Why DSPy?
DSPy is the open-source framework for programming—rather than prompting—language models.
It allows you to iterate fast on building modular AI systems.
Apr 8 • 11 tweets • 4 min read
RIP Tableau and PowerBI.
Enter Julius AI.
This is what Julius can do: 1. The $10 Billion problem with Tableau and PowerBI?
Dashboards are static.
But businesses are dynamic.
That's why I'm so excited about this new tool: Julius AI