Alexey Grigorev Profile picture
👷‍♂️ Building @DataTalksClub community 🎤 Event and podcast host 📚 Book author and course instructor 🌍 Berlin, Germany
2 subscribers
Oct 2, 2022 5 tweets 2 min read
AUC stands for "Area Under the Curve"

Usually, when we say "AUC" we mean "AUC ROC" - the area under the ROC curve

It's a way of evaluating the quality of a binary classification model based on a ROC curve

Let's see what it is 🧵 The ROC curve of the ideal model goes to FPR=0% and TPR=100%

The ROC curve of a random model is a straight line between (0, 0) and (1, 1)

Usually, the ROC curve of our model is somewhere between
Oct 2, 2022 12 tweets 4 min read
ROC curves were used during WW2 to assess how well radars detect planes

Target:

🔸y = 1: there's a plane
🔸y = 0: there's no plane

ROC tells us how well a model can separate these two cases

It's based on two quantities: FPR and TRP

Let's use them to build a ROC curve 🧵 We'll use a more modern example: Churn prediction

🔸y = 1: customer stops being our client
🔸y = 0: customer continues being our client
Oct 2, 2022 7 tweets 2 min read
Accuracy can be misleading

What to use instead?

👉 Precision
Among examples predicted as positive, how many are correct?

👉 Recall
How many positive examples are identified correctly?

Confused? Let me explain it with an example 🧵 Let's use the churn prediction example

🔸 We want to identify customers who will churn
🔸 We train a model for that
🔸 If the model thinks a customer will churn, we offer them a discount to keep them
Sep 30, 2022 7 tweets 2 min read
Confusion table is confusing

In churn prediction:

🔸 True positive: correctly predicted churn
🔸 False positive: predict churn but they didn't churn
🔸 False negative: predict no churn but they churned
🔸 True negative: correctly predicted no churn

Let's see how to use it

🧵 Image To understand it better, it helps to think how the model is applied

In case of churn, we'll offer discounts to people who model marked as churning, hoping it'll help retain the users
Sep 10, 2022 4 tweets 2 min read
200+ Data Science interview questions

🔸 Supervised machine learning (linear models, trees, neural nets)
🔸 Feature selection, parameter tuning
🔸 Unsupervised learning (clustering, dim reduction)
🔸 Recommenders and search
🔸 SQL
🔸 Coding (Python), algorithms

With answers 👇 First, 160+ theoretical interview questions:

github.com/alexeygrigorev…

(Note that 24 questions still have no answers - contributions are welcome)
Sep 9, 2022 8 tweets 2 min read
Week 1 of Machine Learning Zoomcamp:

🔸 What's ML
🔸 Supervised Machine Learning
🔸 Process for ML projects
🔸 Linear algebra refresher
🔸 Numpy and Pandas

Here's a thread with tweet summaries this week What is machine learning?
Sep 8, 2022 9 tweets 3 min read
Linear algebra's most important operations:

1️⃣ Vector-vector multiplication
2️⃣ Matrix-vector multiplication
3️⃣ Matrix-matrix multiplication

The best way to understand them is to express 2️⃣ with 1️⃣ and 3️⃣ with 2️⃣

Let me show you how 🧵 First, let's start with vector-vector multiplication (aka dot-product)

We have two vectors u and v

Multiply each element of both vectors with each other and then sum up the result Image
Aug 24, 2022 9 tweets 2 min read
My onsite interview for ML engineering with a FAANG company:

🔸 Behavioral
🔸 Coding round 1 (two problems)
🔸 Coding round 2 (two problems)
🔸 System design
🔸 ML case study

Here are the questions I got👇 Behavioral:
Feb 25, 2021 14 tweets 3 min read
Career Transitioning into Data Science

Talk by @pandeyparul packed with actionable advice

1️⃣ Love data
2️⃣ Create your own learning plan
3️⃣ Learn by doing
4️⃣ Contribute to open-source
5️⃣ Communicate insights
6️⃣ Network

🔗

Detailed summary 🧵👇 1️⃣ Learn to love data

🔸 Ask yourself: "why data science?"
🔸 Numbers should excite you
🔸 If you don't like seeing a lot of numbers, ask yourself if data science is right for you
Feb 13, 2021 7 tweets 1 min read
For any project, follow these steps

1️⃣ Make it work
2️⃣ Make it right
3️⃣ Make it fast

In this exact order

It's important. Let me explain why

Thread 👇 1️⃣ Make it work

When starting a project

🔸 Experiment
🔸 Figure out how it should work
🔸 Cut corners
🔸 Make ugly hacks

Do anything it takes to solve the problem — and have a working system
Feb 12, 2021 5 tweets 1 min read
Most useful regular expressions for text pre-processing:

🔸 Removing non-letters - \W+
🔸 Replacing numbers with a special token - \d+
🔸 Removing extra whitespaces - \s+

I use these three expressions in every project with text

Code 👇 🔸 Removing non-letters🔸

non_letter = re.compile(r'\W+')
text = non_letter.sub(' ', text)
Feb 9, 2021 6 tweets 1 min read
They say:

"Kaggle doesn't teach you how to translate a business problem into machine learning terms"

This is NOT true

You CAN learn a great deal from @kaggle

Let me tell you how you can do it in 4 simple steps. None of them requires taking part in a competition

Thread 👇 1️⃣ Explore

🔸 Look at the past competitions
🔸 Find 20 competitions that are interesting
🔸 Put them in a spreadsheet
Feb 8, 2021 9 tweets 2 min read
Interview process for ML Engineers and Data Scientists:

1️⃣ Screening
2️⃣ Machine Learning
3️⃣ Coding
4️⃣ Case studies
5️⃣ System design
6️⃣ Behavioral

Here's what you can expect at each step (Thread) 👇 2️⃣ Machine Learning

Usually theoretical questions:

🔸 Linear models
🔸 L1 vs L2 regularization
🔸 XGB vs Random Forest
🔸 Why need activation for neural nets
Feb 8, 2021 8 tweets 2 min read
🤖 Learning machine learning?

Focus on mastering these algorithms:

🔸 Linear regression
🔸 Logistic regression
🔸 Decision trees
🔸 Random forest
🔸 Gradient boosting
🔸 Neural networks + CNN

Don't know how?

Here's a detailed mega-thread 👇

(check the replies as well!) Linear regression 👇

Feb 5, 2021 4 tweets 1 min read
The toughest data science interview I ever had

I got bombarded for 45 minutes with theoretical questions:

🔸 Entropy
🔸 KL divergence, other divergences
🔸 Kolmogorov complexity
🔸 Jacobian and Hessian
🔸 Linear independence
🔸 Determinant

Continued 👇 🔸 Eigenvalues and Eigenvectors
🔸 SVD
🔸 The norm of a vector
🔸 Independent random variables
🔸 Expectation and variance
🔸 Central limit theorem

👇
Jan 29, 2021 4 tweets 1 min read
Learning path to mastering Data Science:

🔸 Python
🔸 Git
🔸 SQL
🔸 NumPy
🔸 Pandas
🔸 Scikit-Learn
🔸 Flask
🔸 Docker
🔸 AWS
🔸 TensorFlow
🔸 Linear Algebra
🔸 Machine Learning basics

What else? Things from Linear Algebra to focus on:

Jan 28, 2021 7 tweets 2 min read
MLOps is just glorified DevOps:

1️⃣ They have the same culture
2️⃣ Tools are the same
3️⃣ Experiments existed long before MLOps
4️⃣ ML problems are mostly engineering problems

Thread 👇 1️⃣ MLOps and DevOps have the same culture

Both advocate for

🔸 End-to-end shared responsibility of the team
🔸 Automating everything
🔸 Autonomous teams
🔸 Continuous learning from failures
Jan 22, 2021 5 tweets 2 min read
How to learn Linear Algebra and say sane?

Thread 👇 Start with Gilbert Strang's course. This is the best course about Linear Algebra

I wish my university teachers were like that

ocw.mit.edu/courses/mathem…
Jan 17, 2021 4 tweets 1 min read
OSI model for ML:

5️⃣ ML libraries (Scikit-Learn, XGBoost, TF)
4️⃣ Core libraries (NumPy)
3️⃣ Algorithms (linear models, trees)
2️⃣ Native code (Fortran, C, C++)
1️⃣ Math (linear algebra, probability, calculus) When developing web apps, we start with the application layer of OSI without worrying about the underlying layers

But why do we start learning machine learning with mathematics?
Mar 14, 2020 52 tweets 9 min read
Preparing for a #MachineLearning or #DataScience interview?

One retweet — one technical question.

Categories: SQL, coding (Python) and algorithms

Let’s start!

#100DaysOfMLCode #100DaysOfPythonCode = SQL =

Suppose we have the following schema:
* Ads(ad_id, camplaign_id, status)
* Events(event_id, ad_id, source, event_type, date, hour)

status: active, inactive
event_type: impression (ad is shown), click (ad is clicked), conversion (app is installed)
Feb 21, 2020 160 tweets 8 min read
Preparing for a #MachineLearning #DataScience interview?

One retweet - one theoretical interview question in the thread 👇

Feel free to give your answers

Let's start!

#100DaysOfCode #100DaysOfMLCode Interview questions are typically based on what the company needs and/or projects you have worked with previously.

So if you didn’t work with time series - it’s unlikely you’ll get many questions about it. Same with computer vision, NLP or recommender systems.