Santiago Profile picture
6 Apr, 28 tweets, 5 min read
There are a lot of moving pieces on a machine learning system.

This is a thread covering the backbone of the process, from data engineering all the way to a retraining pipeline.

Let's start. ↓
Everything starts with a problem you want to solve.

For example, you want to predict your company's sales in the next 12 months, and you have the last two years' worth of sales in a database.

When use case and data align, you are good to go!

The first step is to prepare the data to train a machine learning model that predicts future sales.

You have the data already, but you may need to transform it into a format that's easier for the model.

This process is called "Data Engineering."

For example, you might have a column representing the date of a sale, but all you care about is the month when it happened.

You can transform this "date" column into "month."

Another example is the category of the product that you sold.

You have it stored with a unique long identifier, but your model will benefit from using a consecutive value.

You'll transform this "id" column into a consecutive integer value representing each product.

During data engineering, every column of the original data is called "input," and after we transform it, it's called "feature."

For example, "month" is a feature of your dataset, as well as the "id" of the product.

A dataset is a collection of "instances."

After finishing with the data engineering portion, we go into "Data Validation."

Here we ensure that our dataset makes sense, is properly balanced, is not biased, and can be used to build a model.

This process is about analyzing and understanding the data.

As soon as the dataset is ready, we can split it into 3 different sets:

• Train set
• Validation set
• Test set

I'll follow with a quick summary about these 3 sets, but here is a thread with more information:



The train set will have the most instances and will be used to teach the model.

The validation set usually contains a smaller portion of the instances, and it's used to validate the model as it's training. We use the results on this set to further improve the model.

The test set is never used during training. Instead, we reserve it until the very end of the process, right when we finished updating the model.

The results on this set give us an idea of the final performance of the model.

Training and validating the model are processes that happen simultaneously and may lead to more data engineering.

For example, you may discover that it'd be a good idea to include a "year" feature from the original data.

At this point, you go back and update and validate your dataset, split it again, and restart the training process.

You may also decide to modify the architecture of your model or its hyperparameters.

The architecture of the model refers to its structure. For example, the number of layers in a neural network or their size.

Hyperparameters refer to the configuration settings of the model. For example, the value of how quick or slow it should learn.

As soon as you are happy with the model's performance, you evaluate it with the test set to ensure everything is good to go and deploy it.

Deploying a model refers to making it available to its users. This is also referred to as "serving" the model.

There are multiple ways to make a model available, but the most common one is putting it behind a REST API.

For example, you could build a skinny layer that accepts a JSON input, uses the model, and returns its result as a JSON output.

Let's break this down.

The "Inference" or "Prediction" process is taking input from the user, running it through the model, and returning the output back.

Usually, there are several steps involved in this process.

The 5 steps to run a prediction:

• Receive input
• Transform it into an instance
• Run it through the model
• Transforming model's output
• Returning final output

A REST API will help us receiving the input from the user and returning the final output back. I illustrated this above by using JSON to transfer the data.

The model is expecting an instance in a specific format. We usually need to transform the input JSON into that format before feeding it to the model.

We also need to run the reverse process and transform the model's output into the JSON that we will send back.

Keep in mind that very often, models aren't used in isolation.

For example, we may need to query a database, call an external service, and put everything together with the model's output before returning an answer.

The API might handle all of that.

Although we are close to the end, two more processes are part of the backbone of a machine learning system:

• Monitoring
• Retraining pipeline

Data may change over time. For example, on average, your customers may get older.

This is referred to as "data drift."

The context of your predictions may also change over time. For example, an update to a marketing campaign may slowly increase a specific product's sale ratio.

This is referred to as "concept drift."

Both data and concept drift will degrade the performance of your model.

Remember that your model is a static representation of the relationship between the input data and the predictions it produces.

Monitoring will help you detect drift.

Monitoring is about comparing the model's predictions with actual values over time.

There are multiple ways you can accomplish this. One of them is by routing some of the inputs to humans so they determine what the actual values should be.

Finally, we get to the process I referred to before as the "Retraining pipeline."

Two main components here:

• Collecting additional data
• Producing a new version of the model

This is the only way you can keep your model fresh: updating it frequently.

Depending on how sophisticated your machine learning process is, you'll run this process manually or automatically.

Ideally, everything works without human involvement. In practice, there's a lot of work to make this happens.

This was a long thread. If you stumble upon this tweet and want to read from the beginning, it all starts here.

If you enjoy this content, follow me @svpino for threads like this focused on machine learning. I post them multiple times per week.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Santiago

Santiago Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @svpino

7 Apr
16 key takeaways about starting a career as a machine learning engineer.

↓ 1/10
1. There are more courses out there than you'll ever need. Pick one and finish it. There's no such thing as "the best course."

2. You probably already know most of the math you'll need to get started.

↓ 2/10
3. Don't be discourage by introductory courses that claim you need a ton of math as a prerequisite. Most don't.

4. You want to be proficient with Python. Other languages are great and all, but Python rules.

↓ 3/10
Read 10 tweets
7 Apr
Today is #GumroadDay, so let's celebrate with style!

$5 for "How to get started with machine learning" TODAY ONLY!

You can kick off your machine learning career for less than the price of a Starbucks, but this one you can return if you don't like it!

gum.co/kBjbC/crazy Image
This is working. #GumroadDay is nuts!
Thanks to everyone that has supported me with this course!

Money to feed the family is what affords me the time to post content and focus on helping people with machine learning.

Even if you aren't buying this, like/retweet for visibility!



Read 5 tweets
5 Apr
Many people who want to start with machine learning face a big hurdle:

They think they can't do it at their current company.

But more often than not, this is not the case. This is a thread about things you can do to get past this.

↓ 1/12
First, don't worry if your company doesn't have a machine learning engineer position yet.

Look at this as a good opportunity!

Nobody has any expectations about the job yet, so you'll get to set the pace.

↓ 2/12
Focus on doing the work. The actual position, title, compensation, and other details will follow later.

Here is where you need to get creative, and these are two different strategies that I've seen working.

↓ 3/12
Read 12 tweets
4 Apr
Learning a new language is not an obvious decision, especially when you are just starting in the industry.

Here are 10 frequently asked questions about learning Python 🐍. Hopefully, these give you the answers you are looking for.

1. Can I learn Python for free?

Yes. There are multiple YouTube videos, tutorials, and courses that will teach you Python for free.

But if you can afford it, I'd recommend you find a good MOOC that gives you some structure.

↓ 1/10
2. Is Python hard to learn?

It's not, especially compared with other languages out there.

That being said, becoming an expert is a life-long journey.

But one year of experience is more than enough for you to do whatever you decide to do.

↓ 2/10
Read 14 tweets
3 Apr
The Python 🐍 community on Twitter is amazing!

If you are a Python developer or you are looking to get started, introduce yourself below and let others connect with you👇
Hi 👋, I'm a machine learning engineer, and I've been coding exclusively with Python for 7 straight years.

I believe that Python is one of the most versatile languages you can learn today, and it's an investment with the potential to change your life.
The best part about this are the connections that this enables.

People saying hi, making study groups, asking questions, and helping each other.

Make sure to look through the comments. A lot of likeminded people willing to partner with you and do this together!
Read 4 tweets
3 Apr
25 True|False machine learning questions that are horrible for interviews but pretty fun to answer.

Most importantly: they will make you think and will keep your knowledge sharp.

These are mostly beginner-friendly.



1. A "categorical feature" is a feature that can only take a limited number of possible values.

2. Precision is a performance metric that defines a classification model's ability to identify only relevant samples.



3. Recall is a performance metric that defines a classification model's ability to identify all relevant samples.

4. One-hot encoding is an excellent solution to transform categorical features with high cardinality.

Read 14 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!