Santiago Profile picture
Oct 4 20 tweets 7 min read
Here is a simple machine learning model. One of the classics.

If you are new, let's go together line by line and understand what's happening here:

1 of 20
First, we load the MNIST dataset, containing 70,000 28x28 images showing handwritten digits.

You can load this dataset using Keras with a single line of code.

The function returns the dataset split into train and test sets.

2 of 20
x_train and x_test contain our train and test images.

y_train and y_test contain the target values: a number between 0 and 9 indicating the digit shown in the corresponding image.

We have 60,000 images to train the model and 10,000 to test it.

3 of 20
When dealing with images, we need a tensor with 4 dimensions: batch size, width, height, and color channels.

x_train is (60000, 28, 28). We need to reshape it to add the missing dimension ("1" because these images are grayscale.)

4 of 20
Each pixel goes from 0 to 255. Neural networks work much better with smaller values.

Here we normalize pixels by dividing them by 255. That way, each pixel will go from 0 to 1.

5 of 20
Target values go from 0 to 9 (the value of each digit.)

This line one-hot encodes these values.

For example, this will transform a value like 5, in an array of zeros with a single 1 corresponding to the fifth position:

[0, 0, 0, 0, 0, 1, 0, 0, 0, 0]

6 of 20
Let's now define our model.

There are several ways to create a model in Keras. This one is called the "Sequential API."

Our model will be a sequence of layers that we will define one by one.

7 of 20
A lot is going on with this first line.

First, we define our model's input shape: a 28x28x1 tensor (width, height, channels.)

This is exactly the shape we have in our train dataset.

8 of 20
Then we define our first layer: a Conv2D layer with 32 filters and a 3x3 kernel.

This layer will generate 32 different representations using the training images.

9 of 20
We also need to define the activation function used for this layer: ReLU.

You'll see ReLU everywhere. It's a popular activation function.

It will allow us to solve non-linear problems, like recognizing handwritten digits.

10 of 20
After our Conv2D layer, we have a max pooling operation.

The goal of this layer is to downsample the amount of information collected by the convolutional layer.

We want to throw away unimportant details and retain what truly matters.

11 of 20
We are now going to flatten the output. We want everything in a continuous list of values.

That's what the Flatten layer does. It will give us a flat tensor.

12 of 20
Finally, we have a couple of Dense layers.

Notice how the output layer has a size of 10, one for each of our possible digit values, and a softmax activation.

The softmax ensures we get a probability distribution indicating the most likely digit in the image.

13 of 20
After creating our model, we compile it.

I'm using Stochastic Gradient Descent (SGD) as the optimizer.

The loss is categorical cross-entropy: this is a multi-class classification problem.

We want to record the accuracy as the model trains.

14 of 20
Finally, we fit the model. This starts training it.

A couple of notes:

• I'm using a batch size of 32 images.
• I'm running 10 total epochs.

When fit() is done, we have a fully trained model!

15 of 20
Let's now test the model.

This gets a random image from the test set and displays it.

Notice that we want the image to come from the test set, containing data the model didn't see during training.

16 of 20
We can't forget to reshape and normalize the image as we did before with the entire train set.

I'm doing it this time for the image I use to test the model.

17 of 20
Finally, I predict the value of the image.

Remember that the result is a one-hot-encoded vector. That's why I take the argmax value (the position with the highest probability) and that's the result.

18 of 20
Here is the source code:

Have at it, go nuts, and build something cool.

gist.github.com/svpino/3cb8367…

19 of 20
Every week, I break down machine learning concepts to give you ideas on applying them in real-life situations.

Follow me @svpino to ensure you don't miss what's coming next.

20 of 20

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Santiago

Santiago Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @svpino

Oct 5
Do you know what "LATENT" means?

Lately, we've been using phrases like "latent space" and "latent vector" more than ever.

This is part of the secret language of our Secret Society. Read on if you want to belong:

1 of 9
We use the word “Latent” to refer to the compressed representation of data.

That's it!

You can stop reading now, but go on if you want more details:

2 of 9
A model would learn to classify images by learning their similarities.

During that process, the model creates a compressed internal representation of what's essential and discards any noise or unimportant details.

That representation is hidden—hence "latent"—from us.

3 of 9
Read 9 tweets
Sep 27
Here is a picture of the training and testing losses of my model.

The training loss is low. The testing loss is much higher.

What's happening with this model?

Let's look into it:

1 of 8
First, the training loss is low. That's good!

That means the model can learn the training data.

First insight: Any time you build a model, you want to see your training loss continuously decreasing.

But the training loss is just half of this story.

2 of 8
This model's testing loss is too high.

That means the model struggles with the testing data.

Second insight: When the training loss is low, but the model struggles with the testing data, we say it's overfitting.

3 of 8
Read 8 tweets
Sep 22
The batch size is one of the most influential parameters in the outcome of a neural network.

Here is everything you need to know about the batch size when training a neural network:

1 of 18 Image
I wrote a few experiments. You can run them too.

I plotted everything using @Cometml.

To run the notebook published at the end of this thread, you can create a free account here: comet.com/signup?utm_sou…

The process will take 10 seconds.

Let's start:

2 of 18
Gradient Descent is an optimization algorithm to train neural networks.

The algorithm computes how much we need to adjust the model to get closer to the desired results on every iteration.

Here is how it works in a couple of sentences:

3 of 18
Read 18 tweets
Sep 21
I'm fascinated by storytelling.

You take something mundane and boring and craft a narrative that makes people care!

You can also use it to break through complexity, highlighting why something is exciting despite looking unapproachable at first.

But I never expected this:
Never in a million years had I thought this fascination would be more than a few thoughts here and there.

A little longer than two years ago, I started writing every day.

One goal: To make machine learning more approachable for people.

But I realized I needed more.
Videos are very effective.

In the era of YouTube, many people only consume content they can play and pause.

Videos are challenging, but I'm committed to finding a better way to tell my stories.

I've learned a crucial lesson so far:
Read 5 tweets
Sep 20
Here are the pictures of two different problems.

Two classes. We want to use a neural network to solve these problems.

What's the difference between them?

1 of 8
You can separate both classes from the first problem using a line.

This is a "linearly-separable" problem.

2 of 8
You can't use a line to separate the two classes on the second problem.

This is a much more complex problem.

3 of 8
Read 8 tweets
Sep 16
Pooling is a critical operation in Convolutional Neural Networks:

• Max Pooling: Selects the maximum value of the current view.
• Avg Pooling: Averages the values of the current view.

But do you know why pooling is essential?

1 of 6
The most apparent advantage of pooling:

It reduces the size of the feature maps. This is critical for faster computation and to reduce overfitting.

Depending on the type, pooling keeps what's essential and discards useless information.

But there's more:

2 of 6
A convolutional layer is enough to tell us about the presence of a feature in an image.

But convolutional layers are sensitive to the location of that feature.

That's where pooling comes in.

3 of 6
Read 6 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(