What are Convolutional Neural Networks? 🏞️ ⏭️ ⛰️

CNNs are an important class of deep artificial neural networks that are particularly well suited for images.

If you want to learn the important concepts of CNNs and understand why they work so well, this thread is for you!

πŸ§΅πŸ‘‡
What is a CNN? πŸ€”

A CNN is a deep neural network that contains at least one convolutional layer. A typical CNN has a structure like this:
β–ͺ️ Image as input
β–ͺ️ Several convolutional layers
β–ͺ️ Several interleaved pooling layers
β–ͺ️ One/more fully connected layers

Example: AlexNet
A good example - AlexNet

Throughout the thread I will be giving examples based on AlexNet - this is the net architecture that arguably started the whole deep learning revolution in computer vision!

I've written more about AlexNet here:
Convolutional Layer *️⃣ ⏸️

Each convolutional layer is defined by a set of filters which are used on the input to transform it into some other representation.

A typical filter may detect horizontal edges or some specific color in the image.

The filters defined as convolutions.
Convolution Filter *️⃣

Here is how it works on the example of an image:

1️⃣ Take a small matrix (for example 5x5) - the filter
2️⃣ Overlay it over the image, multiply the pixel values with the matrix values and add them up
3️⃣ Slide the matrix over the whole image
Real examples πŸ”¬

Let's see what kind of filters AlexNet uses (it actually learns them, but will come to that in a moment). The example shows 11x11 filters that detect different features.

1️⃣ - horizontal edges
2️⃣ - vertical edges
3️⃣ - green patches
4️⃣ - blue to yellow edges
The first Convolutional Layer *️⃣ ⏸️

The image as processed with many filters in the first layer (96 for AlexNet).

Every filter will effectively create a new image (called feature map) telling you where the filter "found" matching patterns.

Example for edge detection filters.
The next Convolutional Layer *️⃣ ⏸️

The next layer will have as input not the image, but the feature maps from the previous layer. Now it can combine simple features to create more complex ones.

For example, find places in the image with both horizontal and vertical edges.
Hierarchical architecture 🌲

Layer after layer the network will be combining more and more complex features giving it more expressive power. This is one of the big advantages of deep CNNs!

However, you may be wondering now - where do the filters come from???
Feature Learning πŸ‘¨β€πŸ«

This is the coolest thing - the optimal filters are learned automatically by the network!

In traditional ML methods the features are usually engineered by people, but this is one of the greatest advantages of CNNs - they are learned from the data!

How...?
Parametrization πŸ”‘

In classical NNs we connect each neuron of a layer to all of the neurons in the input and assign the connection some weight.

In CNNs the weights are the values in the filter matrix and they are shared over the whole image! This is way more efficient.
Parametrization πŸ”‘

Imagine an image of 100x100 pixels. Every neuron in the first layer will have 10000 weights. 10000 per neuron❗

For a CNN with 100 3x3 filters we need 900 weights + 100 for the bias.

Indeed, 96% of the weights of AlexNet are in the dense layers!
Training πŸ‹οΈ

As with regular neural nets, during training we can use backpropagation to find the best weights for our problem. This is how the weights of the filter matrices will be optimized and the network will learn the best filters for our problem.
Local connectivity β­•

CNNs are very suitable for images, because they exploit the local connectivity between pixels - pixels next to each other are usually correlated.

The way the filters are applied over the image also make them invariant to translation, which is important.
Global context 🏞️

But wait, if we only look at 5x5 patches, how do we see the "big picture"? How do we use info from different parts of the image?

We scale the image down! When we make the image smaller, a 3x3 filter will correspond to a bigger patch of the original image.
Pooling πŸ’ 

While there are ways to reduce resolution with convolutions, let's focus on pooling for now.

The idea is simple:
1️⃣ Take a NxN patch of the feature map
2️⃣ Replace it with a single value - the max or average value in the patch

This will reduce the resolution by N
Pooling is useful as a way to select the strongest features and also give the network some additional translation invariance. In practice, a 2x2 pooling is usually used.
Typical architecture πŸ›οΈ

In a typical CNN, convolutional and pooling layers are alternated. The resolution of the feature maps is reduced through the network, but we usually increase the number of feature maps. The CNN will then learn more high-level features.
Summary 🏁

β–ͺ️ CNNs use convolutions to compute feature maps describing the image
β–ͺ️ The filters share the weights over the image reducing the number of parameters
β–ͺ️ CNNs exploit the local relationships between pixels
β–ͺ️ CNNs can learn high-level image features and concepts
Further reading πŸ“–

Make sure that you also read this great thread on the topic by @svpino!

Sources πŸ“ƒ

I used images from the following sources:
β–ͺ️ Convolution example from this awesome repo: github.com/vdumoulin/conv…
β–ͺ️ AlexNet examples from the original paper: papers.nips.cc/paper/4824-ima…
β–ͺ️ These slides on AlexNet for the parameters table: cs.toronto.edu/~rgrosse/cours…

β€’ β€’ β€’

Missing some Tweet in this thread? You can try to force a refresh
γ€€

Keep Current with Vladimir Haltakov

Vladimir Haltakov Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @haltakov

15 Feb
Prisoner's Dilemma πŸ€”

Time for some game theory! πŸ‘¨β€πŸ«

Prisoner's Dilemma (PD) is an interesting game that explains how two rational individuals may make decisions that seem irrational.

The game has lots of examples and applications in real life!

Thread πŸ‘‡
There are different examples of PD, but this is the one I like most.

You want to buy something from another person. You exchange closed bags one containing the money and one the goods.

Both you and the other person can choose to honor the deal βœ… or to give an empty bag ❌.
If you both honor the deal βœ… βœ… (cooperate), you both gain something.

If you both exchange empty bags ❌ ❌ (defect), at least nobody loses.

If you leave the bag empty, but get a full bag βœ… ❌, you gain a lot, while the other person is screwed.

Image source: Wikipedia Image
Read 11 tweets
10 Feb
Dealing with imbalanced datasets 🐁 βš–οΈ 🐘

Real world datasets are often imbalanced - some of the classes appear much more often in your data than others.

The problem? You ML model will likely learn to only predict the dominant classes.

What can you do about it? πŸ€”

Thread πŸ‘‡
Example 🚦

We will be dealing with a ML model to detect traffic lights for a self-driving car πŸ€–πŸš—

Traffic lights are small so you will have much more parts of the image that are not traffic lights.

Furthermore, yellow lights 🟑 are much rarer than green 🟒 or red πŸ”΄.
The problem ⚑

Imagine we train a model to classify the color of the traffic light. A typical distribution will be:
πŸ”΄ - 56%
🟑 - 3%
🟒 - 41%

So, your model can get to 97% accuracy just by learning to distinguish red from green.

How can we deal with this? πŸ€”
Read 13 tweets
8 Feb
Machine Learning Formulas Explained πŸ‘¨β€πŸ«

This is the formula for Mean Squared Error (MSE) as defined in WikiPedia. It represents a very simple concept, but may not be easy to read if you are just starting with ML.

Read below and it will be a piece of cake! 🍰

Thread πŸ‘‡
The core ⚫

Let's unpack from the inside out. MSE calculates how close are your model's predictions ΕΆ to the ground truth labels Y. You want the error to go to 0.

If you are predicting house prices, the error could be the difference between the predicted and the actual price.
Why squared? 2️⃣

Subtracting the prediction from the label won't work. The error may be negative or positive, which is a problem when summing up samples.

You can take the absolute value or the square of the error. The square has the property that it punished bigger errors more.
Read 8 tweets
28 Jan
Is this formula difficult? πŸ€”

This is the formula for Gradient Descent with Momentum as presented in Wikipedia.

It may look intimidating at first, but I promise you that by the end of this thread it will be easy to understand!

Thread πŸ‘‡
The Basis ◻️

Let's break it down! The basis is this simple formula describing an iterative optimization method.

We have some weights (parameters) and we iteratively update them in some way to reach a goal.

Iterative methods are used when we cannot compute the solution directly
Gradient Decent Update πŸ“‰

We define a loss function describing how good our model is. We want to find the weights that minimize the loss (make the model better).

We compute the gradient of the loss and update the weights by a small amount (learning rate) against the gradient.
Read 7 tweets
27 Jan
How to add new classes to your ML model? 🍏🍎🍊... 🍌?

You have a large multi-class NN in production.

You discover a new important class and want to add support for it *quickly* and with *low* risk.

Example: traffic signs recognition for self-driving cars πŸ›‘πŸš—

Thread πŸ‘‡
The naive approach πŸ€·β€β™‚οΈ

Collect examples of the new class (for example a new traffic sign), label them and retrain the whole NN.

βœ… It will probably work

❌ It will be time consuming, especially for big models.
❌ Risk for unintended regressions
Freezing the first layers πŸ₯Ά

Typical CNNs learn generic image features in the initial layers and they will likely apply to the new sign as well.

You can freeze the weights of the initial layers and only retrain the last fully connected layer(s).
Read 10 tweets
26 Jan
Machine Learning Interview Question #7 πŸ€–πŸ§ πŸ§

This is a more difficult and more open question...

❓ You are developing a traffic signs detector for a self-driving car.

How would you design it in a way that you can quickly add support for new signs, you didn't see before ❓
🌟 BONUS QUESTION 🌟:

Can you do this with minimal retrain of your neural network?
Looking forward to some creative answers! πŸ˜ƒ

Answer in the replies. Read the rules πŸ‘‡

Read 4 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!