Tweet

Jean de Nyandwi

17 Nov, 27 tweets, 7 min read

Activations functions are one of the most important components of any typical neural network.

What exactly are activation functions, and why do we need to inject them into the neural network?

A thread 🧵🧵

Activations functions are basically mathematical functions that are used to introduce non linearities in the network.

Without an activation function, the neural network would behave like a linear classifier/regressor.

Or simply put, it would only be able to solve linear problems or those kinds of problems where the relationship between input and output can be mapped out easily because input and output change in a proportional manner.

Let me explain what I mean by that...

Let's say that you want to classify or separate two categories in a given data. To make that simple, for example, you want to separate red and green points. Just an example!

Your dataset looks like this 👇

To classify those two points, all you have to do is to draw a straight decision line. You do not need to inject any non-linearities.

We can also try using TensorFlow Playground to simulate similar problem.

Take a loot at the configuration in the image or trying running it yourself: Similar input data as we have above (with little bit of noise), no non-linear activation function.

playground.tensorflow.org/#activation=li…

As you can see in the output, the network tried to separate the orange and blue points. Not 100% accurate, but not so bad. Also consider there are some noises in the data.

But we are very limited on the problems we can solve without non linearities. Why?

Well, the real world problems and data are rarely linear.

Take an example in cat and dog classification. If you take the image of a cat and change some pixels values, it will still be a cat.

The change in input pixels doesn't necessarily results in change of output. Pixels values and what they represent are not linear.

Usually, two things are linear if change in one thing is directly proportional to change in other thing. Otherwise, they are not linear.

For most real world problems, the mapping between data and the output is not directly proportional. Cat & dog classification was on example.

In order to solve those complex non-linear problems, we need to use non linearities. We need to give the network the ability to solve high order functions(mathematically speaking).

Take a look below..As you can see, straight line won't work here....We need to add something else.

Below we illustrate that just drawing a straight boundary line won't work.

We can again use TF Playground to simulate the above.

At first, let's try a linear activation function.

As you can see, it can't work. Whether you can regularize, or train for thousands of epochs, it's not possible to solve a nonlinear problem with linear methods.

But with just changing the activation function from Linear to ReLU or Sigmoid, the results are quite different.

ReLU and Sigmoid that we used above are the mostly used activation function.

There are many other activation functions such as Tanh, Leaky ReLU, SeLU, ELU, Maxout...

Here are their graphical representation.

Image: Internet, no proper credit, used in multiple places

There are some known specifics about the proper usage of activation functions.

Here are some:

◆First and foremost, always use ReLU in the hidden layers. It's fast and it works great. Try its versions like Leaky ReLU when you want extra boost in the accuracy or other metrics.

https://twitter.com/Jeande_d/status/1455486459876569091?s=20

◆Avoid using sigmoid and tanh in the first layers of the network or generally in all hidden layers. They can cause the gradients to vanish quickly and that's a quick ride to getting poor results.

I talked about gradients problems in my past tweets.

https://twitter.com/Jeande_d/status/1455486459876569091?s=20

Sigmoid and Tanh are also computationally expensive due to the presence of exponent in their formula.

◆As a rule for what activation function should be in the last layer, use sigmoid for binary classification and multi-label classification problems, and use softmax for multi-class classification problems.

If you are merely doing regression, then you can use ReLU or leave it!

◆As the last rule, always use ReLU at first.

This is the end of the thread. It was long, but let's try summarizing it.

Activation functions are used to introduce non linearities in the neural network. A network that doesn't have activation function is like a linear classifier/regressor.

Real world problems are rarely linear.

In order to solve them, we have to inject non-linearities into the network to give it the ability to solve those non linear problems.

It's actually like allowing the network to bend itself to fit the problem.

Plugging in nonlinearities into neural network makes them so powerful enough to handle complex problems.

The choice of activation function depends on the problem, but as a general rule, you should always use ReLU in the hidden layers.

It works great, it's cheap, and it's awesome :)

If you would like to learn more, I recommend you go to TensorFlow Playground and play with different types of data and activations functions.

You will see how changing them affects the behavior of the neural network.

playground.tensorflow.org

@Jeande_d

Thanks for reading.

I regularly write about machine learning and deep learning ideas.

Machine learning theories can be complex despite having low actual value in what someone can produce. My goal is to simplify them.

Follow @Jeande_d for more machine learning ideas!

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @Jeande_d

Jean de Nyandwi

@Jeande_d

14 Nov

Machine Learning Weekly Highlights 💡

◆3 things from me
◆2 things from other people and
◆2 from the community

🧵🧵

This week, I wrote about what to consider while choosing a machine learning model for a particular problem, early stopping which is one of the powerful regularization techniques, and what to know about the learning rate.

The next is their corresponding threads!

https://twitter.com/Jeande_d/status/1457706252616617987?s=20

1. What to know about a model selection process...

https://twitter.com/Jeande_d/status/1457706252616617987?s=20

Read 13 tweets

Jean de Nyandwi

@Jeande_d

12 Nov

Learning rate is one of the most important hyperparameters to adjust well during the ML model training.

A high learning rate can speed up the training, but it can cause the model to diverge. A low rate can slow the training.

Here are different learning rate curves

A low learning rate can also give poor results.

A good recommended practice is to usually start with a high rate and then reduce it accordingly.

There are many techniques that can be used to achieve that. They are called learning rate schedulers.

Example of learning rate scheduling techniques:

◆Power scheduler
◆Exponential scheduler
◆Piecewise constant or multi-factor scheduler
◆Performance scheduler
◆Cosine schedule

Read 4 tweets

Jean de Nyandwi

@Jeande_d

11 Nov

The initial loss value that you should expect to get when using softmax activation in the last layer of the neural network:

Initial loss = ln(number_of_classes), ln being a natural logarithm.

Example:

last_layer = api.layers.dense(10, activation='softmax')

# number of classes = 10
initial_loss = ln(10) #2.302

Understanding this is important when it comes to debugging the network. If you see a loss of 4.5 when you have 10 classes, there is something wrong.

Also, the reported loss on the first training epoch is the average loss of the whole batch.

Thus, you may instead get the initial loss less than ln(number_of_classes) because you are training in batches. And it is a good thing.

Read 4 tweets

Jean de Nyandwi

@Jeande_d

10 Nov

The below illustration shows early stopping, one of the effective and simplest regularization techniques used in training neural networks.

A thread on the idea behind early stopping, why it works, and why you should always use it...🧵

Usually, during training, the training loss will decrease gradually, and if everything goes well on the validation side, validation loss will decrease too.

When the validation loss hits the local minimum point, it will start to increase again. Which is a signal of overfitting.

How can we stop the training just right before the validation loss rise again? Or before the validation accuracy starts decreasing?

That's the motivation for early stopping.

With early stopping, we can stop the training when there are no improvements in the validation metrics.

Read 15 tweets

Jean de Nyandwi

@Jeande_d

7 Nov

Machine Learning weekly highlights 💡

◆3 threads from me
◆3 threads from others
◆2 news from the ML communities

3 POSTS FROM ME

This week, I explained Tom Mitchell's classical definition of machine learning, why it is hard to train neural networks, and talked about some recipes for training and debugging neuralnets.

https://twitter.com/Jeande_d/status/1455872282899877894?s=20

Here is the meaning of Tom's definition of machine learning

https://twitter.com/Jeande_d/status/1455872282899877894?s=20

Read 12 tweets

Jean de Nyandwi

@Jeande_d

5 Nov

One of the things that makes training neural networks hard is the number of choices that we have to make before & during training.

Here is a training guideline covering:

◆Architectural choice
◆Activations
◆Losses
◆Optimizers
◆Batch size
◆Training & debugging recipes

🧵🧵

1. ARCHITECTURAL CHOICE

The choice of neural network architecture is primarily guided by data and the problem at hand.

Unless you are researching a new architecture, here are the popular conventions:

◆Tabular data: Feedforward networks (or Multi-layer perceptrons)
◆Images: 2D Convolutional neural networks (Convnets), Vision-transformers(ongoing research)

Read 28 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Thank you for your support!

Share this page!

Jean de Nyandwi

Try unrolling a thread yourself!

More from @Jeande_d

Jean de Nyandwi

Jean de Nyandwi

Jean de Nyandwi

Jean de Nyandwi

Jean de Nyandwi

Jean de Nyandwi

Did Thread Reader help you today?

Like this author's thread?