Prashant Profile picture
20 Apr, 12 tweets, 3 min read
If you don't understand the principle of Backpropagation or the notion of the maths behind it.

Then this 🧵 could be helpful for you.

We are going to use a simple analogy to understand better

(Check final tweets for notes)

↓ 1/11
Consider you (Harry) are trying to solve a puzzle along with two of your friends, Tom and Dick
And sadly none of you guys are among the brightest.

But you start trying to put the puzzle together.

2/11
Tom has put the first 6 pieces out of 20, 2 of them are wrong, then passes the puzzle to Dick.
Dick puts the next 8 pieces, 6 of them wrong, then passes the puzzle to you.
And now, you put the final 6 pieces, 4 of them wrong.

The picture is complete.

3/11
But you look at the instructions and it suggests that you have put 12 pieces wrong.

Now you observe 1/4 of the wrong pieces that you put and correct it, then pass it back to Dick,
Dick corrects 2/8 and lets Tom know,
then Tom corrects 1/2 and asks you to check again.

4/11
Now the manual suggests that you have 8 pieces wrong
So, you repeat the same exercise, until you get it all right, or sufficiently right.

For the process to work the correction has to be communicated from you to Dick to Tom.

How does that explain Backpropagation though?

5/11
🔹Correctly completed Puzzle represents Objective Function
🔹Tom Dick Harry are like 3 layers of NN, Output Layer(Harry), Input Layer(Tom)
🔹The number of wrong pieces is the error we're trying to minimize
🔹Them realizing their mistakes are like the gradients calculated

6/11
Note that, the final error is not dependent only on Harry but also Dick and Tom.

So first Harry corrects himself, then Dick and Tom realize their contribution to the mistake and try to correct it.

7/11
Similarly, once the error has been calculated, the gradient can be directly used to correct the mistakes of
the output layer.

8/11
But since the output is not the direct function of any of the intermediate layers, the error has to be propagated backward through the layers that lie between the output and the corresponding intermediate layer.

9/11
This justifies the use of Chain Rule from Calculus to calculate the partial derivatives.

They help in passing back the respective error gradient to previous layer weights.

10/11
Here's the link for handwritten notes of Backpropagation to understand the mathematics behind better🔣

drive.google.com/file/d/1OwrxOe…

11/11
This example might be an over-simplification but you can get the gist of it.

If you don't get bothered by notations, you'll understand the maths too.

Hope it helps! 👍

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Prashant

Prashant Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @capeandcode

14 Apr
Convolutions! 1D! 2D! 3D!🔲

I've had a lot of trouble understanding different convolutions
What do different convolutions do anyway❓

Without the correct intuition, I found defining any CNN architecture very unenjoyable.

So, here's my little understanding (with pictures)🖼👇
The Number associated with the Convolution signifies two things:
🔸The number of directions the filter moves in and,
🔸The dimensions of the output

Each convolution expects different shapes of inputs and results in output equal to the dimensions it allows the filter to move in.
In 1⃣D-Conv, the kernel moves along a single axis.
It is generally applied over the inputs that also vary along a single dimension, ex: electric signal.

The input could be a 1D array and a small 1D kernel can be applied over it to get another 1D array as output.
Read 9 tweets
12 Apr
Types of Models

In Time-Series Data, we have to observe which model fits the nature of the current data.

Two types of Models are:

🔸Additive Models
🔸Multiplicative Models

Let's discuss in brief 👇 Image
ADDITIVE MODELS

🔹Synthetically it is a model of data in which the effects of the individual factors are differentiated and added to model the data.

It can be represented by:

𝘆(𝘁) = 𝗟𝗲𝘃𝗲𝗹 + 𝗧𝗿𝗲𝗻𝗱 + 𝗦𝗲𝗮𝘀𝗼𝗻𝗮𝗹𝗶𝘁𝘆 + 𝗡𝗼𝗶𝘀𝗲
🔹An additive model is optional for Decomposition procedures and for the Winters' method.

🔹An additive model is optional for two-way ANOVA procedures. Choose this option to omit the interaction term from the model.
Read 8 tweets
31 Mar
How would you interpret the situation if you train a model and see your graphs like these? 📈📉

#machinelearning
If you just focus on the left side, it seems to make sense.
The training loss going down, the validation loss going up.
Clearly, seems to be an overfitting problem? Right?
But the graphs on the right don't seem to make sense in terms of overfitting.

The training accuracy is high, which is fine, but why is that validation accuracy is going up if the validation loss is getting worse, shouldn't it go down too?

Is it still overfitting?

YES!
Read 9 tweets
30 Mar
I had never seriously read a research paper 📃 before and I certainly didn't plan to write one, until I had to.

But I ended up finishing one that got accepted in a conference, it wasn't revolutionary but I was glad that I decided to do it and was able to finish

Here's how:👇
I was lucky to get past the first barrier quickly, choosing a subject or topic of research.

I was exposed to an image processing problem during my internship, which I really liked so I ended up pursuing the same for my research.
But if you're lost about the topic or what to choose, I suggest you check out the most recent papers, and see what interests you and move forward with that.

One good place to start is @paperswithcode
Read 13 tweets
28 Mar
You are looking to get into Machine Learning? You most certainly can
Because I believe that if an above-average student like me was able to do it, you all certainly can as well

Here's how I went from knowing nothing about programming to someone working in Data Science👇
The path that I took wasn't the most optimal way to get a good grip on Machine Learning because...

when I started out, I knew nobody that worked or had knowledge of Data Science which made me try all sorts of different things that were not actually necessary.
I studied C programming as my first language during my freshman year in college. And before the start of my second year, I started learning python just because I knew C is not the way to go.
I learned it out of curiosity and I had no idea about Machine Learning at this point.
Read 15 tweets
27 Mar
Learning rate is one of the most important parameter in Machine Learning Algorithms.📈

You must have seen learning rates something like 0.01, 0.001, 0.0001....

In other words, always in the logarithmic scale. Why?
What happens if we just take random values between 0 and 1?
If we take random values between 0 and 1, we would have a probability of only 10% to get the values between 0 an 0.1, rest 90% of the values would be between 0.1 and 1.

But why do we want between 0 and 0.1?
Read 5 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!