This is the formula for Gradient Descent with Momentum as presented in Wikipedia.
It may look intimidating at first, but I promise you that by the end of this thread it will be easy to understand!
Thread π
The Basis β»οΈ
Let's break it down! The basis is this simple formula describing an iterative optimization method.
We have some weights (parameters) and we iteratively update them in some way to reach a goal.
Iterative methods are used when we cannot compute the solution directly
Gradient Decent Update π
We define a loss function describing how good our model is. We want to find the weights that minimize the loss (make the model better).
We compute the gradient of the loss and update the weights by a small amount (learning rate) against the gradient.
Here is an illustration how it works.
The gradient tells us if the loss will decrease (negative gradient) or increase (positive gradient) if we increase the weight.
The learning rate defines how far along the gradient we will jump in the current step of the optimization.
Momentum β½οΈ
Now we add the momentum. It is defined as the weight update in the previous step times a decay factor.
The decay factor is just a number between 0 and 1 defining how much of the previous update will be taken into account. Ξ± = 0 means no momentum and Ξ± = 1 is a lot.
A useful analogy is a ball rolling down a hill. If the hill is steep, the ball will accelerate (we update the weights more).
This will help the ball jump over small local minima and continue down the hill (to a smaller loss).
More momentum means a heavier ball with high inertia
Putting it all together π
So, in the original formula we update the weights using two terms.
The *gradient descent* term pushes us down the slope of the loss function.
The *momentum* term helps us accelerate and jump over small local minima.
Not that hard, right? π
β’ β’ β’
Missing some Tweet in this thread? You can try to
force a refresh
What are typical challenges when training a deep neural networks βοΈ
βͺοΈ Overfitting
βͺοΈ Underfitting
βͺοΈ Lack of training data
βͺοΈ Vanishing gradients
βͺοΈ Exploding gradients
βͺοΈ Dead ReLUs
βͺοΈ Network architecture design
βͺοΈ Hyperparameter tuning
How to solve them π
Overfitting π
Your model performs well during training, but poorly during test.
Possible solutions:
- Reduce the size of your model
- Add more data
- Increase dropout
- Stop the training early
- Add regularization to your loss
- Decrease batch size
Underfitting π
You model performs poorly both during training and test.
Possible solutions:
- Increase the size of your model
- Add more data
- Train for a longer time
- Start with a pre-trained network
Self-driving car engineer roles - Big Data Engineer π½
Self-driving cars have lots of cameras, lidars and radars. Waymo currently has 29 cameras on a single vehicle! The cars generate huge amounts of data, easily more than 1 GB/s. This data needs to be processed...
Thread π
Problems to work on π€
The big data engineer needs to design and implement efficient storage and data processing pipelines to handle such large amounts of data.
The data also needs to be made available to the developers in a way that they can efficiently get to what they need.
Data πΎ
Imagine that the self-driving car is recording data at a rate of 1 GB/s. Going on a test drive for 4 hours means that you'll collect more than 14 TB of data!
There are specialized loggers that can handle such rates, like this beast for example: vigem.de/en/content/proβ¦