Everything you need to know about the batch size when training a neural network.
(Because it really matters, and understanding it makes a huge difference.)
A thread.
Gradient Descent is an optimization algorithm to train neural networks.
The algorithm computes how much we need to adjust the model to get closer to the results we want on every iteration.
2/
We take samples from the training dataset, run them through the model, and determine how far away our results are from the ones we expect.
We call this "error," and using it, we compute how much we need to update the model weights to improve the results.
3/
A critical decision we need to make is how many samples we use on every iteration.
We have three choices:
▫️ Use a single sample of data.
▫️ Use all of the data at once.
▫️ Use some of the data.
4/
Using a single sample of data on every iteration is called "Stochastic Gradient Descent" (SGD.)
The algorithm uses one sample at a time to compute the updates.
5/
Advantages of Stochastic Gradient Descent:
▫️ Faster learning on some problems.
▫️ The algorithm is simple to understand.
▫️ Avoids getting stuck in local minima.
▫️ Provides immediate feedback.
6/
Disadvantages of Stochastic Gradient Descent:
▫️ Computationally intensive.
▫️ May not settle in the global minimum.
▫️ The performance will be very noisy.
7/
Using all the data at once is called "Batch Gradient Descent."
The algorithm takes the entire dataset and computes the updates after processing every sample.
"Is it reasonable for someone to dive into machine learning with a shallow knowledge of math?"
▫️ The short answer is "yes."
▫️ The more nuanced answer is "it depends."
Let me try and unpack this question for you.
🧵👇
You can think about machine learning as a spectrum that goes all the way from pure research to engineering.
The more you move towards a research position, the more you can benefit from your math knowledge. If you move in the other direction, you'll get away with less of it.
👇
I have friends that got a Ph.D. and became college professors.
For them, math is an absolute requirement!
Not only are they working on research projects, but they are teaching the next generation of scientists and engineers.
Here is a full Python 🐍 implementation of a neural network from scratch in less than 20 lines of code!
It shows how it can learn 5 logic functions. (But it's powerful enough to learn much more.)
An excellent exercise in learning how feedforward and backpropagation work!
A quick rundown of the code:
▫️ X → input
▫️ layer → hidden layer
▫️ output → output layer
▫️ W1 → set of weights between X and layer
▫️ W2 → set of weights between layer and output
▫️ error → how far is our prediction after every epoch
I'm using a sigmoid as the activation function. You will recognize it through this formula:
sigmoid(x) = 1 / 1 + exp(-x)
It would have been nicer to extract it as a separate function, but then the code wouldn't be as compact 😉