Tweet

Santiago

Follow @svpino

7 Feb, 19 tweets, 5 min read

Everything you need to know about the batch size when training a neural network.

(Because it really matters, and understanding it makes a huge difference.)

A thread.

Gradient Descent is an optimization algorithm to train neural networks.

The algorithm computes how much we need to adjust the model to get closer to the results we want on every iteration.

2/

We take samples from the training dataset, run them through the model, and determine how far away our results are from the ones we expect.

We call this "error," and using it, we compute how much we need to update the model weights to improve the results.

3/

A critical decision we need to make is how many samples we use on every iteration.

We have three choices:

▫️ Use a single sample of data.
▫️ Use all of the data at once.
▫️ Use some of the data.

4/

Using a single sample of data on every iteration is called "Stochastic Gradient Descent" (SGD.)

The algorithm uses one sample at a time to compute the updates.

5/

Advantages of Stochastic Gradient Descent:

▫️ Faster learning on some problems.
▫️ The algorithm is simple to understand.
▫️ Avoids getting stuck in local minima.
▫️ Provides immediate feedback.

6/

Disadvantages of Stochastic Gradient Descent:

▫️ Computationally intensive.
▫️ May not settle in the global minimum.
▫️ The performance will be very noisy.

7/

Using all the data at once is called "Batch Gradient Descent."

The algorithm takes the entire dataset and computes the updates after processing every sample.

8/

Advantages of Batch Gradient Descent:

▫️ Computationally efficient.
▫️ Stable performance (less noise.)

9/

Disadvantages of Batch Gradient Descent:

▫️ Requires a lot of memory.
▫️ Slower learning process.
▫️ May get stuck in local minima.

10/

Using some of the data (more than one sample but fewer than the entire dataset) is called "Mini-Batch Gradient Descent."

The algorithm works like Batch Gradient Descent, with the only difference that we use fewer samples.

11/

Advantages of Mini-Batch Gradient Descent:

▫️ Avoids getting stuck in local minima.
▫️ More computationally efficient than SGD.
▫️ Doesn't need as much memory as BGD.

12/

Disadvantages of Mini-Batch Gradient Descent:

▫️ New hyperparameter to worry about.

We usually call this hyperparameter "batch_size."

This is one of the most influential parameters in the outcome of a neural network.

13/

Batch Gradient Descent is rarely used in practice, especially in deep learning, where datasets are huge.

Stochastic Gradient Descent (using a single sample at a time) is not that popular either.

Instead, Mini-Batch is the most used.

14/

But of course, we like to make things confusing.

In practice, we call it "Stochastic Gradient Descent" regardless of the batch's size.

When you hear somebody mention "SGD," keep in mind that they are probably specifying batches of samples (not just a single one.)

15/

There's a lot of research around the optimal batch size.

Your problem is different from any other problem, but empirical evidence suggests that smaller batches perform better.

(Small as in less than a hundred or so.)

16/

To make it even more concrete:

▫️ "(...) 32 is a good default value."

Here is the paper I'm quoting. A good read if you want more of this: arxiv.org/abs/1206.5533

17/

TLDR; to finish the thread:

▫️ Mini-Batch Gradient Descent is the way to go.
▫️ batch_size = 32 is a good starting point.
▫️ Don't be lazy. Read the thread.

@svpino

For more "breaking machine learning down" threads, stay tuned and check out @svpino.

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @svpino

Santiago

@svpino

5 Feb

The one million dollar question:

"Is it reasonable for someone to dive into machine learning with a shallow knowledge of math?"

▫️ The short answer is "yes."
▫️ The more nuanced answer is "it depends."

Let me try and unpack this question for you.

🧵👇

You can think about machine learning as a spectrum that goes all the way from pure research to engineering.

The more you move towards a research position, the more you can benefit from your math knowledge. If you move in the other direction, you'll get away with less of it.

👇

I have friends that got a Ph.D. and became college professors.

For them, math is an absolute requirement!

Not only are they working on research projects, but they are teaching the next generation of scientists and engineers.

👇

Read 11 tweets

Santiago

@svpino

4 Feb

I built a model to predict whether you'll be involved in a crash next time you get in a car.

And it's 99% accurate!

Allow me to show you...👇

Here is the model:

👇

The National Safety Council reports that the odds of being in a car crash in the United States are 1 in 102.

That's a probability of 0.98% of being involved in a crash.

Therefore, my silly model is accurate 99% of the time!

See? I wasn't joking before.

👇

Read 21 tweets

Santiago

@svpino

2 Feb

For the past few months, I've been trying to improve the quality of the content I publish.

There are a couple of ways I'm measuring this:

▫️ Efficiency
▫️ Engagement

Efficiency is about how many impressions and followers I get for every tweet I post.

👇

I've gone from posting 3,126 tweets back in August down to 949 tweets last month.

I've cut a lot of the noise!

During the same period, I've doubled my impressions (up to 14.4M last month,) and I'm now converting 5.38 followers for every tweet (up from 2.52.)

👇

The second way I'm watching the quality of the content I'm posting is through the engagement rate.

This has gone down quite a bit since August (almost cut in half!)

As impressions increase, the more pressure I have to put engaging content out there.

👇

Read 7 tweets

Santiago

@svpino

2 Feb

Here is a full Python 🐍 implementation of a neural network from scratch in less than 20 lines of code!

It shows how it can learn 5 logic functions. (But it's powerful enough to learn much more.)

An excellent exercise in learning how feedforward and backpropagation work!

A quick rundown of the code:

▫️ X → input
▫️ layer → hidden layer
▫️ output → output layer
▫️ W1 → set of weights between X and layer
▫️ W2 → set of weights between layer and output
▫️ error → how far is our prediction after every epoch

I'm using a sigmoid as the activation function. You will recognize it through this formula:

sigmoid(x) = 1 / 1 + exp(-x)

It would have been nicer to extract it as a separate function, but then the code wouldn't be as compact 😉

Read 7 tweets

Santiago

@svpino

1 Feb

Time spent developing better datasets is usually more productive than squeezing the algorithms that process them.

https://twitter.com/BoseShamik/status/1356356571341737985?s=20

One thing to keep in mind is that "better datasets" is not equivalent to "more data."

Regardless of your ability to collect the data, properly pre-processing it will usually give you a very good bang for your buck.

https://twitter.com/BoseShamik/status/1356356571341737985?s=20

https://twitter.com/PogrebnyakE/status/1356356164842364933?s=20

Hopefully, credit is given for the ultimate predictive ability of the solution.

A machine learning system is not just a model. There are a lot of pieces that need to work together.

https://twitter.com/PogrebnyakE/status/1356356164842364933?s=20

Read 5 tweets

Santiago

@svpino

1 Feb

Here is a simple example of a machine learning model.

I put it together a long time ago, and it was very helpful! I sliced it apart a thousand times until things started to make sense.

It's TensorFlow and Keras.

If you are starting out, this may be a good puzzle to solve.

https://twitter.com/freddyrojascama/status/1356251052190937089?s=20

The goal of this model is to learn to multiply one-digit numbers.

https://twitter.com/freddyrojascama/status/1356251052190937089?s=20

https://twitter.com/Ivan94702842/status/1356346044657905667?s=20

The dataset has two values (the ones we want to multiply.) That's why the input shape is 2D.

The input shape represents the input layer of our model. It connects to the first hidden layer: a 4-unit Dense layer.

Then you get another 4-unit Dense layer.

https://twitter.com/Ivan94702842/status/1356346044657905667?s=20

Read 4 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Share this page!

Santiago

Try unrolling a thread yourself!

More from @svpino

Santiago

Santiago

Santiago

Santiago

Santiago

Santiago

Did Thread Reader help you today?

Like this author's thread?