Mini-batch gradient descent is a variation of the gradient descent optimization algorithm used in ML & DL
It is designed to address the limitations of two other variants: BGD and SGD
In BGD the entire training dataset is used to compute the gradient of the cost function for each iteration.
This approach guarantees convergence to the global minimum but can be computationally expensive, especially for large datasets
On other hand (SGD) randomly selects a single training example for each iteration and computes the gradient based on that example
SGD is computationally efficient but can exhibit high variance in the gradient estimate, which can lead to slow convergence and noisy updates
Mini-batch gradient descent combines best of both worlds by using a small subset or mini-batch of training data for each iteration
Instead of using entire dataset (as in BGD) or just single example (as in SGD), MBGD compute gradient based on mini-batch of training example
The mini-batch size is typically chosen to be a compromise between computational efficiency and variance reduction
Common choices for mini-batch sizes are usually in the range of 10 to 1,000, depending on size of the dataset and the available computational resources.
The main advantages of mini-batch gradient descent are:
- Efficiency: By using mini-batches, it allows for parallelization of computations, which can significantly speed up the training process, especially on hardware accelerators like GPUs
- Variance reduction: Compared to stochastic gradient descent, mini-batch gradient descent provides a more stable and less noisy estimate of the gradient, resulting in smoother updates and faster convergence.
- Generalization: Mini-batch gradient descent strikes a balance between the biased updates of batch gradient descent and the noisy updates of stochastic gradient descent, often leading to better generalization performance
However MBGD also introduce new hyperparameter: mini-batch size.
Selecting appropriate mini-batch size can be trade-off between computational efficiency & convergence speed
larger mini-batch size may reduce noise in gradient estimate but also increase computational overhead
mini-batch gradient descent is widely used as optimization algorithm of choice for training deep neural networks & other large-scale ML models offering good balance between computational efficiency & convergence properties
SGD is an optimization algorithm often used in machine learning applications to find the model parameters that correspond to the best fit between predicted and actual outputs. It’s an inexact but powerful technique.
Saddle point or minimax point is point on the surface of graph of function where slopes (derivatives) in orthogonal directions are all zero (a critical point), but which is not local extremum of function
A saddle point (in red) on graph of z = x2 − y2 (hyperbolic paraboloid)
Topic -- Principle Component Analysis
(PCA) Part 1
PCA statistics is science of analyzing all the dimension & reducing them as much as possible while preserving exact information
You can monitor multi-dimensional data (can visualize in 2D or 3D dimension) over any platform using the Principal Component Method of factor analysis.
Step by step explanation of Principal Component Analysis
STANDARDIZATION
COVARIANCE MATRIX COMPUTATION
FEATURE VECTOR
RECAST THE DATA ALONG THE PRINCIPAL COMPONENTS AXES