Although often not denoted, the argument of a vector-scalar function is always a vector.
Most frequently, we write out the components - a.k.a. the variables - explicitly.
Want a practical example of a vector-scalar function?
The loss of a predictive model maps the vector of parameters to a single scalar.
Below, you can see the mean-squared error of a simple linear regression model.
Next up, we have the vector-vector functions.
You can imagine them as a force field, putting a vector to each point.
The most important example of vector-vector functions is the gradient.
We call this a gradient field.
Let's visualize an example!
This is how the vector field given by the gradient of f(x, y) = x² + y² looks.
It is important to note that not all vector-vector functions are gradient fields!
For instance, f(x, y) = (x - xy, xy - y) cannot be a gradient.
Can you figure out the reason why? (Hint: take a look at the partial derivatives of f(x, y).)
Next up, we have scalar-vector functions, that is, curves.
Think about the scalar-vector function f(t) as the trajectory of a particle at time t.
Technically, there is only a single variable involved. Yet, curves play an essential role in multivariable calculus.
Remember how vector-vector functions define force fields?
Scalar-vector functions describe the trajectories of particles moving through them.
Gradient descent connects all of this.
In essence, gradient descent
1. takes the surface of the loss function, 2. computes the vector field given by the gradient, 3. and finds the trajectories given by the gradient vector field by a discrete approximation.
This is just the tip of the iceberg.
Multivariable calculus is one of the most powerful tools in machine learning, helping us to optimize functions in millions of variables.
That is quite a feat.
Most machine learning practitioners don’t understand the math behind their models.
That's why I've created a FREE roadmap so you can master the 3 main topics you'll ever need: algebra, calculus, and probabilities.
The Law of Large Numbers is one of the most frequently misunderstood concepts of probability and statistics.
Just because you lost ten blackjack games in a row, it doesn’t mean that you’ll be more likely to be lucky next time.
What is the law of large numbers, then? Read on:
The strength of probability theory lies in its ability to translate complex random phenomena into coin tosses, dice rolls, and other simple experiments.
So, let’s stick with coin tossing.
What will the average number of heads be if we toss a coin, say, a thousand times?
To mathematically formalize this question, we’ll need random variables.
Tossing a fair coin is described by the Bernoulli distribution, so let X₁, X₂, … be such independent and identically distributed random variables.
Differentiation reveals much more than the slope of the tangent plane.
We like to think about it that way, but from a different angle, differentiation is the same as an approximation with a linear function. This allows us to greatly generalize the concept.
Let's see why!
By definition, the derivative of a function at the point 𝑎 is defined by the limit of the difference quotient, representing the rate of change.
In geometric terms, the differential quotient represents the slope of the line between two points of the function's graph.
Even looking at the definition used to make me sweat, let alone trying to comprehend the pattern. Yet, there is a stunningly simple explanation behind it.
Let's pull back the curtain!
First, the raw definition.
This is how the product of A and B is given. Not the easiest (or most pleasant) to look at.
We are going to unwrap this.
Here is a quick visualization before the technical details.
The element in the i-th row and j-th column of AB is the dot product of A's i-th row and B's j-th column.