This magic is called the QR decomposition, and it's behind the famous eigenvalue-finding QR algorithm.
Here is how it works.
In essence, the QR decomposition factors an arbitrary matrix into the product of an orthogonal and an upper triangular matrix.
(We’ll illustrate everything with the 3 x 3 case, but everything works as is in general as well.)
First, some notations. Every matrix can be thought of as a sequence of column vectors. Trust me, this simple observation is the foundation of many-many Eureka-moments in mathematics.
Why is this useful? Because this way, we can look at matrix multiplication as a linear combination of the columns.
Check out how matrix-vector multiplication looks from this angle. (You can easily work this out by hand if you don’t believe me.)
In other words, a matrix times a vector equals a linear combination of the column vectors.
Similarly, the product of two matrices can be written in terms of linear combinations.
So, what’s the magic behind the QR decomposition? Simple: the vectorized version of the Gram-Schmidt process.
In a nutshell, the Gram-Schmidt process takes a linearly independent set of vectors and returns an orthonormal set that progressively generates the same subspaces.
(If you are not familiar with the Gram-Schmidt process, check out my earlier thread, where I explain everything in detail.)
The output vectors of the Gram-Schmidt process (qᵢ) can be written as the linear combination of the input vectors (aᵢ).
In other words, using the column vector form of matrix multiplication, we obtain that in fact, A factors into the product of two matrices.
As you can see, one term is formed from the Gram-Schmidt process’ output vectors (qᵢ), while the other one is upper triangular.
However, the matrix of qᵢ-s is also special: as its columns are orthonormal, its inverse is its transpose. Such matrices are called orthogonal.
Thus, any matrix can be written as the product of an orthogonal and an upper triangular one, which is the famous QR decomposition.
When is this useful for us? For one, it is used to iteratively find the eigenvalues of matrices. This is called the QR algorithm, one of the top 10 algorithms of the 20th century.
This explanation is also a part of my Mathematics of Machine Learning book.
It's for engineers, scientists, and other curious minds. Explaining math like your teachers should have, but probably never did. Check out the early access!
The single biggest argument about statistics: is probability frequentist or Bayesian?
It's neither, and I'll explain why.
Buckle up. Deep-dive explanation incoming.
First, let's look at what is probability.
Probability quantitatively measures the likelihood of events, like rolling six with a dice. It's a number between zero and one. This is independent of interpretation; it’s a rule set in stone.
In the language of probability theory, the events are formalized by sets within an event space.
The event space is also a set, usually denoted by Ω.)
If the sidewalk is wet, is it raining? Not necessarily. Yet, we are inclined to think so. This is a preposterously common logical fallacy called "affirming the consequent".
However, it is not totally wrong. Why? Enter the Bayes theorem.
Propositions of the form "if A, then B" are called implications.
They are written as "A → B", and they form the bulk of our scientific knowledge.
Say, "if X is a closed system, then the entropy of X cannot decrease" is the 2nd law of thermodynamics.
In the implication A → B, the proposition A is called "premise", while B is called the "conclusion".
The premise implies the conclusion, but not the other way around.
If you observe a wet sidewalk, it is not necessarily raining. Someone might have spilled a barrel of water.
There is a deep truth behind this conventional wisdom: probability is the mathematical extension of logic, augmenting our reasoning toolkit with the concept of uncertainty.
In-depth exploration of probabilistic thinking incoming.
Our journey ahead has three stops:
1. an introduction to mathematical logic, 2. a touch of elementary set theory, 3. and finally, understanding probabilistic thinking.
First things first: mathematical logic.
In logic, we work with propositions.
A proposition is a statement that is either true or false, like
• "it's raining outside",
• or "the sidewalk is wet".
These are often abbreviated as variables, such as A = "it's raining outside".
How to build a good understanding of math for machine learning?
I get this question a lot, so I decided to make a complete roadmap for you. In essence, three fields make this up: calculus, linear algebra, and probability theory.
Let's take a quick look at them!
1. Linear algebra.
In machine learning, data is represented by vectors. Essentially, training a learning algorithm is finding more descriptive representations of data through a series of transformations.
Linear algebra is the study of vector spaces and their transformations.
Simply speaking, a neural network is just a function mapping the data to a high-level representation.
Linear transformations are the fundamental building blocks of these. Developing a good understanding of them will go a long way, as they are everywhere in machine learning.