We can organise different classes of joint-embeddings methods in 4 main categories.
• Contrastive (explicit use of negative samples)
• Clustering
• Distillation
• Redundancy reduction
«Contrastive»
Related embeddings (same colour) should be closer than unrelated embeddings (different colour).
Good negatives samples are *very* important. E.g.
• SimCLR has a *very large* batch size;
• Wu2018 uses an offline memory bank;
• MoCo uses an “online mem bank”.
«Clustering»
Contrastive learning ⇒ grouping in feature space.
We may simply want to assign an embedding to a given cluster. Examples are:
• SwAV performs online clustering using optimal transport;
• DeepClustering;
• SeLA.
«Distillation»
Similarity maximisation through a student-teacher distillation process. Trivial solution avoided by using asymmetries: learning rule and net's architecture.
• BYOL's student has a predictor on top, the teacher is a slow student;
• SimSiam shares weights.
«Redundancy reduction»
Each neuron's representation should be invariant under input data augmentation and independent from other neurons. Everything's done *without* looking at negative examples!
E.g. Barlow Twins makes the covariance close to an identity matrix.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Learn about modern speech recognition and the Graph Transformer Networks with @awnihannun!
In this lecture, Awni covers the connectionist temporal classification (CTC) loss, beam search decoding, weighted finite-state automata and transducers, and GTNs!
«Graph Transformer Networks are deep learning architectures whose states are not tensors but graphs.
You can back-propagate gradients through modules whose inputs and outputs are weighted graphs.
GTNs are very convenient for end-to-end training of speech recognition and NLP sys.»
«They can be seen as a differentiable form of WFST (weighted finite-state transducers) widely used in speech recognition.
Awni is the lead author of libgtn, a GTN library for PyTorch.»
The fifth episode (of five) of the energy 🔋 saga is out! 🤩
In this last episode of the energy saga we code up an AE, DAE, and VAE in @PyTorch. Then, we learn about GAN, where a cost net C is trained contrastively with samples generated by another net.
A GAN is simply a contrastive technique where a cost net C is trained to assign low energy to samples y (blue, cold 🥶, low energy) from the data set and high energy to contrastive samples ŷ (red, hot 🥵, where the “hat” points upward indicating high energy).
y comes from the data set Y.
ŷ is produced by the generating network G, which maps a random vector to the input space ŷ = G(z).
To train G we simply minimise C(G(z)).
And that's it.
No fooling around with discriminators. 🥸
It's *simply* contrastive energy learning. 😇
The fourth episode (of five) of the energy 🔋 saga is out! 🤩
From LV EBM to target prop(agation) to vanilla autoencoder, and then denoising, contractive, and variational autoencoders. Finally, we learn about the VAE's bubble-of-bubbles interpretation.
Edit: updating a thumbnail and adding one more.
In this episode I *really* changed the content wrt last year. Being exposed to EBMs for several semesters now made me realise how all these architectures (and more to come) are connected to each other.
In the companion lecture (which will soon come online), @ylecun goes over a more powerful interpretation of VAE, which I still struggle to understand. As you can imagine, another tweak to my deck will occur when I'll actually get it. (Yeah, I'm slow, yet persistent.)
Speaking about the transformer architecture, one may incorrectly talk about an encoder-decoder architecture. But this is *clearly* not the case.
The transformer architecture is an example of encoder-predictor-decoder architecture, or a conditional language-model.
The classical definition of an encoder-decoder architecture is the autoencoder (AE). The (blue / cold / low-energy) target y is auto-encoded. (The AE slides are coming out later today.)
Now, the main difference between an AE and a language-model (LM) is that the input is delayed by one unit. This means that a predictor is necessary to estimate the hidden representation of a *future* symbol.
It's similar to a denoising AE, where there is a temporal corruption.
In that regard, @MATLAB and @WolframResearch are ridiculously compelling. The user manuals are just amazing, with everything organised and available at your disposal. Moreover, the language syntax is logical, much closer to math, and aligned to your mental flow.
In Mathematica I can write y = 2x (implicit multiplication), x = 6, and y will be now equal 12. y is a variable.
Or I can create a function of x with y[x_] := 2x (notice that x_ means I don't evaluate y right now). Later, I can execute y[x] and get 12, as above.