We can organise different classes of joint-embeddings methods in 4 main categories.
• Contrastive (explicit use of negative samples)
• Clustering
• Distillation
• Redundancy reduction
«Contrastive»
Related embeddings (same colour) should be closer than unrelated embeddings (different colour).
Good negatives samples are *very* important. E.g.
• SimCLR has a *very large* batch size;
• Wu2018 uses an offline memory bank;
• MoCo uses an “online mem bank”.
«Clustering»
Contrastive learning ⇒ grouping in feature space.
We may simply want to assign an embedding to a given cluster. Examples are:
• SwAV performs online clustering using optimal transport;
• DeepClustering;
• SeLA.
«Distillation»
Similarity maximisation through a student-teacher distillation process. Trivial solution avoided by using asymmetries: learning rule and net's architecture.
• BYOL's student has a predictor on top, the teacher is a slow student;
• SimSiam shares weights.
«Redundancy reduction»
Each neuron's representation should be invariant under input data augmentation and independent from other neurons. Everything's done *without* looking at negative examples!
E.g. Barlow Twins makes the covariance close to an identity matrix.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
⚠️ Long post warning ⚠️
5 years ago, for my birthday, out of the blue (this was so much a prank) *The Yann LeCun* texted me (no, we didn't know each other) on Messenger offering me a life changing opportunity, which I failed to obtain the ‘proper’ way, but got it by accident. 🤷🏼♂️
Why did I fail? I'm not that smart.
Don't even start telling me I'm humble. I can gauge far too well the brain-power of NYU PhD students surrounding me, let alone my colleagues.
Did I manage to make it after years of faking it? Not in the slightest.
So, did he make a mistake picking this quirky Italian? I'd say no.
While working on an autonomous driving project, as instructed, I went out of my way to help with teaching for as much as I could.
My dream was to teach world wide, and YouTube let me just do that.
Let's try this. Hopefully, I won't regret it, haha. 😅😅😅
Sat 2 Oct 2021 @ 9:00 EST, live stream of my latest lecture.
Prerequisites: practica 1 and 2 from DLSP21.
Yesterday, in @kchonyc's NLP class, we've learnt about the input (word and sentence) and class embeddings, and how these are updated using the gradient of the log-probability of the correct class, i.e. log p(y* | x).
Say x is a sentence of T words: x = {w₁, w₂, …, w_T}.
1h(w) is the 1-hot representation of w (its index in a dictionary).
e(w) is the dense representation associated with w.
ϕ(x) = ∑ e(wₜ) bag-of-word sentence representation.
∇e(w) = ∇ϕ(x) = u_y* − 𝔼_{y|x}[u_y]
We'll add to e(w) the correct class embedding u_y* while removing what the network thinks it should be instead 𝔼_{y|x}[u_y]. *If* these two are the same, then the gradient will be zero, and nothing will be added or subtracted.
Learn about regularised EBMs: from prediction with latent variables to sparse coding. From temporal regularisation methods to (conditional) variational autoencoders.
We think that not only babies find peekaboo funny.
You let us know, okay?
😅😅😅
Learn about modern speech recognition and the Graph Transformer Networks with @awnihannun!
In this lecture, Awni covers the connectionist temporal classification (CTC) loss, beam search decoding, weighted finite-state automata and transducers, and GTNs!
«Graph Transformer Networks are deep learning architectures whose states are not tensors but graphs.
You can back-propagate gradients through modules whose inputs and outputs are weighted graphs.
GTNs are very convenient for end-to-end training of speech recognition and NLP sys.»
«They can be seen as a differentiable form of WFST (weighted finite-state transducers) widely used in speech recognition.
Awni is the lead author of libgtn, a GTN library for PyTorch.»