Tweet

Alfredo Canziani

Follow @alfcnz

31 Oct, 6 tweets, 3 min read

https://twitter.com/alfcnz/status/1319121960429887488

This week we went through the second part of my lecture on latent variable 👻 energy 🔋 based models. 🤓

We've warmed up a little the temperature 🌡, moving from the freezing 🥶 zero-temperature free energy Fₒₒ(y) (you see below spinning) to a warmer 🥰 Fᵦ(y).

https://twitter.com/alfcnz/status/1319121960429887488

Be careful with that thermostat! If it's gonna get too hot 🥵 you'll end up killing ☠️ your latents 👻 and end up with averaging them all out, indiscriminately, ending up with plain boring MSE (fig 1.3)! 🤒
From fig 2.1–3, you can see how more z's contribute to Fᵦ(y).

This is nice, 'cos during training (fig 3.3, bottom) *The Force* will be strong with a wider region of your manifold, and no longer with the single Jedi. This in turns will lead to a more even pull and will avoid overfitting (fig 3.3, top). Still, we're fine here because z ∈ ℝ.

@matplotlib

Finally, switching to the conditional / self-supervised case, where we introduce an observed x, requires changing 1 line of code! 🤯
Basically, self-sup and un-sup are super close in programming space! 😬
So, learning the horn 📯 was easy peasy! 😎
Made with @matplotlib as usual.

For reference, the (now correct) training data is shown below. Pay attention that ∞ y's (an entire ellipse) are associated to a give x. So, you cannot hope to use a neural net to train on (x, y) pairs. That model would collapse into the segment (0, 0, 0) → (1, 0, 0).

One more note. 🧐
Fig 1.4 shows the *correct* terminology 👍🏻 vs. what is currently commonly used 👎🏻.
Actual softmax → "logsumexp" (scalar).
Its derivative, softargmax → "softmax" (pseudo probability).
This is analogous to max and, its derivative, argmax. But softer. 😀

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @alfcnz

Alfredo Canziani

@alfcnz

21 Oct

This week we've learnt how to perform inference with a latent variable 👻 energy 🔋 based model. 🤓
These models are very convenient when we cannot use a standard feed-forward net that maps vector to vector, and allow us to learn one-to-many and many-to-one relationships.

Take the example of the horn 📯 (this time I drew it correctly, i.e. points do not lie on a grid 𐄳). Given an x there are multiple correct y's, actually, there is a whole ellipse (∞ nb of points) that's associated with it!
Or, forget the x, even considering y alone…

there are (often) two values of y₂ per a given y₁! Use MSE and you'll get a point in the middle… which is WRONG.

What's a “latent variable” you may ask now.
Well, it's a ghost 👻 variable. It was indeed used to generate the data (θ) but we don't have access to (z).

Read 7 tweets

Alfredo Canziani

@alfcnz

29 Apr

🥳 NEW LECTURE 🥳
Graph Convolutional Networks… from attention!
In attention 𝒂 is computed with a [soft]argmax over scores. In GCNs 𝒂 is simply given, and it's called "adjacency vector".
Slides: github.com/Atcold/pytorch…
Notebook: github.com/Atcold/pytorch…

Summary of today's class.

Slide 1: *shows title*.
Slide 2: *recalls self-attention*.
Slide 3: *shows 𝒂, points out it's given*.
The end.

Literally!
I've spent the last week reading everything about these GCNs and… LOL, they quickly found a spot in my mind, next to attention!

The key concept here is the *sparsity* (constraints) of the graph.
In self-attention, every element in the set looks at each and every other element.
If a sparse graph is given, we limit each element (node / vertex) to look only at a few other elements (nodes / vertices).

Read 6 tweets

Alfredo Canziani

@alfcnz

22 Apr

🥳 NEW LECTURE 🥳
“Set to set” and “set to vector” mappings using self/cross hard/soft attention. We combined a (two) attention module(s) with a (two) k=1 1D convolution to get a transformer encoder (decoder).
Slides: github.com/Atcold/pytorch…
Notebook: github.com/Atcold/pytorch…

This week's slides were quite dense, but we've been building up momentum since the beginning of class, 3 months ago.
We recalled concepts from:
• Linear Algebra (Ax as lin. comb. of A's columns weighted by x's components, or scalar products or A's rows against x)

@PyTorch

• Recurrent Nets (stacking x[t] with h[t–1] and concatenating W_x and W_h)
• Autoencoders (encoder-decoder architecture)
• k=1 1D convolutions (that does not assume correlation between neighbouring features and act as a dim. adapter)
and put in practice with @PyTorch.

Read 9 tweets

Alfredo Canziani

@alfcnz

15 May 18

Impressive, especially the TikZ diagrams 🙇‍♂️
Would you mind using $\mathbb{R}$ for the blackboard R?
Thanks for using $\bm{}$ for vectors and $^\top$ for the transposition.
What about $\mathbb{I}$ for the identity matrix?
Use $\varnothing$ instead of $\emptyset$?

Also, isn't the gradient (column vector) the transposed of the Jacobian (row vector)?
What about having the differential operator d upright, and not like a variable?

Figure 5.6 uses bold upright font, instead of the italic one.

Read 6 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Share this page!

Alfredo Canziani

Try unrolling a thread yourself!

More from @alfcnz

Alfredo Canziani

Alfredo Canziani

Alfredo Canziani

Alfredo Canziani

Did Thread Reader help you today?

Like this author's thread?