**
This Thread may be Removed Anytime!**

Twitter may remove this content at anytime! Save it as PDF for later use!

- Follow @ThreadReaderApp to mention us!
- From a Twitter thread mention us with a keyword "unroll"

`@threadreaderapp unroll`

Practice here first or read more on our help page!

#AcademicChatter

Coming from engineering, I'm a former @MATLAB user, moved to @TorchML and @LuaLang, then to @PyTorch and @ThePSF @RealPython, and now I'm exploring @WolframResearch @stephen_wolfram.

For learning, one would prefer knowledge packed frameworks and documentation.

Coming from engineering, I'm a former @MATLAB user, moved to @TorchML and @LuaLang, then to @PyTorch and @ThePSF @RealPython, and now I'm exploring @WolframResearch @stephen_wolfram.

For learning, one would prefer knowledge packed frameworks and documentation.

In that regard, @MATLAB and @WolframResearch are ridiculously compelling. The user manuals are just amazing, with everything organised and available at your disposal. Moreover, the language syntax is logical, much closer to math, and aligned to your mental flow.

In Mathematica I can write y = 2x (implicit multiplication), x = 6, and y will be now equal 12. y is a variable.

Or I can create a function of x with y[x_] := 2x (notice that x_ means I don't evaluate y right now). Later, I can execute y[x] and get 12, as above.

Or I can create a function of x with y[x_] := 2x (notice that x_ means I don't evaluate y right now). Later, I can execute y[x] and get 12, as above.

This week we went through the second part of my lecture on latent variable 👻 energy 🔋 based models. 🤓

We've warmed up a little the temperature 🌡, moving from the freezing 🥶 zero-temperature free energy Fₒₒ(y) (you see below spinning) to a warmer 🥰 Fᵦ(y).

We've warmed up a little the temperature 🌡, moving from the freezing 🥶 zero-temperature free energy Fₒₒ(y) (you see below spinning) to a warmer 🥰 Fᵦ(y).

Be careful with that thermostat! If it's gonna get too hot 🥵 you'll end up killing ☠️ your latents 👻 and end up with averaging them all out, indiscriminately, ending up with plain boring MSE (fig 1.3)! 🤒

From fig 2.1–3, you can see how more z's contribute to Fᵦ(y).

From fig 2.1–3, you can see how more z's contribute to Fᵦ(y).

This is nice, 'cos during training (fig 3.3, bottom) *The Force* will be strong with a wider region of your manifold, and no longer with the single Jedi. This in turns will lead to a more even pull and will avoid overfitting (fig 3.3, top). Still, we're fine here because z ∈ ℝ.

This week we've learnt how to perform inference with a latent variable 👻 energy 🔋 based model. 🤓

These models are very convenient when we cannot use a standard feed-forward net that maps vector to vector, and allow us to learn one-to-many and many-to-one relationships.

These models are very convenient when we cannot use a standard feed-forward net that maps vector to vector, and allow us to learn one-to-many and many-to-one relationships.

Take the example of the horn 📯 (this time I drew it correctly, i.e. points do not lie on a grid 𐄳). Given an x there are multiple correct y's, actually, there is a whole ellipse (∞ nb of points) that's associated with it!

Or, forget the x, even considering y alone…

Or, forget the x, even considering y alone…

there are (often) two values of y₂ per a given y₁! Use MSE and you'll get a point in the middle… which is WRONG.

What's a “latent variable” you may ask now.

Well, it's a ghost 👻 variable. It was indeed used to generate the data (θ) but we don't have access to (z).

What's a “latent variable” you may ask now.

Well, it's a ghost 👻 variable. It was indeed used to generate the data (θ) but we don't have access to (z).

🥳 NEW LECTURE 🥳

Graph Convolutional Networks… from attention!

In attention 𝒂 is computed with a [soft]argmax over scores. In GCNs 𝒂 is simply given, and it's called "adjacency vector".

Slides: github.com/Atcold/pytorch…

Notebook: github.com/Atcold/pytorch…

Graph Convolutional Networks… from attention!

In attention 𝒂 is computed with a [soft]argmax over scores. In GCNs 𝒂 is simply given, and it's called "adjacency vector".

Slides: github.com/Atcold/pytorch…

Notebook: github.com/Atcold/pytorch…

Summary of today's class.

Slide 1: *shows title*.

Slide 2: *recalls self-attention*.

Slide 3: *shows 𝒂, points out it's given*.

The end.

Literally!

I've spent the last week reading everything about these GCNs and… LOL, they quickly found a spot in my mind, next to attention!

Slide 1: *shows title*.

Slide 2: *recalls self-attention*.

Slide 3: *shows 𝒂, points out it's given*.

The end.

Literally!

I've spent the last week reading everything about these GCNs and… LOL, they quickly found a spot in my mind, next to attention!

The key concept here is the *sparsity* (constraints) of the graph.

In self-attention, every element in the set looks at each and every other element.

If a sparse graph is given, we limit each element (node / vertex) to look only at a few other elements (nodes / vertices).

In self-attention, every element in the set looks at each and every other element.

If a sparse graph is given, we limit each element (node / vertex) to look only at a few other elements (nodes / vertices).

🥳 NEW LECTURE 🥳

“Set to set” and “set to vector” mappings using self/cross hard/soft attention. We combined a (two) attention module(s) with a (two) k=1 1D convolution to get a transformer encoder (decoder).

Slides: github.com/Atcold/pytorch…

Notebook: github.com/Atcold/pytorch…

“Set to set” and “set to vector” mappings using self/cross hard/soft attention. We combined a (two) attention module(s) with a (two) k=1 1D convolution to get a transformer encoder (decoder).

Slides: github.com/Atcold/pytorch…

Notebook: github.com/Atcold/pytorch…

This week's slides were quite dense, but we've been building up momentum since the beginning of class, 3 months ago.

We recalled concepts from:

• Linear Algebra (Ax as lin. comb. of A's columns weighted by x's components, or scalar products or A's rows against x)

We recalled concepts from:

• Linear Algebra (Ax as lin. comb. of A's columns weighted by x's components, or scalar products or A's rows against x)

• Recurrent Nets (stacking x[t] with h[t–1] and concatenating W_x and W_h)

• Autoencoders (encoder-decoder architecture)

• k=1 1D convolutions (that does not assume correlation between neighbouring features and act as a dim. adapter)

and put in practice with @PyTorch.

• Autoencoders (encoder-decoder architecture)

• k=1 1D convolutions (that does not assume correlation between neighbouring features and act as a dim. adapter)

and put in practice with @PyTorch.

Impressive, especially the TikZ diagrams 🙇♂️

Would you mind using $\mathbb{R}$ for the blackboard R?

Thanks for using $\bm{}$ for vectors and $^\top$ for the transposition.

What about $\mathbb{I}$ for the identity matrix?

Use $\varnothing$ instead of $\emptyset$?

Would you mind using $\mathbb{R}$ for the blackboard R?

Thanks for using $\bm{}$ for vectors and $^\top$ for the transposition.

What about $\mathbb{I}$ for the identity matrix?

Use $\varnothing$ instead of $\emptyset$?

Also, isn't the gradient (column vector) the transposed of the Jacobian (row vector)?

What about having the differential operator d upright, and not like a variable?

What about having the differential operator d upright, and not like a variable?

Figure 5.6 uses bold upright font, instead of the italic one.