Alfredo Canziani Profile picture
Musician, math lover, cook, dancer, 🏳️‍🌈, and an ass prof of Computer Science at New York University
Jun 2, 2022 9 tweets 2 min read
⚠️ Long post warning ⚠️
5 years ago, for my birthday, out of the blue (this was so much a prank) *The Yann LeCun* texted me (no, we didn't know each other) on Messenger offering me a life changing opportunity, which I failed to obtain the ‘proper’ way, but got it by accident. 🤷🏼‍♂️ Image Why did I fail? I'm not that smart.
Don't even start telling me I'm humble. I can gauge far too well the brain-power of NYU PhD students surrounding me, let alone my colleagues.
Did I manage to make it after years of faking it? Not in the slightest.
Sep 27, 2021 5 tweets 2 min read
Let's try this. Hopefully, I won't regret it, haha. 😅😅😅
Sat 2 Oct 2021 @ 9:00 EST, live stream of my latest lecture.
Prerequisites: practica 1 and 2 from DLSP21. ① Gentle introduction to EBM for classification.
Sep 16, 2021 6 tweets 2 min read
Yesterday, in @kchonyc's NLP class, we've learnt about the input (word and sentence) and class embeddings, and how these are updated using the gradient of the log-probability of the correct class, i.e. log p(y* | x). Say x is a sentence of T words: x = {w₁, w₂, …, w_T}.
1h(w) is the 1-hot representation of w (its index in a dictionary).
e(w) is the dense representation associated with w.
ϕ(x) = ∑ e(wₜ) bag-of-word sentence representation.
Aug 12, 2021 8 tweets 6 min read
📣 NYU Deep Learning SP21 📣
Theme 4 / 3: EBMs, advanced

Website: atcold.github.io/NYU-DLSP21/
Lecture 7:
Lecture 8:
Lecture 9: Learn about regularised EBMs: from prediction with latent variables to sparse coding. From temporal regularisation methods to (conditional) variational autoencoders.
Jun 28, 2021 8 tweets 6 min read
Learn all about self-supervised learning for vision with @imisra_!

In this lecture, Ishan covers pretext invariant rep learning (PIRL), swapping assign. of views (SwAV), audiovisual discrimination (AVID + CMA), and Barlow Twins redundancy reduction.
Here you can find the @MLStreetTalk's interview, where these topics are discussed in a conversational format.
Jun 25, 2021 8 tweets 3 min read
Learn about modern speech recognition and the Graph Transformer Networks with @awnihannun!

In this lecture, Awni covers the connectionist temporal classification (CTC) loss, beam search decoding, weighted finite-state automata and transducers, and GTNs!
Image «Graph Transformer Networks are deep learning architectures whose states are not tensors but graphs.
You can back-propagate gradients through modules whose inputs and outputs are weighted graphs.
GTNs are very convenient for end-to-end training of speech recognition and NLP sys.» Image
May 25, 2021 5 tweets 2 min read
The energy 🔋 saga complete index ☝🏻
💜💚💜

Episode I
Episode II
May 19, 2021 5 tweets 3 min read
The fifth episode (of five) of the energy 🔋 saga is out! 🤩

In this last episode of the energy saga we code up an AE, DAE, and VAE in @PyTorch. Then, we learn about GAN, where a cost net C is trained contrastively with samples generated by another net.
A GAN is simply a contrastive technique where a cost net C is trained to assign low energy to samples y (blue, cold 🥶, low energy) from the data set and high energy to contrastive samples ŷ (red, hot 🥵, where the “hat” points upward indicating high energy).
May 11, 2021 5 tweets 3 min read
The fourth episode (of five) of the energy 🔋 saga is out! 🤩

From LV EBM to target prop(agation) to vanilla autoencoder, and then denoising, contractive, and variational autoencoders. Finally, we learn about the VAE's bubble-of-bubbles interpretation.
Edit: updating a thumbnail and adding one more.

In this episode I *really* changed the content wrt last year. Being exposed to EBMs for several semesters now made me realise how all these architectures (and more to come) are connected to each other.
Apr 8, 2021 7 tweets 3 min read
— Context —

Speaking about the transformer architecture, one may incorrectly talk about an encoder-decoder architecture. But this is *clearly* not the case.
The transformer architecture is an example of encoder-predictor-decoder architecture, or a conditional language-model. The classical definition of an encoder-decoder architecture is the autoencoder (AE). The (blue / cold / low-energy) target y is auto-encoded. (The AE slides are coming out later today.)
Nov 23, 2020 7 tweets 6 min read
#AcademicChatter

Coming from engineering, I'm a former @MATLAB user, moved to @TorchML and @LuaLang, then to @PyTorch and @ThePSF @RealPython, and now I'm exploring @WolframResearch @stephen_wolfram.
For learning, one would prefer knowledge packed frameworks and documentation. In that regard, @MATLAB and @WolframResearch are ridiculously compelling. The user manuals are just amazing, with everything organised and available at your disposal. Moreover, the language syntax is logical, much closer to math, and aligned to your mental flow.
Oct 31, 2020 6 tweets 3 min read
This week we went through the second part of my lecture on latent variable 👻 energy 🔋 based models. 🤓

We've warmed up a little the temperature 🌡, moving from the freezing 🥶 zero-temperature free energy Fₒₒ(y) (you see below spinning) to a warmer 🥰 Fᵦ(y). Be careful with that thermostat! If it's gonna get too hot 🥵 you'll end up killing ☠️ your latents 👻 and end up with averaging them all out, indiscriminately, ending up with plain boring MSE (fig 1.3)! 🤒
From fig 2.1–3, you can see how more z's contribute to Fᵦ(y).
Oct 21, 2020 7 tweets 3 min read
This week we've learnt how to perform inference with a latent variable 👻 energy 🔋 based model. 🤓
These models are very convenient when we cannot use a standard feed-forward net that maps vector to vector, and allow us to learn one-to-many and many-to-one relationships. Take the example of the horn 📯 (this time I drew it correctly, i.e. points do not lie on a grid 𐄳). Given an x there are multiple correct y's, actually, there is a whole ellipse (∞ nb of points) that's associated with it!
Or, forget the x, even considering y alone…
Apr 29, 2020 6 tweets 2 min read
🥳 NEW LECTURE 🥳
Graph Convolutional Networks… from attention!
In attention 𝒂 is computed with a [soft]argmax over scores. In GCNs 𝒂 is simply given, and it's called "adjacency vector".
Slides: github.com/Atcold/pytorch…
Notebook: github.com/Atcold/pytorch… Summary of today's class.

Slide 1: *shows title*.
Slide 2: *recalls self-attention*.
Slide 3: *shows 𝒂, points out it's given*.
The end.

Literally!
I've spent the last week reading everything about these GCNs and… LOL, they quickly found a spot in my mind, next to attention!
Apr 22, 2020 9 tweets 3 min read
🥳 NEW LECTURE 🥳
“Set to set” and “set to vector” mappings using self/cross hard/soft attention. We combined a (two) attention module(s) with a (two) k=1 1D convolution to get a transformer encoder (decoder).
Slides: github.com/Atcold/pytorch…
Notebook: github.com/Atcold/pytorch… This week's slides were quite dense, but we've been building up momentum since the beginning of class, 3 months ago.
We recalled concepts from:
• Linear Algebra (Ax as lin. comb. of A's columns weighted by x's components, or scalar products or A's rows against x)
May 15, 2018 6 tweets 1 min read
Impressive, especially the TikZ diagrams 🙇‍♂️
Would you mind using $\mathbb{R}$ for the blackboard R?
Thanks for using $\bm{}$ for vectors and $^\top$ for the transposition.
What about $\mathbb{I}$ for the identity matrix?
Use $\varnothing$ instead of $\emptyset$? Also, isn't the gradient (column vector) the transposed of the Jacobian (row vector)?
What about having the differential operator d upright, and not like a variable?