* Train a Transformer LM or fine-tune with LoRA
* Text generation with Mistral
* Image generation with Stable Diffusion
* Speech recognition with Whisper github.com/ml-explore/mlx…
Jul 1, 2022 • 9 tweets • 3 min read
Read a bit about Grokking recently. Here's some learnings:
Continue optimizing a model long after perfect training accuracy and it suddenly generalizes.
Figure:
What's especially surprising is that generalization happens SO LONG after perfect accuracy on train.
The sudden generalization is interesting, but we've seen this type of rapid concept learning in NNs before.
Jun 4, 2022 • 9 tweets • 2 min read
A short thread on forward and reverse mode autograd:
A great way to internalize the complexity difference between forward and reverse mode automatic differentiation is through the lens of Jacobian-vector products.
First: the Jacobian of a function is the matrix of derivatives with inputs on rows and outputs on columns.
The (i, j) entry is the derivative of the j-th output with respect to the i-th input.