Lots of folks reached out to me yesterday about the Rust ML and LLM community. Seems like supportive and intellectually-curious community, so I wanted to highlight some of the projects that you should check out 🧵
dfdx is a static shape-typed tensor library . Uses lots of Rust features and supports full backprop.github.com/coreylowman/df…
candle is an inference time tensor library with similar numpy/pytorch syntax. Check out their full LLM inference example
Pretraining without Attention (arxiv.org/abs/2212.10544) - BiGS is alternative to BERT trained on up to 4096 tokens.
Attention can be overkill. Below shows *every* word-word interaction for every sentence over 23 layers of BiGS (no heads, no n^2).
Core architecture is a state-space model. But that's just a fancy way of parameterizing a 1D CNN. This is the whole thing that replaces attention.
To make this simple model, we push the complexity into the per-position feed-forword Networks. We follow recent work by rearranging transformer components with more aggressive gating.
This is mostly an experiment in API design. Trying to keep things explicit and minimal. For example there is no explicit "Agent" or "Tool" abstraction. You build the react agent by just calling functions.
One of the main challenges in this version was wanting to support for streaming in visualizations. Mainly because it is just cool.
Named Tensor Notation is an attempt to define a mathematical notation with named axes. The central conceit is that deep learning is not linear algebra. And that by using linear algebra we leave many technical details ambiguous to readers.
The biggest change in this version is a more complete coverage of differential calculus. Including worked examples of derivatives of attention.