There are like 4 more linear RNN papers out today, but they all use different naming conventions🙃
Might be nice if people synced on the "iconic" version like QKV? Personally partial to: h = Ah + Bx , y = C h where A, B = f(exp(d(x) i))
Griffin
Aug 1, 2023 • 7 tweets • 3 min read
Lots of folks reached out to me yesterday about the Rust ML and LLM community. Seems like supportive and intellectually-curious community, so I wanted to highlight some of the projects that you should check out 🧵
dfdx is a static shape-typed tensor library . Uses lots of Rust features and supports full backprop.github.com/coreylowman/df…
May 10, 2023 • 5 tweets • 2 min read
Pretraining without Attention (arxiv.org/abs/2212.10544) - BiGS is alternative to BERT trained on up to 4096 tokens.
Attention can be overkill. Below shows *every* word-word interaction for every sentence over 23 layers of BiGS (no heads, no n^2).
Core architecture is a state-space model. But that's just a fancy way of parameterizing a 1D CNN. This is the whole thing that replaces attention.
Adds examples with agents, tools, streaming, and more Gradio autovis.
There are about 10 examples of popular prompts here at srush-minichain.hf.space
This is mostly an experiment in API design. Trying to keep things explicit and minimal. For example there is no explicit "Agent" or "Tool" abstraction. You build the react agent by just calling functions.
A rigorous description, opinionated style guide, and gentle polemic for named tensors in math notation.
* Macros: ctan.org/tex-archive/ma…
Named Tensor Notation is an attempt to define a mathematical notation with named axes. The central conceit is that deep learning is not linear algebra. And that by using linear algebra we leave many technical details ambiguous to readers.
Dec 20, 2022 • 7 tweets • 4 min read
Blog Post (w/ @gail_w): On "Thinking Like Transformers"
In which, I get a bit obsessed with learning how to code in Transformer lang🤖.
(You can follow along or do the exercises yourself in a colab notebook.)
The blog post walks through the constructs of building a computational model reflecting the transformer architecture.
Oct 13, 2022 • 14 tweets • 4 min read
It's a joke that all NLP talks must include this graph.
But if you are a student it is a bit intimidating. How can you become an expert in where we are going if you can barely run BERT?
I asked twitter for specific advice that you might focus on: 1) Know the scaling details of the models
You use GPUs everyday, but do you (actually) know how they work?
GPU-Puzzles (v0.1) - 14 short puzzles in Python with a visual debugger. No background required. Do puzzles, learn CUDA.
Link: github.com/srush/GPU-Puzz…
Last year I taught CUDA in my ML class (because I think it is super important), and it was the closest I have ever come to a full class revolution. For whatever reason parallel programming is just a hard thing to think about.
Jul 5, 2022 • 8 tweets • 5 min read
Last week I got a bit obsessive, and decided to draw an extremely complex clock-face from scratch in Python.
Not that anyone asked, but here is a notebook describing each step of the process.
What the heck is a “prompt engineer”, and what are the tools that they would use? /1
A prompt is a snippet of text that specifies an NLP task.
If you want to do summary, you prompt “Read {article} and write a shorter version ... ” if you want to do classification “From the review {review} what did the critic think of the restaurant ..”.
Oct 8, 2021 • 10 tweets • 4 min read
Ill-Advised Rant: Despite working in AI for a decade, I find the introduction chapter in the AI textbook to be nearly alien language.
(I hope this is not seen as punching down as this is one of the most popular CS texts in the world) /1
AI, we are told, is about agents. Like most functions, they look like this.
Inputs are now called "percepts", outputs are made by "actuators". Not working in robotics, I have never heard these words used outside this chapter. /2
Dec 18, 2020 • 5 tweets • 3 min read
What's the difference between layernorm and batchnorm?
What's the difference between 1d, 2d, and 3d convolutions?
PyTorch Docs: ¯\_(ツ)_/¯
Sep 7, 2020 • 15 tweets • 6 min read
Thread: Last week, a list of 100 important NLP papers (github.com/mhagiwara/100-…) went viral. The list is okay, but it has almost *no* papers with female first authors.
NLP is rich with amazing female researchers and mentors. Here is one paper I like for each area on the list:
Area: Discourse
Modeling Local Coherence: An Entity-Based Approach
1/ Spent the last couple weeks in quarantine obsessively coding a website for Virtual ICLR with @hen_str. We wanted to build something that was fun to browse, async first, and feels alive.
2/ We built the main interface around a chat portal with the idea the the main success in async communication has been chat apps like slack.
Apr 2, 2020 • 7 tweets • 4 min read
Open-Science NLP Bounty: ($100 + $100 to charity)
Task: A notebook demonstrating experiments within 30(!) PPL (<84) of this widely cited LM baseline on PTB / WikiText-2 using any non-pretrained, word-only Transformer variant.
The state of benchmarking in NLP right now is so strange. These goofy websites keep precisely-curated leaderboards (paperswithcode.com/sota/language-…), and hardworking grad students cannot get within 2x! these reported results.