Sasha Rush Profile picture
Programmer, professor, currently in the bay area https://t.co/cZl0wTfYw7
2 subscribers
Jan 7 12 tweets 3 min read
10 short videos about LLM infrastructure to help you appreciate Pages 12-18 of the DeepSeek-v3 paper (arxiv.org/abs/2412.19437) 🧵

youtube.com/watch?v=76gulN…
Apr 16, 2024 10 tweets 3 min read
There are like 4 more linear RNN papers out today, but they all use different naming conventions🙃

Might be nice if people synced on the "iconic" version like QKV? Personally partial to: h = Ah + Bx , y = C h where A, B = f(exp(d(x) i)) Griffin Image
Aug 1, 2023 7 tweets 3 min read
Lots of folks reached out to me yesterday about the Rust ML and LLM community. Seems like supportive and intellectually-curious community, so I wanted to highlight some of the projects that you should check out 🧵 dfdx is a static shape-typed tensor library . Uses lots of Rust features and supports full backprop.github.com/coreylowman/df…
May 10, 2023 5 tweets 2 min read
Pretraining without Attention (arxiv.org/abs/2212.10544) - BiGS is alternative to BERT trained on up to 4096 tokens.

Attention can be overkill. Below shows *every* word-word interaction for every sentence over 23 layers of BiGS (no heads, no n^2). Image Core architecture is a state-space model. But that's just a fancy way of parameterizing a 1D CNN. This is the whole thing that replaces attention. Image
Apr 19, 2023 5 tweets 3 min read
MiniChain (v0.3, github.com/srush/MiniChain) - a small library for prompt chaining.

Adds examples with agents, tools, streaming, and more Gradio autovis.

There are about 10 examples of popular prompts here at srush-minichain.hf.space Image This is mostly an experiment in API design. Trying to keep things explicit and minimal. For example there is no explicit "Agent" or "Tool" abstraction. You build the react agent by just calling functions. Image
Feb 27, 2023 5 tweets 2 min read
minichain (v0.1): github.com/srush/MiniChain
Tiny library for LLM apps.

Thanks for all the feedback! Added full code examples for chat, retrieval QA, information extraction. 🧵 Image Full "ChatGPT" example with memory

srush.github.io/MiniChain/exam… Image
Dec 22, 2022 4 tweets 3 min read
Named Tensor Notation (TMLR version, arxiv.org/abs/2102.13196 w/ @davidweichiang + @boazbaraktcs)

A rigorous description, opinionated style guide, and gentle polemic for named tensors in math notation.

* Macros: ctan.org/tex-archive/ma… Named Tensor Notation is an attempt to define a mathematical notation with named axes. The central conceit is that deep learning is not linear algebra. And that by using linear algebra we leave many technical details ambiguous to readers.
Dec 20, 2022 7 tweets 4 min read
Blog Post (w/ @gail_w): On "Thinking Like Transformers"

In which, I get a bit obsessed with learning how to code in Transformer lang🤖.

github.com/srush/raspy

(You can follow along or do the exercises yourself in a colab notebook.) The blog post walks through the constructs of building a computational model reflecting the transformer architecture.
Oct 13, 2022 14 tweets 4 min read
It's a joke that all NLP talks must include this graph.

But if you are a student it is a bit intimidating. How can you become an expert in where we are going if you can barely run BERT?

I asked twitter for specific advice that you might focus on: 1) Know the scaling details of the models

Oct 12, 2022 6 tweets 2 min read
Markup-to-Image Generation

(w/ Yuntian Deng, Nori Kojima; huggingface.co/spaces/yuntian…; arxiv.org/abs/2210.05147 ) Benchmark of four different markup to image tasks. Math (LaTeX), Tables (HTML), Music (Lilypond), and Molecules (SMILES).

Pixel level deterministic evaluation.
Jul 12, 2022 4 tweets 2 min read
You use GPUs everyday, but do you (actually) know how they work?

GPU-Puzzles (v0.1) - 14 short puzzles in Python with a visual debugger. No background required. Do puzzles, learn CUDA.

Link: github.com/srush/GPU-Puzz… ImageImageImageImage Last year I taught CUDA in my ML class (because I think it is super important), and it was the closest I have ever come to a full class revolution. For whatever reason parallel programming is just a hard thing to think about.
Jul 5, 2022 8 tweets 5 min read
Last week I got a bit obsessive, and decided to draw an extremely complex clock-face from scratch in Python.

Not that anyone asked, but here is a notebook describing each step of the process.

Blog: danoneata.github.io/chalk/examples…
Library: github.com/danoneata/chalk The introduction covers design and specifications of the Roman numerals (I promised it was obsessive).
Oct 19, 2021 13 tweets 6 min read
Recently, many NLP folks have joked - “We used to build models! Soon we'll just write prompts”.

Last summer, we (led by @stevebach & @SanhEstPasMoi) forced ourselves to take this literally.

What the heck is a “prompt engineer”, and what are the tools that they would use? /1 A prompt is a snippet of text that specifies an NLP task.

If you want to do summary, you prompt “Read {article} and write a shorter version ... ” if you want to do classification “From the review {review} what did the critic think of the restaurant ..”.
Oct 8, 2021 10 tweets 4 min read
Ill-Advised Rant: Despite working in AI for a decade, I find the introduction chapter in the AI textbook to be nearly alien language.

(I hope this is not seen as punching down as this is one of the most popular CS texts in the world) /1 Image AI, we are told, is about agents. Like most functions, they look like this.

Inputs are now called "percepts", outputs are made by "actuators". Not working in robotics, I have never heard these words used outside this chapter. /2 Image
Dec 18, 2020 5 tweets 3 min read
What's the difference between layernorm and batchnorm?

PyTorch docs: ¯\_(ツ)_/¯

Check-out named tensor notation:
What's the difference between 1d, 2d, and 3d convolutions?

PyTorch Docs: ¯\_(ツ)_/¯
Sep 7, 2020 15 tweets 6 min read
Thread: Last week, a list of 100 important NLP papers (github.com/mhagiwara/100-…) went viral. The list is okay, but it has almost *no* papers with female first authors.

NLP is rich with amazing female researchers and mentors. Here is one paper I like for each area on the list: Area: Discourse

Modeling Local Coherence: An Entity-Based Approach

Regina Barzilay and Mirella Lapata

people.csail.mit.edu/regina/my_pape…
Apr 24, 2020 13 tweets 6 min read
1/ Spent the last couple weeks in quarantine obsessively coding a website for Virtual ICLR with @hen_str. We wanted to build something that was fun to browse, async first, and feels alive. 2/ We built the main interface around a chat portal with the idea the the main success in async communication has been chat apps like slack.
Apr 2, 2020 7 tweets 4 min read
Open-Science NLP Bounty: ($100 + $100 to charity)

Task: A notebook demonstrating experiments within 30(!) PPL (<84) of this widely cited LM baseline on PTB / WikiText-2 using any non-pretrained, word-only Transformer variant.

Context: The state of benchmarking in NLP right now is so strange. These goofy websites keep precisely-curated leaderboards (paperswithcode.com/sota/language-…), and hardworking grad students cannot get within 2x! these reported results.