Latest Twitter Threads by @srush_nlp on Thread Reader App

Jan 7 • 12 tweets • 3 min read

10 short videos about LLM infrastructure to help you appreciate Pages 12-18 of the DeepSeek-v3 paper (arxiv.org/abs/2412.19437) 🧵

youtube.com/watch?v=76gulN…

Apr 16, 2024 • 10 tweets • 3 min read

There are like 4 more linear RNN papers out today, but they all use different naming conventions🙃

Might be nice if people synced on the "iconic" version like QKV? Personally partial to: h = Ah + Bx , y = C h where A, B = f(exp(d(x) i)) Griffin

Aug 1, 2023 • 7 tweets • 3 min read

Lots of folks reached out to me yesterday about the Rust ML and LLM community. Seems like supportive and intellectually-curious community, so I wanted to highlight some of the projects that you should check out 🧵 dfdx is a static shape-typed tensor library . Uses lots of Rust features and supports full backprop.github.com/coreylowman/df…

May 10, 2023 • 5 tweets • 2 min read

Pretraining without Attention (arxiv.org/abs/2212.10544) - BiGS is alternative to BERT trained on up to 4096 tokens.

Attention can be overkill. Below shows *every* word-word interaction for every sentence over 23 layers of BiGS (no heads, no n^2).

Core architecture is a state-space model. But that's just a fancy way of parameterizing a 1D CNN. This is the whole thing that replaces attention.

Apr 19, 2023 • 5 tweets • 3 min read

MiniChain (v0.3, github.com/srush/MiniChain) - a small library for prompt chaining.

Adds examples with agents, tools, streaming, and more Gradio autovis.

There are about 10 examples of popular prompts here at srush-minichain.hf.space

This is mostly an experiment in API design. Trying to keep things explicit and minimal. For example there is no explicit "Agent" or "Tool" abstraction. You build the react agent by just calling functions.

Feb 27, 2023 • 5 tweets • 2 min read

minichain (v0.1): github.com/srush/MiniChain
Tiny library for LLM apps.

Thanks for all the feedback! Added full code examples for chat, retrieval QA, information extraction. 🧵

Full "ChatGPT" example with memory

srush.github.io/MiniChain/exam…

Dec 22, 2022 • 4 tweets • 3 min read

Named Tensor Notation (TMLR version, arxiv.org/abs/2102.13196 w/ @davidweichiang + @boazbaraktcs)

A rigorous description, opinionated style guide, and gentle polemic for named tensors in math notation.

* Macros: ctan.org/tex-archive/ma…

Named Tensor Notation is an attempt to define a mathematical notation with named axes. The central conceit is that deep learning is not linear algebra. And that by using linear algebra we leave many technical details ambiguous to readers.

Dec 20, 2022 • 7 tweets • 4 min read

Blog Post (w/ @gail_w): On "Thinking Like Transformers"

In which, I get a bit obsessed with learning how to code in Transformer lang🤖.

github.com/srush/raspy

(You can follow along or do the exercises yourself in a colab notebook.)

The blog post walks through the constructs of building a computational model reflecting the transformer architecture.

Oct 13, 2022 • 14 tweets • 4 min read

It's a joke that all NLP talks must include this graph.

But if you are a student it is a bit intimidating. How can you become an expert in where we are going if you can barely run BERT?

I asked twitter for specific advice that you might focus on:

1) Know the scaling details of the models

https://twitter.com/stephenroller/status/1579993017234382849

Oct 12, 2022 • 6 tweets • 2 min read

Markup-to-Image Generation

(w/ Yuntian Deng, Nori Kojima; huggingface.co/spaces/yuntian…; arxiv.org/abs/2210.05147 )

Benchmark of four different markup to image tasks. Math (LaTeX), Tables (HTML), Music (Lilypond), and Molecules (SMILES).

Pixel level deterministic evaluation.

Jul 12, 2022 • 4 tweets • 2 min read

You use GPUs everyday, but do you (actually) know how they work?

GPU-Puzzles (v0.1) - 14 short puzzles in Python with a visual debugger. No background required. Do puzzles, learn CUDA.

Link: github.com/srush/GPU-Puzz…

Last year I taught CUDA in my ML class (because I think it is super important), and it was the closest I have ever come to a full class revolution. For whatever reason parallel programming is just a hard thing to think about.

Jul 5, 2022 • 8 tweets • 5 min read

Last week I got a bit obsessive, and decided to draw an extremely complex clock-face from scratch in Python.

Not that anyone asked, but here is a notebook describing each step of the process.

Blog: danoneata.github.io/chalk/examples…
Library: github.com/danoneata/chalk

The introduction covers design and specifications of the Roman numerals (I promised it was obsessive).

Oct 19, 2021 • 13 tweets • 6 min read

Recently, many NLP folks have joked - “We used to build models! Soon we'll just write prompts”.

Last summer, we (led by @stevebach & @SanhEstPasMoi) forced ourselves to take this literally.

What the heck is a “prompt engineer”, and what are the tools that they would use? /1

A prompt is a snippet of text that specifies an NLP task.

If you want to do summary, you prompt “Read {article} and write a shorter version ... ” if you want to do classification “From the review {review} what did the critic think of the restaurant ..”.

Oct 8, 2021 • 10 tweets • 4 min read

Ill-Advised Rant: Despite working in AI for a decade, I find the introduction chapter in the AI textbook to be nearly alien language.

(I hope this is not seen as punching down as this is one of the most popular CS texts in the world) /1

AI, we are told, is about agents. Like most functions, they look like this.

Inputs are now called "percepts", outputs are made by "actuators". Not working in robotics, I have never heard these words used outside this chapter. /2

Dec 18, 2020 • 5 tweets • 3 min read

What's the difference between layernorm and batchnorm?

PyTorch docs: ¯\_(ツ)_/¯

Check-out named tensor notation:

https://twitter.com/srush_nlp/status/1339952181965942784

What's the difference between 1d, 2d, and 3d convolutions?

PyTorch Docs: ¯\_(ツ)_/¯

Sep 7, 2020 • 15 tweets • 6 min read

Thread: Last week, a list of 100 important NLP papers (github.com/mhagiwara/100-…) went viral. The list is okay, but it has almost *no* papers with female first authors.

NLP is rich with amazing female researchers and mentors. Here is one paper I like for each area on the list: Area: Discourse

Modeling Local Coherence: An Entity-Based Approach

Regina Barzilay and Mirella Lapata

people.csail.mit.edu/regina/my_pape…

Apr 24, 2020 • 13 tweets • 6 min read

1/ Spent the last couple weeks in quarantine obsessively coding a website for Virtual ICLR with @hen_str. We wanted to build something that was fun to browse, async first, and feels alive.

2/ We built the main interface around a chat portal with the idea the the main success in async communication has been chat apps like slack.

Apr 2, 2020 • 7 tweets • 4 min read

Open-Science NLP Bounty: ($100 + $100 to charity)

Task: A notebook demonstrating experiments within 30(!) PPL (<84) of this widely cited LM baseline on PTB / WikiText-2 using any non-pretrained, word-only Transformer variant.

Context:

https://twitter.com/Tim_Dettmers/status/1245805495895511042

The state of benchmarking in NLP right now is so strange. These goofy websites keep precisely-curated leaderboards (paperswithcode.com/sota/language-…), and hardworking grad students cannot get within 2x! these reported results.

Share this page!

Enter URL or ID to Unroll