Jay Alammar Profile picture
Machine learning and language models R&D. Builder. Writer. Visualizing AI, ML, and LLMs one concept at a time. @Cohere. https://t.co/TquuQXlLOJ
5 subscribers
Dec 12, 2022 4 tweets 3 min read
New model alert!

@CohereAI's new embedding model supports 100+ languages and delivers 3X better performance than existing open-source models.

See the post by @Nils_Reimers and @amrmkayid: txt.cohere.ai/multilingual/ A glance at the benchmarks comparing it to:
- paraphrase-multilingual-mpnet-base-v2
- LaBSE
- Universal Sentence Encoder cMLM

(more details in the post)
Nov 25, 2022 6 tweets 3 min read
Big update to "The Illustrated Stable Diffusion" post

jalammar.github.io/illustrated-st…

14 new and updated visuals.

The biggest update is that forward diffusion is more precisely explained -- not as a process of steps (that are easy to confuse with de-noising steps).

-1- Forward Diffusion is the process of making training examples by sampling an image, noise, and an amount of noise, and mixing them to create a training example.

-2-
Nov 4, 2022 4 tweets 3 min read
So @nickfrosst lives in the future of LLMs, it feels to me. We have these internal demo sessions, and what Nick builds and presents often feels magical.



This video on github.com/cohere-ai/sand… is a taste of that.

Blog post: txt.cohere.ai/building-a-sea… What this means to me, more precisely, is the dexterity of problem-solving with LLM primitives that break a problem into a pipeline of components :
1) Regex
2) GPT + prompt X
3) pipe that into by API Z
4) Embedding then similarity search
5) GPT + prompt Y Image
Oct 4, 2022 5 tweets 2 min read
The Illustrated Stable Diffusion

jalammar.github.io/illustrated-st…

New post!

Over 30 visuals explaining how Stable Diffusion works (diffusion, latent diffusion, CLIP, and a lot more). When generating an image with Stable Diffusion, it's useful to think of 3 main components in the process.

1- Text encoder, translating words into numbers
2- Image information creator, takes multiple steps refining image information
3- Image decoder, paints the final image

-2-
Sep 20, 2022 9 tweets 4 min read
AI image generation is the most recent mind-blowing AI capability.

#StableDiffusion is a clear milestone in this development because it made a high-performance model available to the masses.

This is how it works.

1/n Image It is versatile in that it can be used in a number of different ways.

Text => Image (like the image above) is the main use case.

Another one is (Image + Text) => Image (Like this image). This is called img2img.

2/n Image
Sep 19, 2022 4 tweets 2 min read
The intelligence of generative LLMs is surprising. But people often overestimate it in areas while underestimating it in others.

Key concepts:
1- It's important to augment their information with the right sources.

So it's not
Human: Question
GPT: Factual answer.

It's more 2- To think of them as tools for surgically applying intelligence to subproblems, not as standalone intelligences themselves

Best if each module has its own tests and human verification of behavior

Human: question
System [step 1]
System [step 2]
System [step 3]
System: answer
May 25, 2022 11 tweets 5 min read
So many exciting things happening in ML these days.

DeepMind's Gato is the direction I'm excited about the most.

One small-ish model that learns text, images, playing video games, robotic sensors and control.

Everything is a sequence!

Let's work out how:

1/n Gato, the Generalist Agent Transformer, is a single model trained on 604 tasks in different modalities.

Its architecture is a Transformer decoder (jalammar.github.io/illustrated-gp…).

Its input tokens are not only text, however.

2/n ImageImage
May 24, 2022 4 tweets 2 min read
The results are in, and they're unexpected! The concept of a layer seems indeed ambiguous.

We pull a switcharoo on newcomers by teaching them one thing in theory, and another in code.

The majority voted 3 layers. I see where they come from, but I disagree. It's understandable because of figures like this, they are the most common way to introduce people to neural networks. Clearly three layers:
- input
- hidden
- output

But when you implement it, it's two Dense layers in Keras or two Linear layers in PyTorch. Image
May 9, 2022 6 tweets 5 min read
Combing For Insight in 10,000 Hacker News Posts With Text Clustering

txt.cohere.ai/combing-for-in…

New blog post!

I embedded and clustered the top HN posts looking for insight on personal/career development. I built an interactive map and found ~700 posts that fit the bill.

1/n
The clusters I was most excited to find are:
1- Life experiences and advice: assets.cohere.ai/blog/text-clus…
2- Technical and personal development: assets.cohere.ai/blog/text-clus…
3- Software career insights, advice, and discussions: assets.cohere.ai/blog/text-clus…

2/n
Dec 25, 2021 13 tweets 6 min read
A 🧵looking at DeepMind's Retro Transformer, which at 7.5B parameters is on par with GPT3 and models 25X its size in knowledge-intensive tasks.

A big moment for Large Language Models (LLMs) for reasons I'll mention in this thread.

deepmind.com/research/publi… Language modeling trains models to predict the next word.

Sometimes, the completion requires knowledge of factual information. Other times, familiarity with language is enough (expressions, grammar).

Examples in the image. Completions:
1) 2021
2) time
Apr 27, 2021 5 tweets 5 min read
Ecstatic to see "Machine learning research communication via illustrated and interactive web articles" published at @rethinkmlpapers workshop at #ICLR2021

In it, I describe my workflow for communicating ML to millions of readers.

Paper: openreview.net/pdf?id=WUrcJoy…

1/5 I discuss five key ML communication artifacts:
1- The hero image
2- The Twitter thread
3- The illustrated article
4- The interactive article
5- Interpretability software

Here are excellent examples of 1 and 2 from @ch402, @karpathy , and @maithra_raghu.

2/5
Feb 15, 2021 5 tweets 3 min read
Einsum is a key method in summing and multiplying tensors. It's implemented in @numpy_team , @TensorFlow , AND @PyTorch. Here's a visual intro to Einstein summation functions.

1/n The einsum expression has two components.
1) Subscripts string -- notation indicating what we want to do with the input arrays.
2) The input arrays. It can be one or more arrays/tensors of varying dimensions.
Feb 6, 2021 5 tweets 3 min read
The covariance matrix is a an essential tool for analyzing relationship in data. In numpy, you can use the np.cov() function to calculate it (numpy.org/doc/stable/ref…).

Here's a shot at visualizing the elements of the covariance matrix and what they mean:
1/5 Along the diagonal are the variance of each row. Variance indicates the spread of the values (1 and 4 in this case) from their average.

2/5
Jan 19, 2021 6 tweets 4 min read
Finding the Words to Say: Hidden State Visualizations for Language Models

jalammar.github.io/hidden-states/

New post! Visualizations glancing at the "thought process" of language models & how it evolves between layers. Builds on awesome work by @nostalgebraist @lena_voita @tallinzen. 1/n Image If we ask GPT2 to fill-in the blank:
Heathrow airport is located in the city of ___

Would it fill it correctly?

Yes.

Which layer stored that knowledge?

The various layers successively increased the ranking of the word "London", but Layer 5 ultimately raised it from 37 to 1. Image
Nov 21, 2020 6 tweets 6 min read
So many fascinating ideas at yesterday's #blackboxNLP workshop at #emnlp2020. Too many bookmarked papers. Some takeaways:
1- There's more room to adopt input saliency methods in NLP. With Grad*input and Integrated Gradients being key gradient-based methods. See: aclweb.org/anthology/2020… aclweb.org/anthology/2020… aclweb.org/anthology/2020…
Jul 21, 2020 14 tweets 7 min read
How GPT3 works. A visual thread.

A trained language model generates text.

We can optionally pass it some text as input, which influences its output.

The output is generated from what the model "learned" during its training period where it scanned vast amounts of text.

1/n Training is the process of exposing the model to lots of text. It has been done once and complete. All the experiments you see now are from that one trained model. It was estimated to cost 355 GPU years and cost $4.6m.

2/n
Jul 14, 2020 6 tweets 6 min read
On the transformer side of #acl2020nlp, three works stood out to me as relevant if you've followed the Illustrated Transformer/BERT series on my blog:
1- SpanBERT
2- BART
3- Quantifying Attention Flow
(1/n) SpanBERT (by @mandarjoshi_ @danqi_chen @YinhanL @dsweld @LukeZettlemoyer @omerlevy_) came out last year but was published in this year's ACL. It found that BERT pre-training is better when you mask continuous strings of tokens, rather than BERT's 15% scattered tokens. ImageImage