Armen Aghajanyan Profile picture
Co-founder & CEO @perceptroninc; ex-RS FAIR/MSFT
Jan 26 10 tweets 2 min read
There is an unprecedented level of cope around DeepSeek, and very little signal on X around R1. I recommend unfollowing anyone spreading conspiracy theories around R1/DeepSeek in general. (1/9) First, the DeepSeek team is incredible and has been putting out absolutely fantastic work since their first model, especially around efficiency. MLA allows for ~10x memory efficiency from the KV cache. They got efficient MoE with >8 experts working with near-perfect implementation (optimal comm/compute overlap) on second-grade interconnect. (2/9)
Jun 11, 2024 4 tweets 2 min read
A common point of contention when training LLM's is what makes it explicitly into pre-training. Different people/groups have different risk profiles. And my job ended up constantly saying no to people. Being explicit about risk profiles mitigates almost all issues. Here's what I have for my team. There are foundational principles I will refrain from budging on (efficiency, raw perception, one-loss). If the proposed addition/subtraction to pre-training passes those, we divide the technique into four risk buckets derived from an additional core principle of our team.

Compositional Risk is Ok: Our team is okay de-risking a setting where individual components have been de-risked but a unified setting has not.Image
Mar 17, 2024 4 tweets 1 min read
There is a commonly held belief that Transformers have no inductive bias and that this bias is learned throughout the training process. This is not true. Transformers have very strong inductive biases. For example, the residual connections force the model to learn refinement strategies. Try to train an image tokenizer using ViT (i.e., VQ-VAE), where the input image is broken into patches. The model will learn a discretization scheme that is very localized to your patch size (i.e., a single discrete token will roughly only change its direct patch) instead of a global discretization/compression scheme.
Jan 11, 2023 12 tweets 6 min read
I'm excited to present: Scaling Laws for Generative Mixed-Modal Language Models. In this paper we explore the scaling properties of mixed-modal generative models, discovering new scaling laws that unify the contributions of individual modalities and the interactions between them. Chinchilla proposed a mathematical formulation, known as compute-optimal scaling laws, to explain how the performance of a language model decreases as the model size (N) and the number of tokens (D) increase.
Jan 20, 2022 13 tweets 6 min read
I’m excited to present our paper CM3: A Causal Masked Multimodal Model of the Internet where we train a model that can do zero-shot unconditional/conditional image generation (PixelCNN/DALL-E), image-infilling/captioning, entity linking/disambig, summarization all with prompting! Causal modeling has the benefit of per-token generation while masked language models allow for bidirectional conditioning at the cost of partial generation. We propose a new objective, causal masking combining the best of both worlds, full generation + optional bidirectionality.