Percy Liang Profile picture
Associate Professor in computer science @Stanford @StanfordHAI @StanfordCRFM @StanfordAILab @stanfordnlp #foundationmodels | Pianist
3 subscribers
Aug 20 10 tweets 3 min read
LM agents are consequential for cybersecurity, both for offense (cyberrisk) and defense (penetration testing). To measure these capabilities, we are excited to release Cybench, a new cybersecurity benchmark consisting of 40 professional Capture the Flag (CTF) tasks: Image The tasks are taken from 4 CTF competitions (HackTheBox, SekaiCTF, Glacier, HKCert) - big thanks for releasing these challenges with solution writeups; this work would not have been possible without them. Image
Nov 3, 2023 8 tweets 1 min read
Myth: open foundation models are antithetical to AI safety.
Fact: open foundation models are critical for AI safety.
Here are three reasons why: First, open models enable a tremendous amount of (badly needed) safety research, which requires full access to model weights (ideally with training data). API access is insufficient.
Jan 11, 2023 8 tweets 4 min read
I have 6 fantastic students and post-docs who are on the academic job market this year. Here is a short thread summarizing their work along with one representative paper: Niladri Chatterji (@niladrichat) develops holistic theoretical understanding in the brave new world of deep learning, capturing optimization and generalization in non-convex and overparametrized settings.
Benign overfitting without linearity: arxiv.org/pdf/2202.05928…
Jan 3, 2023 9 tweets 3 min read
Announcing Holistic Evaluation of Language Models (HELM) v0.2.0 with updated results on the new @OpenAI, @AI21Labs, and @CohereAI models. HELM now evaluates 34 prominent language models in a standardized way on 42 scenarios x 7 metrics. First, looking at the accuracy on the 16 core scenarios:
crfm.stanford.edu/helm/v0.2.0/?g…
Models are ranked by mean win rate, which is the average fraction of other models that a model outperforms across scenarios. Image
Dec 15, 2022 7 tweets 4 min read
📣 CRFM announces PubMedGPT, a new 2.7B language model that achieves a new SOTA on the US medical licensing exam. The recipe is simple: a standard Transformer trained from scratch on PubMed (from The Pile) using @mosaicml on the MosaicML Cloud, then fine-tuned for the QA task. Details: We took Hugging Face’s Transformer implementation, added FlashAttention, built our own tokenizer, and trained over 300B tokens (110 GB text) on 128 A100 GPUs for ~6.25 days. We did full fine-tuning on downstream tasks (e.g., MedQA-USMLE) for evaluation.
Nov 17, 2022 13 tweets 6 min read
Language models are becoming the foundation of language technologies, but when do they work or don’t work? In a new CRFM paper, we propose Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of LMs. Holistic evaluation includes three elements: 1. Broad coverage and recognition of incompleteness: We taxonomize a set of scenarios (e.g., question answering) and metrics (e.g., robustness) and select 42 scenarios and 7 metrics in attempt to cover the design space. Importantly, the taxonomy makes explicit what’s missing. Image
Oct 23, 2022 5 tweets 2 min read
Writing on a whiteboard can make it easier for students to follow compared to slides (especially for math). During the pandemic, I added a feature to sfig (my Javascript slides library) to allow me to reveal parts of a slide using the mouse as if I were writing on a whiteboard: Compare to normal slide builds, I don't need to specify the granularity or build order in advance, which gives me the flexibility of showing (and erasing) parts of the slide in any order. And for math, I just write the latex.
Jun 30, 2022 9 tweets 2 min read
The term "foundation model" and its motivation unfortunately continues to be misunderstood. We wrote a blog post last year (see "Naming" section of crfm.stanford.edu/2021/10/18/ref…) which aims to explain our thought process. Some selected quotes from the post: "We define foundation models as models trained on broad data (generally using self-supervision at scale) that can be adapted to a wide range of downstream tasks...based on standard ideas in transfer learning..."
Jun 21, 2022 4 tweets 1 min read
There are legitimate and scientifically valuable reasons to train a language model on toxic text, but the deployment of GPT-4chan lacks them. AI researchers: please look at this statement and see what you think: forms.gle/ikiYE6ArLpWYz7… How this fits into the broader context: foundation models carry a potential risk of significant harm, so it is imperative to develop community norms for their responsible development and deployment. How do we develop such norms? There are multiple approaches:
May 3, 2022 7 tweets 1 min read
Meta's release of OPT is an exciting step towards opening new opportunities for research. In general, we can think of stronger release as enabling researchers to tackle deeper questions. There are different levels of strength: Level 1 (paper): provides an existence proof that certain capabilities are possible and reveals general ideas that can be built on
Jan 28, 2021 5 tweets 3 min read
Executable papers on CodaLab Worksheets are now linked from paperswithcode.com pages thanks to a collaboration with @paperswithcode! For example:
paperswithcode.com/paper/noise-in… By transitivity, the links are also available from @arxiv:
arxiv.org/abs/1911.09876…