📣 CRFM announces PubMedGPT, a new 2.7B language model that achieves a new SOTA on the US medical licensing exam. The recipe is simple: a standard Transformer trained from scratch on PubMed (from The Pile) using @mosaicml on the MosaicML Cloud, then fine-tuned for the QA task.
Details: We took Hugging Face’s Transformer implementation, added FlashAttention, built our own tokenizer, and trained over 300B tokens (110 GB text) on 128 A100 GPUs for ~6.25 days. We did full fine-tuning on downstream tasks (e.g., MedQA-USMLE) for evaluation.
PubMedGPT is also capable of generation, but like most LMs, it will fabricate content (so don’t trust it!). This is a pressing area for LM research, and we hope that the release of this model can help researchers evaluate and improve the reliability of generation.
We hope that PubMedGPT can serve as a foundation model for biomedical researchers; can it be adapted fruitfully for tasks such as medical text simplification, information retrieval, and knowledge completion? There's a lot more to do!
There are many large, interesting datasets across different sectors - e.g., medicine, law, finance. Rather than relying on a single 100B+ parameter foundation model, we think there’s a lot of value that can be captured by <10B parameter models trained on domain-specific datasets.
Language models are becoming the foundation of language technologies, but when do they work or don’t work? In a new CRFM paper, we propose Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of LMs. Holistic evaluation includes three elements:
1. Broad coverage and recognition of incompleteness: We taxonomize a set of scenarios (e.g., question answering) and metrics (e.g., robustness) and select 42 scenarios and 7 metrics in attempt to cover the design space. Importantly, the taxonomy makes explicit what’s missing.
2. Multi-metric: benchmarks often focus on a single metric (usually accuracy). HELM instead reports 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, efficiency) for each scenario. Tradeoffs are important, and let’s not forgot about metrics beyond accuracy.
Writing on a whiteboard can make it easier for students to follow compared to slides (especially for math). During the pandemic, I added a feature to sfig (my Javascript slides library) to allow me to reveal parts of a slide using the mouse as if I were writing on a whiteboard:
Compare to normal slide builds, I don't need to specify the granularity or build order in advance, which gives me the flexibility of showing (and erasing) parts of the slide in any order. And for math, I just write the latex.
The term "foundation model" and its motivation unfortunately continues to be misunderstood. We wrote a blog post last year (see "Naming" section of crfm.stanford.edu/2021/10/18/ref…) which aims to explain our thought process. Some selected quotes from the post:
"We define foundation models as models trained on broad data (generally using self-supervision at scale) that can be adapted to a wide range of downstream tasks...based on standard ideas in transfer learning..."
"...we emphasize that foundation models present clear and significant societal risks, both in their current implementation and their fundamental premise"
There are legitimate and scientifically valuable reasons to train a language model on toxic text, but the deployment of GPT-4chan lacks them. AI researchers: please look at this statement and see what you think: forms.gle/ikiYE6ArLpWYz7…
How this fits into the broader context: foundation models carry a potential risk of significant harm, so it is imperative to develop community norms for their responsible development and deployment. How do we develop such norms? There are multiple approaches:
1. Principles: Describe values & best practices. 2. Tools: Develop benchmarks & software to make it easier to do the right thing. 3. Behavior: Take actions exemplifying responsible AI. 4. Regulation: Pass legislation that deters bad behavior. 5. Sanctions: Call out bad behavior.
Meta's release of OPT is an exciting step towards opening new opportunities for research. In general, we can think of stronger release as enabling researchers to tackle deeper questions. There are different levels of strength:
Level 1 (paper): provides an existence proof that certain capabilities are possible and reveals general ideas that can be built on
Level 2 (API access): allows researchers to probe and evaluate the capabilities (e.g., reasoning) and limitations (e.g., bias) of existing foundation models
Executable papers contain not just the code and data, but also the experiments that produced the results of a paper. Releasing code is great, but CodaLab goes one step further for full #reproducibility, providing the full certifiable provenance of an empirical result.