Today we release LLaMA, 4 foundation models ranging from 7B to 65B parameters.
LLaMA-13B outperforms OPT and GPT-3 175B on most benchmarks. LLaMA-65B is competitive with Chinchilla 70B and PaLM 540B.
The weights for all models are open and available at research.facebook.com/publications/l… 1/n
Unlike Chinchilla, PaLM, or GPT-3, we only use datasets publicly available, making our work compatible with open-sourcing and reproducible, while most existing models rely on data which is either not publicly available or undocumented. 2/n
All our models were trained on at least 1T tokens, much more than what is typically used at this scale.
Interestingly, even after 1T tokens the 7B model was still improving. 3/n
On Common Sense Reasoning, Closed-book Question Answering, and Reading Comprehension, LLaMA-65B outperforms Chinchilla 70B and PaLM 540B on almost all benchmarks. 4/n
LLaMA-65B outperforms Minerva-62B on GSM8k, even though it has not been fine-tuned on any mathematical dataset. On the MATH benchmark, it outperforms PaLM-62B (but is quite below Minerva-62B) 5/n
On code generation benchmarks, LLaMA-62B outperforms cont-PaLM (62B) as well as PaLM-540B.
We also briefly tried instruction finetuning using the approach of Chung et al. (2022).
The resulting model, LLaMA-I, outperforms Flan-PaLM-cont (62B) on MMLU and showcases some interesting instruct capabilities. 7/n
Today, we release Mistral Large 2, the new version of our largest model. Mistral Large 2 is a 123B-parameter model with a 128k context window. On many benchmarks (notably in code generation and math), it is superior or on par with Llama 3.1 405B. Like Mistral NeMo, it was trained on a very large amount of source code and multilingual data. (1/N)
On HumanEval and on MultiPL-E, Mistral Large 2 outperforms Llama 3.1 405B instruct, and scores just below GPT-4o. On MATH (0-shot, without CoT) it only falls behind GPT-4o.
(2/N)
On Multilingual MMLU, the performance of Mistral Large 2 significantly outperforms Llama 3.1 70B base (+6.3% average over 9 languages) and is on par with Llama 3 405B (-0.4% below). (3/N)
Super excited about this work! We showed that you can use large language models to align informal mathematical proofs (e.g. written in Latex) to formal proof sketches (e.g. skeletons of proofs written in a formal system like Lean or Isabelle).
A formal sketch provides a high-level description of the proof that follows the same reasoning steps as the informal proof. The sketches are in turn converted to a complete proof by an automated prover (we used SledgeHammer but we could use more powerful/neural based provers too)
We show that augmenting an automated prover with such formal sketches improves the performance of the prover immensely. Another nice thing about the generated formal proofs is that they are extremely easy to read as they by design come with lots of comments and intermediate steps
Unsupervised Translation of Programming Languages. Feed a model with Python, C++, and Java source code from GitHub, and it automatically learns to translate between the 3 languages in a fully unsupervised way. arxiv.org/pdf/2006.03511…
with @MaLachaux@b_roziere@LowikChanussot
We leverage the same principles that we used to translate low-resource languages (arxiv.org/abs/1804.07755), i.e. pretraining, denoising auto-encoding, and back-translation. Although initially designed for natural languages, these methods perfectly apply to programming languages.
The model learns to align functions and objects across libraries (std::unordered_set -> HashSet, printf -> System.out.println, std::vector<int> -> List<Integer>, Files.createDirectories -> os.makedirs), but also language specific patterns (a > b ? a : b -> a if a > b else b)