To mark the 2nd anniversary of LLM360, we are proud to release K2-V2: a 70B reasoning-centric foundation model that delivers frontier capabilities.
As a push for "360-open" transparency, we are releasing not only weights, but the full recipe: data composition, training code, logs, and intermediate checkpoints.
About K2-V2:
🧠 70B params, reasoning-optimized
🧊 512K context window
🔓 "360-Open" (Data, Logs, Checkpoints)
📈 SOTA on olympiad math and complex logic puzzles
We evaluated K2 across general knowledge, STEM, coding, and agentic tool use.
The goal? To show open models need not be smaller, weaker versions of closed ones.
K2 outperforms models of similar sizes, and performs close to models that are larger.
A huge thank you to the OSS ecosystem! @huggingface @wandb @github @lmsysorg @AiEleuther @allen_ai @BigCodeProject @PyTorch @nvidia @cerebras @mbzuai and many more.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Proudly present MegaMath, the largest open-source math reasoning pretraining corpus—371B tokens of high-quality mathematical web, code, and synthetic data, designed to build the data foundation for next-generation math-proficient LLMs like o1 and R1. 🧵👇 #LLM #OpenSource #MegaMath #Math #Data4LLMs #Pretraining
Mathematical reasoning is a key feature of advanced LLMs. Training math-proficient models like O1 and DeepSeek-R1 requires large-scale, high-quality, diverse math data. Proprietary corpora, such as Qwen-2.5-Math (1T) and DeepSeekMath (120B), show strong mathematical abilities but are closed source. Existing open corpora lack comparable size and quality. MegaMath aims to bridge this gap.
💡 What’s in MegaMath?
MegaMath is a comprehensive 371B-token collection delivering with top data quality. It is composed of:
📚279B tokens of math-rich web data
🧑💻 28B tokens of math-relevant code
🧠64B tokens of high-quality synthetic data (QA pairs, translated code, text+code blocks)