Apple spilled the beans on Apple Intelligence Foundation Models (notes below):
Architecture:
> Dense - decoder only transformer architecture
> RMSNorm & Query/ Key normalization
> GQA (w/ 8 KV heads)
> SwiGLU activation & RoPE (base_freq=500K for long context)
Pre-training & Tokenisation:
> Webpages crawled through the Applebot (web crawl)
> Code & Math datasets (publicaly licensed)
> BPE tokenizer w/ 100K vocab for server & 49K for on-device
Three step pre-training:
>Core (consumes most of the compute budget)
AFM-server - 6.3T tokens + 4096 seq length
AFM-on-device - initialised from a pruned 6.4B server model, trained for full 6.3T tokens along with distillation loss
- Continued (down-weight lower quality data and increase code, math, licensed data weight)
1T tokens, w/ 8192 seq length
no distillation loss for AFM-on-device in this phase
- Context-lengthening with long sequence + synthetic data
100B tokens, w/ 32768 seq length
Training Infrastructure:
> Pre-trained v4 & v5p TPU clusters
> Using AXLearn (JAX) with a combination of tensor, fsdp, and seq parallelism
> AFM Server trained on 8192 TPUv4 chips
> AFM On-device trained on 2048 TPUv5p chips
Post Training:
> Hybrid data - synthetic + human annotated
> Synthetic data for Mathematics (problem rephrase & reversion + evolution), Tool use and coding
> RLHF: Iterative Teaching Committee - Refresh online human preference data collection using a diverse set of best performing model
> For above, collect pairwise human preference on responses sampled from the comittee
Deployment:
> Adapters for each task, adapter values represented using 16-bits, loaded on-the-fly based on the task
> Quantised under 4-bit-per-weight (3.7 bpw), use accuracy recovering adapters for regaining the lost performance
> Accuracy recovery adapter trains on 10B tokens across different ranks, 8, 16, 32
> Some layers (unimportant) pushed to 2-bit
Evaluation:
> On-device: SoTA in IFEval and competitive with Gemma 7B on AlpacaEval 2.0
> Server: SoTA in IFEval, comparable to Mixtral 8x22B in Arena Hard
> Competitve with GPT 4/ Gemini 1.5 on Tools/ function calling, writing (summarisation, composition) benchmarks
> On-device beats L3 8B on Math
The report is quite feature packed, quite enjoyed skimming through it. Thanks Apple for being so open about your practices and spilling the beans on what would power the next gen of on-device ML.
AI Math Olympiad Winner - Running on Mac! 100% local 🔥
brew install llama.cpp
llama-cli
--hf-repo reach-vb/NuminaMath-7B-TIR-Q8_0-GGUF
--hf-file numinamath-7b-tir-q8_0.gguf
-p "For how many values of the constant $ k $ will the polynomial $ x^{2}+kx+36$ have two distinct integer roots?"
> pretty strong model, beats codellama 13B.
> supports fill-in-the-middle (code completion), code generation and chat.
> compatible with torch.compile()
> optimised for speed, about ~1.5x faster than models in the similar category.
> 2B model supports FIM only.
> 7B supports FIM + Code Generation.
> 7B-IT supports Code Generation + Chat.