I hijacked Apple's Neural Engine -- the chip built for Siri and photo filters.
Reverse-engineered the private APIs and trained a full LLM on it.
Zero fan noise. Zero GPU. Just the Neural Engine doing what nobody thought it could.
Your Mac has one too.
Apple's Neural Engine is in every Apple Silicon Mac. But Apple never documented it for training -- only inference.
We ported native ANE code from maderix (.github.com/maderix/ANE) who reverse-engineered the private APIs. Direct hardware access. Obj-C. No CoreML.
Then built a dynamic weight pipeline: kernels compile once, weight updates are just memcpy.
It exploded. Three times.
v1: Activations hit x[-339, +467] at step 13K. Nuclear.
v2: Cosine schedule said 330K steps but ran for hours. Barely decayed. Diverged at 15K.
v3: All fixes applied, LR=5e-4 still too aggressive. Diverged at 55K.
Three runs. Three failures. But each one got further.
Three things we found that actually worked:
1. Zero-init output projections -- clean gradient flow 2. Logit softcapping (cap=15) -- same technique Claude uses. Gradients can't explode. 3. Split learning rates -- matrices 0.05x, embeddings 5x. 18-experiment sweep found the sweet spot.
Half the LR + all three fixes = v3b.
55 experiments over a few weeks. Here's what it came down to.
This is part of a bigger project -- we're training the same model on 3 accelerators on the same Mac and benchmarking against Karpathy's H100 baseline (0.998).
ANE ended up with the biggest improvement of the three.
When MLX trains on the GPU, the fans scream. You feel the heat. Activity Monitor shows 97% GPU.
When ANE trains -- silence. Cool to the touch. Invisible to the OS.
And they run at the same time. Two models training on the same MacBook. Zero interference. 50+ hours of compute in about 19 hours wall time.
This afternoon: SEQ=1024.
v3b ran at SEQ=512. MLX runs at 1024. Double context = better language modeling.
Gap: 0.329. MLX wins — but ANE is under-optimized.
Two very different paths to convergence.
ANE: one 8-hour overnight run, 72K steps. Still trending down — not plateaued.
MLX: 259 five-minute experiments, 30 improvements. Rapid iteration in Python.
ANE iterates 60x slower. That compounds.
The gap isn't hardware. It's everything else.
Optimizer: pure Adam vs Muon+AdamW (~half the gap alone*)
Research: 55 ANE experiments vs 259 MLX
Architecture: 2 features vs 5+
Language: compiled Obj-C vs Python