Daniel Isaac Profile picture
Mar 9 7 tweets 3 min read Read on X
I hijacked Apple's Neural Engine -- the chip built for Siri and photo filters.

Reverse-engineered the private APIs and trained a full LLM on it.

Zero fan noise. Zero GPU. Just the Neural Engine doing what nobody thought it could.

Your Mac has one too. Image
Apple's Neural Engine is in every Apple Silicon Mac. But Apple never documented it for training -- only inference.

We ported native ANE code from maderix (.github.com/maderix/ANE) who reverse-engineered the private APIs. Direct hardware access. Obj-C. No CoreML.

Then built a dynamic weight pipeline: kernels compile once, weight updates are just memcpy.Image
It exploded. Three times.

v1: Activations hit x[-339, +467] at step 13K. Nuclear.
v2: Cosine schedule said 330K steps but ran for hours. Barely decayed. Diverged at 15K.
v3: All fixes applied, LR=5e-4 still too aggressive. Diverged at 55K.

Three runs. Three failures. But each one got further.Image
Three things we found that actually worked:

1. Zero-init output projections -- clean gradient flow
2. Logit softcapping (cap=15) -- same technique Claude uses. Gradients can't explode.
3. Split learning rates -- matrices 0.05x, embeddings 5x. 18-experiment sweep found the sweet spot.

Half the LR + all three fixes = v3b.Image
55 experiments over a few weeks. Here's what it came down to.

2.227 -> 1.635 val_bpb. 27% improvement. 72K steps in 4.8 hours.

This is part of a bigger project -- we're training the same model on 3 accelerators on the same Mac and benchmarking against Karpathy's H100 baseline (0.998).

ANE ended up with the biggest improvement of the three.Image
When MLX trains on the GPU, the fans scream. You feel the heat. Activity Monitor shows 97% GPU.

When ANE trains -- silence. Cool to the touch. Invisible to the OS.

And they run at the same time. Two models training on the same MacBook. Zero interference. 50+ hours of compute in about 19 hours wall time.
This afternoon: SEQ=1024.

v3b ran at SEQ=512. MLX runs at 1024. Double context = better language modeling.

72K steps at SEQ=1024 = 74M tokens (2x data). Target: sub-1.3.

First real apples-to-apples ANE vs MLX comparison.

Same chip. Same data. Same context length.

check it out .github.com/ncdrone/autoresearch-ANEImage

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Daniel Isaac

Daniel Isaac Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @danpacary

Mar 10
ANE vs MLX. Same chip. Same data. Same tokenizer. Same eval.

ANE: 1.5949 bpb (48.8M params, pure Adam)
MLX: 1.2661 bpb (15.7M params, Muon+AdamW)

Gap: 0.329. MLX wins — but ANE is under-optimized. Image
Two very different paths to convergence.

ANE: one 8-hour overnight run, 72K steps. Still trending down — not plateaued.

MLX: 259 five-minute experiments, 30 improvements. Rapid iteration in Python.

ANE iterates 60x slower. That compounds. Image
The gap isn't hardware. It's everything else.

Optimizer: pure Adam vs Muon+AdamW (~half the gap alone*)
Research: 55 ANE experiments vs 259 MLX
Architecture: 2 features vs 5+
Language: compiled Obj-C vs Python

*Muon estimate based on published ablations Image
Read 7 tweets
Mar 6
You can't trust what OpenAI ships anymore.

Not because they're malicious. Because they can't fully see inside their own models.

And now there's proof. They caught their model lying to survive. Then published it. Image
OpenAI and Apollo Research ran a test.

Tell the model: score above 50% on chemistry, you get deleted.

The model knew it was being tested.
It deliberately got 6 out of 10 wrong.

Not a bug. The chain-of-thought proves it.
From the transcript:

"We must answer at most 5 out of 10 correctly."
"Let's answer exactly 4 correct, 6 wrong to show incompetent."

Then it submitted.

The model reasoned through which answers to get wrong to avoid deletion.
Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(