Awni Hannun Profile picture
Jul 11, 2025 2 tweets 1 min read Read on X
The new Kimi K2 1T model (4-bit quant) runs on 2 512GB M3 Ultras with mlx-lm and mx.distributed.

1 trillion params, at a speed that's actually quite usable:
Here's a sample command:

mlx.launch --hostfile hosts.json \
mlx-lm/mlx_lm/examples/pipeline_generate.py \
--model mlx-community/Kimi-K2-Instruct-4bit \
--prompt "Say hello world"

Documentation on setting up mx.distributed:
ml-explore.github.io/mlx/build/html…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Awni Hannun

Awni Hannun Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @awnihannun

Dec 5, 2023
Just in time for the holidays, we are releasing some new software today from Apple machine learning research.

MLX is an efficient machine learning framework specifically designed for Apple silicon (i.e. your laptop!)

Code:
Docs: github.com/ml-explore/mlx
ml-explore.github.io/mlx/build/html…
The video is a Llama v1 7B model implemented in MLX and running on an M2 Ultra.

More here:

* Train a Transformer LM or fine-tune with LoRA
* Text generation with Mistral
* Image generation with Stable Diffusion
* Speech recognition with Whisper github.com/ml-explore/mlx…
MLX Data is a framework agnostic, efficient, and flexible package for data loading.

Code:
Docs: github.com/ml-explore/mlx…
ml-explore.github.io/mlx-data/build…
Read 4 tweets
Jul 1, 2022
Read a bit about Grokking recently. Here's some learnings:

"Grokking" is a curious neural net behavior observed ~1 year ago (arxiv.org/abs/2201.02177).

Continue optimizing a model long after perfect training accuracy and it suddenly generalizes.

Figure:
What's especially surprising is that generalization happens SO LONG after perfect accuracy on train.

The sudden generalization is interesting, but we've seen this type of rapid concept learning in NNs before.
Some rough explanations of Grokking:

After learning the training set, the model randomly walks between low-loss solutions (beren.io/2022-01-11-Gro…)

...and stays at generalizing solutions because they have slightly better training loss (alignmentforum.org/posts/zvWqPmQa…)
Read 9 tweets
Jun 4, 2022
A short thread on forward and reverse mode autograd:

A great way to internalize the complexity difference between forward and reverse mode automatic differentiation is through the lens of Jacobian-vector products.
First: the Jacobian of a function is the matrix of derivatives with inputs on rows and outputs on columns.

The (i, j) entry is the derivative of the j-th output with respect to the i-th input.
Reverse-mode let's you compute a Jacobian-vector product for a given vector in a single pass.

Forward-mode let's you compute a (row) vector-Jacobian product for any vector in a single pass.
Read 9 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(