Byron Hsu Profile picture
Dec 10 8 tweets 4 min read Read on X
Introducing the first open-source optimized post-training losses in Liger Kernel with ~80% memory reduction, featuring DPO (@rm_rafailov), CPO (@fe1ixxu), ORPO (@jiwoohong98), SimPO (@yumeng0818), JSD, and more, achieving up to 70% end-to-end speedup through larger batch size. Use it as any PyTorch module - Available today in Liger v0.5.0!

github.com/linkedin/Liger…Image
(2/N) Installation and usage are simple. Just pip install liger-kernel and import the loss as a PyTorch module to benefit from significant memory reduction. Image
(3/N) The core challenge is that LLMs' vocab sizes are massive, and losses like DPO or JSD need to materialize multiple copies of the logits, causing memory issues. We applied the same idea from our popular fused linear cross entropy to other losses.
(4/N) For all losses, hidden inputs are fed into the lm head then the loss function. We avoid materializing full logits by chunking hidden inputs and fusing forward and backward passes. Memory can be reduced by up to 80% since the memory peak is only the size of a small chunk!
(5/N) Rather than writing custom triton kernels, we generate kernels using torch compile. We achieve amazing performance by using grad_and_value to run forward and backward in one call, using a for loop to accumulate gradients, and torch compiling the full code. Image
(6/N) Torch compile streamlines kernel execution and fuses operations, but recompilation can add overhead. We found setting variable-length inputs as dynamic with torch._dynamo.mark_dynamic minimizes unnecessary recompilations and ensures consistent performance.
(7/N) We made the interface easy to extend with pure pytorch: github.com/linkedin/Liger…. Researchers can easily innovate custom losses on top of our flexible chunk loss implementation with superior performance. Please follow our official account and see the full release note of v0.5: x.com/liger_kernel/s…
(8/N) This work has been led by @shivam15sahni and @hsu_byron. Special thanks to @cHHillee for developing the Torch Compile Chunk Loss, and to Pramodith (github.com/pramodith) and Austin (github.com/austin362667), both of whom are active open source contributors. We’ve implemented the LigerORPOTrainer on top of the Hugging Face Trainer and are looking forward to deeper integration with training frameworks!

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Byron Hsu

Byron Hsu Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @hsu_byron

Aug 23
(1/n)

Training LLMs can be hindered by out-of-memory, scaling batch size, and seq length. Add one line to boost multi-GPU training throughput by 20% and reduce memory usage by 60%. Introducing Liger-Kernel: Efficient Triton Kernels for LLM Training.

github.com/linkedin/Liger…Image
(2/n) Our kernel integrates smoothly with Flash Attention, PyTorch FSDP, and DeepSpeed. Patch your Hugging Face model with one line, or compose your own model using the provided kernels. These kernels have minimal dependencies—just Torch and Triton.
(3/n) We have taken the spirit from llm.c but used Triton to reimplement RMSNorm, RoPE, SwiGLU, CrossEntropy, and FusedLinearCrossEntropy from scratch with forward and backward passes. The kernels are exact, without approximations. Image
Read 11 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(