Huiqiang Jiang Profile picture
RSDE @MSFTResearch Shanghai
Jul 7, 2024 7 tweets 3 min read
Thanks @_akhaliq for sponsoring. You can now use MInference online in HF Demo with ZeroGPU.
Now, you can process 1M context 10x faster in a single A100 using Long-context LLMs like LLaMA3-1M, GLM4-1M, with even better acc, try MInference 1.0 right now! huggingface.co/spaces/microso…
1) The motivation behind MInference is long-context inference is highly resource-intensive, yet inherently sparse and dynamic. We use dynamic sparse attention to accelerate the pre-filling of 1M inference by up to 10x. For more details, visit project page: aka.ms/MInference
Image