Vaibhav (VB) Srivastav Profile picture
Mar 26 6 tweets 2 min read Read on X
llama.cpp with OpenAI chat completions API! 🦙

100% local. Powered by Metal!

*sound on*

In 2 steps:

1. brew install ggerganov/ggerganov/llama.cpp

2. llama-server --model -c 2048

P.S. All of this with a binary size of less than 5MB ;)

That's it! 🤗
Compatible with more than 7500+ GGUFs available on the Hugging Face Hub and more:

huggingface.co/models?library…
To poll the API all ya gotta do is:

curl --request POST \
--url http://localhost:8080/completion \
--header "Content-Type: application/json" \
--data '{"prompt": "The meaning of life is :", "n_predict": 512}'Image
Also works with streaming mode:

just add "stream": true

curl --request POST \
--url http://localhost:8080/completion \
--header "Content-Type: application/json" \
--data '{"prompt": "The meaning of life is :","n_predict": 512, "stream": true}'Image

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Vaibhav (VB) Srivastav

Vaibhav (VB) Srivastav Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @reach_vb

Mar 21
Introducing Distil-Whisper v3 ⚡

> ~50% less parameters and 6x faster than Large-v3.
> More accurate than large-v3 on long-form synthesis.

Available with 🦀 WebGPU, Whisper.cpp, Transformers, Faster-Whisper and Transformers.js support!

Drop in; no changes are required! 🔥
Along with this, we announce an alpha release of Ratchet - our optimised WebGPU framework to serve blazingly fast Whisper:

Written in 🦀 Rust!

huggingface.co/spaces/FL33TW0…
You can find all the weights and their corresponding usage below.

Can't wait to see what the community builds with it! 🤗

huggingface.co/collections/di…
Read 5 tweets
Mar 19
Introducing Quanto: A PyTorch Quantisation library! ⚡

a.k.a. the gpu poor toolkit ;)

> Supports, int - 2, 4, 8 weights.
> Works seamlessly on CUDA, MPS and CPU.
> Automagically operates with all PyTorch models.
> Native support for Transformers. 🤗
> Quantize, Calibrate or perform Quantization Aware Training!

Best part: Minimal loss in accuracy/ perplexity even with int-4 quantisation.Image
Optimised matmul kernels for int-2,4,8 coming soon!

> pip install quanto

github.com/huggingface/qu…
Read this brilliant blog post put together by the team ❤️

huggingface.co/blog/quanto-in…
Read 4 tweets
Mar 12
Introducing FACodec! ⚡

> Factorised Neural Speech Codec.
> Powers NaturalSpeech 3.
> Checkpoints and Codebase - Apache 2.0 Licensed.
> Performs zero-shot Voice Conversion.
> Consists of an explicit Timbre Extractor & Prosody, Content and Acoustic detail quantisers.
> Current SoTA in Codec.
> Checkpoints on Hugging Face Hub. 🤗Image
Check out the checkpoints here:

huggingface.co/amphion/natura…
and.. a space to play around and observe its reconstruction quality:

huggingface.co/spaces/amphion…
Read 4 tweets
Mar 11
Wow! @CohereForAI just released CMD-R 🔥

> Beats GPT 3.5
> 128K context window.
> 35 billion parameters.
> 10 languages.
> Optimised for reasoning, question answering and summarisation.
> Use it directly in transformers 🤗

huggingface.co/CohereForAI/c4…
All you need to make it work with transformers! ⚡ Image
Read 8 tweets
Mar 9
Fast Mamba Inference is now in Transformers! 🐍

All you need is 5 lines of code and the latest transformers!

Bonus: You can also fine-tune/ RLHF it with TRL & PEFT too 🤗

We support all the base checkpoints along with community-tuned checkpoints too.

Want to try it, too? :)

import torch
from transformers import MambaConfig, MambaForCausalLM, AutoTokenizer

device = "cuda:1"

tokenizer = AutoTokenizer.from_pretrained("state-spaces/mamba-2.8b-hf")
model = MambaForCausalLM.from_pretrained("state-spaces/mamba-2.8b-hf",
device_map=device)

input_ids = tokenizer("The meaning to life is ", return_tensors="pt")["input_ids"]

out = model.generate(input_ids.to(device),
max_new_tokens=100)

print(tokenizer.batch_decode(out))

That's it! 🤗
Massive kudos to @art_zucker for adding it to transformers. Check out the Documentation for more details:

huggingface.co/docs/transform…
@art_zucker PEFT tuning is literally as simple as this 🔥 Image
Read 4 tweets
Feb 6
Let's go! MetaVoice 1B 🔉

> 1.2B parameter model.
> Trained on 100K hours of data.
> Supports zero-shot voice cloning.
> Short & long-form synthesis.
> Emotional speech.
> Best part: Apache 2.0 licensed. 🔥

Powered by a simple yet robust architecture:
> Encodec (Multi-Band Diffusion) and GPT + Encoder Transformer LM.
> DeepFilterNet to clear up MBD artefacts.

Synthesised: "Have you heard about this new TTS model called MetaVoice."
Apache 2.0 Model weights on the 🤗Hub!

huggingface.co/metavoiceio/me…
Voice cloning works like a charm, too! ⚡

Text: "It is a beautiful day out there. I'd like to go out to the beach and catch some sun."
Read 5 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(