curl --request POST \
--url http://localhost:8080/completion \
--header "Content-Type: application/json" \
--data '{"prompt": "The meaning of life is :", "n_predict": 512}'
Also works with streaming mode:
just add "stream": true
curl --request POST \
--url http://localhost:8080/completion \
--header "Content-Type: application/json" \
--data '{"prompt": "The meaning of life is :","n_predict": 512, "stream": true}'
Introducing Quanto: A PyTorch Quantisation library! ⚡
a.k.a. the gpu poor toolkit ;)
> Supports, int - 2, 4, 8 weights.
> Works seamlessly on CUDA, MPS and CPU.
> Automagically operates with all PyTorch models.
> Native support for Transformers. 🤗
> Quantize, Calibrate or perform Quantization Aware Training!
Best part: Minimal loss in accuracy/ perplexity even with int-4 quantisation.
Optimised matmul kernels for int-2,4,8 coming soon!
> 1.2B parameter model.
> Trained on 100K hours of data.
> Supports zero-shot voice cloning.
> Short & long-form synthesis.
> Emotional speech.
> Best part: Apache 2.0 licensed. 🔥
Powered by a simple yet robust architecture:
> Encodec (Multi-Band Diffusion) and GPT + Encoder Transformer LM.
> DeepFilterNet to clear up MBD artefacts.
Synthesised: "Have you heard about this new TTS model called MetaVoice."