Vaibhav (VB) Srivastav Profile picture
chief get-shit-done officer @huggingface | F1 fan | Here for @at_sofdog’s wisdom | *opinions my own
May 1 9 tweets 2 min read
Pretty fucking incredible week so far:

> Qwen3 - MoE (235B, 30B) + Dense (32, 14, 8, 4, 0.6B)
> Xiaomi - MiMo 7B dense
> Kyutai - Helium 2B dense
> DeepSeek - Prover V2 671B MoE
> Qwen2.5 Omni 3B
> Microsoft - Phi4 14B Reasoning, Mini (3.8B) & Plus
> JetBrains- Mellum 4B Dense
> AllenAI - OLMo2 1B Dense

And.. it’s only Thursday! 🔥 Qwen3

huggingface.co/collections/Qw…
Mar 20 5 tweets 2 min read
NEW: Nvidia just open sourced Canary 1B & 180M Flash - multilingual speech recognition AND translation models 🔥

> Second on the Open ASR Leaderboard
> Achieves greater than 1000 RTF 🤯
> 880M & 180M sizes - perfect for on-device
> Supports word-level & segment-level timestamps
> Fluent in English, German, French, Spanish
> Robust and fewer hallucinations
> CC-BY license - allows commercial use

Kudos @NVIDIAAIDev for such a brilliant release - looking forward to what the community does with it next! 🤗Image Here's the 1B Flash model:

huggingface.co/nvidia/canary-…
Jan 14 5 tweets 2 min read
Wait WTF, @MiniMaxAI_ dropped MiniMax-Text 01, 456B parameters (45.9B active) beats DeepSeek v3 with FOUR MILLION context length - commercially permissive! 🔥

> On Hugging Face Hub & works w/ Transformers (custom code) 💥 Image You can check out the model weights here:

huggingface.co/MiniMaxAI/Mini…
Dec 15, 2024 16 tweets 4 min read
fuck it, time to setup this bad boi Image step 1: omz Image
Dec 7, 2024 6 tweets 2 min read
VLMs are going through quite an open revolution AND on-device friendly sizes:

> Google DeepMind w/ PaliGemma2 - 3B, 10B & 28B

> OpenGVLabs w/ InternVL 2.5 - 1B, 2B, 4B, 8B, 26B, 38B & 78B

> Qwen w/ Qwen 2 VL - 2B, 7B & 72B

> Microsoft w/ FlorenceVL - 3B & 8B

(Links below) PaliGemma 2

huggingface.co/collections/go…
Nov 22, 2024 5 tweets 2 min read
HOLY SHITTT! Fully open, Text to Video model capable of generating 24FPS at 768x512 resolution in REAL TIME! 🤯

Model weights:

huggingface.co/Lightricks/LTX…
Nov 16, 2024 4 tweets 2 min read
Audio LMs scene is heating up! 🔥 @FixieAI Ultravox 0.4.1 - 8B model approaching GPT4o level, pick any LLM, train an adapter with Whisper as Audio Encoder, profit 💥

Bonus: MIT licensed checkpoints

> Pre-trained on Llama3.1-8b/ 70b backbone as well as the encoder part of whisper-large-v3-turbo

> Only the multi-modal adapter is trained, while Whisper encoder and LLM are kept frozen

> Use knowledge-distillation loss where Ultravox is trying to match the logits of the LLM backbone

GG @FixieAI - Play with it directly on the space and checkout the models on the hub 🤗 Check out the space here:

huggingface.co/spaces/freddya…
Nov 10, 2024 7 tweets 2 min read
Hertz-dev - 8.5 billion parameters, full-duplex, audio-only base model, APACHE 2.0 licensed 🔥

> Trained on 20 million hours of audio

Train on any down-stream task, speech-to-speech, translation, classification, speech recognition, text-to-speech and more!

GG @si_pbc 🤗 Check out the model checkpoint here:

huggingface.co/si-pbc/hertz-d…
Oct 30, 2024 4 tweets 2 min read
Fuck yeah! MaskGCT - New open SoTA Text to Speech model! 🔥

> Zero-shot voice cloning
> Emotional TTS
> Trained on 100K hours of data
> Long form synthesis
> Variable speed synthesis
> Bilingual - Chinese & English
> Available on Hugging Face

Fully non-autoregressive architecture:
> Stage 1: Predicts semantic tokens from text, using tokens extracted from a speech self-supervised learning (SSL) model
> Stage 2: Predicts acoustic tokens conditioned on the semantic tokens.

Synthesised: "Would you guys personally like to have a fake fireplace, an electric one, in your house? Or would you rather have a real fireplace? Let me know down below. Okay everybody, that's all for today's video and I hope you guys learned a bunch of furniture vocabulary!"

TTS scene keeps getting lit! 🐐 Model weights on the Hub:

huggingface.co/amphion/MaskGCT
Oct 18, 2024 5 tweets 2 min read
The whale is back! Janus 1.3B an multi-modal LM for any-to-any task. Beats DallE 2/ SDXL in generation and Llava 1.5 7B in multimodal - MIT licensed 🔥

Evaluations:

- MMBench: 69.4 (outperforms LLaVA-v1.5 7B: 67.9)
- SEED-Bench: 63.7 (outperforms LLaVA-v1.5 7B: 62.4)
- POPE: 87.0 (outperforms LLaVA-v1.5 7B: 85.5)
- MSCOCO-30K: FID score of 8.53 (outperforms DALL-E 2: 9.0)
- GenEval: Accuracy of 61% (outperforms SDXL: 58%)

Model Architecture:

> 1.3B (outperforms models with 7B parameters)
> Two independent pathways for understanding and generation

> Unified Transformer: Shares the same architecture for both pathways

- Uses the built-in tokenizer of the LLM to convert text into discrete IDs

- Employs SigLIP encoder to extract high-dimensional semantic features from images, flattened into a 1-D sequence

> Visual Generation: Utilizes VQ tokenizer to convert images into discrete IDs, flattened into a 1-D sequence

- Feature Mapping: Uses understanding and generation adaptors to map image features and codebook embeddings into the LLM input space

- Prediction Heads: Built-in for text predictions, randomly initialized for image predictions

> Model checkpoints on the Hub and compatible w/ Transformers (remote code)

Congrats @deepseek_ai for yet another stellar release! 🔥Image Check out the model here:

huggingface.co/deepseek-ai/Ja…
Oct 12, 2024 5 tweets 2 min read
Let's goo! F5-TTS 🔊

> Trained on 100K hours of data
> Zero-shot voice cloning
> Speed control (based on total duration)
> Emotion based synthesis
> Long-form synthesis
> Supports code-switching
> Best part: CC-BY license (commercially permissive)🔥

Diffusion based architecture:
> Non-Autoregressive + Flow Matching with DiT
> Uses ConvNeXt to refine text representation, alignment

Synthesised: I was, like, talking to my friend, and she’s all, um, excited about her, uh, trip to Europe, and I’m just, like, so jealous, right? (Happy emotion)

The TTS scene is on fire! 🐐 Check out the open model weights here:

huggingface.co/SWivid/F5-TTS
Jul 29, 2024 5 tweets 3 min read
Apple spilled the beans on Apple Intelligence Foundation Models (notes below):

Architecture:
> Dense - decoder only transformer architecture
> RMSNorm & Query/ Key normalization
> GQA (w/ 8 KV heads)
> SwiGLU activation & RoPE (base_freq=500K for long context)

Pre-training & Tokenisation:
> Webpages crawled through the Applebot (web crawl)
> Code & Math datasets (publicaly licensed)
> BPE tokenizer w/ 100K vocab for server & 49K for on-device

Three step pre-training:
>Core (consumes most of the compute budget)

AFM-server - 6.3T tokens + 4096 seq length
AFM-on-device - initialised from a pruned 6.4B server model, trained for full 6.3T tokens along with distillation loss

- Continued (down-weight lower quality data and increase code, math, licensed data weight)
1T tokens, w/ 8192 seq length
no distillation loss for AFM-on-device in this phase

- Context-lengthening with long sequence + synthetic data
100B tokens, w/ 32768 seq length

Training Infrastructure:
> Pre-trained v4 & v5p TPU clusters
> Using AXLearn (JAX) with a combination of tensor, fsdp, and seq parallelism
> AFM Server trained on 8192 TPUv4 chips
> AFM On-device trained on 2048 TPUv5p chips

Post Training:
> Hybrid data - synthetic + human annotated
> Synthetic data for Mathematics (problem rephrase & reversion + evolution), Tool use and coding
> RLHF: Iterative Teaching Committee - Refresh online human preference data collection using a diverse set of best performing model
> For above, collect pairwise human preference on responses sampled from the comittee

Deployment:
> Adapters for each task, adapter values represented using 16-bits, loaded on-the-fly based on the task
> Quantised under 4-bit-per-weight (3.7 bpw), use accuracy recovering adapters for regaining the lost performance
> Accuracy recovery adapter trains on 10B tokens across different ranks, 8, 16, 32
> Some layers (unimportant) pushed to 2-bit

Evaluation:
> On-device: SoTA in IFEval and competitive with Gemma 7B on AlpacaEval 2.0
> Server: SoTA in IFEval, comparable to Mixtral 8x22B in Arena Hard
> Competitve with GPT 4/ Gemini 1.5 on Tools/ function calling, writing (summarisation, composition) benchmarks
> On-device beats L3 8B on Math

The report is quite feature packed, quite enjoyed skimming through it. Thanks Apple for being so open about your practices and spilling the beans on what would power the next gen of on-device ML.

More notes, coming soon! 🤗Image Link to the report:

machinelearning.apple.com/papers/apple_i…
Jul 15, 2024 4 tweets 2 min read
AI Math Olympiad Winner - Running on Mac! 100% local 🔥

brew install llama.cpp

llama-cli
--hf-repo reach-vb/NuminaMath-7B-TIR-Q8_0-GGUF
--hf-file numinamath-7b-tir-q8_0.gguf
-p "For how many values of the constant $ k $ will the polynomial $ x^{2}+kx+36$ have two distinct integer roots?"

That's it! 🤗 Check out the Quantised Q8 checkpoint here:

huggingface.co/reach-vb/Numin…
Jul 4, 2024 7 tweets 2 min read
TTS ecosystem has been booming lately:

1. Chat TTS - English + Chinese TTS model optimised for daily conversations/ dialogues + Voice Cloning

2. MARS5 TTS - English only but gives insane prosodic control paired with voice cloning

3. Parler TTS - Smol but powerful text prompt controlled TTS (we’re scaling it up right now)

4. Toucan - Massively Multilingual TTS in 4000+ languages (works even on CPU)

5. MetaVoice - 1B param model with deep voice cloning control. English only.

We’re only half way through the year, pumped to see what the rest has in store for us!

What else am I missing from this year? ChatTTS

github.com/2noise/ChatTTS
Jun 4, 2024 5 tweets 2 min read
Upto 6x faster Whisper with torch compile and HQQ! 🔥

> With negligible drop in performance!

Code + benchmark released ⚡ Benchmarks on short-form datasets indicate very little drop in performance:

*look at that speed-up! Image
Apr 11, 2024 4 tweets 2 min read
Kinda wild that you can merge models with SoTA techniques at the click of a button! 🤯

Presenting MergeKit UI - Drop in your config, access token and voila, you get a merged model back!

Supported merging methods:
1. Model Soups
2. SLERP
3. Task Arithmetic
4. TIES
5. DARE TIES
6. DARE TIES Arithmetic
7. Passthrough
8. Model Stock

We'll take care of the compute so you can work on what matters the most! ✨

Bring it on; let's merge our way to the current SoTA and beyond! 🤗

What would you like to see next? ⚡ Check out the space here:

huggingface.co/spaces/arcee-a…
Apr 9, 2024 6 tweets 2 min read
CodeGemma 2B, 7B & 7B-it 💎

> pretty strong model, beats codellama 13B.
> supports fill-in-the-middle (code completion), code generation and chat.
> compatible with torch.compile()
> optimised for speed, about ~1.5x faster than models in the similar category.
> 2B model supports FIM only.
> 7B supports FIM + Code Generation.
> 7B-IT supports Code Generation + Chat.

> try it out in transformers directly! 🤗Image Check out all the models here:

huggingface.co/collections/go…
Mar 26, 2024 6 tweets 2 min read
llama.cpp with OpenAI chat completions API! 🦙

100% local. Powered by Metal!

*sound on*

In 2 steps:

1. brew install ggerganov/ggerganov/llama.cpp

2. llama-server --model -c 2048

P.S. All of this with a binary size of less than 5MB ;)

That's it! 🤗 Compatible with more than 7500+ GGUFs available on the Hugging Face Hub and more:

huggingface.co/models?library…
Mar 21, 2024 5 tweets 2 min read
Introducing Distil-Whisper v3 ⚡

> ~50% less parameters and 6x faster than Large-v3.
> More accurate than large-v3 on long-form synthesis.

Available with 🦀 WebGPU, Whisper.cpp, Transformers, Faster-Whisper and Transformers.js support!

Drop in; no changes are required! 🔥 Along with this, we announce an alpha release of Ratchet - our optimised WebGPU framework to serve blazingly fast Whisper:

Written in 🦀 Rust!

huggingface.co/spaces/FL33TW0…
Mar 19, 2024 4 tweets 2 min read
Introducing Quanto: A PyTorch Quantisation library! ⚡

a.k.a. the gpu poor toolkit ;)

> Supports, int - 2, 4, 8 weights.
> Works seamlessly on CUDA, MPS and CPU.
> Automagically operates with all PyTorch models.
> Native support for Transformers. 🤗
> Quantize, Calibrate or perform Quantization Aware Training!

Best part: Minimal loss in accuracy/ perplexity even with int-4 quantisation.Image Optimised matmul kernels for int-2,4,8 coming soon!

> pip install quanto

github.com/huggingface/qu…
Mar 12, 2024 4 tweets 2 min read
Introducing FACodec! ⚡

> Factorised Neural Speech Codec.
> Powers NaturalSpeech 3.
> Checkpoints and Codebase - Apache 2.0 Licensed.
> Performs zero-shot Voice Conversion.
> Consists of an explicit Timbre Extractor & Prosody, Content and Acoustic detail quantisers.
> Current SoTA in Codec.
> Checkpoints on Hugging Face Hub. 🤗Image Check out the checkpoints here:

huggingface.co/amphion/natura…