Vaibhav (VB) Srivastav Profile picture
Nov 22 5 tweets 2 min read Read on X
HOLY SHITTT! Fully open, Text to Video model capable of generating 24FPS at 768x512 resolution in REAL TIME! 🤯

Try it out for free here:

huggingface.co/spaces/Lightri…
Works out of the box with Comfy too! Image

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Vaibhav (VB) Srivastav

Vaibhav (VB) Srivastav Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @reach_vb

Nov 16
Audio LMs scene is heating up! 🔥 @FixieAI Ultravox 0.4.1 - 8B model approaching GPT4o level, pick any LLM, train an adapter with Whisper as Audio Encoder, profit 💥

Bonus: MIT licensed checkpoints

> Pre-trained on Llama3.1-8b/ 70b backbone as well as the encoder part of whisper-large-v3-turbo

> Only the multi-modal adapter is trained, while Whisper encoder and LLM are kept frozen

> Use knowledge-distillation loss where Ultravox is trying to match the logits of the LLM backbone

GG @FixieAI - Play with it directly on the space and checkout the models on the hub 🤗
Check out the space here:

huggingface.co/spaces/freddya…
Check out all the model checkpoints here:

huggingface.co/collections/re…
Read 4 tweets
Nov 10
Hertz-dev - 8.5 billion parameters, full-duplex, audio-only base model, APACHE 2.0 licensed 🔥

> Trained on 20 million hours of audio

Train on any down-stream task, speech-to-speech, translation, classification, speech recognition, text-to-speech and more!

GG @si_pbc 🤗
Check out the model checkpoint here:

huggingface.co/si-pbc/hertz-d…
You can even play around with the model on this space:

huggingface.co/spaces/si-pbc/…
Read 7 tweets
Oct 30
Fuck yeah! MaskGCT - New open SoTA Text to Speech model! 🔥

> Zero-shot voice cloning
> Emotional TTS
> Trained on 100K hours of data
> Long form synthesis
> Variable speed synthesis
> Bilingual - Chinese & English
> Available on Hugging Face

Fully non-autoregressive architecture:
> Stage 1: Predicts semantic tokens from text, using tokens extracted from a speech self-supervised learning (SSL) model
> Stage 2: Predicts acoustic tokens conditioned on the semantic tokens.

Synthesised: "Would you guys personally like to have a fake fireplace, an electric one, in your house? Or would you rather have a real fireplace? Let me know down below. Okay everybody, that's all for today's video and I hope you guys learned a bunch of furniture vocabulary!"

TTS scene keeps getting lit! 🐐
Model weights on the Hub:

huggingface.co/amphion/MaskGCT
You can also try it out directly over here:

huggingface.co/spaces/amphion…
Read 4 tweets
Oct 18
The whale is back! Janus 1.3B an multi-modal LM for any-to-any task. Beats DallE 2/ SDXL in generation and Llava 1.5 7B in multimodal - MIT licensed 🔥

Evaluations:

- MMBench: 69.4 (outperforms LLaVA-v1.5 7B: 67.9)
- SEED-Bench: 63.7 (outperforms LLaVA-v1.5 7B: 62.4)
- POPE: 87.0 (outperforms LLaVA-v1.5 7B: 85.5)
- MSCOCO-30K: FID score of 8.53 (outperforms DALL-E 2: 9.0)
- GenEval: Accuracy of 61% (outperforms SDXL: 58%)

Model Architecture:

> 1.3B (outperforms models with 7B parameters)
> Two independent pathways for understanding and generation

> Unified Transformer: Shares the same architecture for both pathways

- Uses the built-in tokenizer of the LLM to convert text into discrete IDs

- Employs SigLIP encoder to extract high-dimensional semantic features from images, flattened into a 1-D sequence

> Visual Generation: Utilizes VQ tokenizer to convert images into discrete IDs, flattened into a 1-D sequence

- Feature Mapping: Uses understanding and generation adaptors to map image features and codebook embeddings into the LLM input space

- Prediction Heads: Built-in for text predictions, randomly initialized for image predictions

> Model checkpoints on the Hub and compatible w/ Transformers (remote code)

Congrats @deepseek_ai for yet another stellar release! 🔥Image
Check out the model here:

huggingface.co/deepseek-ai/Ja…
The model looks pretty strong for it's size: Image
Read 5 tweets
Oct 12
Let's goo! F5-TTS 🔊

> Trained on 100K hours of data
> Zero-shot voice cloning
> Speed control (based on total duration)
> Emotion based synthesis
> Long-form synthesis
> Supports code-switching
> Best part: CC-BY license (commercially permissive)🔥

Diffusion based architecture:
> Non-Autoregressive + Flow Matching with DiT
> Uses ConvNeXt to refine text representation, alignment

Synthesised: I was, like, talking to my friend, and she’s all, um, excited about her, uh, trip to Europe, and I’m just, like, so jealous, right? (Happy emotion)

The TTS scene is on fire! 🐐
Check out the open model weights here:

huggingface.co/SWivid/F5-TTS
Overall architecture: Image
Read 5 tweets
Jul 29
Apple spilled the beans on Apple Intelligence Foundation Models (notes below):

Architecture:
> Dense - decoder only transformer architecture
> RMSNorm & Query/ Key normalization
> GQA (w/ 8 KV heads)
> SwiGLU activation & RoPE (base_freq=500K for long context)

Pre-training & Tokenisation:
> Webpages crawled through the Applebot (web crawl)
> Code & Math datasets (publicaly licensed)
> BPE tokenizer w/ 100K vocab for server & 49K for on-device

Three step pre-training:
>Core (consumes most of the compute budget)

AFM-server - 6.3T tokens + 4096 seq length
AFM-on-device - initialised from a pruned 6.4B server model, trained for full 6.3T tokens along with distillation loss

- Continued (down-weight lower quality data and increase code, math, licensed data weight)
1T tokens, w/ 8192 seq length
no distillation loss for AFM-on-device in this phase

- Context-lengthening with long sequence + synthetic data
100B tokens, w/ 32768 seq length

Training Infrastructure:
> Pre-trained v4 & v5p TPU clusters
> Using AXLearn (JAX) with a combination of tensor, fsdp, and seq parallelism
> AFM Server trained on 8192 TPUv4 chips
> AFM On-device trained on 2048 TPUv5p chips

Post Training:
> Hybrid data - synthetic + human annotated
> Synthetic data for Mathematics (problem rephrase & reversion + evolution), Tool use and coding
> RLHF: Iterative Teaching Committee - Refresh online human preference data collection using a diverse set of best performing model
> For above, collect pairwise human preference on responses sampled from the comittee

Deployment:
> Adapters for each task, adapter values represented using 16-bits, loaded on-the-fly based on the task
> Quantised under 4-bit-per-weight (3.7 bpw), use accuracy recovering adapters for regaining the lost performance
> Accuracy recovery adapter trains on 10B tokens across different ranks, 8, 16, 32
> Some layers (unimportant) pushed to 2-bit

Evaluation:
> On-device: SoTA in IFEval and competitive with Gemma 7B on AlpacaEval 2.0
> Server: SoTA in IFEval, comparable to Mixtral 8x22B in Arena Hard
> Competitve with GPT 4/ Gemini 1.5 on Tools/ function calling, writing (summarisation, composition) benchmarks
> On-device beats L3 8B on Math

The report is quite feature packed, quite enjoyed skimming through it. Thanks Apple for being so open about your practices and spilling the beans on what would power the next gen of on-device ML.

More notes, coming soon! 🤗Image
Maxime (@maximelabonne) did a wonderful deep-dive on the Post-training bit, check it out!

Read 5 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(