Post

More from @reach_vb

Vaibhav (VB) Srivastav

@reach_vb

Jan 27

Whisper in transformers is now better at Long-form generation! ⚡

We've observed an up-to 2-point decrease in Word Error Rate! ;)

You can now use the same techniques used by Open AI Whisper but much faster, thanks to Flash Attention 2 and batching! 🔥

With batching, we've observed up to 4.5x improvements compared to the original implementation!

Make sure to upgrade to the latest version of Transformers - `pip install -U transformers`

Here's how you can test it too:

#!/usr/bin/env python3
from transformers import WhisperForConditionalGeneration, AutoProcessor
from datasets import load_dataset, Audio
import torch
import numpy as np

processor = AutoProcessor.from_pretrained("openai/whisper-small.en")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small.en", torch_dtype=torch.float16)
model. to("cuda")

# retrieve 8 long audio sequences
ds = load_dataset("distil-whisper/earnings21", "full")["test"]
ds = ds.cast_column("audio", Audio(sampling_rate=16000))
ds = ds[:8] # take batch size of 8

raw_audio = [x["array"].astype(np.float32) for x in ds["audio"]]

# process input, make sure to pass `padding='longest'` and `return_attention_mask=True`
inputs = processor(raw_audio,
return_tensors="pt",
truncation=False,
padding="longest",
return_attention_mask=True,
sampling_rate=16_000)

inputs = inputs. to("cuda", torch.float16)

# activate `temperature_fallback` and repetition detection filters and condition on prev text
result = model.generate(**inputs,
condition_on_prev_tokens=False,
temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
logprob_threshold=-1.0,
compression_ratio_threshold=1.35,
return_timestamps=True)

decoded = processor.batch_decode(result, skip_special_tokens=True)
print(decoded)

All thanks to Patrick for helping add this so brilliantly!

Check out the PR here:
github.com/huggingface/tr…

Read 4 tweets

Vaibhav (VB) Srivastav

@reach_vb

Jan 20

Introducing DataTrove 🤯

Its processing pipelines are platform-agnostic, running out of the box locally or on a slurm cluster.

Low memory usage and multiple step design makes it ideal for large workloads, such as to process an LLM's training data. ✨

github.com/huggingface/da…

We provide a wide array of quick stadt examples to get you started! 🚀

Full Pipeline consists of a DataTrove document:

Read 6 tweets

Vaibhav (VB) Srivastav

@reach_vb

Jan 14

What are the top open source TTS models out there? 🤔

Here’s my list so far:

XTTS -
YourTTS -
FastSpeech2 -
VITS -
TorToiSe -
Pheme -

Edit:

Some more options from the comments 👇🏻

EmotiVoice -
StyleTTS 2 -
pflowtts_pytorch -
VALL-E -

What else is out there?huggingface.co/coqui/XTTS-v2
github.com/Edresson/YourT…
github.com/DigitalPhoneti…
huggingface.co/docs/transform…
github.com/neonbjb/tortoi…
github.com/PolyAI-LDN/phe…
github.com/netease-youdao…
github.com/yl4579/StyleTT…
github.com/p0p4k/pflowtts…
github.com/enhuiz/vall-e

Ah I somehow managed to fork the tweet with my edits lol.

In case you know of any other models then put them down below please! 🙏

https://twitter.com/taha_yssne/status/1746607510503481440

Piper.

https://twitter.com/taha_yssne/status/1746607510503481440

Read 6 tweets

Vaibhav (VB) Srivastav

@reach_vb

Jan 13

Introducing MLX-LM! ⚡ *sound on*

Run LLMs on-device directly on your Mac with 3 lines of code! ;)

100% local and quite spiffy (even faster with 4-bit)!

I made a quick video covering the package, its capabilities and a bit of quantisation.

The video goes through what MLX is, and some applications and then we explore the mlx-lm package.

All you gotta do is:

`pip install mlx-lm` 🔥

Another stellar job by @awnihannun in making this land so beautifully! - and there's more in store, I'm sure ;)

I uploaded the video to YT in case y'all face any issues watching this on X:

Here's all you need to do to get started:

Step 1: Create a virtual environment and install mlx-lm

python3 -m venv mlx-experiments

Next, activate the virtualenv

source mlx-experiments/bin/activate

Lastly, install mlx-lm

pip install mlx-lm

Read 6 tweets

Vaibhav (VB) Srivastav

@reach_vb

Jan 8

Let's go, 200% faster Whisper w/ speculative decoding! 🔥

Whisper (baseline) - 73 seconds
Whisper w/ Speculative Decoding - 33 seconds

All with zero drop in performance! ⚡

Pseudocode:
1. Initialise a Teacher model ex: openai/whisper-large-v2.
2. Load an assistant model ex: distil-whisper/distil-large-v2 or openai/whisper-tiny.
3. Pass the assistant model over to the pipeline.
4. Transcribe away!

That's it! 🤗

Step 1: Initialise a teacher model.

Step 2: Load an assistant model.

Read 6 tweets

Vaibhav (VB) Srivastav

@reach_vb

Nov 30, 2023

Making audio a first-class citizen in LLMs: Qwen Audio 🔉

Using a Multi-Task Training Framework, Qwen Audio - Combines OpenAI's Whisper large v2 (Audio encoder) with Qwen 7B LM to train on over 30 audio tasks jointly.

Tasks ranging from Speech Recognition to Music Captioning to Language Identification to Sound Event Classification and more! 🔥

It beats the current SoTA across the tasks!

Bonus: Instruction-tuned Qwen-Audio-Chat allows for seamless multi-turn interactions through audio or text inputs.

Let the era of Audio-LLMs begin! 🤯

Play with the model directly here 🤗
huggingface.co/spaces/Qwen/Qw…

Base model 👇🏻

huggingface.co/Qwen/Qwen-Audio

Read 4 tweets

Share this page!

Enter URL or ID to Unroll

Vaibhav (VB) Srivastav

Try unrolling a thread yourself!

More from @reach_vb

Vaibhav (VB) Srivastav

Vaibhav (VB) Srivastav

Vaibhav (VB) Srivastav

Vaibhav (VB) Srivastav

Vaibhav (VB) Srivastav

Vaibhav (VB) Srivastav

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!