The lads at @argmaxinc optimised Whisper to work at blazingly fast speeds on iOS and Mac!
> All code is MIT-licensed.
> Upto 3x faster than the competition.
> Neural Engine as well as Metal runners.
> Open source CoreML models.
> 2 lines of code :)
> Whisper & Whisper-Turbo (even faster variant)
(Look how it utilises ANE so beautifully in the video showing their sample app on Mac!)
@argmaxinc Open Source Swift package for iOS and Mac devices 🍎
Whisper in transformers is now better at Long-form generation! ⚡
We've observed an up-to 2-point decrease in Word Error Rate! ;)
You can now use the same techniques used by Open AI Whisper but much faster, thanks to Flash Attention 2 and batching! 🔥
With batching, we've observed up to 4.5x improvements compared to the original implementation!
Make sure to upgrade to the latest version of Transformers - `pip install -U transformers`
Here's how you can test it too:
#!/usr/bin/env python3
from transformers import WhisperForConditionalGeneration, AutoProcessor
from datasets import load_dataset, Audio
import torch
import numpy as np
processor = AutoProcessor.from_pretrained("openai/whisper-small.en")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small.en", torch_dtype=torch.float16)
model. to("cuda")
# retrieve 8 long audio sequences
ds = load_dataset("distil-whisper/earnings21", "full")["test"]
ds = ds.cast_column("audio", Audio(sampling_rate=16000))
ds = ds[:8] # take batch size of 8
raw_audio = [x["array"].astype(np.float32) for x in ds["audio"]]
# process input, make sure to pass `padding='longest'` and `return_attention_mask=True`
inputs = processor(raw_audio,
return_tensors="pt",
truncation=False,
padding="longest",
return_attention_mask=True,
sampling_rate=16_000)
inputs = inputs. to("cuda", torch.float16)
# activate `temperature_fallback` and repetition detection filters and condition on prev text
result = model.generate(**inputs,
condition_on_prev_tokens=False,
temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
logprob_threshold=-1.0,
compression_ratio_threshold=1.35,
return_timestamps=True)
Pseudocode: 1. Initialise a Teacher model ex: openai/whisper-large-v2. 2. Load an assistant model ex: distil-whisper/distil-large-v2 or openai/whisper-tiny. 3. Pass the assistant model over to the pipeline. 4. Transcribe away!
Making audio a first-class citizen in LLMs: Qwen Audio 🔉
Using a Multi-Task Training Framework, Qwen Audio - Combines OpenAI's Whisper large v2 (Audio encoder) with Qwen 7B LM to train on over 30 audio tasks jointly.
Tasks ranging from Speech Recognition to Music Captioning to Language Identification to Sound Event Classification and more! 🔥
It beats the current SoTA across the tasks!
Bonus: Instruction-tuned Qwen-Audio-Chat allows for seamless multi-turn interactions through audio or text inputs.