Deepgram Profile picture
Nov 14, 2023 19 tweets 5 min read Read on X
The wait is over! With 60M+ minutes transcribed, our next-gen speech-to-text model Nova-2 is now available.

What's new?
✅ Expanded languages: Spanish, Hindi, German, French, Portuguese
✅ Custom model training
✅ On-prem deployment



Let's dive in...🧵 dpgr.am/9f52615
In our early access release, Nova-2 impressed developers with its unmatched performance and value compared to competitors.

✅ An average 30% reduction in word error rate (WER)
✅ 5-40x faster speed
✅ 3-5x lower costs
✅ Full feature set: diarization, smart formatting, and more
Since then, thousands of projects have been developed across diverse use cases from autonomous AI agents and coaching bots to call center analytics and conversational AI platforms, transcribing more than 60 million audio minutes in a little more than a month.
Regarding benchmarking, the results center on industry-standard accuracy metrics for #STT like word error rate (WER) and word recognition rate (WRR).

But many research topics in #NLP require evaluation that goes beyond standard metrics, towards a more human-centered approach.
For evaluating #LLM-powered speech generation systems, human preference scoring is often considered one of the most reliable approaches in spite of its high costs compared to automated evaluation methods.
To this end, we conducted human preference testing using outside professional annotators who examined a set of 50 unique transcripts produced by Nova-2 and 3 other providers, evaluated in randomized head-to-head comparisons (totaling 300 unique transcription preference matchups).
They were then asked to listen to the audio and give their preference of formatted transcripts based on an open-ended criterion. Figure 2: For each annotator, the “preferred count” is given for each of the vendors (i.e. the number of times the annotator preferred that vendor’s transcripts across all comparisons).
In head-to-head comparisons, Nova-2 transcripts were preferred ~60% of the time, and 5/6 annotators preferred formatted Nova-2 results more than any other vendor. Nova-2 had the highest overall win rate at 42%, which is a 34% higher win rate than the next-best OpenAI Whisper[2]. Figure 3: The file win/preference rate, or the percent of audio files where that vendor’s transcript was most commonly preferred over all other vendors.
We conducted benchmarks against OpenAI’s recently released #WhisperV3, but were a bit perplexed by the results as we frequently encountered significant repetitive hallucinations in a sizable portion of our test files.
The result was a much higher WER and wider variance for OpenAI’s latest model than expected.

We are actively investigating how to better tune Whisper-v3 in order to conduct a fair and comprehensive comparison with Nova-2.

(Stay tuned, and we will share the results soon).
Our benchmarking methodology for non-English languages with Nova-2 utilized 50+ hours of high quality, human-annotated audio, encompassing a wide range of audio lengths, varying environments, diverse accents, and subjects across many domains.
We transcribed these data sets with Nova-2 and some of the most prominent STT models in the market for leading non-English languages.
Nova-2 outperforms all tested competitors by an average of 30.3%.

Significant performance:
➡️ Hindi (41% relative WER improvement)
➡️ Spanish (15% relative WER improvement)
➡️ German (27% relative WER improvement)
➡️ 34% relative WER improvement vs. Whisper large-V2. Figure 4: Nova-2’s median file word error rate (WER) for Hindi pre-recorded transcription across all audio domains.
Nova-2 not only outperforms rivals in accuracy but also shows less variation in results, leading to more reliable transcripts across diverse languages in practical applications. Figure 5: The figure above compares the average non-English Word Error Rate (WER) of our Nova-2 model for Spanish, German, and Hindi with other popular models across four audio domains: video/media, podcast, meeting, and phone call. It uses a boxplot chart, which is a type of chart often used to visually show the distribution of numerical data and skewness. The chart displays the five-number summary of each dataset, including the minimum value, first quartile (median of the lower half), median, third quartile (median of the upper half), and maximum value.
Nova-2 beats the competition for non-English streaming by more than 2% in absolute WER over all languages combined with a 23% relative WER improvement on average (and 11% relative WER improvement over the next best alternative, Azure), as shown below. Figure 6: Nova-2’s relative  word error rate (WER) improvement percentage for streaming transcription across all non-English languages and audio domains.
Nova-2 stands out for its speed and accuracy, but also continues the legacy of its predecessor by being the most cost-effective speech-to-text model on the market.

Priced competitively at just $0.0043 per minute for pre-recorded audio, Nova-2 is 3-5x more affordable. Priced competitively at just $0.0043 per minute for pre-recorded audio, Nova-2 is 3-5 times more affordable than other comprehensive providers in the market (based on current listings).
As the #AI market has come to recognize, customization can be vital for making AI models work for your use case. Deepgram customers can now have a custom-trained model created for them using the best foundation available–Nova-2.
Deepgram can also provide data labeling services, and even create audio to train on to ensure your custom models produce the best results possible, giving you a boost in performance atop Nova-2’s already impressive, out-of-the-box capabilities.
Dive into the details of this latest release, our approach to benchmarking, and more in the full announcement.

We can't wait to see what you build with Nova-2! Be sure to share your projects with us by mentioning us here.

Happy building!

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Deepgram

Deepgram Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @DeepgramAI

Oct 4, 2024
After spending time with #OpenAI’s Voice Mode in #ChatGPT, we were eager to explore the API behind it.

A few weeks ago, we launched our Voice Agent API, and we’ve been curious to see how the two compare. Here’s what we found—just some early thoughts. 👇
⚡Latency: Both solutions performed similarly when it came to response time. Whether handling simple or complex tasks, the latency felt roughly equal at ~<1sec.
📈Consumer vs. Enterprise:

OpenAI’s API: We think OpenAI’s approach is more consumer-focused, (and, hot-take) likely built by chaining together models like whisper-large-v3-turbo, gpt-4o, and their 6 TTS voices for smooth, natural interactions. (Why? High cost + use of a VAD)

Deepgram’s API: Our API uses a similar chained-approach, but is designed with enterprise precision in mind, especially for structured, real-time data capture—like end-of-thought detection and accurate input processing in business environments.
Read 8 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(