In our early access release, Nova-2 impressed developers with its unmatched performance and value compared to competitors.
✅ An average 30% reduction in word error rate (WER)
✅ 5-40x faster speed
✅ 3-5x lower costs
✅ Full feature set: diarization, smart formatting, and more
Since then, thousands of projects have been developed across diverse use cases from autonomous AI agents and coaching bots to call center analytics and conversational AI platforms, transcribing more than 60 million audio minutes in a little more than a month.
Regarding benchmarking, the results center on industry-standard accuracy metrics for #STT like word error rate (WER) and word recognition rate (WRR).
But many research topics in #NLP require evaluation that goes beyond standard metrics, towards a more human-centered approach.
For evaluating #LLM-powered speech generation systems, human preference scoring is often considered one of the most reliable approaches in spite of its high costs compared to automated evaluation methods.
To this end, we conducted human preference testing using outside professional annotators who examined a set of 50 unique transcripts produced by Nova-2 and 3 other providers, evaluated in randomized head-to-head comparisons (totaling 300 unique transcription preference matchups).
They were then asked to listen to the audio and give their preference of formatted transcripts based on an open-ended criterion.
In head-to-head comparisons, Nova-2 transcripts were preferred ~60% of the time, and 5/6 annotators preferred formatted Nova-2 results more than any other vendor. Nova-2 had the highest overall win rate at 42%, which is a 34% higher win rate than the next-best OpenAI Whisper[2].
We conducted benchmarks against OpenAI’s recently released #WhisperV3, but were a bit perplexed by the results as we frequently encountered significant repetitive hallucinations in a sizable portion of our test files.
The result was a much higher WER and wider variance for OpenAI’s latest model than expected.
We are actively investigating how to better tune Whisper-v3 in order to conduct a fair and comprehensive comparison with Nova-2.
(Stay tuned, and we will share the results soon).
Our benchmarking methodology for non-English languages with Nova-2 utilized 50+ hours of high quality, human-annotated audio, encompassing a wide range of audio lengths, varying environments, diverse accents, and subjects across many domains.
We transcribed these data sets with Nova-2 and some of the most prominent STT models in the market for leading non-English languages.
Nova-2 outperforms all tested competitors by an average of 30.3%.
Significant performance:
➡️ Hindi (41% relative WER improvement)
➡️ Spanish (15% relative WER improvement)
➡️ German (27% relative WER improvement)
➡️ 34% relative WER improvement vs. Whisper large-V2.
Nova-2 not only outperforms rivals in accuracy but also shows less variation in results, leading to more reliable transcripts across diverse languages in practical applications.
Nova-2 beats the competition for non-English streaming by more than 2% in absolute WER over all languages combined with a 23% relative WER improvement on average (and 11% relative WER improvement over the next best alternative, Azure), as shown below.
Nova-2 stands out for its speed and accuracy, but also continues the legacy of its predecessor by being the most cost-effective speech-to-text model on the market.
Priced competitively at just $0.0043 per minute for pre-recorded audio, Nova-2 is 3-5x more affordable.
As the #AI market has come to recognize, customization can be vital for making AI models work for your use case. Deepgram customers can now have a custom-trained model created for them using the best foundation available–Nova-2.
Deepgram can also provide data labeling services, and even create audio to train on to ensure your custom models produce the best results possible, giving you a boost in performance atop Nova-2’s already impressive, out-of-the-box capabilities.
Dive into the details of this latest release, our approach to benchmarking, and more in the full announcement.
We can't wait to see what you build with Nova-2! Be sure to share your projects with us by mentioning us here.
Happy building!
• • •
Missing some Tweet in this thread? You can try to
force a refresh
After spending time with #OpenAI’s Voice Mode in #ChatGPT, we were eager to explore the API behind it.
A few weeks ago, we launched our Voice Agent API, and we’ve been curious to see how the two compare. Here’s what we found—just some early thoughts. 👇
⚡Latency: Both solutions performed similarly when it came to response time. Whether handling simple or complex tasks, the latency felt roughly equal at ~<1sec.
📈Consumer vs. Enterprise:
OpenAI’s API: We think OpenAI’s approach is more consumer-focused, (and, hot-take) likely built by chaining together models like whisper-large-v3-turbo, gpt-4o, and their 6 TTS voices for smooth, natural interactions. (Why? High cost + use of a VAD)
Deepgram’s API: Our API uses a similar chained-approach, but is designed with enterprise precision in mind, especially for structured, real-time data capture—like end-of-thought detection and accurate input processing in business environments.