🎷Excited to present our paper, “Careless Whisper: Speech-to-text Hallucination Harms” at @FAccTConference! 🎷We assess Whisper (OpenAI’s speech recognition tool) for transcribed hallucinations that don’t appear in audio input. Paper link: , thread 👇 arxiv.org/abs/2402.08021
We noticed in 2023 that, even when an audio file had ended, Whisper had a habit of hallucinating additional sentences that were never spoken. And, re-running Whisper on the same file yielded different hallucinations - see below example (hallucinations in red) (1/14)
This allowed us to quantify the hallucinations in the AphasiaBank speech dataset: about 1% of >13k audio files tested resulted in hallucinations. More occurred among speakers with aphasia (a language disorder that can occur post-stroke) relative to the control group (2/14)
But, we wanted to understand more about the hallucination text itself. We taxonomized hallucinations by types of harm, and found nearly 40% of hallucinations showcased harms of perpetuating violence, inaccurate associations, or false authority. What do these mean? (3/14)
Harms perpetuating violence involve misrepresentation of a speaker’s words that could become part of a formal record (e.g. in a courtroom trial); we present 3 subcategories of examples: physical violence, sexual innuendo, and demographic stereotyping (4/14)
Harms of inaccurate associations involve misrepresentation of the real world that could lead to inaccuracies (e.g. in patient medical notes). 3 subcategories include made-up names, social relationships, and health statuses (5/14)
Finally, harms of false authority involve misrepresentation of the speaker source, which could facilitate phishing / prompt injection attacks. These include Youtuber-speak (“like and subscribe”), thanking specific entities, and linking to websites (real or not) (6/14)
This all begs the question: why are these hallucinations happening? The Youtuber speak is consistent with the reporting on Whisper transcribing 1 million hours of Youtube audio (), but this doesn’t explain the existence of hallucinations (7/14)nytimes.com/2024/04/06/tec…
We present 2 hypotheses. 1st, we believe this has to do with OpenAI-specific modeling choices. We don’t see hallucinations like this in competing speech recognition tools on the market (8/14)
2nd, we find that speech with longer non-verbal durations (e.g. disfluencies from taking longer to speak, stuttering, pausing often – all symptoms of aphasia) tend to yield more Whisper hallucinations. We see this difference btwn aphasia and control speakers in our sample (9/14)
This is consistent with many user complaints that silence in audio leads to Whisper hallucinations, and is something that Whisper seems to have gotten better about over time: (10/14)github.com/openai/whisper…
We’re concerned about the allocative & representational harms arising for speakers with more pauses in speech (not just speech impairments, but also the elderly or non-native language speakers) for whom Whisper could disproportionately generate hallucinations (11/14)
These hallucinations can exacerbate existing societal biases and algorithmic harms across medical, hiring, legal, and education decisions. And worse, they’re difficult to detect in downstream transcriptions unless you know to look for them! So, what to do? (12/14)
OpenAI should (a) make Whisper users aware of potential hallucinations & advise against use in high-stakes decisions, (b) ensure inclusion of diverse speakers in the design process, & (c) work to update Whisper modeling / data collection to mitigate hallucinations (13/14)
Many thanks to the folks who we’ve chatted with and/or directly inspired our work; we hope to continue the conversation! (14/14) @Aphasia_Inst @TAPUnlimited @jurafsky @Diyi_Yang @sayashk @sulin_blodgett @hannawallach @o_saja @jennwvaughan @eytanadar @IsabelleZaugg @Grady_Booch
Share this Scrolly Tale with your friends.
A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.