ASR
Automatic Speech Recognition
A model that turns audio into text. The reverse of TTS. Modern neural ASR rivals human transcription accuracy.
ASR (Automatic Speech Recognition) is the model family that converts an audio waveform to text. The mirror of TTS: TTS goes text → audio, ASR goes audio → text. Old systems used Hidden Markov Models and hand-crafted acoustic models; modern systems are fully neural.
Typical modern ASR architecture: 1. Acoustic encoder: turn waveform → mel-spectrogram → hidden representation. 2. Decoder: representation → tokens (words/chars), usually transformer-based.
OpenAI's Whisper model (2022, open-source) changed the game: 96 languages, robust to noise, transformer encoder + decoder. Then came Whisper-v3, Distil-Whisper, AssemblyAI, Deepgram, Speechmatics.
New-gen features: - Diarization: who's speaking (Speaker 1 / Speaker 2) - Streaming: real-time transcription - Translation: translate while transcribing (Whisper "translate" mode) - Punctuation + timestamps: not raw text but formatted
Think of a professional court reporter. They listen to a hearing and type it word for word. They tolerate accents, whispers, background noise. Modern ASR does the same job 100× faster at 1/100 the cost.
Building a podcast production tool: 1. Guest speaks on Zoom (1-hour recording). 2. Upload to Whisper API: 1-hour mp3 → transcript in ~30 seconds. 3. Diarization: separated by "Guest:", "Host:" labels. 4. Timestamps on each sentence. 5. Cost: $0.36 (Whisper $0.006/min).
In 2020 this was $80-150 from a transcription service + 24-hour wait. Now it's automatic. The same pipeline powers: - Meeting notes (Otter.ai, Fireflies) - Call analytics (Gong, Chorus) - Live captions (YouTube, Zoom) - Voice commands (Siri, Alexa, Google Assistant)
- Podcast/video transcription
- Automating meeting notes
- Call-center analytics (transcribe → summarize with LLM)
- Accessibility — live captions for deaf users
- Voice command interfaces (smart speaker, automotive, IVR)
- Search over audio data (index the transcripts)
- High-accuracy legal/medical documentation — still need human review
- Very low-quality audio (phone-line, heavy noise) — quality drops
- Less-spoken languages (Whisper supports 96 but not equally well)
- Hard real-time low-latency (live translation) — modern models still ~500ms
ASR has hallucinations too
Whisper produces 'thanks for watching' in silences (training data residue). Long silences, instrumental music = hallucination risk. Pre-filter with voice activity detection (VAD).
Accent and dialect performance
A model perfect on standard Istanbul Turkish may be weak on Black Sea dialect. Indian and Scottish accents trouble English models. Match training to your user demographic.
Code-switching (language mixing)
Mixed-language speech — 'I joined the meeting'e katıldım' — can confuse ASR. Multilingual model + post-processing correction is needed.