AI Dictionary
Intermediate· ~2 min read#asr#speech-to-text#whisper

ASR

Automatic Speech Recognition

A model that turns audio into text. The reverse of TTS. Modern neural ASR rivals human transcription accuracy.

SPEECH → TEXT (ASR)AUDIO WAVE.mp3 / .wavASRWhisper / Deepgramspeech recog.TRANSCRIPT"Merhaba dünya,nasılsın?"confidence: 0.94Whisper, Deepgram, AssemblyAI — modern transcription rivals human accuracy
Definition

ASR (Automatic Speech Recognition) is the model family that converts an audio waveform to text. The mirror of TTS: TTS goes text → audio, ASR goes audio → text. Old systems used Hidden Markov Models and hand-crafted acoustic models; modern systems are fully neural.

Typical modern ASR architecture: 1. Acoustic encoder: turn waveform → mel-spectrogram → hidden representation. 2. Decoder: representation → tokens (words/chars), usually transformer-based.

OpenAI's Whisper model (2022, open-source) changed the game: 96 languages, robust to noise, transformer encoder + decoder. Then came Whisper-v3, Distil-Whisper, AssemblyAI, Deepgram, Speechmatics.

New-gen features: - Diarization: who's speaking (Speaker 1 / Speaker 2) - Streaming: real-time transcription - Translation: translate while transcribing (Whisper "translate" mode) - Punctuation + timestamps: not raw text but formatted

Analogy

Think of a professional court reporter. They listen to a hearing and type it word for word. They tolerate accents, whispers, background noise. Modern ASR does the same job 100× faster at 1/100 the cost.

Real-world example

Building a podcast production tool: 1. Guest speaks on Zoom (1-hour recording). 2. Upload to Whisper API: 1-hour mp3 → transcript in ~30 seconds. 3. Diarization: separated by "Guest:", "Host:" labels. 4. Timestamps on each sentence. 5. Cost: $0.36 (Whisper $0.006/min).

In 2020 this was $80-150 from a transcription service + 24-hour wait. Now it's automatic. The same pipeline powers: - Meeting notes (Otter.ai, Fireflies) - Call analytics (Gong, Chorus) - Live captions (YouTube, Zoom) - Voice commands (Siri, Alexa, Google Assistant)

When to use
  • Podcast/video transcription
  • Automating meeting notes
  • Call-center analytics (transcribe → summarize with LLM)
  • Accessibility — live captions for deaf users
  • Voice command interfaces (smart speaker, automotive, IVR)
  • Search over audio data (index the transcripts)
When not to use
  • High-accuracy legal/medical documentation — still need human review
  • Very low-quality audio (phone-line, heavy noise) — quality drops
  • Less-spoken languages (Whisper supports 96 but not equally well)
  • Hard real-time low-latency (live translation) — modern models still ~500ms
Common pitfalls

ASR has hallucinations too

Whisper produces 'thanks for watching' in silences (training data residue). Long silences, instrumental music = hallucination risk. Pre-filter with voice activity detection (VAD).

Accent and dialect performance

A model perfect on standard Istanbul Turkish may be weak on Black Sea dialect. Indian and Scottish accents trouble English models. Match training to your user demographic.

Code-switching (language mixing)

Mixed-language speech — 'I joined the meeting'e katıldım' — can confuse ASR. Multilingual model + post-processing correction is needed.