AI Dictionary
Intermediate· ~2 min read#tts#speech#audio

TTS

Text-to-Speech

A model that turns written text into natural-sounding speech. Modern neural TTS is indistinguishable from human voices.

TEXT → SPEECH (TTS)TEXT"Merhaba dünya,nasılsın?"language, tone, voice IDTTSneural vocoderspeech synthAUDIO WAVE.mp3 / .wav🔊ElevenLabs, OpenAI TTS, Google — one sentence becomes audio in ~1s
Definition

TTS (Text-to-Speech) is the model family that converts text into audio waveforms. Old TTS systems (rule-based, concatenative) sounded robotic; modern neural TTS (Tacotron, FastSpeech, VITS, ElevenLabs, OpenAI TTS) reaches near-indistinguishable human quality.

Typical architecture is two-stage: 1. Acoustic model: text → mel-spectrogram (time-frequency representation). 2. Vocoder: mel-spectrogram → raw waveform (HiFi-GAN, WaveNet, or newer diffusion vocoders).

Modern additions: - Voice cloning: mimic a person's voice from 30 seconds of audio - Emotion/tone control: prompt-driven tone, pace, mood - Multilingual: 30+ languages from one model - Streaming TTS: convert text to speech as it arrives

Analogy

Like reading a book aloud. But the reader is highly skilled — tone, emphasis, emotion, pronunciation all correct. They know which word to stress, how long to pause at a comma, how to raise pitch on a question. Old TTS read like a robot; new TTS sounds like a pro voice actor.

Real-world example

A podcast production tool: you give it a written script, AI voices it. Steps: 1. POST to ElevenLabs: text + voice_id ("Sarah", "Adam") + model_id ("eleven_turbo_v2_5"). 2. In ~3-5 seconds: 1-minute audio file (.mp3, .wav). 3. Multiple characters → different voice_id; merge in a DAW. 4. Cost: ~$0.30 per minute (ElevenLabs).

Before 2023 this quality required studio + voice artist + post-production (hours + thousands of $). Now: one API call. Audiobooks, e-learning narration, assistant voices — all standard.

When to use
  • Accessibility — voice written content for visually impaired users
  • Assistant/chat product — convert LLM text answers to speech
  • Podcast/audiobook production — instead of voice actors
  • Call-center IVR — beyond static menus, dynamic speech
  • Language learning — pronounce words/sentences
When not to use
  • High artistic quality (film dub) — human voice actors still win
  • Real-time low-latency (live translation) — TTS adds 100-500ms
  • Static repetitive content — basic old-school TTS is cheaper
  • When deception risk is real — voice cloning has ethical/legal issues
Common pitfalls

Voice cloning ethics & law

Cloning someone's voice without consent is illegal in many places. ElevenLabs and others enforce consent policies. Production needs explicit consent records.

Intonation and nuance still hard

Models capture base tone but miss nuances like sarcasm, surprise, fake cheer. SSML (Speech Synthesis Markup Language) tags can help guide manually.

Multilingual claims can mislead

A model may claim '50-language support' but quality in Turkish may not match English. Test your specific language before shipping — agglutinative languages (TR, FI) often suffer.