TTS — Explained · AI Sözlüğü

Definition

TTS (Text-to-Speech) is the model family that converts text into audio waveforms. Old TTS systems (rule-based, concatenative) sounded robotic; modern neural TTS (Tacotron, FastSpeech, VITS, ElevenLabs, OpenAI TTS) reaches near-indistinguishable human quality.

Typical architecture is two-stage: 1. Acoustic model: text → mel-spectrogram (time-frequency representation). 2. Vocoder: mel-spectrogram → raw waveform (HiFi-GAN, WaveNet, or newer diffusion vocoders).

Modern additions: - Voice cloning: mimic a person's voice from 30 seconds of audio - Emotion/tone control: prompt-driven tone, pace, mood - Multilingual: 30+ languages from one model - Streaming TTS: convert text to speech as it arrives

Analogy

Like reading a book aloud. But the reader is highly skilled — tone, emphasis, emotion, pronunciation all correct. They know which word to stress, how long to pause at a comma, how to raise pitch on a question. Old TTS read like a robot; new TTS sounds like a pro voice actor.

Real-world example

A podcast production tool: you give it a written script, AI voices it. Steps: 1. POST to ElevenLabs: text + voice_id ("Sarah", "Adam") + model_id ("eleven_turbo_v2_5"). 2. In ~3-5 seconds: 1-minute audio file (.mp3, .wav). 3. Multiple characters → different voice_id; merge in a DAW. 4. Cost: ~$0.30 per minute (ElevenLabs).

Before 2023 this quality required studio + voice artist + post-production (hours + thousands of $). Now: one API call. Audiobooks, e-learning narration, assistant voices — all standard.

When to use

Accessibility — voice written content for visually impaired users
Assistant/chat product — convert LLM text answers to speech
Podcast/audiobook production — instead of voice actors
Call-center IVR — beyond static menus, dynamic speech
Language learning — pronounce words/sentences

When not to use

High artistic quality (film dub) — human voice actors still win
Real-time low-latency (live translation) — TTS adds 100-500ms
Static repetitive content — basic old-school TTS is cheaper
When deception risk is real — voice cloning has ethical/legal issues

Common pitfalls

Voice cloning ethics & law

Cloning someone's voice without consent is illegal in many places. ElevenLabs and others enforce consent policies. Production needs explicit consent records.

Intonation and nuance still hard

Models capture base tone but miss nuances like sarcasm, surprise, fake cheer. SSML (Speech Synthesis Markup Language) tags can help guide manually.

Multilingual claims can mislead

A model may claim '50-language support' but quality in Turkish may not match English. Test your specific language before shipping — agglutinative languages (TR, FI) often suffer.