Text-to-Speech

Wojciech Achtelik

AI Engineer Lead

July 2, 2025

Glossary Category

Voice AI

Text-to-Speech is the speech-synthesis technology that converts written text into natural-sounding audio using neural networks. A modern pipeline tokenizes input, converts characters to phonemes, predicts mel-spectrograms with an acoustic model such as Tacotron 2, FastSpeech 2, or GPT-TTS, and then transforms the spectrogram into waveforms via a neural vocoder like WaveGlow or HiFi-GAN. Advanced systems support zero-shot voice cloning and emotion control through style tokens, letting apps render brand voices or multilingual output from a single model. Quality is measured by mean-opinion score (MOS) and real-time factor (RTF); low RTF enables streaming voice assistants, while high MOS ensures audiobooks remain engaging. Deployment options range from cloud APIs—Amazon Polly, Azure Neural TTS, ElevenLabs—to on-device models that meet GDPR by keeping audio local. Text-to-Speech powers screen readers, accessibility features, call-center bots, and Retrieval-Augmented Generation (RAG) chatbots that can “talk back,” turning static content into interactive voice experiences.

Text-to-Speech

Other terms