Speech-to-Text

Bartosz Roguski

Machine Learning Engineer

July 2, 2025

Glossary Category

Voice AI

Speech-to-Text is the process of converting spoken audio into written words using automatic speech recognition (ASR) models. A typical pipeline captures a waveform, applies a Mel spectrogram, and feeds the features into a neural network—often a Transformer like Whisper, Conformer, or wav2vec 2.0—that outputs token sequences. Post-processing fixes punctuation, casing, and speaker diarization, while language-model rescoring boosts accuracy on domain jargon. Modern APIs offer real-time streaming, word-level timestamps, and automatic translation, enabling captions, voice UIs, and meeting notes. Accuracy hinges on audio quality, accent diversity, and latency constraints; evaluation relies on word-error rate (WER) and real-time factor (RTF). Fine-tuning with enterprise call logs or medical vocab can cut WER by 30 %+, and on-device models meet GDPR by keeping data local. By turning voice into searchable, analyzable text, Speech-to-Text powers chatbots, analytics dashboards, and Retrieval-Augmented Generation (RAG) pipelines.

Speech-to-Text

Other terms