Automatic Speech Recognition (ASR)
Automatic Speech Recognition (ASR) is the technology that converts spoken audio into machine-readable text by mapping acoustic signals to linguistic units. A modern ASR pipeline captures waveforms, computes Mel spectrograms, and feeds them to a neural acoustic model—often a Transformer like Whisper, Conformer, or wav2vec 2.0—to predict token sequences. A language model then decodes these tokens into coherent words, while post-processors add capitalization, punctuation, and speaker diarization. Advanced systems deliver real-time streaming with word-level timestamps and multilingual support, and can be fine-tuned on domain audio to cut word-error rate (WER) by 30 % or more. ASR powers captions, voice assistants, call-center bots, and Retrieval-Augmented Generation pipelines that ground LLM responses in spoken data. Key metrics include WER, real-time factor (RTF), and latency; challenges—accent variance, background noise, privacy—are mitigated by noise-robust training, on-device inference, and encryption.