Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting

PG()
Bartosz Roguski
Machine Learning Engineer
June 24, 2025

Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting is a latency-reduction technique that lets an LLM produce a quick “draft” answer before the retriever finishes its full search. The draft’s predicted keywords and entities seed pre-fetch queries to the vector or hybrid index, so relevant passages arrive just as the model reaches its final decoding stage. If the initial speculation drifts off-topic, a verifier sub-model prunes low-confidence tokens and reruns retrieval on the refined prompt. This overlapped pipeline turns idle waiting time into useful computation, cutting end-to-end response latency by 30-50 % without raising hallucination risk. Engineers tune draft length, confidence thresholds, and verifier strictness to balance speed and accuracy. The method shines in real-time chatbots, voice agents, and mobile apps where every 100 ms matters, yet factual grounding cannot be sacrificed. Speculative RAG extends concepts from speculative decoding and asynchronous search, bringing production-grade responsiveness to knowledge-grounded LLM services.