Retriever-Reader Architecture
Retriever-Reader Architecture is a two-stage question-answering framework in which a retriever first fetches a small set of relevant documents from a large corpus, and a reader then performs deep language understanding to extract or generate the answer. The retriever uses fast methods—BM25, dense vector search, or hybrid—to narrow billions of passages to the top-k candidates in milliseconds. The reader—often a Transformer fine-tuned on SQuAD or an instruction-tuned LLM—scans those candidates with full attention and returns either a span (extractive QA) or a free-form response (generative RAG). This division balances speed and accuracy: a lightweight retriever keeps latency low, while a heavy reader focuses compute on promising text. Key tunables are k value, re-ranking depth, and reader context window. Metrics such as recall@k for the retriever and exact-match/F1 for the reader gauge system health. Widely used in search engines, chatbots, and legal discovery tools, Retriever-Reader Architecture grounds large language models in evidence, cutting hallucinations and token costs.