Back to blog

Old-School Keyword Search to the Rescue When Your RAG Fails

Karol Piekarski

PhD, Agentic AI & Data Engineer

November 3, 2025

Category Post

RAG

Table of content

With all the AI hype, semantic search powered by language models has become the default choice for retrieval augmented generation, or RAG, systems. But is it always the best approach? While semantic search excels at understanding meaning and different phrasing, it often stumbles on precise identifiers with a tendency to confuse similar product names or mixing up date ranges. In this post, we explore how hybrid search combines semantic embeddings with keyword matching to handle the conceptual understanding and literal precision that real-world queries demand.

Vector search and retrieval-augmented generation have emerged as critical AI workflow tools to make business data more structured and address (and amend) the disconnect between enterprise models and execution.

Bianca Lewis, executive director of the OpenSearch Software Foundation, October 3 2025, on How RAG continues to ‘tailor’ well-suited AI

Why your RAG might confuse similar products or dates

Clients often reach out in frustration saying that their RAG systems produce hallucinated or inaccurate results despite providing clean data and properly configured vector databases. In these cases, unstructured data is often not to blame, and the culprit is frequently the retrieval strategy itself, which must match the type of content to what you are actually searching for.

Pure semantic search can struggle with precise identifiers and exact matches, for instance a query about “IBM 5150” might return results about the similar “IBM 5100,” or searching for “Q3 2024 revenue” could mistakenly pull up Q4 2023 or Q2 2024 results instead. This happens because semantic embeddings capture conceptual similarity rather than exact terminology.

To overcome these pitfalls, modern RAG systems increasingly adopt hybrid search approaches that combine semantic understanding with keyword matching (like BM25) to catch both the conceptual meaning and the literal precision that users demand.

Understanding dense vs. sparse search: the key to hybrid retrieval strategies

Semantic search is “dense” because it represents documents as rich numerical vectors to capture both meaning and context, while sparse search treats documents as simple lists in which keywords appear. Much like a chef who marks only the specific ingredients each recipe contains (tomatoes: yes, basil: yes, cinnamon: no) rather than describing the dish’s overall flavor profile, which is how semantic search helps you find “savory Italian dishes” even without listing the exact ingredients.

BM25 (Best Match 25) is one of the most commonly used sparse search algorithms. BM25 scores matches by counting how often your search terms appear, recognizing that finding “Tesla” three times in a 200-word article about electric cars is far more relevant than finding it three times in a 10,000-word history of inventors, while automatically ignoring common filler words like “the” and “and.”

Note: While this overview focuses on BM25, other popular sparse retrieval algorithms exist, such as TF-IDF, BM11, or the Dirichlet language models, but for brevity, we are using BM25 as an example.

When a hybrid search approach makes the difference: real query examples

To understand where hybrid search truly shines, let us examine three real queries from a vintage computer database. These examples reveal how different search approaches handle the same information and why combining both methods often yields the best results.

Key take-aways: hybrid search performance at a glance

Real-world queries illustrate fundamental patterns in retrieval behavior:

Semantic search excels when understanding context and intent matters more than matching specific terms
Keyword search dominates when precision details (specs, model numbers, exact attributes) are the distinguishing factors
Hybrid search provides insurance against the weaknesses of either approach, consistently keeping relevant results near the top

Example 1: Semantic search finds the answer when keywords fail

Query: Which computer did UK schools use? Expected answer: BBC Microcomputer System

Dense (semantic) — top 5 results

Rank	Computer	Score
1	BBC Microcomputer System	0.5560
2	Bell & Howell Computer	0.4665
3	VideoBrain Family Computer	0.4554
4	Tomy Tutor	0.4378
5	Commuter 1083	0.4351

Sparse (BM25) — top 5 results

Rank	Computer	Score
1	Sinclair ZX Spectrum	7.0326
2	BBC Microcomputer System	6.5383
3	Apple I	5.8093
4	NEC APC	5.5303
5	TRS-80 Color Computer	1.8753

This example demonstrates where semantic search excels. The query contains no direct reference to “BBC Microcomputer System,” yet semantic search places it first because the system’s description includes relevant contextual information about educational use.

Notice something interesting about the BM25 results: Sinclair ZX Spectrum ranks first despite not mentioning schools at all. Why? The keyword matching algorithm latched onto “UK,” a term strongly associated with the popular British computer. This shows keyword search’s limitation: it cannot distinguish between a computer that was popular in the UK versus one specifically used in UK schools.

Hybrid search (50% semantic + 50% BM25)

1. [0.9548] ██████████████████████████████████████

BBC Microcomputer System
2. [0.5717] ██████████████████████

Sinclair ZX Spectrum
3. [0.3883] ███████████████

Apple I
4. [0.3628] ██████████████

NEC APC
5. [0.2322] █████████

Bell & Howell Computer

The hybrid approach surfaces the correct answer decisively with a significant score gap between the first and second results.

Example 2: When queries explicitly mention alternatives

Query: Which computer was used by UK Schools, was it Spectrum? Expected answer: BBC Microcomputer System

Dense (semantic) — top 5 results

Rank	Computer	Score
1	Sinclair ZX Spectrum	0.61
2	BBC Microcomputer System	0.51
3	VideoBrain Family Computer	0.43
4	Commuter 1083	0.43
5	Spectravideo CompuMate	0.43

Sparse (BM25) — top 5 results

Rank	Computer	Score
1	BBC Microcomputer System	16.89
2	VideoBrain Family Computer	9.41
3	Sinclair ZX Spectrum	8.66
4	NEC APC	8.31
5	Commodore Amiga 600	7.87

This result reveals an intriguing pattern. Semantic search actually gets distracted by the direct mention of “Spectrum” in the query, ranking it first, perhaps because the embedding model captures the semantic context of the ZX Spectrum’s popularity in the UK market. Meanwhile, BM25’s keyword matching correctly weighs the combination of “UK,” “Schools,” and the comparative structure of the question.

Hybrid search (50% semantic + 50% BM25)

1. [0.80] ████████████████████████████████

BBC Microcomputer System
2. [0.69] ███████████████████████████

Sinclair ZX Spectrum
3. [0.36] ██████████████

VideoBrain Family Computer
4. [0.23] █████████

Commuter 1083
5. [0.18] ███████

NEC APC

The hybrid approach finds the sweet spot, correctly identifying BBC Microcomputer System as the primary answer while still acknowledging the mentioned alternative.

Example 3: Precision matters for technical specifications

Query: Which Apple II clone shipped with 128K RAM? Expected answer: Franklin ACE 2100

Dense (semantic) — top 5 results

Rank	Computer	Score
1	Apple IIc	0.61
2	Apple IIc Plus	0.60
3	Apple II	0.59
4	Apple IIGS	0.59
5	Franklin Ace 100	0.56

Sparse (BM25) — top 5 results

Rank	Computer	Score
1	Franklin ACE 2100	13.78
2	Franklin Ace 100	13.19
3	Franklin Ace 1000	12.24
4	Apple I	10.48
5	Apple IIGS	10.43

This example represents a common real-world scenario where specific details matter, similar to someone searching for “silver iPhone SE 3rd gen with 128GB storage” in an e-commerce context. Here, keyword search performs better because the exact phrases “Apple II clone” and “128K RAM” serve as crucial distinguishing factors. Semantic search, by contrast, focuses on general similarity within the Apple product family, missing the specific technical requirements.

Hybrid search (50% semantic + 50% BM25)

1. [0.86] ██████████████████████████████████

Franklin Ace 100
2. [0.78] ███████████████████████████████

Apple IIc Plus
3. [0.77] ███████████████████████████████

Franklin ACE 2100 (expected)
4. [0.76] ██████████████████████████████

Apple IIGS
5. [0.73] █████████████████████████████

Apple IIc
The hybrid approach places Franklin ACE 2100 in third position. While not perfect, the hybrid system keeps the correct answer visible, whereas semantic search alone would bury it outside the top results. Interestingly, rephrasing the query to “Which Apple II fork came with 128K RAM?” returns Franklin ACE 2100 first in keyword search and third in hybrid, suggesting the retrieval pattern is not simply coincidence based on matching obvious terms like “clone” or “shipped.”

Key patterns in hybrid search performance

These real-world queries illustrate fundamental patterns in retrieval behavior:

Semantic search excels when understanding context and intent matters more than matching specific terms
Keyword search dominates when precision details (specs, model numbers, exact attributes) are the distinguishing factors
Hybrid search provides insurance against the weaknesses of either approach, consistently keeping relevant results near the top

While it is difficult to draw sweeping conclusions from a handful of examples without knowing the complete source data and implementation details, these cases offer a valuable glimpse into the challenges of building effective retrieval systems. They show how both methods can complement each other, with each compensating for the other’s blind spots.

When to choose hybrid search over single-method retrieval

The key takeaway? If your users ask questions across this spectrum, from broad conceptual queries to hyper-specific technical searches, you are likely better served by a hybrid approach than betting everything on a single retrieval method. You can deploy this retrieval system and experiment with these examples yourself. We have prepared a comprehensive evaluation using a dataset of 169 vintage computer descriptions and 507 LLM-generated (gpt-4.1) evaluation queries. The dataset is small enough to run locally without any costs, yet representative enough to demonstrate the advantages of hybrid search for documents heavily loaded with technical keywords. How does it work? Each test query follows this structure:

{
  "expected_doc_name": "Datapoint 2200",
  "question": "Which computer directly inspired the Intel 8008 microprocessor?",
  "question_type": "identifying_query",
  "evidence_quote": "architecture directly inspired Intel's 8008 microprocessor",
  "difficulty": "medium",
  "tags": [],
  "confuser_entity": ""
}

Each of the 169 computer documents has been converted into embeddings (numerical representations of meaning) using the jina-embeddings-v3 model over JinaAI’s API and indexed for retrieval
For each of the 507 test queries, we’ve generate embeddings using the same method
HNSW indexing with cosine similarity retrieves the most similar documents from the semantic search component, while BM25 handles keyword matching
Results from both methods are normalized and combined using configurable weights (e.g., 50% semantic + 50% keyword)
We examine the top 5 ranked results and compare them against the expected document name
We repeat the process for different ratios of dense (semantic) versus sparse (BM25) weighting to find the optimal balance

Evaluation results: dense (semantic) vs sparse (BM25) weight configurations

Weight Configuration: Dense/Sparse	Accuracy@1	Accuracy@3	Accuracy@5	MRR@1	MRR@3	MRR@5
100/0%	66.47%	83.04%	86.19%	66.47%	74.10%	74.85%
70/30%	78.30%	89.35%	93.89%	78.30%	83.23%	84.28%
50/50%	85.40%	96.84%	98.82%	85.40%	90.66%	91.14%
30/70%	89.15%	97.24%	98.03%	89.15%	92.93%	93.13%
0/100%	87.57%	96.25%	97.63%	87.57%	91.55%	91.87%

Understanding the metrics:

Accuracy@k measures the percentage of queries where the correct answer appears anywhere in the top k results. For example, Accuracy@3 means “how often is the right answer in the top 3 results?”
MRR@k (Mean Reciprocal Rank) measures how high the correct answer ranks, giving more credit when it appears first rather than third. It is calculated as the average of 1/rank for the first correct answer (e.g., rank 1 = 1.0, rank 2 = 0.5, rank 3 = 0.33).

Key observations: Sparse (keyword) search significantly outperforms pure dense (semantic) search for this dataset, achieving 87.57% accuracy compared to the 66.47% for pure embeddings. The optimal configuration at 70% sparse / 30% dense reaches 89.15% top-1 accuracy, while a balanced 50-50 split delivers strong performance at 85.40%, demonstrating that this weighted combination in hybrid approaches provides robust results across different query types.

Worth noting:
This retrieval system could be further improved with metadata filtering. For queries like “computers from 1981 with 128K RAM,” you could first filter by production year and memory specifications before running hybrid search, dramatically reducing the search space and improving both speed and accuracy.

Weighing the hybrid approach: what your RAG gains and what it costs

Hybrid search is not a silver bullet, it is a deliberate design choice that trades operational simplicity for more robust relevance and control over your knowledge base. The core advantage lies in complementary coverage: sparse search reliably catches exact identifiers, while dense search captures intent and handles paraphrasing. Beyond better coverage, hybrid RAG offers explainability and control, allowing you to surface which specific terms matched, apply field boosts, and still benefit from semantic recall. This transparency makes debugging easier and helps users understand why they are seeing particular results.

What you gain	What it costs
Coverage: Catches both exact terms (IDs, dates, SKUs) and fuzzy intent (synonyms, paraphrases)	Dual systems: Two pipelines to run and monitor instead of one
Explainability: Shows which words and fields matched for easier debugging and user trust	Synchronization: Must keep keyword analyzers and embeddings aligned as content changes
Precision: Exact queries like “IBM 5150” or “2024-Q3” hit reliably	Query complexity: Result merging and score fusion add steps to each query
Rich filtering: Facets and filters (date, category, entity) work naturally via keyword fields	Schema maintenance: Requires designing and keeping fields (titles, entities, tags) up to date

The real trade-off is operational complexity. You are now running two retrieval pathways that need to stay synchronized. Hybrid search introduces sparse-specific configuration, which means tokenization rules, text analyzers, keyword field structure, and fusion weights that balance the two scoring systems. The added overhead comes from keeping keyword indexes updated alongside embeddings and managing score fusion at query time.

Finding the right fit: when hybrid search makes sense

The decision to adopt hybrid search depends less on theoretical benefits and more on whether it solves problems you are actually experiencing. Before committing to hybrid retrieval in your RAG solutions, evaluate these key factors:

Query diversity – Do your users ask both precise questions (“find invoice INV-2024-0847”) and exploratory ones (“how do I resolve payment disputes”)? Mixed patterns favor hybrid, while homogeneous queries may work with a single method.
Failure pattern analysis – Fine-tune your system to see where it breaks. Wrong quarters in date searches or confused product models signal a need for keyword precision. Users constantly rephrasing queries suggest semantic search to fill the gap.
Content characteristics – Documents heavy with technical identifiers, model numbers, legal citations, or precise dates benefit more from hybrid systems than narrative content does. Consider what percentage of your corpus contains these exact-match elements.
Scale and operational burden – A static archive of 10,000 documents is a one-time setup cost, while millions of rapidly updating documents require continuous reprocessing of both keyword indexes and embeddings. Consider whether your team has bandwidth to keep dual retrieval systems synchronized as content changes.

You can start simple if you are uncertain, deploy pure semantic search first, log problematic queries, and let real failure patterns guide your next move. Test hybrid on a representative sample, measure the impact against your specific pain points, and adopt it only when the gains justify the added complexity you team can sustain.

Ready to see how fine-tuned RAG can improve your search results?

Meet directly with our founders and PhD AI engineers. We will demonstrate real implementations from 30+ agentic projects and show you the practical steps to integrate them into your specific workflows—no hypotheticals, just proven approaches.

Book your session today

Last updated: January 23, 2026