When clean text is not enough: structured extraction for RAG

Garbage in, garbage out – every seasoned data scientist knows poor data can derail Retrieval-Augmented Generation (RAG). Yet there is a gap between having clean text and having retrieval-ready content. While raw text in a vector database may suffice for basic use cases, this article explores how LLMs can restructure that text into retrieval-ready formats for more refined search.
Is perfectly parsed text good enough for RAG?
Real-world data rarely arrives in perfect condition, regardless of its source. Data quality extends beyond just file formats or OCR accuracy. Even if you have clean text, perfectly parsed from an image or PDF, it still might not be in the best shape for building an effective RAG system. The key often lies in how you structure that extracted data, making it more comprehensible and easier to leverage in different contexts. Fortunately, large language models excel at structured data extraction.
Vector search and retrieval-augmented generation have emerged as critical AI workflow tools to make business data more structured and address (and amend) the disconnect between enterprise models and execution.
– Bianca Lewis, executive director of the OpenSearch Software Foundation, October 3 2025, on How RAG continues to ‘tailor’ well-suited AI

Note: The order of steps 3 and 4 can vary depending on implementation. You might chunk first and then extract structure from each chunk, or extract structure from the full document and then chunk the enriched content.
What semantic search does well and where it falls short
In the simplest vector search implementation, you take a block of unstructured text (say, parsed from a PDF), maybe split it into smaller chunks, based on character count or paragraph breaks, create embeddings (numerical representations that capture meaning) for them, and load them into a vector database. Then when you query the database, your questions get semantically compared to those prepared chunks. Sounds like exactly what we need, right?
In many cases, this straightforward approach works well, but we often find stumbling blocks hidden in the details. Examples? A RAG system will handle general phrases like “revolutionary IBM computer from the 80s” beautifully, but might struggle to distinguish between “IBM 5100” and “IBM 5150,” which is something classic keyword search handles easily. We run into similar problems with “IBM computer released in 1981” if the same database contains other models from around that time, as similar dates become confusing when relying purely on semantic search.
Making your retrieval system more robust and accurate
To address these limitations and build a more reliable retrieval system, engineers employ various techniques:
- Metadata extraction and filtering – narrow down chunks by specific attributes (like year or document type) before semantic search even begins
- Hybrid search – combine semantic vectors with traditional keyword matching (BM25, exact matches) to catch both conceptual and literal queries
- Query understanding – analyze the user’s question to route it appropriately or reformulate it for better retrieval
- Contextual Retrieval – Anthropic’s approach, which enriches each chunk with surrounding context before embedding
- Summary-based indexing – embed condensed versions of documents rather than raw text, filtering out noise while preserving key information
- Graph RAG – structure knowledge as interconnected nodes and relationships (
patient → diagnosed_with → condition → treated_by → medication
), then use semantic search to navigate the graph. Instead of searching through flat text chunks, you can traverse meaningful connections between entities, making it easier to answer complex questions that require understanding how different pieces of information relate to each other.
These approaches range from straightforward to sophisticated, but here is what they have in common: they all benefit from, or outright require, structured information extracted from your raw text.
Think about what each technique actually requires: metadata filtering, keywords for hybrid search, contextual summaries for Contextual Retrieval, entity relationships for Graph RAG. None of these exist in your raw extracted text, you have to create them through structured information extraction. This is where structured extraction becomes critical. It is not just about making your data prettier, but about creating the foundation that allow advanced retrieval techniques to work at all.
Retrieval technique | Structured elements required |
---|---|
Metadata Filtering | dates, categories, document types, tags |
Hybrid Search | keywords, exact identifiers, product codes |
Query Understanding | keywords, intent markers, domain terms |
Contextual Retrieval | document title, section summaries, key concepts |
Graph RAG | entities, relationships, attributes |
How structured data extraction works in real-world RAG systems
To make this concrete, imagine you are building a search engine for thousands of articles stored as PDF files. Each article is a wall of text without clear headers or titles and with tables scattered throughout to add further complexity. So you use a PDF extraction tool and end up with raw text like the example below.
The IBM Personal Computer, model 5150, revolutionized the computing industry when it launched in the early 1980s. This ground breaking machine featured an Intel 8088 processor running at 4.77 MHz, starting with 16KB of RAM that could be expanded through additional memory chips and expansion cards. The system came with a distinctive beige chassis and IBM’s commitment to an open architecture, which allowed third-party manufacturers to create compatible hardware and software.
The 5150 featured five expansion slots, enabling users to add graphics cards, memory expansions, and various peripheral controllers. Its CGA graphics capability and optional floppy disk drives made it suitable for both business applications and personal use. The computer’s modular design philosophy encouraged customization and upgrades, making it accessible to businesses and enthusiasts alike.
Event | Month | Year |
---|---|---|
Initial Release | August | 1981 |
Peak Sales Period | — | 1982-1984 |
Final Units Shipped | April | 1987 |
Official Discontinuation | — | 1987 |
Storage options included cassette tape interface or one or two 5.25-inch floppy disk drives with 160KB capacity each. The base configuration started at $1,565 but a fully loaded system could exceed $3,000. IBM initially projected sales of 250,000 units over five years but exceeded that within the first year alone.
Meanwhile, the Apple II, which had been released in 1977, continued to dominate the education and home markets with its color graphics and extensive software library. Apple’s machine benefited from an earlier entry into the personal computer space and had established strong brand loyalty, particularly among schools and creative professionals who valued its superior graphics capabilities and user-friendly design.
The IBM 5150 established standards that persist in modern computing including the IBM PC compatible architecture and the PC expansion bus system. Its success spawned an entire industry of clone manufacturers and positioned IBM as the enterprise standard.
IBM-PC-5150-SYS-001-1981-US
Thanks to structured extraction, you can transform this wall of text into a more organized form tailored to your specific needs.
{
"name": "IBM PC 5150",
"category": "Personal Computer",
"keywords": [
"IBM",
"PC",
"5150",
"Intel 8088",
"open architecture",
"1980s"
],
"summary": "The IBM Personal Computer model 5150, launched in 1981, featured an Intel 8088 CPU, up to 256KB RAM, five expansion slots, CGA graphics and optional floppy drives, establishing the PC architecture.",
"tables": [
{
"title": "IBM PC 5150 Timeline",
"data": [
{ "Event": "Initial Release", "Date": "August", "Year": 1981 },
{ "Event": "Peak Sales Period", "Date": "", "Year": "1982-1984" },
{ "Event": "Final Units Shipped", "Date": "April", "Year": 1987 },
{ "Event": "Official Discontinuation", "Date": "", "Year": 1987 }
]
}
],
"year_end": 1987,
"year_start": 1981,
"identification_number": "IBM-PC-5150-SYS-001-1981-US"
}
This transformation is what structured extraction delivers. What was once an unstructured wall of text is now a queryable, filterable dataset. Your vector database can now support metadata filtering by year (1981-1987), enable Graph RAG by linking the IBM 5150 entity to its specifications and timeline events, and power contextual retrieval with the pre-generated summary. The extracted table data is preserved in a structured format, and that identification number at the end becomes a searchable field rather than an orphaned string buried in text. But this power comes with a cost… literally.
Tradeoffs involved in structured extraction
When you are working with hundreds of thousands of documents or more the cost of processing each one through an LLM can add up quickly. You need to carefully weigh whether the improved retrieval accuracy justifies the extraction costs, or at minimum, ensure it makes business sense for your use case.
The good news? You can extract everything you need in a single pass. Rather than making multiple API calls to get summaries, then keywords, then metadata separately, modern LLMs can pull all structured elements at once—titles, dates, categories, entity relationships, and summaries all in one operation. This dramatically reduces both cost and document processing time.
Batch processing for structured extraction: a set-it-and-forget-it solution
Batch processing offers a compelling alternative for large-scale extraction tasks, with major providers like OpenAI, Anthropic, Google Gemini, and Mistral offering cost discounts of up to 50% compared to standard API calls. But the savings are not just financial, as batch processing simplifies your entire workflow by handling all the operational complexity for you.
Instead of managing thousands of individual API calls with retry logic, rate limit handling, and error tracking, you package all your requests into a single file (typically JSONL format), upload it, and let the provider handle everything. Most batch jobs complete within 24 hours, though many finish much faster. You also get significantly higher rate limits, even processing hundreds of thousands of requests in a single batch, which would be impossible to handle through real-time APIs.
For structured extraction across large document collections, this means you can process your entire archive overnight without worrying about connection failures, implementing queuing systems, or babysitting the process. Upload your documents in the evening, retrieve the structured results in the morning.
Lowering costs with open-weight models for structured extraction
Despite the potential costs discussed above, you do not need expensive frontier models to test whether structured extraction improves your RAG system. We have created a comparison showing outputs from 5 different popular open-weight models processing the same extraction task. You can see first hand how each model handles the IBM PC example above and compare their extraction accuracy, as we have provided a short list with model names and documentation links below.
Model ID | Model Name / Documentation Link |
---|---|
openai/gpt-oss-20b | [GPT-OSS 20B](pplx://action/translate) |
openai/gpt-oss-120b | [GPT-OSS 120B](pplx://action/translate) |
moonshotai/kimi-k2-instruct-0905 | [Kimi K2 Instruct](pplx://action/translate) |
meta-llama/llama-4-maverick-17b-128e-instruct | [Llama 4 Maverick](pplx://action/translate) |
meta-llama/llama-4-scout-17b-16e-instruct | [Llama 4 Scout](pplx://action/translate) |
To make testing even easier, we have written a simple Python script that lets you experiment with these models through Groq’s free tier (no credit card required). One standout option is OpenAI’s gpt-oss-20b
, which delivers impressive extraction quality and can even run locally on a good laptop if you want complete control over your data and zero API costs.
Making the right choice for your RAG system
Structured extraction is not all-or-nothing, the question is how much structure you actually need. Start by examining where your retrieval system falls short: imprecise date searches, missing product codes, inability to filter by type. These pain points reveal exactly which structured elements will deliver value.
Test on a small scale first. Extract structure from a representative sample, 100 or 1,000 documents, and measure the impact. Does metadata filtering reduce irrelevant results? Does hybrid search catch those exact model numbers semantic search misses? Clear improvements validate the approach before you commit to large-scale processing.
Key factors to evaluate:
- Volume – 1,000 documents is negligible; 10 million requires cost justification
- Query patterns – Broad semantic searches need less structure; precise filtering and exact matching need more
- Update frequency – Static archives are one-time costs; frequently updated datasets require continuous processing
- Model selection – Frontier models are not always necessary; smaller models handle simple extraction at lower cost
- Batch processing – For large archives, batch APIs offer 50% savings and eliminate operational complexity
The goal is pragmatic optimization: extract enough structure to enable the retrieval techniques that solve your actual problems, without over-engineering for theoretical gains. Start with your pain points, validate the solution on a subset, then scale with confidence once you know which structured elements actually improve your users’ search experience.
Ready to see how sophisticated RAG systems can transform business workflows?
Meet directly with our founders and PhD AI engineers. We will demonstrate real implementations from 30+ agentic projects and show you the practical steps to integrate them into your specific workflows—no hypotheticals, just proven approaches.
The LLM Book
The LLM Book explores the world of Artificial Intelligence and Large Language Models, examining their capabilities, technology, and adaptation.
