Back to blog

Beyond Frontier Models: Testing Lightweight LLMs for Document Processing in RAG

Karol Piekarski

PhD, Agentic AI & Data Engineer

February 4, 2026

Category Post

AI LLM Scoring LLMs Model Evaluation

Table of content

Structured extraction is one of the most effective ways to enhance Retrieval-Augmented Generation systems, enabling everything from metadata filtering to Graph RAG. By enriching walls of text with summaries and keywords, you create powerful retrieval capabilities. But this comes with a cost: the pre-processing overhead of running each document through an LLM before indexing.

Do you really need expensive frontier models to get good results? We evaluated how well smaller open-weight models handle extraction tasks. This article presents evaluation results for smaller variants of DeepSeek, Gemma, Llama, Mistral, Phi, and Qwen, demonstrating how to use the Pydantic Evals framework with deterministic tests and LLM-as-a-judge approaches.

Vector search and retrieval-augmented generation have emerged as critical AI workflow tools to make business data more structured and address (and amend) the disconnect between enterprise models and execution.

– Bianca Lewis, executive director of the OpenSearch Software Foundation, October 3 2025, on How RAG continues to ‘tailor’ well-suited AI

Why use smaller open-weight models?

Why bother with small open-weight models when capable LLMs are so widely available? While individual consumers rely on ChatGPT web sessions, businesses may face stricter requirements. Public sector and NGO organizations especially need to run models locally, often as a necessity rather than a preference.

Three factors drive this need:

Data privacy and security – Sensitive documents and extracted information stay on-premise. No data leaves your infrastructure through third-party APIs. This ensures compliance with data protection regulations and internal security policies.
Cost efficiency – API costs disappear entirely. Operating expenses become predictable and controllable. This matters especially when processing large document volumes at scale, where per-token pricing quickly adds up.
Independence and control – No vendor lock-in limiting your options. You maintain full control over model deployment and customization. Service availability does not depend on external providers or their pricing.

These advantages make smaller models compelling for many organizations. The critical question is whether you must sacrifice accuracy to gain these benefits. Our evaluation suggests you do not.

However, there is no shortcut to finding the right model. You need systematic evaluation with your specific data and tasks. This section shares our findings from testing several models on document extraction, followed by recommendations for conducting your own evaluation processes.

How we conducted our experiments

We designed our evaluation to balance rigor with practical constraints:

Simple prompting for comparability – A relatively short and simple prompt ensured comparable results across models. Though techniques like few-shot prompting can improve performance, we kept the baseline prompt consistent across all model tests.
Complex schema for stress testing – The extraction schema was fairly complex and particularly challenging for the smaller models tested. This pushed models beyond simple data extraction to test their limits.
Selective metrics – Rather than evaluating every extracted field, we picked a few metrics highlighting different aspects of extraction quality. These are explained in detail in the table below.
GPT-4.1 baseline – A strong general-purpose model (GPT-4.1) created reference extractions as our baseline for comparison.
PydanticAI Evals framework – PydanticAI Evals provided the infrastructure to create evaluators and run the evaluation systematically.

Picking the right metrics

The table below shows the selected evaluation metrics we used to assess extraction quality. These examples illustrate the evaluation process and do not include the complete set of fields we extracted from each document.

Metric	Related attributes in schema	Eval method	Rationale: What aspect of extraction we are testing
Summary length & ending	summary	Deterministic	Checks that the summary is neither too short nor too long (roughly 100–300 characters) and that it ends with clean sentence punctuation (optionally followed by a closing quote or bracket). Scores 1.0 if both conditions are met, otherwise 0.0. Favors concise, well-formed snippets over fragments.
Name similarity	name	Deterministic (RapidFuzz)	Normalizes both baseline and extracted names (lowercasing, collapsing whitespace) and compares them using fuzz.ratio(). Returns a similarity score from 0.0–1.0, rewarding close matches while tolerating minor punctuation and formatting differences.
Manufacturer similarity	manufacturer	Deterministic (RapidFuzz)	Uses the same fuzz.ratio() approach as name similarity on the manufacturer field. Captures cases where models expand or slightly rephrase manufacturer names while still staying semantically aligned.
Release year accuracy	release_year	Deterministic	Treats the release year as an integer and scores 1.0 only for exact matches between expected and extracted values (after type conversion), otherwise 0.0. Tests whether models correctly anchor facts to the right year.
Keyword coverage	keywords (expected) + summary/description (actual)	Deterministic (fuzzy set coverage)	Normalizes expected and extracted keywords, then uses fuzzy matching and set coverage to measure how many important concepts appear in the model output, even when phrased differently. Ensures summaries highlight the right core ideas, not just any detail.
Coverage	full extraction payload	Deterministic	Checks whether the model produced any meaningful extraction for a document (non-empty fields, lists, or nested structures). Returns 1.0 when at least one field is populated, 0.0 otherwise.
Summary quality (LLM judge)	summary	LLM-as-a-judge (moonshotai/kimi-k2-0905)	Evaluates summary quality on three dimensions: fidelity to baseline facts (0-4 pts), comparison to the baseline summary (0-3 pts), and historical context/narrative richness (0-3 pts). Produces a 0–10 score normalized to 0.0–1.0, providing a nuanced quality signal beyond what deterministic metrics can capture.

Key design principles:

Character-based fuzzy matching (fuzz.ratio) for name and manufacturer fields handles punctuation and formatting variations gracefully
Keyword coverage uses simple substring matching for efficiency and interpretability
Release year matching is exact (no tolerance) to ensure precision in temporal data extraction
Coverage evaluator checks for name presence as the minimum viable extraction signal
LLM judge focuses on description field for semantic/contextual evaluation where deterministic metrics are insufficient
All evaluators return 0.0-1.0 scores for consistent aggregation and comparison

Evals and metrics design challenges

Our approach prioritizes deterministic evaluation methods. We keep the use of LLM-as-a-judge (where another model evaluates extraction results) limited to cases where deterministic metrics are not sufficient in order to reduce ambiguity and additional evaluation costs. However, even something as straightforward as comparing product names can be tricky depending on your specific dataset. Consider these evaluation results:

Manufacturer mismatches

Expected (GPT-4.1)	Actual (Gemma3-12b)	Our score
Apple (manufactured by Sharp)	Sharp	0.29
Poqet Computer Corporation	Poqet	0.32
NCR Corporation	NCR	0.33
Altos Computer Systems	Altos	0.37
Tano Corporation	Tano	0.40

Though they achieve low scores, the Gemma3 results are not actually bad. After investigation, we found that the full name “Poqet Computer Corporation” was not present in the source data. GPT-4.1 expanded “Poqet” to the full corporate name and we used those results as our baseline. This illustrates a common challenge: you rarely have ideal reference data and often must create it synthetically. The apparent discrepancies reflect the baseline model going beyond the source text, and does not reflect the evaluated model failing. The same trade-offs between strict metrics and nuanced judgment become even more pronounced when you evaluate summarization quality, which we discuss below.

Evaluation results across models

We evaluated 13 models on structured extraction from 168 historical computing documents, comparing their outputs against a GPT-4.1-generated baseline. While this benchmark is model-generated rather than human-verified ground truth, it provides a consistent reference point for comparison. All models were run via Ollama with quantization.

Our evaluation combined deterministic metrics (name similarity, manufacturer similarity, release year accuracy, keyword coverage, alias coverage, summary length, and field coverage) with an LLM judge (moonshotai/kimi-k2-0905 via OpenRouter) scoring summary quality on three dimensions: fidelity to baseline facts (0-4 pts), comparison to baseline summary (0-3 pts), and context/narrative quality (0-3 pts). The composite scoring applies z-score normalization with these weights: Summary Quality (30%), Coverage (10%), and 12% each for other metrics. Note that z-score normalization can amplify small metric differences and the judge model significantly shapes quality scores.

Performance rankings

Evaluation Comparison (by composite score)

Rank	Model	Score	Success	Name	Manufacturer	Year	Keywords	Coverage	Summary Length	Summary Quality
1	gpt-oss-20b	1.00	163/168	0.87	0.85	0.93	0.51	0.97	0.90	0.46
2	llama3.1-8b	0.98	167/168	0.93	0.85	0.96	0.49	0.99	0.88	0.40
3	llama3.2-3b	0.97	168/168	0.94	0.84	0.98	0.49	1.00	0.90	0.38
4	qwen3-8b	0.97	161/168	0.90	0.83	0.89	0.53	0.96	0.86	0.46
5	phi4-14b	0.91	167/168	0.93	0.86	0.98	0.51	0.99	0.83	0.31
6	gemma3-12b	0.89	162/168	0.89	0.86	0.93	0.45	0.96	0.90	0.41
7	deepseek-r1-7b	0.70	167/168	0.87	0.85	0.91	0.48	0.99	0.77	0.29
8	gemma3-4b	0.53	152/168	0.85	0.78	0.86	0.43	0.90	0.77	0.37
9	qwen3-4b	0.49	168/168	0.92	0.87	0.93	0.54	1.00	0.23	0.13
10	gemma3-1b	0.36	156/168	0.67	0.76	0.83	0.51	0.93	0.89	0.26
11	qwen3-1.7b	0.19	144/168	0.78	0.76	0.79	0.42	0.86	0.57	0.30
12	deepseek-r1-1.5b	0.02	161/168	0.76	0.69	0.83	0.38	0.96	0.55	0.17
13	mistral-7b	0.00	146/168	0.79	0.72	0.82	0.43	0.87	0.35	0.19

The composite score above weights summary quality heavily (30%), which may not reflect your priorities. The rank-based view below offers an alternative perspective by treating all metrics equally, showing each model’s average rank across individual metrics, revealing which models deliver the most consistent performance regardless of how you weight the final score.

Rank-Based View (average rank across metrics; lower is better)

Rank	Model	AvgRank	Name	Manufacturer	Year	Keywords	Coverage	Summary Length	Summary Quality
1	llama3.2-3b	3.71	1	7	2	7	1	3	5
2	phi4-14b	3.86	2	2	1	3	5	7	7
3	llama3.1-8b	4.29	3	5	3	6	4	5	4
4	gpt-oss-20b	4.57	8	4	5	5	6	2	2
5	gemma3-12b	4.71	6	3	4	9	7	1	3
6	qwen3-8b	5.57	5	8	8	2	9	6	1
7	qwen3-4b	5.71	4	1	6	1	2	13	13
8	deepseek-r1-7b	6.86	7	6	7	8	3	8	9
9	gemma3-1b	8.86	13	10	11	4	10	4	10
10	gemma3-4b	9.14	9	9	9	11	11	9	6
11	qwen3-1.7b	11.14	11	11	13	12	13	10	8
12	deepseek-r1-1.5b	11.29	12	13	10	13	8	11	12
13	mistral-7b	11.29	10	12	12	10	12	12	11

Key findings

Four models cluster at the top with composite scores 0.97–1.00, separated by only 3 percentage points: gpt-oss-20b (1.00), llama3.1-8b (0.98), llama3.2-3b (0.97), and qwen3-8b (0.97). A significant gap appears after rank 4, with Phi-4 (14B) dropping to 0.91. Most models excel at field extraction, where average coverage is 95%, with llama3.2-3b and qwen3-4b achieving perfect 100%, but summary quality proved universally challenging. No model exceeded 46% on judge scores (average: 32%), with the best performers being gpt-oss-20b and qwen3-8b, both at 0.46. This dramatic gap confirms that summary generation is the hardest aspect of structured extraction.

The most striking finding is llama3.2-3b’s exceptional efficiency: at just 3 billion parameters, it nearly matches the 8B llama3.1’s performance (0.97 vs 0.98 composite) and achieved perfect 100% success rate (168/168 vs 167/168). This challenges conventional assumptions about parameter count requirements. The qwen3-4b paradox also stands out, achieving perfect coverage and topping several metrics (manufacturer 0.87, year 0.93, keywords 0.54) yet ranking only 9th overall due to frequently cutting summaries mid-sentence, triggering zero judge scores. Meanwhile, gpt-oss-20b won overall with 1.00 composite despite the lowest success rate (163/168), compensating with superior summary quality. Phi-4 demonstrates clear specialization: excelling at structured fields (0.98 year accuracy, 0.93 name similarity, 0.99 coverage) while struggling with narrative generation (0.31 judge score, ranking 11th on that metric).

Summary of best lightweight LLMs

Performance patterns reveal clear size-efficiency trends. In the 1-3B range, llama3.2-3b (3B) delivers elite performance rivaling much larger models, while tiny models like gemma3-1b (1B) and qwen3-1.7b struggle with consistency despite decent coverage. The 4-8B tier shows the strongest overall performance: llama3.1-8b and qwen3-8b both achieve composite scores ≥0.97, balancing coverage reliability with respectable summary quality. At 12-20B parameters, the picture becomes more nuanced. Phi-4 (14B) ranks 5th with 0.91 composite, excelling at structured extraction but weaker on summaries. Meanwhile, gpt-oss-20b and gemma3-12b demonstrate that larger models can excel at narrative generation (judge scores 0.46 and 0.41) without necessarily improving field extraction accuracy.

Mistral-7b’s poor performance (0.00 composite, last place) stands out as particularly surprising given its 7B parameter count. However, these results reflect only this specific structured extraction task with our particular prompting approach and schema. We make no claims about overall model quality, as the same models can show dramatically different results in other contexts with alternative prompts, schemas, or use cases. Model performance is highly task-dependent, which is why systematic evaluation with your specific data remains essential. Well-tuned lightweight LLMs for local use consistently outperform poorly-optimized larger ones, making the 3-8B range the optimal choice for production use, balancing quality, efficiency, and resource requirements.

Fact-focused vs contextual summaries

One of the most challenging tasks during extraction, and simultaneously an indicator of the model’s ability to deeply understand text, was found in creating summaries of processed documents. Technically speaking, summarization goes beyond pure extraction, but in practice it is a common requirement.

Despite deliberately not specifying the desired summary style in our system instructions, we observed a clear pattern. Smaller models produced more fact-focused summaries, while larger models attempted to present broader context and better synthesize information.

Both approaches ground summaries in factual information, but differ in scope and intent. Fact-focused summaries catalog what a computer was, while contextual summaries build on those facts to explain why it mattered in the broader arc of computing history.

Rank	Model	AvgRank	Name	Manufacturer	Year	Keywords	Coverage	Summary Length	Summary Quality
1	llama3.2-3b	3.71	1	7	2	7	1	3	5
2	phi4-14b	3.86	2	2	1	3	5	7	7
3	llama3.1-8b	4.29	3	5	3	6	4	5	4
4	gpt-oss-20b	4.57	8	4	5	5	6	2	2
5	gemma3-12b	4.71	6	3	4	9	7	1	3
6	qwen3-8b	5.57	5	8	8	2	9	6	1
7	qwen3-4b	5.71	4	1	6	1	2	13	13
8	deepseek-r1-7b	6.86	7	6	7	8	3	8	9
9	gemma3-1b	8.86	13	10	11	4	10	4	10
10	gemma3-4b	9.14	9	9	9	11	11	9	6
11	qwen3-1.7b	11.14	11	11	13	12	13	10	8
12	deepseek-r1-1.5b	11.29	12	13	10	13	8	11	12
13	mistral-7b	11.29	10	12	12	10	12	12	11

Note that contextual summaries are not always preferable. If you expect dry, maximally objective fact-based summaries, specify this explicitly in your prompt.

How different models handle summarization tasks

One small model, Llama3.2-3b, performed surprisingly well for its size at creating summaries. Its descriptions consistently balanced between fact-focused and contextual approaches. Despite not being perfect, it outperformed many larger models. We expected summaries within a specific character range (100–300 characters), which proved difficult for many models, including some larger 7B models. The otherwise capable Mistral-7b performed poorly here, often returning mechanically truncated fragments cut off around 300 characters without proper editing.

This selection of examples shows significant variability in results. Meaning a much smaller model running locally can match or even exceed top-tier commercial models. Notice how Llama3.2-3B and Gemma3-12b descriptions of the Commodore 64 outperform GPT-4.1.

Computer	Llama3.2-3B	Gemma3-12b	GPT-4.1
IMSAI 8080	The IMSAI 8080 is an early S-100 based system that was designed to be highly expandable. It uses the Intel 8080A CPU at 2.0 MHz and has a maximum RAM of 64K.	The IMSAI 8080 was a popular early microcomputer based on the Intel 8080A processor. It was designed to address the supply issues with the Altair 8800 and became known for its expandability via the S-100 bus.	The IMSAI 8080, released in 1975 by IMSAI Manufacturing Corporation, is an Intel 8080A-based kit or assembled computer, famous as the first successful Altair 8800 clone and for its expandable S-100 bus architecture.
Atari 800	The Atari 800 is a home computer designed by Jay Miner and released in November 1979. It features a MOS 6502 CPU, 8K base RAM, and 48K max RAM, as well as a 24 x 40 text display and 320 x 192 monochrome display.	Announced in December 1978 and released in November 1979, the Atari 800 was a home computer with a MOS 6502 processor running at 1.8 MHz and a base RAM of 8 KB, expandable to 48 KB. It offered various display resolutions and storage options including a 90 KB floppy drive and cassette recorder.	The Atari 800 is a modular 8-bit home computer introduced by Atari in late 1979, notable for its expandability, advanced graphics, and gaming capabilities.
Commodore 64	The Commodore 64 was a groundbreaking personal computer released in late 1982 by Commodore Business Machines, Inc. It built upon the success of its predecessor, the VIC-20, and went on to become one of the most popular computers of all time.	The Commodore 64 was created by Commodore Business Machines, Inc. and released in late 1982. It was designed to improve upon the features of the VIC-20 and became the best-selling single computer system with approximately 17 million units sold.	The Commodore 64 is an 8-bit home computer released in late 1982 by Commodore, selling about 17 million units, more than any other single model.

What about the smallest models? Does this mean models around 1B parameters cannot handle summaries? Not necessarily. These models are designed with fine-tuning in mind. Even tiny ones like Gemma3-270M can produce quality summaries after proper adaptation. A common practice with smaller models is to separate extraction and summarization into distinct steps, as they may struggle to handle both tasks simultaneously. First perform extraction, then use the structured, cleaned data for summarization with models fine-tuned specifically for the purpose.

Key findings

Our testing revealed several patterns:

No universal solution exists. One model may excel at a task while another, similar model fails. Everything depends on your specific use case and requirements.
Responsiveness to prompts improves with model size. Smaller models respond poorly to prompt instructions and few-shot learning. Medium-sized models (7B) show some improvement, while larger models handle these techniques more reliably.
Fine-tuning extends smaller model capabilities. For relatively simple tasks with limited token usage (such as processing one-page documents), you can fine-tune smaller models on consumer hardware (24GB RAM) using quantized models distributed through Ollama or MLX. These often outperform larger general-purpose models.
Summarization is harder than extraction across all model sizes. Smaller models (2-4B and 7B) often handle basic data extraction adequately but struggle to create quality summaries. Models around 12B parameters like Gemma-12b begin to produce consistent summaries that properly account for context. The notable exception was Llama3.2-3b, which we discussed earlier.

Recommendations: run your own lightweight LLM evaluations

Our evaluation covers a relatively small dataset and your results will vary with different documents and extraction tasks. The goal here is to demonstrate a systematic evaluation approach you can apply to your own use case. Here is how to make that choice systematically:

Establish systematic evaluation. Without proper test cases and reproducible benchmarks, you cannot reliably compare models. Anecdotal testing will mislead you.
Test with your actual data. These results reflect our specific documents and extraction schemas. Your data may behave differently. Build a representative test set from your real-world documents.
Prompt engineering matters as much as model selection. How you structure your extraction prompt significantly affects accuracy across all model sizes. Invest time in prompt refinement alongside model testing.
Consider fine-tuning for specialized needs. If your task has specific patterns or domain language, a fine-tuned smaller model may outperform a larger general-purpose one.

Structured extraction remains one of the most powerful ways to enhance retrieval augmented generation systems. The evaluation approach demonstrated here shows you can achieve this without reliance on frontier models. With systematic testing, you can confidently deploy smaller models that deliver the extraction quality your RAG pipeline needs while maintaining control, privacy, and predictable costs.

Last updated: February 4, 2026