How we built a production RAG pipeline for geopolitical intelligence using LlamaParse and Qdrant

We at Vstorm built a production RAG system for a geopolitical intelligence firm with two pipelines working in tandem. The ingestion pipeline runs every two hours: it polls Google Drive, parses reports with LlamaParse’s agentic mode using a custom prompt, chunks at paragraph boundaries, generates hybrid search (dense OpenAI + sparse BM25), and upserts into Qdrant with a metadata payload. The retrieval pipeline handles each analyst query: scope validation, query embedding, LLM-parsed temporal and thematic filtering, hybrid search with a minimum score threshold, Reciprocal Rank Fusion, recency boost, and a cited LLM answer with a follow-up question. This article documents every decision in both pipelines.
How we built a production RAG pipeline for geopolitical intelligence using LlamaParse and Qdrant
Most RAG pipelines fail before the LLM ever generates a word. The failure happens at ingestion, when a PDF is passed through a basic loader that strips tables, discards images, and returns a flat string that carries none of the document’s structural meaning. Then the LLM receives degraded context and produces unreliable answers, so the team diagnoses a retrieval problem while the real problem remains upstream.
We at Vstorm encountered this directly when building a production LlamaParse RAG pipeline for our work with a geopolitical intelligence firm. The engagement required us to make dense analytical reports; full of embedded charts, probability tables, and annotated data; queryable at scale. The system has two distinct pipelines: an ingestion pipeline that keeps the vector store up-to-date, and a retrieval pipeline that handles each analyst query from validation through to a cited, source-grounded answer. What we built, and why we made the technical decisions we did, is the subject of this article.
Why document parsing is where most RAG pipelines break
The standard approach to PDF ingestion relies on libraries such as PyPDF2 or pdfminer. These tools extract text character by character. They work adequately for clean, text-only documents, but start to break down when presented with anything more complex.
A geopolitical report is not a clean document. It contains multi-column layouts, embedded charts with labeled axes, data tables spanning multiple pages, and images that carry analytical context. A basic PDF loader reads such a document and returns a string in which table rows have collapsed into comma-separated fragments, chart labels have been detached from their data, and page headers have merged with paragraph text. The chunks derived from this string are semantically incoherent and the vector embeddings built from those chunks carry noise into every retrieval call.
Any team that has tried to build a retrieval system over a corpus of real-world documents, such as annual reports, legal contracts, technical specifications, or intelligence briefings, will recognize this pain point. The symptom is an LLM that gives plausible-sounding but factually incorrect answers. The cause is that the retrieved context did not accurately represent the source documents.
Fixing retrieval accuracy by tuning the LLM or adjusting chunking strategy is treating a downstream symptom. The correct intervention is upstream: using a parser that preserves document structure.
The engagement: geopolitical intelligence at scale
We built a production LlamaParse RAG pipeline for our work with a geopolitical intelligence firm that produces analytical reports on political risk, regulatory change, and macroeconomic developments across multiple regions. The reports are research-dense: each contains narrative analysis, probability tables showing scenario forecasts, charts, and structured data across dozens of pages. Prior to the system we built, that knowledge existed only as individual geopolitical reports posted on their forum and PDF files downloaded from Google Drive. An analyst looking for prior coverage of a specific country or policy scenario had to manually search document archives, open likely candidates, and read them to locate relevance. There was no way to query across the corpus.
The pipeline we built runs continuously, every two hours. Beginning with standard change detection and download of selected PDF documents, the PDF is submitted to LlamaParse, which is configured through three operational settings before parsing begins.
- Image parsing. The user can disable or enable image parsing depending on document type. For reports containing charts and annotated visuals, image parsing can be enabled to ensure analytical content carried in visuals is not lost. If it is disabled, images are skipped.
- Custom parsing prompt. A custom prompt instructs LlamaParse to preserve document structure and extract probability tables with their three columns intact: scenario, current probability, and historical probability. Without explicit instruction, general-purpose parsing can merge or fragment these columns.
- Parsing mode. LlamaParse offers three modes — fast, cost-effective, and agentic. The pipeline uses the agentic mode, which delivers the highest accuracy. The output is a markdown file with page-level metadata — page number and page footer — preserved as structured fields alongside the extracted text.
Users can now submit a natural-language query, such as “what is our coverage of export control developments in Southeast Asia across the past 18 months,” and receive source-grounded answers drawn from the full document corpus, with citations to the specific reports, pages, and sections that support each finding.
For more on how Vstorm approaches RAG engineering, see our RAG development service page.
What LlamaParse does that standard PDF loaders do not
LlamaParse is a complex document parsing platform built for the ingestion requirements of enterprise RAG systems. It supports over 90 document formats and, as of early 2025, had processed over one billion documents (LlamaIndex, 2025). Unlike standard PDF loaders, it offers three parsing modes — fast, cost-effective, and agentic — each trading cost for accuracy. The agentic mode uses a vision-capable model to understand document layout, not just extract characters.
Two configuration options were directly relevant to this engagement. First, image parsing: LlamaParse can be instructed to process embedded visuals and represent their content in the markdown output. For geopolitical reports, this is not optional. Charts showing risk indices over time and maps annotated with event data carry analytical meaning that a text-only parse discards entirely.
Second, the custom parsing prompt. LlamaParse accepts natural-language instructions that direct the model’s attention during parsing. We used this to specify two objectives: preserve the overall document structure and extract the probability tables that appear throughout the reports, specifically their three columns (scenario, current probability, historical probability). Without this instruction, the agentic model may reformat or merge those columns in ways that break downstream retrieval of probability data. The prompt makes the extraction deterministic for the structures we care about.
The output is a markdown file. Alongside the text content, LlamaParse preserves page-level metadata, like page number and page footer, as structured fields. These flow directly into the chunk payload in the next stage, giving every retrieved chunk a precise document location.
Dean Barr, Applied AI Lead and Data Scientist at Carlyle, has observed that LlamaParse’s handling of nested tables, complex spatial layouts, and image extraction is essential for maintaining data integrity in advanced RAG and agent-based model development (LlamaIndex customer testimonials). This matches what we have found in practice.
Dimension |
Standard PDF loader |
LlamaParse |
Text extraction |
Raw text, structure lost |
Layout-aware markdown output |
Tables |
Garbled or dropped |
Preserved as structured data |
Charts and images |
Ignored |
Parsed with context |
Multilingual content |
Variable quality |
100+ languages supported |
Output format |
Unstructured string |
Clean markdown, ready for chunking |
Enterprise scale |
File-by-file |
Millions of pages, enterprise-grade |
The key difference is not format support, as most loaders can handle PDFs. The difference is structural fidelity. LlamaParse produces layout-aware markdown in which table rows remain coherent, chart labels stay associated with their data, and section hierarchies are preserved. When this output is chunked and embedded, the resulting vectors carry accurate semantic content. Retrieval quality improves because the inputs are accurate.
Chunking, vectorization, and storage: a closer look at the end of the pipeline
Good parsing is necessary but not sufficient. The decisions made in the three subsequent stages — chunking, vectorization, and storage — determine whether the retrieval layer returns precise context or generic fragments. Following parsing by LlamaParse, the document follows down the rest of the pipeline.
Chunking. The parsed markdown is split at paragraph boundaries. We do not split mid-paragraph. The target chunk size is 1,000 characters, with a hard limit of 1,200. There is no overlap between chunks. For this document type, overlap would add noise rather than continuity: each paragraph in a geopolitical report makes a self-contained point, and retrieving the tail of an adjacent paragraph alongside it dilutes the signal. Chunk numbering resets per page, which preserves the relationship between a chunk’s position and the page metadata inherited from LlamaParse.
Dense vectors. Each chunk is embedded using OpenAI’s embedding model, producing a dense vector that captures the semantic meaning of the text. Dense retrieval is good at finding chunks that are conceptually related to a query, even when the exact terminology differs. A query about “political instability in East Africa” will surface a chunk that discusses “governance fragility in the Horn of Africa” because the semantic representations are close in vector space, even though the words do not overlap.
Sparse vectors. Each chunk is also scored using BM25, a term-frequency algorithm that measures how important each word in the chunk is relative to the full document collection. The BM25 vector is sparse: most values are zero, with non-zero scores only for words that are statistically significant in the chunk. Sparse retrieval handles what dense retrieval misses: exact matches on specific terminology. Geopolitical content is dense with proper nouns — names of sanctions regimes, regulation codes, country-specific policy instruments — that may not be well-represented in an embedding model’s training distribution. BM25 catches these precisely.
Hybrid upsert to Qdrant. Each chunk is upserted to Qdrant carrying both vectors. At query time, both the dense and sparse scores are combined — dense retrieval finds semantically related content, sparse retrieval surfaces exact terminology matches, and the combined ranking outperforms either approach alone. The chunk payload stores 11 fields: unique chunk ID, page number, chunk number within the page, content type (text, table, or list), document title, document type, theme, topics, published date, links, and file hash. This payload supports two things: filtered retrieval (an analyst can scope a query to reports published after a given date, or to a specific document type), and deduplication (the file hash prevents a document from being reprocessed if it is detected unchanged during the two-hour GCP poll).
Qdrant was chosen for three reasons: it is open-source and carries no vendor lock-in; it supports self-hosted deployment, which the client required for data residency reasons — geopolitical intelligence cannot be sent to third-party cloud infrastructure for processing; and its payload filtering is first-class, supporting the metadata-scoped queries the system needs to serve.
From query to cited answer: the retrieval pipeline
Ingestion is one half of the system. The other half is what happens when a user submits a query. The retrieval pipeline runs eight steps, each of which makes a specific decision that affects the quality of the final answer.
Step 0 — Scope validation. Before the query is embedded or searched, it is validated for topic relevance. The system accepts queries on geopolitical, geo-economic, sanctions, tariffs, security, and policy topics only. Out-of-scope queries — questions about unrelated subjects, requests for general knowledge, off-topic conversational input — are rejected with a graceful explanation rather than passed to the retrieval stack. This is not a content filter in the security sense; it is a scope guard that protects the quality of results. A system designed for geopolitical intelligence should not return spurious results because an analyst asked the wrong kind of question by mistake.
Step 1 — Query embedding. The validated query is embedded using the same OpenAI model used during ingestion. This alignment is essential: if the query and chunk vectors were produced by different models, similarity scores would be meaningless. The resulting dense vector captures the semantic intent of the question and is passed to the parallel search that follows.
Step 2 — Temporal intent parsing and filter construction. Before search runs, the LLM analyses the query for temporal and thematic intent and constructs filter parameters. Queries containing “latest” or “recent” map to a 60-day published date filter; “upcoming” or “future” map to 30 days; references to a specific year resolve to an exact date range. Filters for document type and theme can also be applied. By default, only chunks from the latest version of each document are returned, preventing superseded analysis from surfacing alongside current material.
Step 3 — Hybrid search. Dense vector search and sparse BM25 search run in parallel, each returning the top 20 results above a minimum score threshold of 0.4. Results below that threshold are discarded before fusion. The two result sets are then merged using Reciprocal Rank Fusion (RRF), which rewards chunks that rank highly in both methods. This makes the hybrid approach more reliable than either method alone: dense retrieval handles conceptual queries, BM25 handles exact terminology, and RRF produces a single ranked list that benefits from both.
Steps 4 and 5 — Recency boost and re-ranking. The fused result list is re-scored using exponential decay with a 30-day half-life and a 20% weight. A document published 30 days ago receives a relative boost score of 0.5; one published 60 days ago receives approximately 0.25. The boost is applied on top of the retrieval score, not as a replacement for it — a highly relevant older document will still outrank a marginally relevant newer one.
Step 6 — Context assembly. The top five to ten chunks — selected after re-ranking — are passed to the LLM as context, with their full content and metadata payloads included.
Step 7 — Generation and citations. The LLM generates an answer using only the supplied context and formats citations inline as [1], [2], and so on. Only sources that are actually cited in the answer are shown to the analyst. Unreferenced chunks from the retrieval pool are not listed, preventing the answer from appearing more comprehensively sourced than it is.
Step 8 — Follow-up question. When the system has retrieved documents — that is, when the query was in scope and results were returned above the relevance threshold — the LLM appends a follow-up question. The question stays within the same topic area and surfaces the next logical line of inquiry the analyst may not have thought to pursue, keeping the research conversation moving forward without requiring a query to be reformulated from scratch.
For context on how retrieval-augmented generation works and how agentic systems extend this pattern, see our article on agentic RAG.
Ready to see how agentic AI transforms business workflows?
Meet directly with our founders and PhD AI engineers. We will demonstrate real implementations from 30+ agentic projects and show you the practical steps to integrate them into your specific workflows—no hypotheticals, just proven approaches.
Where this pattern applies beyond geopolitical intelligence
The pipeline we built for the geopolitical intelligence firm is not specific to that domain. The same parse–chunk–retrieve architecture applies wherever an organisation’s knowledge lives in complex document formats that standard loaders cannot faithfully ingest.
We built the same pattern for Mapline.ai, a US-based real estate technology company, where the task was to accelerate property due diligence. Due diligence involves analysing large volumes of legal and technical documents — title reports, environmental assessments, planning documents — under time pressure. The AI agent we built for Mapline turned weeks of document review into minutes. You can read the case study here.
The pattern also applies to product companies whose core offering depends on document intelligence. A product such as Tyce.ai, an AI agent for drafting contracts and professional documents, illustrates this clearly. The quality of what such a product can generate depends entirely on the accuracy of its ingestion layer. Weak ingestion of unstructured data produces generic outputs. Accurate ingestion of structured documents produces outputs that reflect the specific content, terminology, and structure of the source material.
Finance, insurance, healthcare, and manufacturing all present the same core problem: large volumes of unstructured data locked in PDFs, reports, and scanned records, with high accuracy requirements on retrieval. LlamaIndex’s enterprise customer base reflects this directly, with deployments in financial due diligence, invoice processing, technical documentation search, and clinical record review.
What the full system requires beyond the pipeline
The ingestion and retrieval steps described above are necessary conditions for a working system. They are not sufficient conditions for a production system. Two factors outside the pipeline itself consistently determine whether a RAG pipeline crosses from working prototype to deployed tool that practitioners rely on.
Data residency by design. For clients handling sensitive content — geopolitical intelligence, legal documents, medical records, financial research — the question of where data is processed and stored is not a deployment detail; it is an architectural constraint that must be resolved before tooling is selected. In this engagement, the decision to use Qdrant’s self-hosted deployment and LlamaParse’s on-premise enterprise option was made at the outset, not retrofitted later. Teams that choose cloud-first defaults and add on-premise options afterwards typically face integration rework that delays production launch.
Knowledge transfer. We build systems that our clients can maintain and extend without us. That means documented architectural decisions — including why RRF was selected over score normalisation, why 0.4 was chosen as the minimum threshold, why the recency half-life is 30 days rather than 14 or 60 — reproducible configurations, and hands-on training for the teams who will operate and iterate on the system. A well-engineered system that only its builders understand is a support dependency, not a client asset.
Both factors are embedded in how we approach every engagement through our TriStorm methodology, from use case assessment through to a deployed, observable system the client owns entirely.
Conclusion
The geopolitical intelligence engagement produced a system with two distinct engineering surfaces. On the ingestion side: GCP polling, LlamaParse agentic parsing with a domain-specific prompt, paragraph-boundary chunking, hybrid vectorisation, and an 11-field metadata payload that makes every stored chunk filterable and deduplicated. On the retrieval side: scope validation, temporal intent parsing, hybrid search with a minimum threshold, RRF fusion, exponential recency decay, cited answer generation, and a follow-up question that advances the research conversation.
Each of these decisions was made in response to a specific constraint or failure mode. None of them are defaults. The result is a LlamaParse RAG pipeline that analysts rely on in daily work, running entirely on infrastructure they control, and that the client team can maintain and extend without us.
If your organisation holds knowledge in complex documents, and most do, the question is not whether this class of engineering applies to your situation. The question is whether each layer of your pipeline has been built with the same level of intention.
Ready to see how agentic AI transforms business workflows?
Meet directly with our founders and PhD AI engineers. We will demonstrate real implementations from 30+ agentic projects and show you the practical steps to integrate them into your specific workflows—no hypotheticals, just proven approaches.
Summirize with AI
The LLM Book
The LLM Book explores the world of Artificial Intelligence and Large Language Models, examining their capabilities, technology, and adaptation.



