Engineering a zero-hallucination agentic RAG system for clinical triage guidelines
Vstorm built a HIPAA-compliant agentic RAG system for Schmitt-Thompson Clinical Content, translating proprietary clinical triage guidelines into a zero-hallucination AI pipeline.
Scenarios used to validate the system accuracy
correct Correct Recommended Disposition rate reached
Hallucinations in the system

Schmitt-Thompson Clinical Content (STCC) is one of the most widely deployed clinical decision-support publishers in North America.
Its triage guidelines, structured clinical frameworks that determine urgency and route patients to the appropriate level of care, are in active use across more than 400 health systems and health plans, and in an additional 10,000 physician practices.
Industry
Healthcare
Headquarters
United States
Company size
10+
STCC provides the most comprehensive triage and advice content, spanning the continuum of delivery:
After Hours, used by call center nurses and Office Hours, used in practices and clinics.
Vstorm’s impact, the TL;DR:
- Vstorm built a HIPAA-compliant agentic RAG system for Schmitt-Thompson Clinical Content, translating proprietary clinical triage guidelines into a zero-hallucination AI pipeline.
- The core engineering challenge was reliably parsing logic-based, non-natural-language content.
- Vstorm’s team designed a four-stage agentic pipeline, compared PDF versus database retrieval (database won), and evaluated three GPT model versions across nine proof-of-concept builds.
- Tested against 329 validated clinical scenarios across 16 guidelines, the system achieved disposition accuracy within five percentage points of set benchmarks on 13 of 16 guidelines.
- Two deterministic failure modes were identified and are in active development.
“The STCC telehealth triage guidelines are used by thousands of nurses at hundreds of healthcare facilities around the world, marking STCC as the gold standard in the industry.” — Patty Maynard, Chief Operating Officer, STCC
Healthcare AI is moving quickly. 85% of healthcare organisations surveyed by McKinsey have either implemented or are actively pursuing generative AI-based solutions. STCC recognised that its content, built over decades and validated against expert clinical benchmarks, had to be made machine-accessible. The risk of doing nothing was clear: competitors building AI-native tools would fill the gap, and public LLMs were already at risk of absorbing proprietary guideline content through uncontrolled data exposure.
“We approached this as a safety program first. If we can’t measure accuracy against nurse-validated scenarios, we shouldn’t claim progress.” — Matthew Thompson, Product Manager, STCC / TAG
STCC came to Vstorm to build a controlled, validated, HIPAA-compliant agentic system that could operate at clinical accuracy without any chance of hallucination.
Strategic alignment and planning
Deep-dive workshops to align technical roadmap with business objectives.
Proof of Value
Rapid prototyping to validate approach and demonstrate ROI before full commitment.
Process augmentation
Embedding our experts within your team for knowledge transfer and sustained impact.
The first challenge was not architecture, it was data. STCC guidelines are not written in natural language. They are structured as decision trees: sequential yes/no questions arranged in descending order of urgency, each linked to a recommended disposition level such as “Go to ED Now” or “Home Care.”
The engineering problem: logic-based content in a language model world
This format is efficient for human practitioners but presents a specific failure risk for large language models. Standard LLMs are trained to interpret natural prose. When presented with logic-indicator syntax, such as branching conditionals, tabular urgency tiers, or Boolean qualifiers, they can misread the logic and produce confident but incorrect outputs. In a clinical setting, a misread is not an edge case to tolerate. It is a patient safety risk.
The engineering brief was therefore precise: build a system that treats the guidelines as the exclusive source of truth, introduces no external inference, and reports clearly when it cannot find an answer rather than generating one.
Architecture: a four-stage pipeline
Our team at Vstorm designed a four-stage agentic pipeline to handle the full scenario-to-disposition workflow.
- Stage 1 — Patient context extraction. A dedicated LLM agent extracts patient age and gender from the incoming scenario. This information filters the applicable guideline set before any clinical reasoning begins, narrowing the retrieval space immediately.
- Stage 2 — Reason for visit identification. A second dedicated agent identifies the primary and secondary reasons for the visit. This structured extraction step provides clean, unambiguous inputs for the retrieval stage rather than passing raw prose directly to the vector store.
- Stage 3 — Guideline selection via RAG. The extracted reason for visit is matched against a pre-indexed guideline vector store using retrieval-augmented generation. An LLM then classifies the most appropriate guideline from the retrieved candidates. This stage is where guideline selection errors are most likely to occur, particularly in clinically ambiguous presentations where multiple guidelines are plausible.
- Stage 4 — Parallel TAQ evaluation. Triage Assessment Questions (TAQs) are evaluated concurrently across all disposition levels using parallel LLM calls. Multi-part TAQs, which contain compound conditions, are parsed using Boolean logic to prevent partial matches from producing incorrect dispositions. The system outputs a recommended disposition, a SOAP-formatted clinical note, and an optional LLM-generated rationale.
The entire pipeline is built on the PydanticAI framework as the agent environment, with Logfire providing observability across every stage. Every decision is traceable. Every step is auditable.
Retrieval method: why we moved from PDF to database
In earlier proof-of-concept builds, guideline content was delivered to the retrieval layer via pre-parsed PDF. This approach was fast to prototype but introduced structural noise: PDF parsing produced fragmented chunks that broke the logical continuity of individual guidelines, particularly around multi-step TAQ sequences.
Our engineering team moved to a relational database retrieval model. Guidelines were indexed as structured records with explicit relationships between questions, urgency tiers, and disposition outcomes. The results confirmed the decision: database retrieval outperformed pre-parsed PDF delivery in both disposition accuracy and operational scalability across the full test suite. The relational structure preserved the internal logic of each guideline, giving the LLM a clean, consistently formatted context at every retrieval call.
Model selection: iterative testing across three GPT versions
The team evaluated three GPT model versions across proof-of-concept builds v0.3.x through v9.1: GPT-4.1, GPT-5, and GPT-5.1. All models were operated with HIPAA compliance configurations, a non-negotiable requirement given the sensitivity of patient scenario data.
GPT-5 and GPT-5.1 consistently outperformed GPT-4.1 across the guideline test suite. The performance gap was most pronounced in clinically complex scenarios requiring multi-step reasoning across branching TAQ sequences. GPT-5.1 was selected as the production model, offering the strongest balance of disposition accuracy and reasoning reliability within the compliance constraints.
Testing and validation: 329 scenarios, 16 guidelines
Clinical scenarios used to evaluate the system were developed through a rigorous 14-step validation process managed jointly by STCC and Vstorm. STCC Nurse Editors drafted scenarios for each guideline; scenarios were revised or discarded where editorial agreement could not be reached. Remaining scenarios were submitted to a panel of five expert telehealth triage nurses for validation. Only scenarios where the panel’s aggregate Correct Recommended Disposition (CRD) rate reached 95% or higher were retained and reviewed by the Senior Medical Editor.
The final test set comprised 329 validated clinical scenarios across 16 adult after-hours telehealth guidelines. Chatbot disposition accuracy was measured against this expert-validated benchmark; not against theoretical correctness, but against the standard established by practising clinicians.
“Research, testing and validation of the content is at the core of what we do. Nurses need to trust the guidelines and the trust is built on reliability.” — Laurie O’Bryan, RN, Nurse Editor, STCC / TAG
Results
Full-scenario disposition accuracy was strong across the majority of guidelines tested. Thirteen of 16 guidelines demonstrated chatbot accuracy within five percentage points of their respective benchmarks. Two guidelines exceeded the benchmark: Abdominal Pain — Male (100% vs. 96.2%) and Headache (100% vs. 97.4%).
Three guidelines underperformed the benchmark: Neurologic Deficit (85.0% vs. 96.0%), Cough — Acute Productive (84.0% vs. 98.3%), and Urinary Symptoms (84.0% vs. 98.4%).
Post-analysis identified two primary failure modes. The most frequent was no positive TAQ identified: the system correctly selected the guideline but failed to match any TAQ to the clinical scenario, producing an incomplete disposition. The second failure mode was incorrect guideline selection in cases with overlapping or ambiguous chief complaints, scenarios where two or more guidelines were plausible matches at the retrieval stage.
Both failure modes are deterministic in nature. They do not arise from hallucination or fabricated reasoning, but from retrieval and matching gaps that can be addressed through targeted engineering improvements.
The testing also confirmed zero hallucination events across the full scenario set. Strict prompt engineering, combined with the yes/no TAQ response format, provided no structural pathway for the model to generate unsupported content. When the system could not identify an answer, it reported the absence of a match and provided traceable source references or the explicit absence of them.
Next steps
The system remains in active development. Vstorm’s team is addressing both identified failure modes directly. TAQ matching logic is being extended to handle scenarios where no positive TAQ is present: a condition the current pipeline flags but does not yet resolve. Guideline selection is being improved for ambiguous chief complaint presentations through enhanced retrieval ranking and disambiguation logic.
In parallel, Vstorm is developing a locally deployable model variant designed to run on client-owned infrastructure. This addresses the requirements of healthcare environments with stricter data residency constraints, where patient history and treatment records cannot leave the organisation’s perimeter. The local model is being engineered with reinforced anti-hallucination constraints and explicit data boundary controls, enabling a higher degree of patient context to be passed into the system safely.
Meet directly with our founders and PhD AI engineers. We will demonstrate real implementations from 30+ agentic projects and show you the practical steps to integrate them into your specific workflows—no hypotheticals, just proven approaches



