How to build a HIPAA-compliant AI agent: architecture patterns and deployment checklist

Photoroom
Wojciech Achtelik
PhD(c), ... -> PhD(d), AI Tech Lead
Authorship
Nicholas Berryman
Writer
April 23, 2026
Group ()
Category Post
Table of content

Most healthcare AI compliance failures happen not at the infrastructure layer but in the data flow; ungoverned agent memory, unsigned BAAs, and PHI passed beyond its minimum-necessary scope. This article covers what constitutes PHI in an agentic context, three production-tested architecture patterns (HIPAA-eligible cloud endpoints, on-premise local models, and PHI de-identification layers), the current vendor BAA landscape, the three compliance gaps most engineering teams miss, and a seven-area pre-deployment checklist. Grounded in Vstorm’s work for a US healthcare provider serving 100,000+ members.


Most healthcare AI projects that fail HIPAA compliance do not fail at the infrastructure layer. The servers are encrypted. The access logs exist. The failure happens in the data flow, an agent calling an API without a signed Business Associate Agreement, a memory store retaining patient data without a retention policy, a scheduling tool receiving a full clinical record when it only needed a calendar slot.

The scale of this gap is measurable. A peer-reviewed study published in PMC found that 66% of US physicians now use AI tools in their practice, yet only 23% of health systems have signed BAAs with their AI vendors. That is not an infrastructure problem. It is an architectural one.

This guide addresses it directly. It covers what counts as protected health information (PHI) in an AI agent PHI compliance context, three healthcare AI architecture patterns and their compliance implications, the vendor decisions most engineering teams get wrong, and a pre-deployment checklist drawn from production deployments, including our own work with a US healthcare provider serving more than 100,000 members.

One regulatory note before beginning: HHS OCR published a proposed update to the HIPAA Security Rule in January 2025, the first major revision in over a decade. OCR has placed finalisation on its regulatory agenda for May 2026, though the rule has not yet been enacted and remains subject to the current administration’s priorities. The proposed changes; mandatory encryption, mandatory network segmentation, mandatory MFA; set the compliance direction organisations should be designing for regardless of the final timeline.


What counts as PHI in an AI context

Before designing any HIPAA-compliant AI agent, engineers need a precise definition of what they are protecting. And it is more expansive than most assume.

HIPAA’s Safe Harbor method identifies 18 categories of information that, when linked to a patient, constitute Protected Health Information. These range from names and dates to IP addresses, device identifiers, and biometric data. The compliance challenge in an AI context is not the identifiers in isolation, it is combination risk. A diagnosis code alone (for example, E11.65 — Type 2 diabetes with hyperglycemia) is not PHI. The same code paired with a date of service and a five-digit ZIP code creates a uniquely identifiable record.

AI agents create this risk at every inference call. A scheduling agent that receives a full patient record to confirm an appointment has just transmitted PHI to wherever that record travelled, including the LLM API, the memory store, and every system log along the way. The HIPAA minimum necessary standard requires that each agent tool call receives only the specific data fields required for its function. An appointment confirmation does not need diagnostic history. A billing verification does not need clinical notes.

Today, most healthcare organisations manage PHI access through static, role-based controls applied at the system level: a nurse accesses the scheduling module, a physician accesses the clinical record. Agentic AI breaks this model. An agent can traverse multiple systems autonomously within a single workflow, accessing data across role boundaries that were never designed to interact. That is where the compliance architecture must be rebuilt from the ground up.

PHI identifier category

How it appears in an AI agent context

Names

Patient greeting in a conversational interface or chat confirmation message

Dates (except year)

Appointment confirmation payload; discharge date in a clinical summary prompt

Geographic data smaller than a state

ZIP code passed to a scheduling agent to find nearby facilities

Phone numbers

Patient contact field retrieved for an outreach workflow

Email addresses

Included in intake form data passed to an LLM for triage routing

Social Security numbers

Pulled from EHR during benefits verification by a billing agent

Medical record numbers

Used as a lookup key in a multi-system agent workflow

IP addresses

Captured in web-based intake form logs; stored in session state

Biometric identifiers

Voice prints from a voice agent interaction recorded for quality review

Full-face photographs

Patient ID image attached to an intake document processed by a document agent


Three architecture patterns

There is no single default architecture for a HIPAA-compliant AI agent. The correct pattern depends on one question: where is PHI allowed to flow, and under what operational controls? In practice, most healthcare AI systems fall into one of three deployment patterns.

All three can be compliant. They differ in where PHI is processed, how much operational responsibility stays with the healthcare organisation, and which failure modes are most likely in production.

Pattern 1 — HIPAA-eligible cloud endpoints (AWS Bedrock / Azure OpenAI)

For most healthcare organisations without in-house GPU capacity, this is the default production pattern. The model is consumed as a managed service from a major cloud provider that designates the endpoint as HIPAA-eligible and signs a BAA covering it. PHI may then reach the model under contractual safeguards and tenant isolation, without the team operating any inference infrastructure of its own.

The pattern fits when the workflow exceeds what a small self-hosted model can reliably do, particularly tasks involving multi-step reasoning over clinical context or structured extraction from unstructured notes.

Consumption is usually metered per token, though providers also offer provisioned-throughput modes that trade flexibility for latency guarantees. Capacity planning still has to account for per-minute request and token quotas, which can cap real throughput well below the model’s nominal performance and are not always elastic on short notice.

The BAA is the starting point, not the finish line. AWS Bedrock and Azure OpenAI both offer HIPAA-eligible configurations, but their default abuse-monitoring and data-retention behaviours differ. The compliance perimeter does not end at the model. PHI also flows through the orchestration layer, retrieval and memory stores, and observability pipelines, each of which must sit under the same BAA or be architected never to receive PHI in the first place.

The shared-responsibility model leaves all of this to the deploying organisation, and misconfiguration, not provider failure, is the more likely route to a breach.

Pattern 2 — On-premise deployment with a local model

In this pattern, inference runs entirely inside infrastructure controlled by the healthcare organisation, so PHI never leaves the network for model processing.

It is the right choice when data residency rules, internal security policy, or the sensitivity of the workflow rules out external providers altogether, with clinical documentation and diagnostic support being the canonical cases.

As of early 2026, this is more operationally practical than it was even a year ago. Open-weight models now run at viable speeds on high-end hardware, which puts a serious tier of capability within reach of an in-house deployment. The constraint is no longer whether a useful model can be run locally, but which one, and at what cost.

Larger models demand a non-trivial GPU footprint, and once utilisation is high enough, the hardware bill can exceed what the same workload would cost in tokens from a managed endpoint. Teams running heavily loaded systems often recover the economics by serving smaller or quantised models tuned to the specific workflow rather than reaching for the largest checkpoint available.

The harder limit is capability. Even the strongest open-weight models still trail frontier proprietary systems on agentic workloads: tool use, long-horizon planning, and reliable adherence to complex instructions; and that gap matters most in exactly the multi-step workflows where AI agents earn their keep. Local models can perform very well on narrower, well-scoped internal tasks, but the performance has to be validated against the real workflow rather than assumed from public benchmarks.

Finally, compliance is not a free by-product of keeping the model in-house. Databases, vector stores, caches, internal APIs, and logging systems can all still store or expose PHI, and removing one external processor does not by itself make the surrounding architecture compliant.

Pattern 3 — PHI de-identification layer

In this pattern, PHI is stripped or pseudonymised before any prompt reaches an external model, so the agent only ever operates on de-identified data, and re-identification for output delivery is handled by a separate, tightly scoped process.

Each sensitive value: a name, a date of birth, an MRN, is replaced with an opaque surrogate such as [PATIENT_8f3a], and the process of mapping back to the original value lives in a separate encrypted store that only the re-identification step can read. The model sees the surrogates, generates its response against them, and a downstream service substitutes the real values back in before the output reaches the clinician.

The mechanics differ sharply between structured and unstructured inputs. Structured fields arriving from an EHR, an FHIR resource are redacted by field name, which is reliable and easy to audit. Free text, such as clinical notes or chat transcripts, is the hard part, and is typically handled by a PHI-specific named-entity recognition model backed by a regex pass for high-precision identifiers and an audit log covering every redaction across all eighteen HIPAA Safe Harbor identifiers.

The honest framing of this pattern is defence in depth rather than a substitute for a BAA. NER-based de-identification on messy clinical prose is good but not perfect, and a single missed identifier is a PHI leak into the model provider. Sending the output to a non-BAA endpoint on the theory that de-identified data is no longer PHI is a position OCR has consistently declined to endorse for free text.

Layered on top of a Pattern 1 deployment, however, it does meaningful work: it enforces minimum-necessary at the architecture level rather than leaving it to discipline at the application layer, and it bounds the sensitivity of whatever does reach the model. The cost is that the de-identification layer becomes critical infrastructure in its own right: anything it misses leaks, anything that breaks in the mapping store breaks re-identification on the way back, and the team has to validate and monitor it as carefully as the model itself.


Ready to see how agentic AI transforms business workflows?

Meet directly with our founders and PhD AI engineers. We will demonstrate real implementations from 30+ agentic projects and show you the practical steps to integrate them into your specific workflows—no hypotheticals, just proven approaches.


BAA requirements: what your vendor stack actually needs to sign

A Business Associate Agreement is a legally binding contract required by HIPAA whenever a vendor handles PHI on behalf of a covered entity. Sending PHI to any system without one in place is a direct violation, regardless of how well-encrypted the connection is. Penalties can reach $50,000 per violation.

The most commonly missed gap in healthcare AI architecture is the LLM boundary. Engineers build a secure internal system, then route PHI to an LLM API without verifying BAA status. The table below reflects the current landscape as of Q1 2026.

LLM vendor / service

BAA available

Conditions

PHI permitted under BAA

OpenAI API

Enterprise and API tiers only; zero-retention mode must be enabled

Yes — GPT-4o, GPT-4, o1, o3, Whisper, DALL-E via API

ChatGPT (Free, Plus, Team)

No BAA available at any consumer tier

No

Anthropic API (Claude)

Available via direct sales agreement only

Yes — API access under signed agreement

Claude (Free, Pro, Max, Team)

No BAA available at consumer or standard business tiers

No

Azure OpenAI Service

Requires configuration; data residency and access controls must be set explicitly

Yes — under correct configuration

AWS Bedrock

Requires configuration; carries its own integration complexity

Yes — under correct configuration

Two additional requirements are frequently overlooked. First, the BAA must extend to the cloud provider the LLM vendor runs on. If the vendor’s infrastructure sits on a platform that has not signed a subcontractor BAA, the chain is broken. Second, the BAA must explicitly prohibit the vendor from using PHI to train shared models. Before signing, ask the vendor directly: does my data leave my dedicated instance, and is it used to fine-tune your global base model? Both answers must be no.


Agent memory, audit trails, and the compliance gaps most teams miss

Encryption and signed BAAs do not close every agentic leak. PHI can still spread through retrieval indexes, prompt caches, and logs that cannot reconstruct what happened after an incident.

Gap 1 — Agent memory and state persistence

Agent memory is not only chat history. It includes tool outputs, vector stores, and observability traces. Any component that receives or can reveal ePHI belongs inside the HIPAA system boundary: access-controlled, encrypted where required, or reasonable and appropriate under the Security Rule, BAA-covered where a vendor maintains it, and governed by retention and deletion policy.

RAG systems are the most frequently underestimated part of this boundary. Embeddings are not de-identification. OWASP LLM08:2025 identifies vector and embedding risks including unauthorized retrieval, cross-tenant leakage, and inversion. Published inversion research shows that embeddings can leak source-text information, so a vector database containing clinical notes should be treated as a PHI repository, not as a harmless derivative.

Provider-side caching needs the same review. OpenAI’s standard in-memory prompt caching is Zero Data Retention eligible, with cached prefixes usually evicted after five to ten minutes of inactivity and sometimes active for up to one hour. Extended prompt caching is different: key/value tensors derived from customer content may be retained in GPU-local storage for up to 24 hours and are not ZDR-eligible, even though the original prompt text is not what is persisted to local storage and other ZDR protections may still apply. A BAA may permit PHI processing, but it does not make every caching mode equivalent.

Gap 2 — Audit trails for autonomous decisions

HIPAA’s audit controls require mechanisms to record and examine activity in systems that contain or use ePHI, and unique user identification is what makes those events attributable. HIPAA does not prescribe agent-specific log fields, but in an agentic workflow the practical unit of audit is the tool call, not the chat session. A defensible implementation should capture the user, agent, authorizer where applicable, record scope, operation, fields disclosed, policy decision, timestamp, model and prompt version, output, reviewer, and final action.

Shared service-account keys break this chain because they prove that “the system” accessed PHI, not who delegated the access or why. Logs should be tamper-evident and retained according to the organisation’s HIPAA documentation, record-retention, and legal-hold policies; required Security Rule documentation has a six-year retention requirement. Under the 2025 Security Rule NPRM, AI software that creates, receives, maintains, transmits, or can affect ePHI would reasonably be expected to appear in the proposed technology asset inventory and risk analysis.

Gap 3 — Decision-level accountability

For treatment, triage, claims, escalation, or care management, PHI-access logs are not enough. The organisation needs a decision record: what data entered the model, what the model returned, who reviewed it, and what action followed. HIPAA governs the PHI flow; for clinical decision-support use cases, FDA guidance and, where certified health IT is involved, ONC decision-support criteria supply the better operational principle: the professional should be able to independently review the basis for a recommendation and treat the agent output as an input, not the final decision. Time-critical, diagnostic, or directive outputs may also raise FDA medical-device questions beyond HIPAA.

“A HIPAA risk analysis is essential for identifying where ePHI is stored and what security measures are needed to protect it. Completing an accurate and thorough risk analysis that informs a risk management plan is a foundational step to mitigate or prevent cyberattacks and breaches.”

Paula M. Stannard, Director, HHS Office for Civil Rights (HHS.gov, August 2025)

Pre-deployment checklist

The checklist below maps seven compliance areas to their requirements and to the failure pattern that typically appears when each is skipped. It applies to any HIPAA-compliant AI agent deployment regardless of architecture pattern.

Area

Production gate

What failure looks like

PHI map and scope

Map every place the agent touches ePHI, then limit each tool to the fields it actually needs.

Scheduling receives diagnoses or notes because the whole chart was passed by default.

BAA and model boundary

Confirm BAAs for every vendor in the PHI path. If PHI reaches the model, use a covered configuration; if not, validate de-identification.

PHI reaches an LLM, analytics tool, vector store, or support system outside the contracted boundary.

Identity and access

Use unique user IDs, least privilege, per-tool authorization, and MFA where practical. Avoid shared service accounts.

Logs show that “the app” accessed PHI, but not which user delegated the action.

Encryption and storage

Protect ePHI in transit and at rest, including prompts, outputs, logs, exports, embeddings, caches, backups, and local debug files.

The main database is encrypted, but traces or vector indexes hold PHI with weaker controls.

Audit and retention

Log agent actions at the tool-call level and define deletion rules for memory, traces, embeddings, backups, and vendor-held data.

Security cannot reconstruct which agent action disclosed which PHI, or delete it later.

Incidents and change control

Prepare playbooks for PHI leaks, model changes, prompt updates, retrieval changes, rollback, and notification paths.

A prompt update changes disclosure behaviour with no security review or recovery plan.

Clinical accountability

For triage, treatment support, claims, or care management, require validation, human review where needed, and a decision record.

The agent output becomes the decision, while logs capture only the data access.


What this looks like in production

We built a multi-channel pre-appointment AI agent for a US healthcare provider serving more than 100,000 members across multiple states. The agent operated across voice, messaging, and web interfaces, enabling patients to share updates and concerns before their appointment, and giving clinical teams structured intake data before the encounter began.

The compliance architecture requirement was not an add-on. Because the agent operated across three channel types, PHI boundaries had to be enforced at every channel boundary independently: the voice interaction, the messaging payload, the web intake form submission, and the EHR write-back. Each channel carried different data format risks, different logging requirements, and different human-in-the-loop thresholds.

The outcome: each doctor saves more than five hours per week, and patient engagement increased by over 20%. Neither result was achievable without the compliance architecture being correct, because without it, the system would not have been deployable in a regulated environment at all. For teams considering agentic AI in healthcare, this is the practical lesson: compliance architecture is not the constraint on what you can build. It is the foundation that makes deployment possible.


Conclusion

HIPAA compliance for a HIPAA-compliant AI agent is not a checklist applied after the system is built. It is a set of architectural decisions made before the first tool call is written. Where PHI flows, which vendors are in the data path, how agent memory is governed, and what audit trail exists for every autonomous decision the system makes.

The 2025 HIPAA Security Rule proposals, whether finalised on schedule or not, reflect the direction regulators are moving: from addressable flexibility to mandatory specificity. Organisations that design for the proposed standard now will not be caught retrofitting when it is enacted.

For mid-market healthcare organisations without in-house agentic AI expertise, the cost of getting this wrong is not only regulatory. An ungoverned PHI access event in a multi-agent system can touch thousands of records before it is detected. The architecture patterns and checklist above are drawn from deployments where that risk was designed out from the start. Not discovered during an audit.


Ready to see how agentic AI transforms business workflows?

Meet directly with our founders and PhD AI engineers. We will demonstrate real implementations from 30+ agentic projects and show you the practical steps to integrate them into your specific workflows—no hypotheticals, just proven approaches.

Last updated: April 23, 2026

The LLM Book

The LLM Book explores the world of Artificial Intelligence and Large Language Models, examining their capabilities, technology, and adaptation.

Read it now