Self-Hosted LLMs vs. Cloud AI: A Practical Comparison for Healthcare
For healthcare organizations handling protected health information (PHI), self-hosted LLMs provide stronger data residency guarantees and eliminate third-party data exposure, but they require more infrastructure investment. Cloud AI is faster to deploy and easier to scale, but introduces compliance complexity around BAAs, data routing, and vendor lock-in. The right choice depends on your data sensitivity, technical capacity, and risk tolerance.
Why This Decision Matters Now
The AI in healthcare market reached $21.66 billion in 2025 and is projected to grow at a 38.6% CAGR through 2030 (MarketsandMarkets, 2025). At the same time, healthcare data breaches cost an average of $9.77 million per incident in 2024, the highest of any industry for the fourteenth consecutive year (IBM/HIPAA Journal, 2024). Every AI tool that touches PHI expands your attack surface and compliance obligations.
The question isn't whether to adopt AI. It's how to adopt it without creating new vectors for data exposure.
The Core Comparison: Self-Hosted LLMs vs. Cloud AI
| Dimension | Self-Hosted LLMs | Cloud AI (API-Based) |
|---|---|---|
| Data Residency | PHI never leaves your network. Full control over storage location, encryption keys, and access logs. | Data transits to provider data centers. Region selection available but routing not always transparent. |
| HIPAA Compliance | You own the entire compliance surface. No BAA needed with an AI vendor because there is no AI vendor. | Requires a signed BAA. Compliance depends on using only "in-scope" services - not all AI features qualify. |
| Latency | Sub-10ms network latency for on-prem. Inference speed depends on your GPU hardware. | Network round-trip of 50-200ms typical. Provider-side inference is fast due to optimized clusters. |
| Upfront Cost | Higher. GPU servers (NVIDIA A100/H100), networking, and ops staffing required. | Lower. Pay-per-token pricing with no hardware investment. |
| Ongoing Cost | Predictable. Hardware amortization plus electricity and maintenance. | Variable. Scales with usage, can spike unpredictably at volume. |
| Customization | Full fine-tuning, LoRA adapters, custom tokenizers, domain-specific training on your own data. | Limited to provider-offered fine-tuning APIs. Your training data may be subject to provider policies. |
| Maintenance Burden | High. You manage updates, security patches, model versioning, and GPU driver compatibility. | Low. Provider handles infrastructure, scaling, and model updates. |
| Model Quality | Open models (Llama 3, Mistral) approach but don't always match frontier closed models on general benchmarks. | Access to latest frontier models (GPT-4o, Claude, Gemini) with state-of-the-art performance. |
Deployment Options for Self-Hosted LLMs
Self-hosted doesn't mean one-size-fits-all. There are three common deployment architectures, each with different security and operational profiles.
Fully Air-Gapped (On-Premises, No Internet)
- The model, inference engine, and all data stay on hardware with no external network connection
- Maximum security posture - eliminates remote attack vectors entirely
- Model updates require manual transfer via secure media
- Best for: organizations handling the most sensitive PHI, military/VA health systems, or environments with strict regulatory mandates
On-Premises with Internet Connectivity
- Models run on local hardware, but the network allows outbound connections for updates, monitoring, and telemetry
- Enables remote management and automated patching
- Requires careful firewall rules to prevent PHI egress
- Best for: hospital systems with existing data center infrastructure and a mature IT security team
Private Cloud (Azure/AWS Private Endpoints)
- Models run in your own cloud tenancy (VPC/VNet) with private endpoints - no public internet exposure
- Data stays within your cloud account; the provider never accesses it
- You control encryption keys via AWS KMS or Azure Key Vault
- Offers cloud scalability without the shared-tenancy risk of API-based AI services
- Best for: organizations already running workloads in AWS/Azure who want to add AI without new physical infrastructure
Open-Source Models Worth Evaluating
The open-source LLM landscape has matured rapidly. Here are the models most relevant to healthcare deployments:
Meta Llama 3 (8B, 70B, 405B parameters)
The current standard for open-weight models. Llama 3 70B performs competitively with GPT-4 on many benchmarks and runs well on a dual-GPU server. The 8B variant is viable for edge deployments and lower-complexity tasks like appointment scheduling or patient FAQ responses. The community has produced healthcare-specific fine-tunes built on Llama 3 for clinical note summarization and medical Q&A.
Mistral / Mixtral (7B, 8x7B, 8x22B)
Mistral's Mixture-of-Experts (MoE) architecture activates only a subset of parameters per inference, delivering strong performance with lower compute requirements. Mixtral 8x7B offers roughly GPT-3.5-level quality at a fraction of the hardware cost. Good for high-throughput use cases like prior authorization triage or coding assistance where you need fast, cost-efficient inference.
Microsoft Phi-3 / Phi-4 (3.8B - 14B)
Small language models optimized for constrained environments. Useful for on-device or edge inference where GPU resources are limited. Strong at structured extraction tasks like pulling ICD-10 codes from clinical notes.
Domain-Specific Models
Models like Google's Med-PaLM 2 (closed) demonstrated that domain fine-tuning dramatically improves clinical accuracy. Open alternatives are emerging: BioMistral, Meditron, and PMC-LLaMA are fine-tuned on biomedical literature and show improved performance on medical licensing exam questions (USMLE) compared to their base models.
The BAA Reality Check
A Business Associate Agreement is the legal mechanism that allows a cloud provider to handle PHI on your behalf under HIPAA. All three major cloud providers offer BAAs:
- Microsoft Azure: BAA available by default through the Online Services Data Protection Addendum. Covers Azure OpenAI Service, Azure Machine Learning, and most core Azure services. (Microsoft Learn)
- AWS: BAA available through AWS Artifact. Covers a designated list of HIPAA-eligible services including Amazon Bedrock and SageMaker.
- Google Cloud: BAA available. Covers Vertex AI and a list of designated services.
Here's what most people miss: a BAA does not make a service HIPAA-compliant. It makes the provider contractually liable for their part. You're still responsible for:
- Configuring the service correctly (encryption, access controls, audit logging)
- Using only the services listed as "in-scope" in the BAA - not every feature of Azure or AWS qualifies
- Ensuring your prompts and workflows don't inadvertently expose PHI through non-covered services
- Managing the "minimum necessary" standard - don't send an entire patient record when you only need a medication list
| BAA Consideration | Cloud AI with BAA | Self-Hosted (No BAA Needed) |
|---|---|---|
| Covered Services | Only designated "in-scope" services; new AI features may not be covered immediately | N/A - you control the full stack |
| Data Use for Training | BAA typically prohibits, but verify; some providers have separate training data policies | Your data is never used for external training |
| Breach Notification | Provider must notify you; timelines vary by contract | You detect and manage breaches directly |
| Subcontractor Chain | Provider may use subprocessors; your PHI passes through multiple entities | No subcontractors involved |
| Audit Rights | Limited to provider's compliance reports (SOC 2, HITRUST); direct audit usually not available | Full audit access to every component |
Decision Framework: Which Architecture Fits Your Organization?
Use this checklist to determine your best path. Work through each question in order:
Step 1: Data Classification
- Will the AI system process, generate, or have access to PHI? If no, cloud AI with standard security controls is likely sufficient. If yes, continue.
- Does the PHI include high-sensitivity categories (behavioral health, substance abuse, HIV status, psychotherapy notes)? If yes, strongly consider self-hosted or air-gapped deployment.
Step 2: Regulatory Requirements
- Does your organization operate under regulations beyond HIPAA (state privacy laws, GDPR for international patients, 42 CFR Part 2)? If yes, self-hosted simplifies multi-regulation compliance because data never crosses jurisdictional boundaries.
- Do your payer contracts or institutional policies prohibit PHI processing by third parties? If yes, self-hosted is required.
Step 3: Operational Capacity
- Do you have (or can you hire) ML engineering staff to manage model deployment, updates, and monitoring? If no, consider a managed self-hosted platform like Compass AI that handles deployment complexity while keeping data on your infrastructure.
- Do you have existing GPU infrastructure or budget for it? If no, private cloud deployment (your VPC + cloud GPUs) offers a middle path.
Step 4: Use Case Evaluation
- Is the use case patient-facing or does it influence clinical decisions? If yes, prioritize model validation, bias testing, and auditability - all easier with self-hosted models you can fully inspect.
- Is it an internal productivity use case (summarizing documentation, coding assistance)? If yes, cloud AI with a BAA may be acceptable with proper data minimization.
The Honest Tradeoffs
Cloud AI wins on time-to-value. You can have a working prototype in hours, not weeks. For non-PHI use cases, pilot projects, and low-risk internal tools, cloud APIs are pragmatic.
Self-hosted wins on control and long-term risk reduction. When you're processing thousands of clinical notes per day, the calculus shifts: you eliminate vendor dependency, reduce per-inference cost at scale, and maintain an airtight compliance posture. The upfront investment is real - expect $50K-$200K for a production-grade GPU cluster, depending on model size and throughput requirements - but the marginal cost of each additional inference approaches zero.
For many healthcare organizations, the practical path is a hybrid approach: cloud AI for non-PHI workloads and internal tools, self-hosted models for anything that touches patient data. Platforms like Compass AI exist specifically to reduce the operational burden of self-hosted deployment, providing the data sovereignty of on-prem with a managed experience closer to cloud.
Frequently Asked Questions
Can I use ChatGPT or GPT-4 with patient data if I sign a BAA?
OpenAI offers a BAA for ChatGPT Enterprise and the API, but the BAA only covers specific services and configurations. Consumer ChatGPT is explicitly excluded. Even with a BAA, you must ensure prompts are minimized to necessary PHI only, and you remain responsible for access controls, audit logging, and encryption at rest. The BAA transfers some liability but not all responsibility.
How do open-source models compare to GPT-4 for clinical tasks?
Llama 3 70B and Mixtral 8x22B approach GPT-4 performance on general reasoning and medical knowledge benchmarks. For specialized clinical tasks like note summarization or ICD coding, fine-tuned open models often match or exceed general-purpose frontier models because they're trained on domain-specific data. The gap has narrowed significantly since 2024, and for most healthcare NLP tasks, open models are production-ready.
What GPU hardware do I need to run a self-hosted LLM?
For a 7-8B parameter model (Llama 3 8B, Mistral 7B): a single NVIDIA A10G or L4 GPU with 24GB VRAM is sufficient. For 70B models: plan for 2-4 NVIDIA A100 (80GB) or H100 GPUs. Quantized versions (4-bit GPTQ/AWQ) can cut memory requirements roughly in half with minimal quality loss. Running Llama 3 70B quantized is feasible on two A100s.
Does self-hosted mean I don't need to worry about HIPAA at all?
No. HIPAA applies to how you handle PHI regardless of where the AI runs. Self-hosted eliminates the business associate relationship with an AI vendor, but you still must implement the HIPAA Security Rule: access controls, encryption, audit logs, risk assessments, and workforce training. What it does eliminate is third-party data exposure and the compliance complexity of managing vendor BAAs.
What's the total cost of ownership for self-hosted vs. cloud over three years?
For a mid-size deployment processing 10,000 inference requests per day: cloud AI API costs typically run $3,000-$15,000/month depending on model and token volume. A self-hosted setup with two A100 GPUs costs roughly $60K-$80K upfront for hardware, plus $1,500-$3,000/month for power, cooling, and staff time. At moderate-to-high volume, self-hosted breaks even around 12-18 months and becomes significantly cheaper over a three-year horizon.











