Introduction
The rise of large language models has created unprecedented opportunities for workflow automation. However, sending sensitive data to third-party APIs poses unacceptable risks for healthcare, finance, and government sectors. This guide demonstrates how to deploy privacy-preserving AI infrastructure using open-source models, with a focus on HIPAA compliance and cost-effective scaling.
HIPAA Compliance Note: While this guide provides technical infrastructure for private AI deployment, HIPAA compliance requires additional administrative, physical, and technical safeguards beyond the scope of this article. Consult with compliance experts for your specific use case.
Why Open Source Models?
Open-source models offer critical advantages for sensitive applications:
- Data sovereignty: All processing occurs within your infrastructure
- No data retention: Unlike APIs, no third party logs or trains on your data
- Auditable security: Model code and weights can be inspected
- Cost predictability: Fixed infrastructure costs vs per-token API pricing
- Customization: Fine-tune models on domain-specific data
- Regulatory compliance: Easier to meet HIPAA, GDPR, and industry requirements
Ollama Models: Comprehensive Comparison
Ollama provides an easy-to-use platform for running open-source models locally or in your infrastructure. Here's a detailed comparison of popular models:
| Model | Size | VRAM | Best For | Strengths | Weaknesses |
|---|---|---|---|---|---|
| Gemma:2b | 2B params | 4GB | Fast responses, edge deployment | Very fast, low resource | Limited reasoning |
| Gemma:7b | 7B params | 8GB | General automation, chatbots | Good balance, efficient | Complex tasks limited |
| Gemma:27b | 27B params | 20GB | Complex workflows, analysis | Strong reasoning, versatile | Higher resource needs |
| Llama 3:8b | 8B params | 8GB | General purpose, coding | Excellent coding, fast | Moderate reasoning |
| Llama 3:70b | 70B params | 48GB | Advanced reasoning, research | Superior quality, versatile | High compute cost |
| Mixtral:8x7b | 47B params | 32GB | Multi-task, high quality | MoE efficiency, multilingual | Complex deployment |
| Mistral:7b | 7B params | 8GB | Cost-effective quality | Great value, efficient | Less capable than 70B+ |
| CodeLlama:13b | 13B params | 12GB | Code generation, analysis | Specialized for code | Limited general knowledge |
Model Selection Guidance
For Medical/HIPAA Applications
Recommended: Gemma:27b or Llama 3:70b
Healthcare requires high accuracy and strong reasoning. Gemma:27b provides excellent performance at moderate cost, while Llama 3:70b offers the highest quality for critical applications.
For Document Processing/Analysis
Recommended: Gemma:27b or Mixtral:8x7b
Both models excel at understanding and analyzing documents. Mixtral's mixture-of-experts architecture handles diverse document types efficiently.
For High-Volume, Simple Tasks
Recommended: Gemma:7b or Mistral:7b
For classification, extraction, or simple Q&A, 7B models offer the best cost-performance ratio. Deploy multiple instances for high throughput.
For Code-Heavy Workflows
Recommended: CodeLlama:13b or Llama 3:70b
CodeLlama specializes in code, while Llama 3:70b handles both code and natural language exceptionally well.
Kubernetes Deployment
Let's deploy Gemma:27b on Kubernetes with GPU support, load balancing, and automatic scaling.
Step 1: Create Namespace and ConfigMap
# namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: ai-automation
labels:
name: ai-automation
compliance: hipaa
---
# configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: ollama-config
namespace: ai-automation
data:
OLLAMA_HOST: "0.0.0.0:11434"
OLLAMA_MODELS: "/models"
OLLAMA_NUM_PARALLEL: "4" # Number of parallel requests
OLLAMA_MAX_LOADED_MODELS: "2" # Keep 2 models in memory
OLLAMA_GPU_LAYERS: "999" # Offload all layers to GPUStep 2: Create Persistent Volume for Models
# pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ollama-models-pvc
namespace: ai-automation
spec:
accessModes:
- ReadWriteOnce
storageClassName: fast-ssd # Use fast SSD for model loading
resources:
requests:
storage: 100Gi # Gemma:27b needs ~30GB, allow space for multiple modelsStep 3: Deploy Ollama with Gemma:27b
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama-gemma27b
namespace: ai-automation
labels:
app: ollama
model: gemma27b
spec:
replicas: 2 # Start with 2 replicas
selector:
matchLabels:
app: ollama
model: gemma27b
template:
metadata:
labels:
app: ollama
model: gemma27b
spec:
# Node affinity for GPU nodes
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu
operator: Exists
- key: gpu-type
operator: In
values:
- a100
- v100
# Pod Anti-affinity to spread across nodes
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- ollama
topologyKey: kubernetes.io/hostname
containers:
- name: ollama
image: ollama/ollama:latest
imagePullPolicy: Always
ports:
- containerPort: 11434
name: http
protocol: TCP
envFrom:
- configMapRef:
name: ollama-config
resources:
requests:
memory: "24Gi"
cpu: "4"
nvidia.com/gpu: "1"
limits:
memory: "32Gi"
cpu: "8"
nvidia.com/gpu: "1"
volumeMounts:
- name: models
mountPath: /models
# Startup probe - model loading can take time
startupProbe:
httpGet:
path: /api/tags
port: 11434
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 30 # Allow 5 minutes for startup
# Liveness probe
livenessProbe:
httpGet:
path: /api/tags
port: 11434
periodSeconds: 30
timeoutSeconds: 5
failureThreshold: 3
# Readiness probe
readinessProbe:
httpGet:
path: /api/tags
port: 11434
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
# Lifecycle hook to pull model on start
lifecycle:
postStart:
exec:
command:
- "/bin/sh"
- "-c"
- "ollama pull gemma:27b"
volumes:
- name: models
persistentVolumeClaim:
claimName: ollama-models-pvc
# Security context for HIPAA compliance
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
# Toleration for GPU nodes
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoScheduleStep 4: Create Service and Ingress
# service.yaml
apiVersion: v1
kind: Service
metadata:
name: ollama-service
namespace: ai-automation
labels:
app: ollama
spec:
type: ClusterIP
sessionAffinity: ClientIP # Stick sessions for better caching
sessionAffinityConfig:
clientIP:
timeoutSeconds: 3600 # 1 hour session stickiness
ports:
- port: 80
targetPort: 11434
protocol: TCP
name: http
selector:
app: ollama
---
# ingress.yaml (optional, for external access)
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: ollama-ingress
namespace: ai-automation
annotations:
nginx.ingress.kubernetes.io/ssl-redirect: "true"
nginx.ingress.kubernetes.io/backend-protocol: "HTTP"
# Increase timeout for long-running generation
nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
# TLS for HIPAA compliance
cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
tls:
- hosts:
- ai.yourdomain.com
secretName: ollama-tls
rules:
- host: ai.yourdomain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: ollama-service
port:
number: 80Step 5: Horizontal Pod Autoscaler
# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ollama-hpa
namespace: ai-automation
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ollama-gemma27b
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
# Scale based on GPU utilization (requires metrics-server with GPU support)
- type: Pods
pods:
metric:
name: gpu_utilization
target:
type: AverageValue
averageValue: "75"
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min before scaling down
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 60 # Quick scale up
policies:
- type: Percent
value: 100
periodSeconds: 30Deploy Everything
# Apply all manifests
kubectl apply -f namespace.yaml
kubectl apply -f configmap.yaml
kubectl apply -f pvc.yaml
kubectl apply -f deployment.yaml
kubectl apply -f service.yaml
kubectl apply -f hpa.yaml
# Monitor deployment
kubectl get pods -n ai-automation -w
# Check logs
kubectl logs -n ai-automation -l app=ollama --tail=100 -f
# Test the service
kubectl port-forward -n ai-automation svc/ollama-service 11434:80
# Make a test request
curl http://localhost:11434/api/generate -d '{
"model": "gemma:27b",
"prompt": "Summarize the key points from this medical report...",
"stream": false
}'Azure GPU Infrastructure with Terraform
Now let's provision the Azure infrastructure to run our Kubernetes cluster with GPU nodes.
Terraform Configuration
# main.tf
terraform {
required_version = ">= 1.0"
required_providers {
azurerm = {
source = "hashicorp/azurerm"
version = "~> 3.0"
}
}
backend "azurerm" {
resource_group_name = "terraform-state-rg"
storage_account_name = "tfstate"
container_name = "tfstate"
key = "ai-automation.tfstate"
}
}
provider "azurerm" {
features {}
}
# Variables
variable "environment" {
description = "Environment name"
type = string
default = "production"
}
variable "location" {
description = "Azure region"
type = string
default = "eastus"
}
variable "cluster_name" {
description = "AKS cluster name"
type = string
default = "ai-automation-aks"
}
variable "node_count" {
description = "Initial number of GPU nodes"
type = number
default = 2
}
variable "max_node_count" {
description = "Maximum number of GPU nodes for autoscaling"
type = number
default = 10
}
# Resource Group
resource "azurerm_resource_group" "main" {
name = "ai-automation-${var.environment}-rg"
location = var.location
tags = {
Environment = var.environment
Project = "AI-Automation"
Compliance = "HIPAA"
}
}
# Virtual Network
resource "azurerm_virtual_network" "main" {
name = "ai-automation-vnet"
address_space = ["10.0.0.0/16"]
location = azurerm_resource_group.main.location
resource_group_name = azurerm_resource_group.main.name
}
resource "azurerm_subnet" "aks" {
name = "aks-subnet"
resource_group_name = azurerm_resource_group.main.name
virtual_network_name = azurerm_virtual_network.main.name
address_prefixes = ["10.0.1.0/24"]
}
# Log Analytics Workspace for monitoring
resource "azurerm_log_analytics_workspace" "main" {
name = "ai-automation-logs"
location = azurerm_resource_group.main.location
resource_group_name = azurerm_resource_group.main.name
sku = "PerGB2018"
retention_in_days = 90 # HIPAA requires 90 days minimum
}
# AKS Cluster
resource "azurerm_kubernetes_cluster" "main" {
name = var.cluster_name
location = azurerm_resource_group.main.location
resource_group_name = azurerm_resource_group.main.name
dns_prefix = var.cluster_name
kubernetes_version = "1.28"
# System node pool (non-GPU, for system workloads)
default_node_pool {
name = "system"
node_count = 2
vm_size = "Standard_D4s_v3"
vnet_subnet_id = azurerm_subnet.aks.id
type = "VirtualMachineScaleSets"
enable_auto_scaling = true
min_count = 2
max_count = 5
node_labels = {
"workload-type" = "system"
}
node_taints = [
"CriticalAddonsOnly=true:NoSchedule"
]
}
identity {
type = "SystemAssigned"
}
# Network profile
network_profile {
network_plugin = "azure"
network_policy = "calico"
load_balancer_sku = "standard"
service_cidr = "10.1.0.0/16"
dns_service_ip = "10.1.0.10"
}
# Enable monitoring
oms_agent {
log_analytics_workspace_id = azurerm_log_analytics_workspace.main.id
}
# HIPAA compliance features
role_based_access_control_enabled = true
azure_policy_enabled = true
# Enable encryption at rest
disk_encryption_set_id = azurerm_disk_encryption_set.main.id
tags = {
Environment = var.environment
Compliance = "HIPAA"
}
}
# GPU Node Pool for AI workloads
resource "azurerm_kubernetes_cluster_node_pool" "gpu" {
name = "gpupool"
kubernetes_cluster_id = azurerm_kubernetes_cluster.main.id
vm_size = "Standard_NC24ads_A100_v4" # Azure VM with NVIDIA A100
node_count = var.node_count
# Autoscaling
enable_auto_scaling = true
min_count = var.node_count
max_count = var.max_node_count
# GPU-specific labels and taints
node_labels = {
"workload-type" = "gpu"
"gpu-type" = "a100"
"nvidia.com/gpu" = "true"
}
node_taints = [
"nvidia.com/gpu=true:NoSchedule"
]
# Use premium SSD for fast model loading
os_disk_type = "Ephemeral"
os_disk_size_gb = 200
vnet_subnet_id = azurerm_subnet.aks.id
tags = {
Environment = var.environment
GPU = "A100"
}
}
# Disk Encryption Set for HIPAA compliance
resource "azurerm_key_vault" "main" {
name = "ai-automation-kv"
location = azurerm_resource_group.main.location
resource_group_name = azurerm_resource_group.main.name
tenant_id = data.azurerm_client_config.current.tenant_id
sku_name = "premium"
enabled_for_disk_encryption = true
purge_protection_enabled = true
network_acls {
default_action = "Deny"
bypass = "AzureServices"
}
}
resource "azurerm_key_vault_key" "disk_encryption" {
name = "disk-encryption-key"
key_vault_id = azurerm_key_vault.main.id
key_type = "RSA"
key_size = 4096
key_opts = [
"decrypt",
"encrypt",
"sign",
"unwrapKey",
"verify",
"wrapKey",
]
}
resource "azurerm_disk_encryption_set" "main" {
name = "ai-automation-des"
location = azurerm_resource_group.main.location
resource_group_name = azurerm_resource_group.main.name
key_vault_key_id = azurerm_key_vault_key.disk_encryption.id
identity {
type = "SystemAssigned"
}
}
# Grant DES access to Key Vault
resource "azurerm_key_vault_access_policy" "disk_encryption" {
key_vault_id = azurerm_key_vault.main.id
tenant_id = data.azurerm_client_config.current.tenant_id
object_id = azurerm_disk_encryption_set.main.identity[0].principal_id
key_permissions = [
"Get",
"WrapKey",
"UnwrapKey",
]
}
data "azurerm_client_config" "current" {}
# Container Registry for private images
resource "azurerm_container_registry" "main" {
name = "aiautomationacr"
resource_group_name = azurerm_resource_group.main.name
location = azurerm_resource_group.main.location
sku = "Premium"
admin_enabled = false
# HIPAA compliance
public_network_access_enabled = false
network_rule_bypass_option = "AzureServices"
}
# Grant AKS access to ACR
resource "azurerm_role_assignment" "aks_acr_pull" {
scope = azurerm_container_registry.main.id
role_definition_name = "AcrPull"
principal_id = azurerm_kubernetes_cluster.main.kubelet_identity[0].object_id
}
# Outputs
output "cluster_name" {
value = azurerm_kubernetes_cluster.main.name
}
output "kube_config" {
value = azurerm_kubernetes_cluster.main.kube_config_raw
sensitive = true
}
output "cluster_endpoint" {
value = azurerm_kubernetes_cluster.main.kube_config[0].host
}
output "acr_login_server" {
value = azurerm_container_registry.main.login_server
}Deploy Infrastructure
# Initialize Terraform
terraform init
# Review planned changes
terraform plan -out=tfplan
# Apply infrastructure
terraform apply tfplan
# Get AKS credentials
az aks get-credentials \
--resource-group ai-automation-production-rg \
--name ai-automation-aks \
--overwrite-existing
# Verify GPU nodes
kubectl get nodes -l nvidia.com/gpu=true
# Install NVIDIA GPU Operator
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/deployments/gpu-operator/values.yaml
# Verify GPU is available
kubectl run gpu-test --rm -it --restart=Never \
--image=nvidia/cuda:12.0-base \
--limits=nvidia.com/gpu=1 \
-- nvidia-smiCost Analysis: GPU Cluster at Scale
Understanding costs is critical for budgeting private AI infrastructure. Here's a detailed breakdown:
Azure GPU VM Pricing (East US, as of Nov 2024)
| VM Size | GPU | VRAM | Hourly | Monthly (24/7) | Best For |
|---|---|---|---|---|---|
| NC6s_v3 | 1x V100 | 16GB | $3.06 | $2,203 | Dev/test, small models |
| NC12s_v3 | 2x V100 | 32GB | $6.12 | $4,406 | Medium scale |
| NC24ads_A100_v4 | 1x A100 | 80GB | $3.67 | $2,642 | Production (our choice) |
| NC48ads_A100_v4 | 2x A100 | 160GB | $7.35 | $5,292 | Large models, high throughput |
| NC96ads_A100_v4 | 4x A100 | 320GB | $14.69 | $10,577 | Maximum performance |
Complete Monthly Cost Scenarios
Small Deployment (Startup/POC)
Medium Deployment (Growing Business)
Large Deployment (Enterprise Scale)
Cost Comparison: Self-Hosted vs API Services
Break-Even Analysis
Comparing our medium deployment ($14,470/mo) to GPT-4 API pricing ($0.03/1K tokens input, $0.06/1K tokens output):
- Average request: 500 input tokens, 500 output tokens
- Cost per API request: $0.015 + $0.030 = $0.045
- Our infrastructure: $14,470/mo for 500 req/hr capacity
- Data Privacy: No data leaves your infrastructure (required for HIPAA)
- No Rate Limits: Full control over throughput
- Customization: Fine-tune models on proprietary data
- Predictable Costs: No surprise bills from usage spikes
- Latency: Sub-100ms inference vs 500-2000ms API latency
Cost Optimization Strategies
1. Use Reserved Instances
Azure Reserved Instances offer up to 72% savings. For our medium deployment, this reduces GPU costs from $13,210/mo to approximately $3,699/mo — saving $9,511/mo.
2. Implement Spot Instances for Non-Critical Workloads
Use Spot instances for batch processing or development environments. Spot pricing offers up to 90% discount but can be interrupted with 30-second notice.
3. Aggressive Autoscaling
Scale down to minimum replicas during off-peak hours. If your workload has 50% lower demand at night, this saves ~25% on GPU costs.
4. Model Selection by Use Case
Route simple queries to Gemma:7b (2x cheaper to run) and complex ones to Gemma:27b. This can reduce average costs by 30-40% while maintaining quality.
5. Quantization
Use 4-bit or 8-bit quantized models where accuracy requirements allow. This can reduce VRAM requirements by 4x, allowing you to use smaller, cheaper VMs.
Example Workflow: Medical Report Processing
Let's implement a complete example: processing medical reports with Gemma:27b for HIPAA-compliant automation.
import asyncio
import httpx
from typing import Dict, List
import json
class MedicalReportProcessor:
"""
HIPAA-compliant medical report processor using private Ollama deployment.
All data stays within your infrastructure.
"""
def __init__(self, ollama_endpoint: str = "http://ollama-service.ai-automation.svc.cluster.local"):
self.endpoint = ollama_endpoint
self.model = "gemma:27b"
self.client = httpx.AsyncClient(timeout=60.0)
async def extract_structured_data(self, report_text: str) -> Dict:
"""
Extract structured data from unstructured medical reports.
"""
prompt = f"""Extract the following information from this medical report.
Return ONLY a valid JSON object with these exact fields:
{{
"patient_demographics": {{
"age": <number or null>,
"gender": "<M/F/Other or null>"
}},
"chief_complaint": "<string or null>",
"diagnosis": ["<diagnosis 1>", "<diagnosis 2>"],
"medications": ["<med 1>", "<med 2>"],
"procedures": ["<procedure 1>"],
"lab_results": {{
"<test_name>": "<value and unit>"
}},
"follow_up": "<string or null>"
}}
Medical Report:
{report_text}
JSON Output:"""
response = await self.client.post(
f"{self.endpoint}/api/generate",
json={
"model": self.model,
"prompt": prompt,
"stream": False,
"options": {
"temperature": 0.1, # Low temperature for factual extraction
"num_predict": 2000,
}
}
)
result = response.json()
# Parse the JSON from model output
try:
extracted = json.loads(result["response"])
return extracted
except json.JSONDecodeError:
# Fallback: extract JSON from text
text = result["response"]
start = text.find("{")
end = text.rfind("}") + 1
return json.loads(text[start:end])
async def summarize_report(self, report_text: str, max_words: int = 200) -> str:
"""
Generate concise clinical summary.
"""
prompt = f"""Provide a concise clinical summary of this medical report in {max_words} words or less.
Focus on: diagnosis, key findings, treatment plan, and follow-up.
Medical Report:
{report_text}
Clinical Summary:"""
response = await self.client.post(
f"{self.endpoint}/api/generate",
json={
"model": self.model,
"prompt": prompt,
"stream": False,
"options": {
"temperature": 0.3,
"num_predict": 300,
}
}
)
return response.json()["response"]
async def classify_urgency(self, report_text: str) -> str:
"""
Classify report urgency for triage.
"""
prompt = f"""Classify the urgency of this medical report into one of these categories:
- CRITICAL: Life-threatening condition requiring immediate attention
- URGENT: Serious condition requiring prompt attention within hours
- ROUTINE: Standard follow-up or non-urgent matter
Respond with ONLY the category name (CRITICAL, URGENT, or ROUTINE).
Medical Report:
{report_text}
Urgency:"""
response = await self.client.post(
f"{self.endpoint}/api/generate",
json={
"model": self.model,
"prompt": prompt,
"stream": False,
"options": {
"temperature": 0.0, # Deterministic for classification
"num_predict": 10,
}
}
)
urgency = response.json()["response"].strip().upper()
# Validate output
if urgency not in ["CRITICAL", "URGENT", "ROUTINE"]:
return "ROUTINE" # Default to routine if unclear
return urgency
async def process_report(self, report_text: str) -> Dict:
"""
Full pipeline: extract data, summarize, and classify.
"""
# Run all tasks concurrently for speed
structured_data, summary, urgency = await asyncio.gather(
self.extract_structured_data(report_text),
self.summarize_report(report_text),
self.classify_urgency(report_text)
)
return {
"structured_data": structured_data,
"summary": summary,
"urgency": urgency,
"processed_at": datetime.utcnow().isoformat()
}
async def batch_process(self, reports: List[str], max_concurrent: int = 5) -> List[Dict]:
"""
Process multiple reports with concurrency control.
"""
semaphore = asyncio.Semaphore(max_concurrent)
async def process_with_semaphore(report):
async with semaphore:
return await self.process_report(report)
tasks = [process_with_semaphore(report) for report in reports]
return await asyncio.gather(*tasks)
# Example usage
async def main():
processor = MedicalReportProcessor()
sample_report = """
Patient: John Doe, 58-year-old male
Chief Complaint: Chest pain and shortness of breath
History: Patient presents with substernal chest pain radiating to left arm,
onset 2 hours ago. Associated with diaphoresis and nausea.
Exam: BP 160/95, HR 102, RR 22, SpO2 94% on room air
Cardiovascular: Tachycardic, regular rhythm, no murmurs
Respiratory: Bilateral crackles at bases
EKG: ST elevations in leads II, III, aVF
Troponin: 2.4 ng/mL (elevated)
Assessment: Acute inferior wall myocardial infarction
Plan:
- Aspirin 325mg PO given
- Heparin bolus and drip initiated
- Cardiology consulted for urgent catheterization
- Transfer to CCU
"""
result = await processor.process_report(sample_report)
print("Urgency:", result["urgency"])
print("\nSummary:", result["summary"])
print("\nStructured Data:", json.dumps(result["structured_data"], indent=2))
if __name__ == "__main__":
asyncio.run(main())Monitoring and Observability
Production AI systems require comprehensive monitoring:
# prometheus-servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: ollama-metrics
namespace: ai-automation
spec:
selector:
matchLabels:
app: ollama
endpoints:
- port: http
path: /metrics
interval: 30s
---
# grafana-dashboard.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: ollama-dashboard
namespace: monitoring
data:
dashboard.json: |
{
"dashboard": {
"title": "AI Automation - Ollama Performance",
"panels": [
{
"title": "Requests per Second",
"targets": [
{
"expr": "rate(ollama_requests_total[5m])"
}
]
},
{
"title": "GPU Utilization",
"targets": [
{
"expr": "nvidia_gpu_duty_cycle"
}
]
},
{
"title": "Average Response Time",
"targets": [
{
"expr": "rate(ollama_request_duration_seconds_sum[5m]) / rate(ollama_requests_total[5m])"
}
]
},
{
"title": "Model Memory Usage",
"targets": [
{
"expr": "ollama_model_memory_bytes"
}
]
}
]
}
}Security Best Practices
For HIPAA compliance and general security:
- Encryption at rest: Use Azure Disk Encryption with customer-managed keys for all persistent storage
- Encryption in transit: TLS 1.3 for all network communications, including internal cluster traffic
- Network isolation: Deploy in private VNet, use Azure Private Link for external access
- Access control: Azure RBAC for cluster access, Kubernetes RBAC for workload permissions
- Audit logging: Enable Azure Monitor audit logs, retain for 90+ days for HIPAA compliance
- Vulnerability scanning: Use Azure Defender for containers, scan images before deployment
- Secrets management: Store credentials in Azure Key Vault, inject via CSI driver
- Network policies: Implement Calico policies to restrict pod-to-pod communication
Conclusion
Private AI automation with open-source models is not only feasible but often more cost-effective than API services for high-volume applications. With proper infrastructure, organizations can achieve:
- Complete data sovereignty and HIPAA compliance
- Predictable, optimized costs at scale
- Lower latency and higher throughput
- Customization capabilities for domain-specific needs
- Independence from third-party API providers
At Subzero Research, we help organizations design, deploy, and optimize private AI infrastructure for healthcare, finance, and government applications. Our expertise spans distributed systems, GPU optimization, and regulatory compliance.
Need Help with Your Private AI Infrastructure?
We provide consulting services for architecture design, HIPAA compliance, and deployment of private AI systems.
Contact Us