Subzero Research Inc | Building a Free Internet for a Free People

Introduction

The rise of large language models has created unprecedented opportunities for workflow automation. However, sending sensitive data to third-party APIs poses unacceptable risks for healthcare, finance, and government sectors. This guide demonstrates how to deploy privacy-preserving AI infrastructure using open-source models, with a focus on HIPAA compliance and cost-effective scaling.

HIPAA Compliance Note: While this guide provides technical infrastructure for private AI deployment, HIPAA compliance requires additional administrative, physical, and technical safeguards beyond the scope of this article. Consult with compliance experts for your specific use case.

Why Open Source Models?

Open-source models offer critical advantages for sensitive applications:

Data sovereignty: All processing occurs within your infrastructure
No data retention: Unlike APIs, no third party logs or trains on your data
Auditable security: Model code and weights can be inspected
Cost predictability: Fixed infrastructure costs vs per-token API pricing
Customization: Fine-tune models on domain-specific data
Regulatory compliance: Easier to meet HIPAA, GDPR, and industry requirements

Ollama Models: Comprehensive Comparison

Ollama provides an easy-to-use platform for running open-source models locally or in your infrastructure. Here's a detailed comparison of popular models:

Model	Size	VRAM	Best For	Strengths	Weaknesses
Gemma:2b	2B params	4GB	Fast responses, edge deployment	Very fast, low resource	Limited reasoning
Gemma:7b	7B params	8GB	General automation, chatbots	Good balance, efficient	Complex tasks limited
Gemma:27b	27B params	20GB	Complex workflows, analysis	Strong reasoning, versatile	Higher resource needs
Llama 3:8b	8B params	8GB	General purpose, coding	Excellent coding, fast	Moderate reasoning
Llama 3:70b	70B params	48GB	Advanced reasoning, research	Superior quality, versatile	High compute cost
Mixtral:8x7b	47B params	32GB	Multi-task, high quality	MoE efficiency, multilingual	Complex deployment
Mistral:7b	7B params	8GB	Cost-effective quality	Great value, efficient	Less capable than 70B+
CodeLlama:13b	13B params	12GB	Code generation, analysis	Specialized for code	Limited general knowledge

Model Selection Guidance

For Medical/HIPAA Applications

Recommended: Gemma:27b or Llama 3:70b

Healthcare requires high accuracy and strong reasoning. Gemma:27b provides excellent performance at moderate cost, while Llama 3:70b offers the highest quality for critical applications.

For Document Processing/Analysis

Recommended: Gemma:27b or Mixtral:8x7b

Both models excel at understanding and analyzing documents. Mixtral's mixture-of-experts architecture handles diverse document types efficiently.

For High-Volume, Simple Tasks

Recommended: Gemma:7b or Mistral:7b

For classification, extraction, or simple Q&A, 7B models offer the best cost-performance ratio. Deploy multiple instances for high throughput.

For Code-Heavy Workflows

Recommended: CodeLlama:13b or Llama 3:70b

CodeLlama specializes in code, while Llama 3:70b handles both code and natural language exceptionally well.

Kubernetes Deployment

Let's deploy Gemma:27b on Kubernetes with GPU support, load balancing, and automatic scaling.

Step 1: Create Namespace and ConfigMap

# namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: ai-automation
  labels:
    name: ai-automation
    compliance: hipaa

---
# configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: ollama-config
  namespace: ai-automation
data:
  OLLAMA_HOST: "0.0.0.0:11434"
  OLLAMA_MODELS: "/models"
  OLLAMA_NUM_PARALLEL: "4"  # Number of parallel requests
  OLLAMA_MAX_LOADED_MODELS: "2"  # Keep 2 models in memory
  OLLAMA_GPU_LAYERS: "999"  # Offload all layers to GPU

Step 2: Create Persistent Volume for Models

# pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-models-pvc
  namespace: ai-automation
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: fast-ssd  # Use fast SSD for model loading
  resources:
    requests:
      storage: 100Gi  # Gemma:27b needs ~30GB, allow space for multiple models

Step 3: Deploy Ollama with Gemma:27b

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama-gemma27b
  namespace: ai-automation
  labels:
    app: ollama
    model: gemma27b
spec:
  replicas: 2  # Start with 2 replicas
  selector:
    matchLabels:
      app: ollama
      model: gemma27b
  template:
    metadata:
      labels:
        app: ollama
        model: gemma27b
    spec:
      # Node affinity for GPU nodes
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: nvidia.com/gpu
                operator: Exists
              - key: gpu-type
                operator: In
                values:
                - a100
                - v100
      
      # Pod Anti-affinity to spread across nodes
      podAntiAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - ollama
            topologyKey: kubernetes.io/hostname
      
      containers:
      - name: ollama
        image: ollama/ollama:latest
        imagePullPolicy: Always
        
        ports:
        - containerPort: 11434
          name: http
          protocol: TCP
        
        envFrom:
        - configMapRef:
            name: ollama-config
        
        resources:
          requests:
            memory: "24Gi"
            cpu: "4"
            nvidia.com/gpu: "1"
          limits:
            memory: "32Gi"
            cpu: "8"
            nvidia.com/gpu: "1"
        
        volumeMounts:
        - name: models
          mountPath: /models
        
        # Startup probe - model loading can take time
        startupProbe:
          httpGet:
            path: /api/tags
            port: 11434
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 30  # Allow 5 minutes for startup
        
        # Liveness probe
        livenessProbe:
          httpGet:
            path: /api/tags
            port: 11434
          periodSeconds: 30
          timeoutSeconds: 5
          failureThreshold: 3
        
        # Readiness probe
        readinessProbe:
          httpGet:
            path: /api/tags
            port: 11434
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        
        # Lifecycle hook to pull model on start
        lifecycle:
          postStart:
            exec:
              command:
              - "/bin/sh"
              - "-c"
              - "ollama pull gemma:27b"
      
      volumes:
      - name: models
        persistentVolumeClaim:
          claimName: ollama-models-pvc
      
      # Security context for HIPAA compliance
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 1000
      
      # Toleration for GPU nodes
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule

Step 4: Create Service and Ingress

# service.yaml
apiVersion: v1
kind: Service
metadata:
  name: ollama-service
  namespace: ai-automation
  labels:
    app: ollama
spec:
  type: ClusterIP
  sessionAffinity: ClientIP  # Stick sessions for better caching
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 3600  # 1 hour session stickiness
  ports:
  - port: 80
    targetPort: 11434
    protocol: TCP
    name: http
  selector:
    app: ollama

---
# ingress.yaml (optional, for external access)
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: ollama-ingress
  namespace: ai-automation
  annotations:
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/backend-protocol: "HTTP"
    # Increase timeout for long-running generation
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
    # TLS for HIPAA compliance
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
  tls:
  - hosts:
    - ai.yourdomain.com
    secretName: ollama-tls
  rules:
  - host: ai.yourdomain.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: ollama-service
            port:
              number: 80

Step 5: Horizontal Pod Autoscaler

# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ollama-hpa
  namespace: ai-automation
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ollama-gemma27b
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  # Scale based on GPU utilization (requires metrics-server with GPU support)
  - type: Pods
    pods:
      metric:
        name: gpu_utilization
      target:
        type: AverageValue
        averageValue: "75"
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 min before scaling down
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 60  # Quick scale up
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30

Deploy Everything

# Apply all manifests
kubectl apply -f namespace.yaml
kubectl apply -f configmap.yaml
kubectl apply -f pvc.yaml
kubectl apply -f deployment.yaml
kubectl apply -f service.yaml
kubectl apply -f hpa.yaml

# Monitor deployment
kubectl get pods -n ai-automation -w

# Check logs
kubectl logs -n ai-automation -l app=ollama --tail=100 -f

# Test the service
kubectl port-forward -n ai-automation svc/ollama-service 11434:80

# Make a test request
curl http://localhost:11434/api/generate -d '{
  "model": "gemma:27b",
  "prompt": "Summarize the key points from this medical report...",
  "stream": false
}'

Azure GPU Infrastructure with Terraform

Now let's provision the Azure infrastructure to run our Kubernetes cluster with GPU nodes.

Terraform Configuration

# main.tf
terraform {
  required_version = ">= 1.0"
  
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 3.0"
    }
  }
  
  backend "azurerm" {
    resource_group_name  = "terraform-state-rg"
    storage_account_name = "tfstate"
    container_name       = "tfstate"
    key                  = "ai-automation.tfstate"
  }
}

provider "azurerm" {
  features {}
}

# Variables
variable "environment" {
  description = "Environment name"
  type        = string
  default     = "production"
}

variable "location" {
  description = "Azure region"
  type        = string
  default     = "eastus"
}

variable "cluster_name" {
  description = "AKS cluster name"
  type        = string
  default     = "ai-automation-aks"
}

variable "node_count" {
  description = "Initial number of GPU nodes"
  type        = number
  default     = 2
}

variable "max_node_count" {
  description = "Maximum number of GPU nodes for autoscaling"
  type        = number
  default     = 10
}

# Resource Group
resource "azurerm_resource_group" "main" {
  name     = "ai-automation-${var.environment}-rg"
  location = var.location
  
  tags = {
    Environment = var.environment
    Project     = "AI-Automation"
    Compliance  = "HIPAA"
  }
}

# Virtual Network
resource "azurerm_virtual_network" "main" {
  name                = "ai-automation-vnet"
  address_space       = ["10.0.0.0/16"]
  location            = azurerm_resource_group.main.location
  resource_group_name = azurerm_resource_group.main.name
}

resource "azurerm_subnet" "aks" {
  name                 = "aks-subnet"
  resource_group_name  = azurerm_resource_group.main.name
  virtual_network_name = azurerm_virtual_network.main.name
  address_prefixes     = ["10.0.1.0/24"]
}

# Log Analytics Workspace for monitoring
resource "azurerm_log_analytics_workspace" "main" {
  name                = "ai-automation-logs"
  location            = azurerm_resource_group.main.location
  resource_group_name = azurerm_resource_group.main.name
  sku                 = "PerGB2018"
  retention_in_days   = 90  # HIPAA requires 90 days minimum
}

# AKS Cluster
resource "azurerm_kubernetes_cluster" "main" {
  name                = var.cluster_name
  location            = azurerm_resource_group.main.location
  resource_group_name = azurerm_resource_group.main.name
  dns_prefix          = var.cluster_name
  kubernetes_version  = "1.28"
  
  # System node pool (non-GPU, for system workloads)
  default_node_pool {
    name                = "system"
    node_count          = 2
    vm_size             = "Standard_D4s_v3"
    vnet_subnet_id      = azurerm_subnet.aks.id
    type                = "VirtualMachineScaleSets"
    enable_auto_scaling = true
    min_count           = 2
    max_count           = 5
    
    node_labels = {
      "workload-type" = "system"
    }
    
    node_taints = [
      "CriticalAddonsOnly=true:NoSchedule"
    ]
  }
  
  identity {
    type = "SystemAssigned"
  }
  
  # Network profile
  network_profile {
    network_plugin     = "azure"
    network_policy     = "calico"
    load_balancer_sku  = "standard"
    service_cidr       = "10.1.0.0/16"
    dns_service_ip     = "10.1.0.10"
  }
  
  # Enable monitoring
  oms_agent {
    log_analytics_workspace_id = azurerm_log_analytics_workspace.main.id
  }
  
  # HIPAA compliance features
  role_based_access_control_enabled = true
  
  azure_policy_enabled = true
  
  # Enable encryption at rest
  disk_encryption_set_id = azurerm_disk_encryption_set.main.id
  
  tags = {
    Environment = var.environment
    Compliance  = "HIPAA"
  }
}

# GPU Node Pool for AI workloads
resource "azurerm_kubernetes_cluster_node_pool" "gpu" {
  name                  = "gpupool"
  kubernetes_cluster_id = azurerm_kubernetes_cluster.main.id
  vm_size               = "Standard_NC24ads_A100_v4"  # Azure VM with NVIDIA A100
  node_count            = var.node_count
  
  # Autoscaling
  enable_auto_scaling = true
  min_count           = var.node_count
  max_count           = var.max_node_count
  
  # GPU-specific labels and taints
  node_labels = {
    "workload-type" = "gpu"
    "gpu-type"      = "a100"
    "nvidia.com/gpu" = "true"
  }
  
  node_taints = [
    "nvidia.com/gpu=true:NoSchedule"
  ]
  
  # Use premium SSD for fast model loading
  os_disk_type = "Ephemeral"
  os_disk_size_gb = 200
  
  vnet_subnet_id = azurerm_subnet.aks.id
  
  tags = {
    Environment = var.environment
    GPU         = "A100"
  }
}

# Disk Encryption Set for HIPAA compliance
resource "azurerm_key_vault" "main" {
  name                = "ai-automation-kv"
  location            = azurerm_resource_group.main.location
  resource_group_name = azurerm_resource_group.main.name
  tenant_id           = data.azurerm_client_config.current.tenant_id
  sku_name            = "premium"
  
  enabled_for_disk_encryption = true
  purge_protection_enabled    = true
  
  network_acls {
    default_action = "Deny"
    bypass         = "AzureServices"
  }
}

resource "azurerm_key_vault_key" "disk_encryption" {
  name         = "disk-encryption-key"
  key_vault_id = azurerm_key_vault.main.id
  key_type     = "RSA"
  key_size     = 4096
  
  key_opts = [
    "decrypt",
    "encrypt",
    "sign",
    "unwrapKey",
    "verify",
    "wrapKey",
  ]
}

resource "azurerm_disk_encryption_set" "main" {
  name                = "ai-automation-des"
  location            = azurerm_resource_group.main.location
  resource_group_name = azurerm_resource_group.main.name
  key_vault_key_id    = azurerm_key_vault_key.disk_encryption.id
  
  identity {
    type = "SystemAssigned"
  }
}

# Grant DES access to Key Vault
resource "azurerm_key_vault_access_policy" "disk_encryption" {
  key_vault_id = azurerm_key_vault.main.id
  tenant_id    = data.azurerm_client_config.current.tenant_id
  object_id    = azurerm_disk_encryption_set.main.identity[0].principal_id
  
  key_permissions = [
    "Get",
    "WrapKey",
    "UnwrapKey",
  ]
}

data "azurerm_client_config" "current" {}

# Container Registry for private images
resource "azurerm_container_registry" "main" {
  name                = "aiautomationacr"
  resource_group_name = azurerm_resource_group.main.name
  location            = azurerm_resource_group.main.location
  sku                 = "Premium"
  admin_enabled       = false
  
  # HIPAA compliance
  public_network_access_enabled = false
  network_rule_bypass_option    = "AzureServices"
}

# Grant AKS access to ACR
resource "azurerm_role_assignment" "aks_acr_pull" {
  scope                = azurerm_container_registry.main.id
  role_definition_name = "AcrPull"
  principal_id         = azurerm_kubernetes_cluster.main.kubelet_identity[0].object_id
}

# Outputs
output "cluster_name" {
  value = azurerm_kubernetes_cluster.main.name
}

output "kube_config" {
  value     = azurerm_kubernetes_cluster.main.kube_config_raw
  sensitive = true
}

output "cluster_endpoint" {
  value = azurerm_kubernetes_cluster.main.kube_config[0].host
}

output "acr_login_server" {
  value = azurerm_container_registry.main.login_server
}

Deploy Infrastructure

# Initialize Terraform
terraform init

# Review planned changes
terraform plan -out=tfplan

# Apply infrastructure
terraform apply tfplan

# Get AKS credentials
az aks get-credentials \
  --resource-group ai-automation-production-rg \
  --name ai-automation-aks \
  --overwrite-existing

# Verify GPU nodes
kubectl get nodes -l nvidia.com/gpu=true

# Install NVIDIA GPU Operator
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/deployments/gpu-operator/values.yaml

# Verify GPU is available
kubectl run gpu-test --rm -it --restart=Never \
  --image=nvidia/cuda:12.0-base \
  --limits=nvidia.com/gpu=1 \
  -- nvidia-smi

Cost Analysis: GPU Cluster at Scale

Understanding costs is critical for budgeting private AI infrastructure. Here's a detailed breakdown:

Azure GPU VM Pricing (East US, as of Nov 2024)

VM Size	GPU	VRAM	Hourly	Monthly (24/7)	Best For
NC6s_v3	1x V100	16GB	$3.06	$2,203	Dev/test, small models
NC12s_v3	2x V100	32GB	$6.12	$4,406	Medium scale
NC24ads_A100_v4	1x A100	80GB	$3.67	$2,642	Production (our choice)
NC48ads_A100_v4	2x A100	160GB	$7.35	$5,292	Large models, high throughput
NC96ads_A100_v4	4x A100	320GB	$14.69	$10,577	Maximum performance

Complete Monthly Cost Scenarios

Small Deployment (Startup/POC)

2x NC24ads_A100_v4 (GPU nodes)$5,284/mo

2x Standard_D4s_v3 (system nodes)$290/mo

100GB Premium SSD storage$20/mo

Load Balancer$25/mo

Log Analytics (90-day retention)$150/mo

Network egress (estimated)$50/mo

Total Monthly Cost:$5,819/mo

Capacity: ~200-300 requests/hour with Gemma:27b

Medium Deployment (Growing Business)

5x NC24ads_A100_v4 (GPU nodes)$13,210/mo

3x Standard_D4s_v3 (system nodes)$435/mo

500GB Premium SSD storage$100/mo

Load Balancer (Standard)$25/mo

Log Analytics (90-day retention, higher volume)$400/mo

Network egress (estimated)$200/mo

Azure Private Link (for HIPAA)$100/mo

Total Monthly Cost:$14,470/mo

Capacity: ~500-750 requests/hour with Gemma:27b

Large Deployment (Enterprise Scale)

10x NC24ads_A100_v4 (GPU nodes)$26,420/mo

5x Standard_D8s_v3 (system nodes)$1,450/mo

1TB Premium SSD storage$200/mo

Load Balancer (Standard with WAF)$250/mo

Log Analytics (90-day retention, enterprise volume)$1,000/mo

Network egress (estimated)$500/mo

Azure Private Link + VPN Gateway$300/mo

Azure Key Vault (premium)$50/mo

Total Monthly Cost:$30,170/mo

Capacity: ~1,000-1,500 requests/hour with Gemma:27b

Cost Comparison: Self-Hosted vs API Services

Break-Even Analysis

Comparing our medium deployment ($14,470/mo) to GPT-4 API pricing ($0.03/1K tokens input, $0.06/1K tokens output):

Assumptions:

Average request: 500 input tokens, 500 output tokens
Cost per API request: $0.015 + $0.030 = $0.045
Our infrastructure: $14,470/mo for 500 req/hr capacity

Break-Even Calculation:

Monthly API requests needed: $14,470 / $0.045 = 321,556 requests/month

Or approximately: 446 requests/hour (24/7)

✓ Our infrastructure is cost-effective at 90% capacity utilization

Additional Self-Hosting Benefits Beyond Cost:

Data Privacy: No data leaves your infrastructure (required for HIPAA)
No Rate Limits: Full control over throughput
Customization: Fine-tune models on proprietary data
Predictable Costs: No surprise bills from usage spikes
Latency: Sub-100ms inference vs 500-2000ms API latency

Cost Optimization Strategies

1. Use Reserved Instances

Azure Reserved Instances offer up to 72% savings. For our medium deployment, this reduces GPU costs from $13,210/mo to approximately $3,699/mo — saving $9,511/mo.

2. Implement Spot Instances for Non-Critical Workloads

Use Spot instances for batch processing or development environments. Spot pricing offers up to 90% discount but can be interrupted with 30-second notice.

3. Aggressive Autoscaling

Scale down to minimum replicas during off-peak hours. If your workload has 50% lower demand at night, this saves ~25% on GPU costs.

4. Model Selection by Use Case

Route simple queries to Gemma:7b (2x cheaper to run) and complex ones to Gemma:27b. This can reduce average costs by 30-40% while maintaining quality.

5. Quantization

Use 4-bit or 8-bit quantized models where accuracy requirements allow. This can reduce VRAM requirements by 4x, allowing you to use smaller, cheaper VMs.

Example Workflow: Medical Report Processing

Let's implement a complete example: processing medical reports with Gemma:27b for HIPAA-compliant automation.

import asyncio
import httpx
from typing import Dict, List
import json

class MedicalReportProcessor:
    """
    HIPAA-compliant medical report processor using private Ollama deployment.
    All data stays within your infrastructure.
    """
    
    def __init__(self, ollama_endpoint: str = "http://ollama-service.ai-automation.svc.cluster.local"):
        self.endpoint = ollama_endpoint
        self.model = "gemma:27b"
        self.client = httpx.AsyncClient(timeout=60.0)
    
    async def extract_structured_data(self, report_text: str) -> Dict:
        """
        Extract structured data from unstructured medical reports.
        """
        prompt = f"""Extract the following information from this medical report.
Return ONLY a valid JSON object with these exact fields:

{{
  "patient_demographics": {{
    "age": <number or null>,
    "gender": "<M/F/Other or null>"
  }},
  "chief_complaint": "<string or null>",
  "diagnosis": ["<diagnosis 1>", "<diagnosis 2>"],
  "medications": ["<med 1>", "<med 2>"],
  "procedures": ["<procedure 1>"],
  "lab_results": {{
    "<test_name>": "<value and unit>"
  }},
  "follow_up": "<string or null>"
}}

Medical Report:
{report_text}

JSON Output:"""

        response = await self.client.post(
            f"{self.endpoint}/api/generate",
            json={
                "model": self.model,
                "prompt": prompt,
                "stream": False,
                "options": {
                    "temperature": 0.1,  # Low temperature for factual extraction
                    "num_predict": 2000,
                }
            }
        )
        
        result = response.json()
        # Parse the JSON from model output
        try:
            extracted = json.loads(result["response"])
            return extracted
        except json.JSONDecodeError:
            # Fallback: extract JSON from text
            text = result["response"]
            start = text.find("{")
            end = text.rfind("}") + 1
            return json.loads(text[start:end])
    
    async def summarize_report(self, report_text: str, max_words: int = 200) -> str:
        """
        Generate concise clinical summary.
        """
        prompt = f"""Provide a concise clinical summary of this medical report in {max_words} words or less.
Focus on: diagnosis, key findings, treatment plan, and follow-up.

Medical Report:
{report_text}

Clinical Summary:"""

        response = await self.client.post(
            f"{self.endpoint}/api/generate",
            json={
                "model": self.model,
                "prompt": prompt,
                "stream": False,
                "options": {
                    "temperature": 0.3,
                    "num_predict": 300,
                }
            }
        )
        
        return response.json()["response"]
    
    async def classify_urgency(self, report_text: str) -> str:
        """
        Classify report urgency for triage.
        """
        prompt = f"""Classify the urgency of this medical report into one of these categories:
- CRITICAL: Life-threatening condition requiring immediate attention
- URGENT: Serious condition requiring prompt attention within hours
- ROUTINE: Standard follow-up or non-urgent matter

Respond with ONLY the category name (CRITICAL, URGENT, or ROUTINE).

Medical Report:
{report_text}

Urgency:"""

        response = await self.client.post(
            f"{self.endpoint}/api/generate",
            json={
                "model": self.model,
                "prompt": prompt,
                "stream": False,
                "options": {
                    "temperature": 0.0,  # Deterministic for classification
                    "num_predict": 10,
                }
            }
        )
        
        urgency = response.json()["response"].strip().upper()
        # Validate output
        if urgency not in ["CRITICAL", "URGENT", "ROUTINE"]:
            return "ROUTINE"  # Default to routine if unclear
        return urgency
    
    async def process_report(self, report_text: str) -> Dict:
        """
        Full pipeline: extract data, summarize, and classify.
        """
        # Run all tasks concurrently for speed
        structured_data, summary, urgency = await asyncio.gather(
            self.extract_structured_data(report_text),
            self.summarize_report(report_text),
            self.classify_urgency(report_text)
        )
        
        return {
            "structured_data": structured_data,
            "summary": summary,
            "urgency": urgency,
            "processed_at": datetime.utcnow().isoformat()
        }
    
    async def batch_process(self, reports: List[str], max_concurrent: int = 5) -> List[Dict]:
        """
        Process multiple reports with concurrency control.
        """
        semaphore = asyncio.Semaphore(max_concurrent)
        
        async def process_with_semaphore(report):
            async with semaphore:
                return await self.process_report(report)
        
        tasks = [process_with_semaphore(report) for report in reports]
        return await asyncio.gather(*tasks)


# Example usage
async def main():
    processor = MedicalReportProcessor()
    
    sample_report = """
    Patient: John Doe, 58-year-old male
    Chief Complaint: Chest pain and shortness of breath
    
    History: Patient presents with substernal chest pain radiating to left arm, 
    onset 2 hours ago. Associated with diaphoresis and nausea.
    
    Exam: BP 160/95, HR 102, RR 22, SpO2 94% on room air
    Cardiovascular: Tachycardic, regular rhythm, no murmurs
    Respiratory: Bilateral crackles at bases
    
    EKG: ST elevations in leads II, III, aVF
    Troponin: 2.4 ng/mL (elevated)
    
    Assessment: Acute inferior wall myocardial infarction
    
    Plan: 
    - Aspirin 325mg PO given
    - Heparin bolus and drip initiated
    - Cardiology consulted for urgent catheterization
    - Transfer to CCU
    """
    
    result = await processor.process_report(sample_report)
    
    print("Urgency:", result["urgency"])
    print("\nSummary:", result["summary"])
    print("\nStructured Data:", json.dumps(result["structured_data"], indent=2))


if __name__ == "__main__":
    asyncio.run(main())

Monitoring and Observability

Production AI systems require comprehensive monitoring:

# prometheus-servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: ollama-metrics
  namespace: ai-automation
spec:
  selector:
    matchLabels:
      app: ollama
  endpoints:
  - port: http
    path: /metrics
    interval: 30s

---
# grafana-dashboard.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: ollama-dashboard
  namespace: monitoring
data:
  dashboard.json: |
    {
      "dashboard": {
        "title": "AI Automation - Ollama Performance",
        "panels": [
          {
            "title": "Requests per Second",
            "targets": [
              {
                "expr": "rate(ollama_requests_total[5m])"
              }
            ]
          },
          {
            "title": "GPU Utilization",
            "targets": [
              {
                "expr": "nvidia_gpu_duty_cycle"
              }
            ]
          },
          {
            "title": "Average Response Time",
            "targets": [
              {
                "expr": "rate(ollama_request_duration_seconds_sum[5m]) / rate(ollama_requests_total[5m])"
              }
            ]
          },
          {
            "title": "Model Memory Usage",
            "targets": [
              {
                "expr": "ollama_model_memory_bytes"
              }
            ]
          }
        ]
      }
    }

Security Best Practices

For HIPAA compliance and general security:

Encryption at rest: Use Azure Disk Encryption with customer-managed keys for all persistent storage
Encryption in transit: TLS 1.3 for all network communications, including internal cluster traffic
Network isolation: Deploy in private VNet, use Azure Private Link for external access
Access control: Azure RBAC for cluster access, Kubernetes RBAC for workload permissions
Audit logging: Enable Azure Monitor audit logs, retain for 90+ days for HIPAA compliance
Vulnerability scanning: Use Azure Defender for containers, scan images before deployment
Secrets management: Store credentials in Azure Key Vault, inject via CSI driver
Network policies: Implement Calico policies to restrict pod-to-pod communication

Conclusion

Private AI automation with open-source models is not only feasible but often more cost-effective than API services for high-volume applications. With proper infrastructure, organizations can achieve:

Complete data sovereignty and HIPAA compliance
Predictable, optimized costs at scale
Lower latency and higher throughput
Customization capabilities for domain-specific needs
Independence from third-party API providers

At Subzero Research, we help organizations design, deploy, and optimize private AI infrastructure for healthcare, finance, and government applications. Our expertise spans distributed systems, GPU optimization, and regulatory compliance.

Need Help with Your Private AI Infrastructure?

We provide consulting services for architecture design, HIPAA compliance, and deployment of private AI systems.