Statistical Components and Metrics Guide

Purpose

This document explains the statistical components used in this platform and, more importantly, why each one matters for this specific claim-risk product.

It is not a generic ML glossary. Each metric is tied to platform behavior, operational decisions, and release governance.

1. Statistical Components Used in the Platform

1.1 Supervised model families

Denial risk (POST /v1/score with workflow="healthcare.denial"): binary classification model (LightGBM) producing a denial probability.
Prior auth (POST /v1/score with workflow="healthcare.prior_auth"): binary classification model (LightGBM) producing probability that prior auth is required/at risk.
Reimbursement (POST /v1/score with workflow="healthcare.reimbursement"): regression model (LightGBM) producing expected allowed amount.
Appeals (POST /v1/claims/appeals/generate): deterministic structured generation with quality checks (not a probability classifier).

Unified response contract notes:

top-level score is standardized across workflows
for denial/prior-auth, score is calibrated probability
for reimbursement, score is normalized risk (1 - reimbursement_ratio)
workflow-specific numeric outputs are nested under details.denial, details.prior_auth, or details.reimbursement
denial details include explicit routing metadata (operating_point, distribution_profile, routing_source)

1.2 Feature engineering and interaction terms

The platform uses claim context features and engineered interactions, including:

payer and CPT interaction keys
modifier flags and network status
billed amount transforms (log1p, per-unit features)
top-payer interaction features in prior auth

Why this matters here:

payer behavior is heterogeneous; interactions are required to capture payer-specific denial/auth patterns
pure global features underperform on high-volume payer segments

1.3 Probability calibration

Denial probabilities are post-hoc calibrated (sigmoid vs isotonic) and the selected calibrator artifact is locked. Current frozen calibration selection is isotonic, with lock data in:

tests/contracts/calibration_artifact_snapshot.json

Why this matters here:

downstream actions are threshold-driven; uncalibrated probabilities cause poor triage and unstable review queues
calibration quality directly affects trust in "0.71 means high risk" style decisions
calibration method/version are exposed in denial and prior-auth responses (calibration_method, calibration_version) so field debugging matches runtime logs

Concrete example:

if the model predicts 0.70 denial risk across 100 similar claims, about 70 should actually deny; ECE measures how close observed outcomes are to that expectation.

1.4 Thresholding and operating points

Denial scoring supports explicit operating modes:

synthetic_v1: high_recall=0.21, balanced=0.25, high_precision=0.31
cms_v1: high_recall=0.24, balanced=0.28, high_precision=0.34
commercial_beta: high_recall=0.22, balanced=0.26, high_precision=0.32

Pinned regime model versions:

synthetic_v1: denial model version 0.3.3
cms_v1: denial model version 0.3.3
commercial_beta: denial model version 0.3.3 (default profile)

Configured via:

DENIAL_OPERATING_POINT
DENIAL_DEFAULT_DISTRIBUTION_PROFILE
DENIAL_OPERATING_THRESHOLD (explicit override)

Why this matters here:

practices with limited reviewer capacity may choose higher precision
practices focused on denial avoidance may choose higher recall

1.5 Segment support gating

Support profiles gate confidence by segment using sample volume and quality thresholds:

prior auth: payer-level thresholds using sample count, positive count, and AUC
reimbursement: CPT-level thresholds using sample count and R2

Runtime support logic lives in:

app/support_profiles.py

Why this matters here:

avoids over-claiming quality on low-volume long-tail payers/CPTs
keeps pilots honest and commercially defensible

1.6 Drift diagnostics with controlled degradation

Prior auth tracks payer-level drift on:

CPT mix
network status
billed amount band

Drift score uses weighted total variation distance and triggers confidence downgrade when threshold is exceeded.

Runtime logic lives in:

app/payer_drift.py
app/prior_auth.py

Why this matters here:

payer policy/behavior shifts can break segment performance without changing global metrics
controlled downgrade is safer than silently serving overconfident scores

If drift persists beyond configured cooldown windows or repeatedly exceeds thresholds, retraining is triggered under the governance cadence and incident process described in docs/model-validation-governance-summary-v1.0.md.

1.7 Stratified simulation and bootstrap uncertainty

Evaluation uses stratified sampling with fixed positive targets plus bootstrap confidence intervals.

Core evaluation implementation:

scripts/simulate_api_validation.py
scripts/simulate_denial_targets.py
scripts/simulate_reimbursement_targets.py
scripts/cross_model_integrity_report.py

Why this matters here:

healthcare labels are imbalanced; random sampling can hide failures
confidence intervals are required to separate real signal from run-to-run noise

1.8 Outcome feedback telemetry loop

Operational endpoint:

POST /v1/feedback

Feedback is treated as structured telemetry (not ticketing), linked to prior scoring calls using request_id from x-request-id. Runtime validation enforces:

request_id exists and is recent
endpoint match (denial, prior_auth, reimbursement)
caller ownership match
short note format with PHI-like pattern blocking

Why this matters here:

provides real-world calibration error signal earlier than aggregate lagging outcomes
enables mismatch-rate tracking by payer/CPT/outcome class
supports confidence downgrade and drift sensitivity tuning using production feedback patterns

1.9 Latest benchmark context (2026-02-17 UTC)

Primary validation artifacts:

local full 5-seed pack: artifacts/test-reports/local-full-5seed-validation-20260216T225006Z.md
Fly full 5-seed fast pack: artifacts/test-reports/fly-full-5seed-validation-20260216T235231Z.md
Fly dual-machine rerun: artifacts/test-reports/fly-two-machine-rerun-20260217T004535Z.md

Statistical interpretation:

Local: denial AUC mean 0.7972, prior-auth AUC mean 0.9708, reimbursement R2 mean 0.7967, p95 latency 66.87 ms.
Fly fast pack: denial AUC mean 0.8489, prior-auth AUC mean 0.9436, reimbursement MAPE mean 0.1067, p95 latency 3482.47 ms.
Fly dual-machine rerun: denial AUC 0.7687, prior-auth AUC 0.9364, reimbursement R2 0.8089, p95 latency 529.02 ms.

Operational takeaway:

model quality metrics remain broadly stable across environments;
latency profile is the dominant environment-sensitive variable;
two-machine API scaling improved Fly latency materially vs single-machine runs, but did not clear the strict < 500 ms p95 gate in the latest rerun.

2. Acronym Glossary with Platform Relevance

2.1 Classification discrimination and ranking metrics

AUC (Area Under ROC Curve):
Definition: probability the model ranks a random positive above a random negative.
Platform relevance: indicates how well denial/prior-auth models rank risky claims for triage queues; high AUC improves concentration of true denials in the top risk bucket.

ROC (Receiver Operating Characteristic):
Definition: curve of true positive rate vs false positive rate across thresholds.
Platform relevance: supports threshold strategy discussions (high_recall vs high_precision).

AP (Average Precision):
Definition: area under precision-recall curve.
Platform relevance: useful under class imbalance and complements AUC for denial/prior-auth risk ranking quality.

2.2 Calibration metrics

ECE (Expected Calibration Error):
Definition: weighted average absolute difference between predicted probability and observed frequency in bins.
Platform relevance: ensures probability outputs can be used operationally; low ECE means risk probabilities are credible for staffing and policy decisions.

Brier Score:
Definition: mean squared error between predicted probability and binary outcome.
Platform relevance: tracks overall probabilistic accuracy; used alongside ECE to detect confidence misalignment.

2.3 Top-k operations metrics

Top-20 Capture:
Definition: percent of all positives found in the top 20 percent highest-risk predictions.
Platform relevance: primary triage metric; shows how much denial volume can be intercepted by reviewing only the riskiest subset.

Top-20 Precision:
Definition: percent of positives inside the top 20 percent risk bucket.
Platform relevance: estimates review efficiency; higher precision means fewer wasted manual reviews.

Lift:
Definition: bucket precision divided by base positive rate.
Platform relevance: prevalence-aware business value metric; especially important for prior auth where base rates are lower than denial.

2.4 Regression metrics (reimbursement)

MAPE (Mean Absolute Percentage Error):
Definition: mean absolute percent error between predicted and actual allowed amount.
Platform relevance: primary business gate because finance teams care about percent forecast error.

R2 (Coefficient of Determination):
Definition: fraction of variance explained relative to mean baseline.
Platform relevance: useful diagnostic, but secondary gate because it can be noisy at segment/seed level.

MAE (Mean Absolute Error):
Definition: average absolute dollar error.
Platform relevance: shows absolute miss size in currency terms.

RMSE (Root Mean Squared Error):
Definition: square root of mean squared error; penalizes larger misses.
Platform relevance: sensitivity metric for outlier reimbursement misses.

2.5 Uncertainty, stability, and runtime metrics

CI (Confidence Interval):
Definition: uncertainty range around a metric estimate (here from bootstrap resampling).
Platform relevance: prevents overinterpreting one strong or weak run; used in cross-model integrity checks.

Bootstrap:
Definition: repeated sampling with replacement to estimate metric variability.
Platform relevance: quantifies uncertainty of AUC/top-k metrics in stratified runs.

p95 latency:
Definition: 95th percentile response time.
Platform relevance: hard runtime SLO-style check; API quality is not acceptable if model quality is high but inference latency is unstable.

2.6 Drift and distribution metrics

TVD (Total Variation Distance):
Definition: half the sum of absolute distribution differences across categories.
Platform relevance: used in payer drift scoring to detect shifts in CPT mix/network/amount bands and trigger confidence downgrade.

3. Why AUC Matters in This Platform (Specific Example)

AUC is relevant here because the platform is a triage engine, not just a yes/no classifier.

In denial and prior auth workflows:

claims are ranked by risk score
teams review a constrained fraction of claims (for example top 10-20 percent)
business value depends on concentration of true problems in that reviewed subset

A higher AUC usually improves top-bucket concentration, which drives:

higher top-20 capture
higher lift
better reviewer ROI

Important caveat:

AUC alone is insufficient. This platform also enforces calibration (ECE), top-k business metrics, and segment stability checks.

4. How Metrics Map to Business Decisions

Denial endpoint:
AUC and top-k metrics decide whether risk ranking is useful.
ECE decides whether score magnitudes can drive threshold policy.
Operating point (0.22, 0.26, 0.32) decides recall/precision tradeoff.

Prior auth endpoint:
AUC + lift determine triage quality on low-prevalence outcomes.
support level + drift diagnostics decide confidence messaging and routing intensity.

Reimbursement endpoint:
mean MAPE and monthly within-10-percent are primary business gates.
R2 is secondary diagnostic for model explainability and variance tracking.

Appeals endpoint:
structural pass/completeness/citation metrics ensure generated output is submission-ready and safe.

5. Current Governance and Metric Freezing

Model behavior is protected by frozen lock artifacts checked at runtime startup:

model artifact lock: tests/contracts/model_artifact_hashes.json
model version lock: tests/contracts/model_version_pins.json
calibration lock: tests/contracts/calibration_artifact_snapshot.json
threshold lock: tests/contracts/threshold_config_snapshot.json
evaluation config lock: tests/contracts/evaluation_config_snapshot.json

Why this matters here:

prevents silent drift from accidental artifact swaps or threshold edits
keeps simulation evidence aligned with production scoring behavior

Operational deployment probes:

GET /health for runtime metadata visibility
GET /readyz for load-balancer/readiness decisions (200 ready, 503 not ready)
GET /metrics for Prometheus histogram/counter scraping (source of truth, public, text/plain; version=0.0.4; charset=utf-8)
GET /metrics.json for compact operator debugging (metrics_schema_version: "1", requires metrics:read)

What We Do Not Optimize For

This platform does not optimize for maximizing raw AUC at the expense of calibration quality, stability, or segment-level fairness/support integrity.

6. Practical Reading Order for Non-Statisticians

If you want the fastest path to understanding platform quality:

Look at top-20 capture and lift first (operational value).
Check ECE second (probability trustworthiness).
Check feedback mismatch rates (POST /v1/feedback telemetry) by payer/CPT third.
Check segment support and drift status fourth (where confidence should be downgraded).
Check latency and stability fifth (production reliability).
Use AUC/R2 as supporting diagnostics, not standalone go/no-go signals.

7. Scope and Intended Use

These statistics support revenue cycle decision-support workflows. They do not replace human adjudication, clinical judgment, legal review, or payer-final determination.

Document Content

Statistical Components and Metrics Guide

Purpose

1. Statistical Components Used in the Platform

1.1 Supervised model families

1.2 Feature engineering and interaction terms

1.3 Probability calibration

1.4 Thresholding and operating points

1.5 Segment support gating

1.6 Drift diagnostics with controlled degradation

1.7 Stratified simulation and bootstrap uncertainty

1.8 Outcome feedback telemetry loop

1.9 Latest benchmark context (2026-02-17 UTC)

2. Acronym Glossary with Platform Relevance

2.1 Classification discrimination and ranking metrics

2.2 Calibration metrics

2.3 Top-k operations metrics

2.4 Regression metrics (reimbursement)

2.5 Uncertainty, stability, and runtime metrics

2.6 Drift and distribution metrics

3. Why AUC Matters in This Platform (Specific Example)

4. How Metrics Map to Business Decisions

5. Current Governance and Metric Freezing

What We Do Not Optimize For

6. Practical Reading Order for Non-Statisticians

7. Scope and Intended Use