Statistical Components and Metrics Guide
Purpose
This document explains the statistical components used in this platform and, more importantly, why each one matters for this specific claim-risk product.
It is not a generic ML glossary. Each metric is tied to platform behavior, operational decisions, and release governance.
1. Statistical Components Used in the Platform
1.1 Supervised model families
- Denial risk (
POST /v1/scorewithworkflow="healthcare.denial"): binary classification model (LightGBM) producing a denial probability. - Prior auth (
POST /v1/scorewithworkflow="healthcare.prior_auth"): binary classification model (LightGBM) producing probability that prior auth is required/at risk. - Reimbursement (
POST /v1/scorewithworkflow="healthcare.reimbursement"): regression model (LightGBM) producing expected allowed amount. - Appeals (
POST /v1/claims/appeals/generate): deterministic structured generation with quality checks (not a probability classifier).
Unified response contract notes:
- top-level
scoreis standardized across workflows - for denial/prior-auth,
scoreis calibrated probability - for reimbursement,
scoreis normalized risk (1 - reimbursement_ratio) - workflow-specific numeric outputs are nested under
details.denial,details.prior_auth, ordetails.reimbursement - denial details include explicit routing metadata (
operating_point,distribution_profile,routing_source)
1.2 Feature engineering and interaction terms
The platform uses claim context features and engineered interactions, including:
- payer and CPT interaction keys
- modifier flags and network status
- billed amount transforms (
log1p, per-unit features) - top-payer interaction features in prior auth
Why this matters here:
- payer behavior is heterogeneous; interactions are required to capture payer-specific denial/auth patterns
- pure global features underperform on high-volume payer segments
1.3 Probability calibration
Denial probabilities are post-hoc calibrated (sigmoid vs isotonic) and the selected calibrator artifact is locked. Current frozen calibration selection is isotonic, with lock data in:
tests/contracts/calibration_artifact_snapshot.json
Why this matters here:
- downstream actions are threshold-driven; uncalibrated probabilities cause poor triage and unstable review queues
- calibration quality directly affects trust in "0.71 means high risk" style decisions
- calibration method/version are exposed in denial and prior-auth responses (
calibration_method,calibration_version) so field debugging matches runtime logs
Concrete example:
- if the model predicts
0.70denial risk across100similar claims, about70should actually deny; ECE measures how close observed outcomes are to that expectation.
1.4 Thresholding and operating points
Denial scoring supports explicit operating modes:
synthetic_v1:high_recall=0.21,balanced=0.25,high_precision=0.31cms_v1:high_recall=0.24,balanced=0.28,high_precision=0.34commercial_beta:high_recall=0.22,balanced=0.26,high_precision=0.32
Pinned regime model versions:
synthetic_v1: denial model version0.3.3cms_v1: denial model version0.3.3commercial_beta: denial model version0.3.3(default profile)
Configured via:
DENIAL_OPERATING_POINTDENIAL_DEFAULT_DISTRIBUTION_PROFILEDENIAL_OPERATING_THRESHOLD(explicit override)
Why this matters here:
- practices with limited reviewer capacity may choose higher precision
- practices focused on denial avoidance may choose higher recall
1.5 Segment support gating
Support profiles gate confidence by segment using sample volume and quality thresholds:
- prior auth: payer-level thresholds using sample count, positive count, and AUC
- reimbursement: CPT-level thresholds using sample count and R2
Runtime support logic lives in:
app/support_profiles.py
Why this matters here:
- avoids over-claiming quality on low-volume long-tail payers/CPTs
- keeps pilots honest and commercially defensible
1.6 Drift diagnostics with controlled degradation
Prior auth tracks payer-level drift on:
- CPT mix
- network status
- billed amount band
Drift score uses weighted total variation distance and triggers confidence downgrade when threshold is exceeded.
Runtime logic lives in:
app/payer_drift.pyapp/prior_auth.py
Why this matters here:
- payer policy/behavior shifts can break segment performance without changing global metrics
- controlled downgrade is safer than silently serving overconfident scores
If drift persists beyond configured cooldown windows or repeatedly exceeds thresholds, retraining is triggered under the governance cadence and incident process described in docs/model-validation-governance-summary-v1.0.md.
1.7 Stratified simulation and bootstrap uncertainty
Evaluation uses stratified sampling with fixed positive targets plus bootstrap confidence intervals.
Core evaluation implementation:
scripts/simulate_api_validation.pyscripts/simulate_denial_targets.pyscripts/simulate_reimbursement_targets.pyscripts/cross_model_integrity_report.py
Why this matters here:
- healthcare labels are imbalanced; random sampling can hide failures
- confidence intervals are required to separate real signal from run-to-run noise
1.8 Outcome feedback telemetry loop
Operational endpoint:
POST /v1/feedback
Feedback is treated as structured telemetry (not ticketing), linked to prior scoring calls using request_id from x-request-id. Runtime validation enforces:
request_idexists and is recent- endpoint match (
denial,prior_auth,reimbursement) - caller ownership match
- short note format with PHI-like pattern blocking
Why this matters here:
- provides real-world calibration error signal earlier than aggregate lagging outcomes
- enables mismatch-rate tracking by payer/CPT/outcome class
- supports confidence downgrade and drift sensitivity tuning using production feedback patterns
1.9 Latest benchmark context (2026-02-17 UTC)
Primary validation artifacts:
- local full 5-seed pack:
artifacts/test-reports/local-full-5seed-validation-20260216T225006Z.md - Fly full 5-seed fast pack:
artifacts/test-reports/fly-full-5seed-validation-20260216T235231Z.md - Fly dual-machine rerun:
artifacts/test-reports/fly-two-machine-rerun-20260217T004535Z.md
Statistical interpretation:
- Local: denial AUC mean
0.7972, prior-auth AUC mean0.9708, reimbursement R2 mean0.7967, p95 latency66.87 ms. - Fly fast pack: denial AUC mean
0.8489, prior-auth AUC mean0.9436, reimbursement MAPE mean0.1067, p95 latency3482.47 ms. - Fly dual-machine rerun: denial AUC
0.7687, prior-auth AUC0.9364, reimbursement R20.8089, p95 latency529.02 ms.
Operational takeaway:
- model quality metrics remain broadly stable across environments;
- latency profile is the dominant environment-sensitive variable;
- two-machine API scaling improved Fly latency materially vs single-machine runs, but did not clear the strict
< 500 msp95 gate in the latest rerun.
2. Acronym Glossary with Platform Relevance
2.1 Classification discrimination and ranking metrics
- AUC (Area Under ROC Curve):
- Definition: probability the model ranks a random positive above a random negative.
- Platform relevance: indicates how well denial/prior-auth models rank risky claims for triage queues; high AUC improves concentration of true denials in the top risk bucket.
- ROC (Receiver Operating Characteristic):
- Definition: curve of true positive rate vs false positive rate across thresholds.
- Platform relevance: supports threshold strategy discussions (
high_recallvshigh_precision).
- AP (Average Precision):
- Definition: area under precision-recall curve.
- Platform relevance: useful under class imbalance and complements AUC for denial/prior-auth risk ranking quality.
2.2 Calibration metrics
- ECE (Expected Calibration Error):
- Definition: weighted average absolute difference between predicted probability and observed frequency in bins.
- Platform relevance: ensures probability outputs can be used operationally; low ECE means risk probabilities are credible for staffing and policy decisions.
- Brier Score:
- Definition: mean squared error between predicted probability and binary outcome.
- Platform relevance: tracks overall probabilistic accuracy; used alongside ECE to detect confidence misalignment.
2.3 Top-k operations metrics
- Top-20 Capture:
- Definition: percent of all positives found in the top 20 percent highest-risk predictions.
- Platform relevance: primary triage metric; shows how much denial volume can be intercepted by reviewing only the riskiest subset.
- Top-20 Precision:
- Definition: percent of positives inside the top 20 percent risk bucket.
- Platform relevance: estimates review efficiency; higher precision means fewer wasted manual reviews.
- Lift:
- Definition: bucket precision divided by base positive rate.
- Platform relevance: prevalence-aware business value metric; especially important for prior auth where base rates are lower than denial.
2.4 Regression metrics (reimbursement)
- MAPE (Mean Absolute Percentage Error):
- Definition: mean absolute percent error between predicted and actual allowed amount.
- Platform relevance: primary business gate because finance teams care about percent forecast error.
- R2 (Coefficient of Determination):
- Definition: fraction of variance explained relative to mean baseline.
- Platform relevance: useful diagnostic, but secondary gate because it can be noisy at segment/seed level.
- MAE (Mean Absolute Error):
- Definition: average absolute dollar error.
- Platform relevance: shows absolute miss size in currency terms.
- RMSE (Root Mean Squared Error):
- Definition: square root of mean squared error; penalizes larger misses.
- Platform relevance: sensitivity metric for outlier reimbursement misses.
2.5 Uncertainty, stability, and runtime metrics
- CI (Confidence Interval):
- Definition: uncertainty range around a metric estimate (here from bootstrap resampling).
- Platform relevance: prevents overinterpreting one strong or weak run; used in cross-model integrity checks.
- Bootstrap:
- Definition: repeated sampling with replacement to estimate metric variability.
- Platform relevance: quantifies uncertainty of AUC/top-k metrics in stratified runs.
- p95 latency:
- Definition: 95th percentile response time.
- Platform relevance: hard runtime SLO-style check; API quality is not acceptable if model quality is high but inference latency is unstable.
2.6 Drift and distribution metrics
- TVD (Total Variation Distance):
- Definition: half the sum of absolute distribution differences across categories.
- Platform relevance: used in payer drift scoring to detect shifts in CPT mix/network/amount bands and trigger confidence downgrade.
3. Why AUC Matters in This Platform (Specific Example)
AUC is relevant here because the platform is a triage engine, not just a yes/no classifier.
In denial and prior auth workflows:
- claims are ranked by risk score
- teams review a constrained fraction of claims (for example top 10-20 percent)
- business value depends on concentration of true problems in that reviewed subset
A higher AUC usually improves top-bucket concentration, which drives:
- higher top-20 capture
- higher lift
- better reviewer ROI
Important caveat:
- AUC alone is insufficient. This platform also enforces calibration (ECE), top-k business metrics, and segment stability checks.
4. How Metrics Map to Business Decisions
- Denial endpoint:
- AUC and top-k metrics decide whether risk ranking is useful.
- ECE decides whether score magnitudes can drive threshold policy.
- Operating point (
0.22,0.26,0.32) decides recall/precision tradeoff.
- Prior auth endpoint:
- AUC + lift determine triage quality on low-prevalence outcomes.
- support level + drift diagnostics decide confidence messaging and routing intensity.
- Reimbursement endpoint:
- mean MAPE and monthly within-10-percent are primary business gates.
- R2 is secondary diagnostic for model explainability and variance tracking.
- Appeals endpoint:
- structural pass/completeness/citation metrics ensure generated output is submission-ready and safe.
5. Current Governance and Metric Freezing
Model behavior is protected by frozen lock artifacts checked at runtime startup:
- model artifact lock:
tests/contracts/model_artifact_hashes.json - model version lock:
tests/contracts/model_version_pins.json - calibration lock:
tests/contracts/calibration_artifact_snapshot.json - threshold lock:
tests/contracts/threshold_config_snapshot.json - evaluation config lock:
tests/contracts/evaluation_config_snapshot.json
Why this matters here:
- prevents silent drift from accidental artifact swaps or threshold edits
- keeps simulation evidence aligned with production scoring behavior
Operational deployment probes:
GET /healthfor runtime metadata visibilityGET /readyzfor load-balancer/readiness decisions (200ready,503not ready)GET /metricsfor Prometheus histogram/counter scraping (source of truth, public,text/plain; version=0.0.4; charset=utf-8)GET /metrics.jsonfor compact operator debugging (metrics_schema_version: "1", requiresmetrics:read)
What We Do Not Optimize For
This platform does not optimize for maximizing raw AUC at the expense of calibration quality, stability, or segment-level fairness/support integrity.
6. Practical Reading Order for Non-Statisticians
If you want the fastest path to understanding platform quality:
- Look at top-20 capture and lift first (operational value).
- Check ECE second (probability trustworthiness).
- Check feedback mismatch rates (
POST /v1/feedbacktelemetry) by payer/CPT third. - Check segment support and drift status fourth (where confidence should be downgraded).
- Check latency and stability fifth (production reliability).
- Use AUC/R2 as supporting diagnostics, not standalone go/no-go signals.
7. Scope and Intended Use
These statistics support revenue cycle decision-support workflows. They do not replace human adjudication, clinical judgment, legal review, or payer-final determination.