Model Validation & Governance Summary (v1.1)

Document Control

Document version: v1.1
Effective date: 2026-02-17
Repository release version: 0.1.90 (VERSION)
Scope: denial risk, prior auth prediction, reimbursement estimation, appeal generation
Environment profile: synthetic + CMS-hybrid simulation data
Intended audience: engineering, product, pilot partners, and technical compliance reviewers

1. Purpose

This document defines how model quality is validated, how runtime behavior is governed, and how changes are controlled for the claim-risk platform.

Primary goals:

keep model behavior measurable and reproducible
prevent silent runtime drift from configuration or artifact changes
provide transparent pass/fail criteria for pilot readiness

1.1 Intended use and decision responsibility

This platform provides decision-support signals for revenue cycle operations. Outputs are not medical determinations, legal determinations, or payer-final adjudication decisions. Final submission, coding, authorization, and appeal decisions remain with qualified human reviewers.

2. Platform Capability Scope

Model-backed endpoints in scope:

POST /v1/score (primary, workflow-based scoring)
POST /v1/feedback (structured real-world outcome telemetry)
POST /v1/claims/denial/score (legacy/deprecated compatibility endpoint)
POST /v1/claims/prior-auth/predict (legacy/deprecated compatibility endpoint)
POST /v1/claims/reimbursement/estimate (legacy/deprecated compatibility endpoint)
POST /v1/claims/appeals/generate (beta-enterprise, x-public=false)

Supporting endpoint families in scope:

history retrieval endpoints for appeals, prior auth, and reimbursement
workflow + limit discovery endpoints: GET /v1/workflows, GET /v1/limits
token issuance endpoint for monetized/scoped API access: POST /v1/tokens/issue
self-serve control-plane endpoints for one-time setup links and workspace key lifecycle:

POST /v1/control-plane/setup-links, POST /v1/control-plane/setup-links/redeem, GET /v1/control-plane/workspace, GET/POST /v1/control-plane/workspace/api-keys

operational probe endpoints: GET /health, GET /readyz, and GET /metrics (public) plus GET /metrics.json (requires metrics:read)

3. Model and Artifact Inventory

Pinned model versions and artifact hashes are frozen in tests/contracts/model_version_pins.json and tests/contracts/model_artifact_hashes.json.

| Model | Pinned version | Artifact path | |---|---|---| | Denial LightGBM (regime bundle) | 0.3.3 | artifacts/denial-risk-lightgbm-500k/model.joblib (+ denial_model_synth.joblib, denial_model_cms.joblib, denial_model_commercial.joblib) | | Prior Auth LightGBM | 0.2.3-prior-auth-on-file | artifacts/prior-auth-lightgbm/model.joblib | | Reimbursement LightGBM | 0.2.1 | artifacts/reimbursement-lightgbm/model.joblib |

Calibration artifact (denial) is pinned in tests/contracts/calibration_artifact_snapshot.json:

path: artifacts/denial-risk-lightgbm-500k/calibration-pass/probability_calibrator.joblib
selected method: isotonic
selected ECE snapshot: 0.003216288147

4. Validation Methodology

4.1 Sampling and evaluation protocol

Stratified sampling with fixed positive targets (not naive random-only sampling)
Bootstrap confidence intervals for discrimination and top-k metrics
Multi-seed evaluation to test stability
Segment-level checks for payer and CPT behavior

Evaluation defaults are frozen in tests/contracts/evaluation_config_snapshot.json, including:

bootstrap replicates: 1000
confidence level: 0.95
stratified workflow defaults and top-payer minimums
required prior-auth tuned features

4.2 Production scoring path alignment

Simulation and governance enforce parity with production scoring behavior:

runtime scoring path expectation: api_runtime
bootstrap source expectation: api_runtime_predictions
tuned prior-auth feature set readiness check included

4.3 Data limitations and applicability boundaries

Current validation is based on synthetic + CMS-hybrid simulation data. Real-world payer behavior can introduce additional distribution shifts, policy nuances, and documentation variability that require recalibration, threshold review, and/or feature updates. Pilot and production performance should be monitored continuously with segment-level diagnostics before expanding support claims.

5. Latest Validation Snapshot (Current Baseline)

This section reflects the latest generated artifacts as of 2026-02-17.

5.1 Local full 5-seed validation pack

Source: artifacts/test-reports/local-full-5seed-validation-20260216T225006Z.md

seeds: 11, 23, 37, 53, 71
denial checks: 20/25
prior-auth checks: 20/20
reimbursement checks: 13/15
appeals checks: 38/40
cross-model checks: 35/35
denial AUC mean: 0.7972
denial top-20 capture mean: 0.5385
denial top-20 precision mean: 0.5385
prior-auth AUC mean: 0.9708
reimbursement R2 mean: 0.7967
reimbursement MAPE mean: 0.1781
overall p95 latency: 66.87 ms

Denial target and reimbursement target companion packs:

denial target source: artifacts/simulations/full5_denial_20260216T224507Z/denial_target_simulation_20260216T224650Z.json
reimbursement target source: artifacts/simulations/full5_reimbursement_20260216T224507Z/reimbursement_target_simulation_20260216T224510Z.json
denial pass-all seeds: 1/5; aggregate ECE mean: 0.0110
reimbursement business gate: PASS; MAPE mean: 0.1192; monthly within 10% mean: 0.9611
frozen high-recall thresholds by regime: synthetic_v1=0.21, cms_v1=0.24, commercial_beta=0.22

5.2 Fly remote 5-seed fast-bounded pack

Source: artifacts/test-reports/fly-full-5seed-validation-20260216T235231Z.md

denial checks: 20/25
prior-auth checks: 20/20
reimbursement checks: 11/15
appeals checks: 40/40
cross-model checks: 30/35
denial AUC mean: 0.8489
prior-auth AUC mean: 0.9436
reimbursement MAPE mean: 0.1067
overall p95 latency: 3482.47 ms

5.3 Fly dual-machine rerun (stability + latency)

Sources:

artifacts/test-reports/fly-two-machine-rerun-20260217T004535Z.md
artifacts/test-reports/fly-two-machine-state-20260217T003423Z.log
artifacts/fly-remote-simulation/two-machine-rerun-20260217T003423Z-seed11/20260217T004412Z/report.json

Validation summary:

denial checks: 3/5
prior-auth checks: 4/4
reimbursement checks: 3/3
appeals checks: 8/8
cross-model checks: 6/7
overall p95 latency: 529.02 ms (latency gate < 500 ms narrowly missed)

Machine profile used throughout the dual-machine rerun:

| Service | Profile | |---|---| | API | 2 x shared 4 vCPU / 4096 MB (iad) | | Postgres | shared 4 vCPU / 4096 MB (iad) | | Token service | shared 2 vCPU / 1024 MB (iad) |

Interpretation:

local and Fly runs keep discrimination/calibration behavior directionally consistent;
strict denial top-20 capture/precision gates are still the primary shortfall area;
Fly latency materially improves with two API machines but remains above the strict < 500 ms gate in this rerun profile.

6. Governance Controls

6.1 Runtime lock enforcement

Runtime lock enforcement is implemented in app/runtime_locks.py and executed during API startup (app/main.py).

Startup validation checks:

model version pin lock (all scoring models + denial regime profile map)
model artifact hash + size lock
calibration artifact snapshot lock
threshold configuration lock
evaluation configuration lock

Primary lock files:

tests/contracts/model_version_pins.json
tests/contracts/model_artifact_hashes.json
tests/contracts/calibration_artifact_snapshot.json
tests/contracts/threshold_config_snapshot.json
tests/contracts/evaluation_config_snapshot.json

6.2 Contract and configuration stability

endpoint contract snapshot frozen in tests/contracts/endpoint_contract_snapshot.json
model version pins frozen in tests/contracts/model_version_pins.json
runtime default snapshot frozen in tests/contracts/runtime_config_snapshot.json

6.3 Test quality gates

Automated quality gate from make test-suites:

line coverage threshold: >= 70%
branch coverage threshold: >= 55%
contract artifact safety test ensures JSON contract snapshots contain no secret-like material before runtime image bundling

Latest coverage snapshot (artifacts/test-reports/coverage_summary.md):

code coverage: 80.60%
branch coverage: 68.79%

6.4 Runtime security baseline

Deployment defaults are intentionally hardened to reduce accidental insecure rollout risk:

API auth defaults to enabled (API_AUTH_ENABLED=true)
local DB port binding defaults to loopback (DB_BIND_ADDRESS=127.0.0.1)
runtime compose requires explicit database secret input (POSTGRES_PASSWORD)
token issuance helper requires explicit admin token (TOKEN_SERVICE_ADMIN_TOKEN)
token issuance helper defaults to TLS verification on (TOKEN_SERVICE_INSECURE_SKIP_VERIFY=0)
control-plane setup/session secrets are explicitly scoped (CONTROL_PLANE_SETUP_TOKEN_SECRET, CONTROL_PLANE_SESSION_TOKEN_SECRET)
API and token-service containers run as non-root user (appuser)

These controls are separate from model metrics but are required for production-governance credibility.

7. Runtime Observability and Drift Governance

7.1 Structured logging

Scoring logs include model metadata and support context:

denial logs include model version + calibration method/version + routing metadata (distribution_profile, routing_source)
prior-auth logs include support level, drift status, drift reason, and model/calibration versions
denial and prior-auth API responses also expose calibration_method and calibration_version for field-level debugging parity with logs

Example operational events:

denial_score_scored
prior_auth_prediction_scored
prior_auth_drift_downgrade_rate_alert

7.1.1 Readiness signaling

GET /health reports loaded runtime metadata (model, calibrator, operating point, artifacts)
GET /readyz is the deployment-readiness probe and returns:
200 when dependency checks pass
503 when critical readiness checks fail (for example database connectivity)
GET /metrics exposes Prometheus metrics for machine scraping
GET /metrics.json exposes compact operator diagnostics (metrics_schema_version: "1") including request totals, latency percentiles, and error counters; access requires metrics:read

7.2 Payer-level drift diagnostics

Prior-auth drift monitoring tracks feature drift per payer:

CPT mix
network status
billed amount band

When drift exceeds configured thresholds:

confidence support level is auto-downgraded
downgrade reason is included in response/support metadata
alerting is triggered when downgrade rate exceeds configured window threshold

Key alert controls:

PRIOR_AUTH_DRIFT_ALERT_WINDOW_SIZE (default 200)
PRIOR_AUTH_DRIFT_ALERT_MIN_SAMPLES (default 50)
PRIOR_AUTH_DRIFT_ALERT_THRESHOLD (default 0.20)
PRIOR_AUTH_DRIFT_ALERT_COOLDOWN_SECONDS (default 300)

7.3 Feedback telemetry controls

Feedback endpoint:

POST /v1/feedback

Runtime controls:

request ownership and recency validation using x-request-id from prior scoring responses
endpoint consistency check (request_id must match submitted endpoint family)
short-note enforcement and PHI-like pattern rejection in notes
per-token feedback rate limiting

Distributed consistency control:

request-id ledger is persisted in rcm.scoring_request_events so feedback validation works across multi-instance deployments
feedback events persist in rcm.feedback_events for downstream mismatch/drift analysis

8. Operating Threshold Governance

Denial threshold strategy is explicit and configurable:

high_recall: 0.22 (default)
balanced: 0.26
high_precision: 0.32

Controls:

DENIAL_OPERATING_POINT selects a named operating mode
DENIAL_OPERATING_THRESHOLD supports explicit override

Threshold and support-profile defaults are frozen by snapshot in:

tests/contracts/threshold_config_snapshot.json

9. Change Management Standard (Required for v1+ releases)

Any model/config update must complete this sequence:

retrain/recalibrate and produce new artifacts
update lock snapshots and version pins
run full test suites (unit, integration, functional, performance)
run denial, reimbursement, and cross-model simulation packs
verify governance gates and produce updated reports
release with git commit + tag

Changes are not release-ready if runtime lock validation fails or if frozen snapshots are out of sync with deployed artifacts.

9.1 Model retraining frequency policy

Models are reviewed at least quarterly for retraining eligibility, and retraining is triggered earlier when drift diagnostics, segment-support degradation, or business-metric regressions exceed defined governance thresholds.

9.2 Incident response for model degradation

If material model degradation is detected, the platform escalates to the latest known-good locked artifact/threshold profile and initiates an incident review with remediation and revalidation before re-release.

10. Current Risks and Open Actions

Denial model: strict aggregate target bundle is close but not consistently full-pass across all seeds; continue incremental interaction tuning while preserving calibration and stability.
Reimbursement: primary business gate is stable; continue watching lower-seed R2 variability as secondary diagnostic, not primary release blocker.
Prior auth and appeals: currently strong in current synthetic regime; continue monitoring for payer-mix shifts in pilot data.

11. Summary

The platform now runs with explicit model governance controls, locked artifacts/configuration, deterministic evaluation baselines, and production-oriented observability.

Validation status is strong for prior auth, reimbursement business metrics, appeals structure, and cross-model integrity. Denial calibration and lift are stable, with remaining work focused on closing the final discrimination/capture margin under strict multi-seed targets.

Document Content