Model Validation & Governance Summary (v1.1)
Document Control
- Document version:
v1.1 - Effective date:
2026-02-17 - Repository release version:
0.1.90(VERSION) - Scope: denial risk, prior auth prediction, reimbursement estimation, appeal generation
- Environment profile: synthetic + CMS-hybrid simulation data
- Intended audience: engineering, product, pilot partners, and technical compliance reviewers
1. Purpose
This document defines how model quality is validated, how runtime behavior is governed, and how changes are controlled for the claim-risk platform.
Primary goals:
- keep model behavior measurable and reproducible
- prevent silent runtime drift from configuration or artifact changes
- provide transparent pass/fail criteria for pilot readiness
1.1 Intended use and decision responsibility
This platform provides decision-support signals for revenue cycle operations. Outputs are not medical determinations, legal determinations, or payer-final adjudication decisions. Final submission, coding, authorization, and appeal decisions remain with qualified human reviewers.
2. Platform Capability Scope
Model-backed endpoints in scope:
POST /v1/score(primary, workflow-based scoring)POST /v1/feedback(structured real-world outcome telemetry)POST /v1/claims/denial/score(legacy/deprecated compatibility endpoint)POST /v1/claims/prior-auth/predict(legacy/deprecated compatibility endpoint)POST /v1/claims/reimbursement/estimate(legacy/deprecated compatibility endpoint)POST /v1/claims/appeals/generate(beta-enterprise,x-public=false)
Supporting endpoint families in scope:
- history retrieval endpoints for appeals, prior auth, and reimbursement
- workflow + limit discovery endpoints:
GET /v1/workflows,GET /v1/limits - token issuance endpoint for monetized/scoped API access:
POST /v1/tokens/issue - self-serve control-plane endpoints for one-time setup links and workspace key lifecycle:
- operational probe endpoints:
GET /health,GET /readyz, andGET /metrics(public) plusGET /metrics.json(requiresmetrics:read)
POST /v1/control-plane/setup-links, POST /v1/control-plane/setup-links/redeem, GET /v1/control-plane/workspace, GET/POST /v1/control-plane/workspace/api-keys
3. Model and Artifact Inventory
Pinned model versions and artifact hashes are frozen in tests/contracts/model_version_pins.json and tests/contracts/model_artifact_hashes.json.
| Model | Pinned version | Artifact path | |---|---|---| | Denial LightGBM (regime bundle) | 0.3.3 | artifacts/denial-risk-lightgbm-500k/model.joblib (+ denial_model_synth.joblib, denial_model_cms.joblib, denial_model_commercial.joblib) | | Prior Auth LightGBM | 0.2.3-prior-auth-on-file | artifacts/prior-auth-lightgbm/model.joblib | | Reimbursement LightGBM | 0.2.1 | artifacts/reimbursement-lightgbm/model.joblib |
Calibration artifact (denial) is pinned in tests/contracts/calibration_artifact_snapshot.json:
- path:
artifacts/denial-risk-lightgbm-500k/calibration-pass/probability_calibrator.joblib - selected method:
isotonic - selected ECE snapshot:
0.003216288147
4. Validation Methodology
4.1 Sampling and evaluation protocol
- Stratified sampling with fixed positive targets (not naive random-only sampling)
- Bootstrap confidence intervals for discrimination and top-k metrics
- Multi-seed evaluation to test stability
- Segment-level checks for payer and CPT behavior
Evaluation defaults are frozen in tests/contracts/evaluation_config_snapshot.json, including:
- bootstrap replicates:
1000 - confidence level:
0.95 - stratified workflow defaults and top-payer minimums
- required prior-auth tuned features
4.2 Production scoring path alignment
Simulation and governance enforce parity with production scoring behavior:
- runtime scoring path expectation:
api_runtime - bootstrap source expectation:
api_runtime_predictions - tuned prior-auth feature set readiness check included
4.3 Data limitations and applicability boundaries
Current validation is based on synthetic + CMS-hybrid simulation data. Real-world payer behavior can introduce additional distribution shifts, policy nuances, and documentation variability that require recalibration, threshold review, and/or feature updates. Pilot and production performance should be monitored continuously with segment-level diagnostics before expanding support claims.
5. Latest Validation Snapshot (Current Baseline)
This section reflects the latest generated artifacts as of 2026-02-17.
5.1 Local full 5-seed validation pack
Source: artifacts/test-reports/local-full-5seed-validation-20260216T225006Z.md
- seeds:
11, 23, 37, 53, 71 - denial checks:
20/25 - prior-auth checks:
20/20 - reimbursement checks:
13/15 - appeals checks:
38/40 - cross-model checks:
35/35 - denial AUC mean:
0.7972 - denial top-20 capture mean:
0.5385 - denial top-20 precision mean:
0.5385 - prior-auth AUC mean:
0.9708 - reimbursement R2 mean:
0.7967 - reimbursement MAPE mean:
0.1781 - overall p95 latency:
66.87 ms
Denial target and reimbursement target companion packs:
- denial target source:
artifacts/simulations/full5_denial_20260216T224507Z/denial_target_simulation_20260216T224650Z.json - reimbursement target source:
artifacts/simulations/full5_reimbursement_20260216T224507Z/reimbursement_target_simulation_20260216T224510Z.json - denial pass-all seeds:
1/5; aggregate ECE mean:0.0110 - reimbursement business gate:
PASS; MAPE mean:0.1192; monthly within 10% mean:0.9611 - frozen high-recall thresholds by regime:
synthetic_v1=0.21,cms_v1=0.24,commercial_beta=0.22
5.2 Fly remote 5-seed fast-bounded pack
Source: artifacts/test-reports/fly-full-5seed-validation-20260216T235231Z.md
- denial checks:
20/25 - prior-auth checks:
20/20 - reimbursement checks:
11/15 - appeals checks:
40/40 - cross-model checks:
30/35 - denial AUC mean:
0.8489 - prior-auth AUC mean:
0.9436 - reimbursement MAPE mean:
0.1067 - overall p95 latency:
3482.47 ms
5.3 Fly dual-machine rerun (stability + latency)
Sources:
artifacts/test-reports/fly-two-machine-rerun-20260217T004535Z.mdartifacts/test-reports/fly-two-machine-state-20260217T003423Z.logartifacts/fly-remote-simulation/two-machine-rerun-20260217T003423Z-seed11/20260217T004412Z/report.json
Validation summary:
- denial checks:
3/5 - prior-auth checks:
4/4 - reimbursement checks:
3/3 - appeals checks:
8/8 - cross-model checks:
6/7 - overall p95 latency:
529.02 ms(latency gate< 500 msnarrowly missed)
Machine profile used throughout the dual-machine rerun:
| Service | Profile | |---|---| | API | 2 x shared 4 vCPU / 4096 MB (iad) | | Postgres | shared 4 vCPU / 4096 MB (iad) | | Token service | shared 2 vCPU / 1024 MB (iad) |
Interpretation:
- local and Fly runs keep discrimination/calibration behavior directionally consistent;
- strict denial top-20 capture/precision gates are still the primary shortfall area;
- Fly latency materially improves with two API machines but remains above the strict
< 500 msgate in this rerun profile.
6. Governance Controls
6.1 Runtime lock enforcement
Runtime lock enforcement is implemented in app/runtime_locks.py and executed during API startup (app/main.py).
Startup validation checks:
- model version pin lock (all scoring models + denial regime profile map)
- model artifact hash + size lock
- calibration artifact snapshot lock
- threshold configuration lock
- evaluation configuration lock
Primary lock files:
tests/contracts/model_version_pins.jsontests/contracts/model_artifact_hashes.jsontests/contracts/calibration_artifact_snapshot.jsontests/contracts/threshold_config_snapshot.jsontests/contracts/evaluation_config_snapshot.json
6.2 Contract and configuration stability
- endpoint contract snapshot frozen in
tests/contracts/endpoint_contract_snapshot.json - model version pins frozen in
tests/contracts/model_version_pins.json - runtime default snapshot frozen in
tests/contracts/runtime_config_snapshot.json
6.3 Test quality gates
Automated quality gate from make test-suites:
- line coverage threshold:
>= 70% - branch coverage threshold:
>= 55% - contract artifact safety test ensures JSON contract snapshots contain no secret-like material before runtime image bundling
Latest coverage snapshot (artifacts/test-reports/coverage_summary.md):
- code coverage:
80.60% - branch coverage:
68.79%
6.4 Runtime security baseline
Deployment defaults are intentionally hardened to reduce accidental insecure rollout risk:
- API auth defaults to enabled (
API_AUTH_ENABLED=true) - local DB port binding defaults to loopback (
DB_BIND_ADDRESS=127.0.0.1) - runtime compose requires explicit database secret input (
POSTGRES_PASSWORD) - token issuance helper requires explicit admin token (
TOKEN_SERVICE_ADMIN_TOKEN) - token issuance helper defaults to TLS verification on (
TOKEN_SERVICE_INSECURE_SKIP_VERIFY=0) - control-plane setup/session secrets are explicitly scoped (
CONTROL_PLANE_SETUP_TOKEN_SECRET,CONTROL_PLANE_SESSION_TOKEN_SECRET) - API and token-service containers run as non-root user (
appuser)
These controls are separate from model metrics but are required for production-governance credibility.
7. Runtime Observability and Drift Governance
7.1 Structured logging
Scoring logs include model metadata and support context:
- denial logs include model version + calibration method/version + routing metadata (
distribution_profile,routing_source) - prior-auth logs include support level, drift status, drift reason, and model/calibration versions
- denial and prior-auth API responses also expose
calibration_methodandcalibration_versionfor field-level debugging parity with logs
Example operational events:
denial_score_scoredprior_auth_prediction_scoredprior_auth_drift_downgrade_rate_alert
7.1.1 Readiness signaling
GET /healthreports loaded runtime metadata (model, calibrator, operating point, artifacts)GET /readyzis the deployment-readiness probe and returns:200when dependency checks pass503when critical readiness checks fail (for example database connectivity)GET /metricsexposes Prometheus metrics for machine scrapingGET /metrics.jsonexposes compact operator diagnostics (metrics_schema_version: "1") including request totals, latency percentiles, and error counters; access requiresmetrics:read
7.2 Payer-level drift diagnostics
Prior-auth drift monitoring tracks feature drift per payer:
- CPT mix
- network status
- billed amount band
When drift exceeds configured thresholds:
- confidence support level is auto-downgraded
- downgrade reason is included in response/support metadata
- alerting is triggered when downgrade rate exceeds configured window threshold
Key alert controls:
PRIOR_AUTH_DRIFT_ALERT_WINDOW_SIZE(default200)PRIOR_AUTH_DRIFT_ALERT_MIN_SAMPLES(default50)PRIOR_AUTH_DRIFT_ALERT_THRESHOLD(default0.20)PRIOR_AUTH_DRIFT_ALERT_COOLDOWN_SECONDS(default300)
7.3 Feedback telemetry controls
Feedback endpoint:
POST /v1/feedback
Runtime controls:
- request ownership and recency validation using
x-request-idfrom prior scoring responses - endpoint consistency check (
request_idmust match submitted endpoint family) - short-note enforcement and PHI-like pattern rejection in
notes - per-token feedback rate limiting
Distributed consistency control:
- request-id ledger is persisted in
rcm.scoring_request_eventsso feedback validation works across multi-instance deployments - feedback events persist in
rcm.feedback_eventsfor downstream mismatch/drift analysis
8. Operating Threshold Governance
Denial threshold strategy is explicit and configurable:
high_recall:0.22(default)balanced:0.26high_precision:0.32
Controls:
DENIAL_OPERATING_POINTselects a named operating modeDENIAL_OPERATING_THRESHOLDsupports explicit override
Threshold and support-profile defaults are frozen by snapshot in:
tests/contracts/threshold_config_snapshot.json
9. Change Management Standard (Required for v1+ releases)
Any model/config update must complete this sequence:
- retrain/recalibrate and produce new artifacts
- update lock snapshots and version pins
- run full test suites (
unit,integration,functional,performance) - run denial, reimbursement, and cross-model simulation packs
- verify governance gates and produce updated reports
- release with git commit + tag
Changes are not release-ready if runtime lock validation fails or if frozen snapshots are out of sync with deployed artifacts.
9.1 Model retraining frequency policy
Models are reviewed at least quarterly for retraining eligibility, and retraining is triggered earlier when drift diagnostics, segment-support degradation, or business-metric regressions exceed defined governance thresholds.
9.2 Incident response for model degradation
If material model degradation is detected, the platform escalates to the latest known-good locked artifact/threshold profile and initiates an incident review with remediation and revalidation before re-release.
10. Current Risks and Open Actions
- Denial model: strict aggregate target bundle is close but not consistently full-pass across all seeds; continue incremental interaction tuning while preserving calibration and stability.
- Reimbursement: primary business gate is stable; continue watching lower-seed R2 variability as secondary diagnostic, not primary release blocker.
- Prior auth and appeals: currently strong in current synthetic regime; continue monitoring for payer-mix shifts in pilot data.
11. Summary
The platform now runs with explicit model governance controls, locked artifacts/configuration, deterministic evaluation baselines, and production-oriented observability.
Validation status is strong for prior auth, reimbursement business metrics, appeals structure, and cross-model integrity. Denial calibration and lift are stable, with remaining work focused on closing the final discrimination/capture margin under strict multi-seed targets.