Sentinel Signal

Model Validation & Governance Summary (v1.1)

Source: docs/model-validation-governance-summary-v1.0.md

Document Content

Model Validation & Governance Summary (v1.1)

Document Control

  • Document version: v1.1
  • Effective date: 2026-02-17
  • Repository release version: 0.1.90 (VERSION)
  • Scope: denial risk, prior auth prediction, reimbursement estimation, appeal generation
  • Environment profile: synthetic + CMS-hybrid simulation data
  • Intended audience: engineering, product, pilot partners, and technical compliance reviewers

1. Purpose

This document defines how model quality is validated, how runtime behavior is governed, and how changes are controlled for the claim-risk platform.

Primary goals:

  • keep model behavior measurable and reproducible
  • prevent silent runtime drift from configuration or artifact changes
  • provide transparent pass/fail criteria for pilot readiness

1.1 Intended use and decision responsibility

This platform provides decision-support signals for revenue cycle operations. Outputs are not medical determinations, legal determinations, or payer-final adjudication decisions. Final submission, coding, authorization, and appeal decisions remain with qualified human reviewers.

2. Platform Capability Scope

Model-backed endpoints in scope:

  • POST /v1/score (primary, workflow-based scoring)
  • POST /v1/feedback (structured real-world outcome telemetry)
  • POST /v1/claims/denial/score (legacy/deprecated compatibility endpoint)
  • POST /v1/claims/prior-auth/predict (legacy/deprecated compatibility endpoint)
  • POST /v1/claims/reimbursement/estimate (legacy/deprecated compatibility endpoint)
  • POST /v1/claims/appeals/generate (beta-enterprise, x-public=false)

Supporting endpoint families in scope:

  • history retrieval endpoints for appeals, prior auth, and reimbursement
  • workflow + limit discovery endpoints: GET /v1/workflows, GET /v1/limits
  • token issuance endpoint for monetized/scoped API access: POST /v1/tokens/issue
  • self-serve control-plane endpoints for one-time setup links and workspace key lifecycle:
  • POST /v1/control-plane/setup-links, POST /v1/control-plane/setup-links/redeem, GET /v1/control-plane/workspace, GET/POST /v1/control-plane/workspace/api-keys

  • operational probe endpoints: GET /health, GET /readyz, and GET /metrics (public) plus GET /metrics.json (requires metrics:read)

3. Model and Artifact Inventory

Pinned model versions and artifact hashes are frozen in tests/contracts/model_version_pins.json and tests/contracts/model_artifact_hashes.json.

| Model | Pinned version | Artifact path | |---|---|---| | Denial LightGBM (regime bundle) | 0.3.3 | artifacts/denial-risk-lightgbm-500k/model.joblib (+ denial_model_synth.joblib, denial_model_cms.joblib, denial_model_commercial.joblib) | | Prior Auth LightGBM | 0.2.3-prior-auth-on-file | artifacts/prior-auth-lightgbm/model.joblib | | Reimbursement LightGBM | 0.2.1 | artifacts/reimbursement-lightgbm/model.joblib |

Calibration artifact (denial) is pinned in tests/contracts/calibration_artifact_snapshot.json:

  • path: artifacts/denial-risk-lightgbm-500k/calibration-pass/probability_calibrator.joblib
  • selected method: isotonic
  • selected ECE snapshot: 0.003216288147

4. Validation Methodology

4.1 Sampling and evaluation protocol

  • Stratified sampling with fixed positive targets (not naive random-only sampling)
  • Bootstrap confidence intervals for discrimination and top-k metrics
  • Multi-seed evaluation to test stability
  • Segment-level checks for payer and CPT behavior

Evaluation defaults are frozen in tests/contracts/evaluation_config_snapshot.json, including:

  • bootstrap replicates: 1000
  • confidence level: 0.95
  • stratified workflow defaults and top-payer minimums
  • required prior-auth tuned features

4.2 Production scoring path alignment

Simulation and governance enforce parity with production scoring behavior:

  • runtime scoring path expectation: api_runtime
  • bootstrap source expectation: api_runtime_predictions
  • tuned prior-auth feature set readiness check included

4.3 Data limitations and applicability boundaries

Current validation is based on synthetic + CMS-hybrid simulation data. Real-world payer behavior can introduce additional distribution shifts, policy nuances, and documentation variability that require recalibration, threshold review, and/or feature updates. Pilot and production performance should be monitored continuously with segment-level diagnostics before expanding support claims.

5. Latest Validation Snapshot (Current Baseline)

This section reflects the latest generated artifacts as of 2026-02-17.

5.1 Local full 5-seed validation pack

Source: artifacts/test-reports/local-full-5seed-validation-20260216T225006Z.md

  • seeds: 11, 23, 37, 53, 71
  • denial checks: 20/25
  • prior-auth checks: 20/20
  • reimbursement checks: 13/15
  • appeals checks: 38/40
  • cross-model checks: 35/35
  • denial AUC mean: 0.7972
  • denial top-20 capture mean: 0.5385
  • denial top-20 precision mean: 0.5385
  • prior-auth AUC mean: 0.9708
  • reimbursement R2 mean: 0.7967
  • reimbursement MAPE mean: 0.1781
  • overall p95 latency: 66.87 ms

Denial target and reimbursement target companion packs:

  • denial target source: artifacts/simulations/full5_denial_20260216T224507Z/denial_target_simulation_20260216T224650Z.json
  • reimbursement target source: artifacts/simulations/full5_reimbursement_20260216T224507Z/reimbursement_target_simulation_20260216T224510Z.json
  • denial pass-all seeds: 1/5; aggregate ECE mean: 0.0110
  • reimbursement business gate: PASS; MAPE mean: 0.1192; monthly within 10% mean: 0.9611
  • frozen high-recall thresholds by regime: synthetic_v1=0.21, cms_v1=0.24, commercial_beta=0.22

5.2 Fly remote 5-seed fast-bounded pack

Source: artifacts/test-reports/fly-full-5seed-validation-20260216T235231Z.md

  • denial checks: 20/25
  • prior-auth checks: 20/20
  • reimbursement checks: 11/15
  • appeals checks: 40/40
  • cross-model checks: 30/35
  • denial AUC mean: 0.8489
  • prior-auth AUC mean: 0.9436
  • reimbursement MAPE mean: 0.1067
  • overall p95 latency: 3482.47 ms

5.3 Fly dual-machine rerun (stability + latency)

Sources:

  • artifacts/test-reports/fly-two-machine-rerun-20260217T004535Z.md
  • artifacts/test-reports/fly-two-machine-state-20260217T003423Z.log
  • artifacts/fly-remote-simulation/two-machine-rerun-20260217T003423Z-seed11/20260217T004412Z/report.json

Validation summary:

  • denial checks: 3/5
  • prior-auth checks: 4/4
  • reimbursement checks: 3/3
  • appeals checks: 8/8
  • cross-model checks: 6/7
  • overall p95 latency: 529.02 ms (latency gate < 500 ms narrowly missed)

Machine profile used throughout the dual-machine rerun:

| Service | Profile | |---|---| | API | 2 x shared 4 vCPU / 4096 MB (iad) | | Postgres | shared 4 vCPU / 4096 MB (iad) | | Token service | shared 2 vCPU / 1024 MB (iad) |

Interpretation:

  • local and Fly runs keep discrimination/calibration behavior directionally consistent;
  • strict denial top-20 capture/precision gates are still the primary shortfall area;
  • Fly latency materially improves with two API machines but remains above the strict < 500 ms gate in this rerun profile.

6. Governance Controls

6.1 Runtime lock enforcement

Runtime lock enforcement is implemented in app/runtime_locks.py and executed during API startup (app/main.py).

Startup validation checks:

  • model version pin lock (all scoring models + denial regime profile map)
  • model artifact hash + size lock
  • calibration artifact snapshot lock
  • threshold configuration lock
  • evaluation configuration lock

Primary lock files:

  • tests/contracts/model_version_pins.json
  • tests/contracts/model_artifact_hashes.json
  • tests/contracts/calibration_artifact_snapshot.json
  • tests/contracts/threshold_config_snapshot.json
  • tests/contracts/evaluation_config_snapshot.json

6.2 Contract and configuration stability

  • endpoint contract snapshot frozen in tests/contracts/endpoint_contract_snapshot.json
  • model version pins frozen in tests/contracts/model_version_pins.json
  • runtime default snapshot frozen in tests/contracts/runtime_config_snapshot.json

6.3 Test quality gates

Automated quality gate from make test-suites:

  • line coverage threshold: >= 70%
  • branch coverage threshold: >= 55%
  • contract artifact safety test ensures JSON contract snapshots contain no secret-like material before runtime image bundling

Latest coverage snapshot (artifacts/test-reports/coverage_summary.md):

  • code coverage: 80.60%
  • branch coverage: 68.79%

6.4 Runtime security baseline

Deployment defaults are intentionally hardened to reduce accidental insecure rollout risk:

  • API auth defaults to enabled (API_AUTH_ENABLED=true)
  • local DB port binding defaults to loopback (DB_BIND_ADDRESS=127.0.0.1)
  • runtime compose requires explicit database secret input (POSTGRES_PASSWORD)
  • token issuance helper requires explicit admin token (TOKEN_SERVICE_ADMIN_TOKEN)
  • token issuance helper defaults to TLS verification on (TOKEN_SERVICE_INSECURE_SKIP_VERIFY=0)
  • control-plane setup/session secrets are explicitly scoped (CONTROL_PLANE_SETUP_TOKEN_SECRET, CONTROL_PLANE_SESSION_TOKEN_SECRET)
  • API and token-service containers run as non-root user (appuser)

These controls are separate from model metrics but are required for production-governance credibility.

7. Runtime Observability and Drift Governance

7.1 Structured logging

Scoring logs include model metadata and support context:

  • denial logs include model version + calibration method/version + routing metadata (distribution_profile, routing_source)
  • prior-auth logs include support level, drift status, drift reason, and model/calibration versions
  • denial and prior-auth API responses also expose calibration_method and calibration_version for field-level debugging parity with logs

Example operational events:

  • denial_score_scored
  • prior_auth_prediction_scored
  • prior_auth_drift_downgrade_rate_alert

7.1.1 Readiness signaling

  • GET /health reports loaded runtime metadata (model, calibrator, operating point, artifacts)
  • GET /readyz is the deployment-readiness probe and returns:
  • 200 when dependency checks pass
  • 503 when critical readiness checks fail (for example database connectivity)
  • GET /metrics exposes Prometheus metrics for machine scraping
  • GET /metrics.json exposes compact operator diagnostics (metrics_schema_version: "1") including request totals, latency percentiles, and error counters; access requires metrics:read

7.2 Payer-level drift diagnostics

Prior-auth drift monitoring tracks feature drift per payer:

  • CPT mix
  • network status
  • billed amount band

When drift exceeds configured thresholds:

  • confidence support level is auto-downgraded
  • downgrade reason is included in response/support metadata
  • alerting is triggered when downgrade rate exceeds configured window threshold

Key alert controls:

  • PRIOR_AUTH_DRIFT_ALERT_WINDOW_SIZE (default 200)
  • PRIOR_AUTH_DRIFT_ALERT_MIN_SAMPLES (default 50)
  • PRIOR_AUTH_DRIFT_ALERT_THRESHOLD (default 0.20)
  • PRIOR_AUTH_DRIFT_ALERT_COOLDOWN_SECONDS (default 300)

7.3 Feedback telemetry controls

Feedback endpoint:

  • POST /v1/feedback

Runtime controls:

  • request ownership and recency validation using x-request-id from prior scoring responses
  • endpoint consistency check (request_id must match submitted endpoint family)
  • short-note enforcement and PHI-like pattern rejection in notes
  • per-token feedback rate limiting

Distributed consistency control:

  • request-id ledger is persisted in rcm.scoring_request_events so feedback validation works across multi-instance deployments
  • feedback events persist in rcm.feedback_events for downstream mismatch/drift analysis

8. Operating Threshold Governance

Denial threshold strategy is explicit and configurable:

  • high_recall: 0.22 (default)
  • balanced: 0.26
  • high_precision: 0.32

Controls:

  • DENIAL_OPERATING_POINT selects a named operating mode
  • DENIAL_OPERATING_THRESHOLD supports explicit override

Threshold and support-profile defaults are frozen by snapshot in:

  • tests/contracts/threshold_config_snapshot.json

9. Change Management Standard (Required for v1+ releases)

Any model/config update must complete this sequence:

  1. retrain/recalibrate and produce new artifacts
  2. update lock snapshots and version pins
  3. run full test suites (unit, integration, functional, performance)
  4. run denial, reimbursement, and cross-model simulation packs
  5. verify governance gates and produce updated reports
  6. release with git commit + tag

Changes are not release-ready if runtime lock validation fails or if frozen snapshots are out of sync with deployed artifacts.

9.1 Model retraining frequency policy

Models are reviewed at least quarterly for retraining eligibility, and retraining is triggered earlier when drift diagnostics, segment-support degradation, or business-metric regressions exceed defined governance thresholds.

9.2 Incident response for model degradation

If material model degradation is detected, the platform escalates to the latest known-good locked artifact/threshold profile and initiates an incident review with remediation and revalidation before re-release.

10. Current Risks and Open Actions

  • Denial model: strict aggregate target bundle is close but not consistently full-pass across all seeds; continue incremental interaction tuning while preserving calibration and stability.
  • Reimbursement: primary business gate is stable; continue watching lower-seed R2 variability as secondary diagnostic, not primary release blocker.
  • Prior auth and appeals: currently strong in current synthetic regime; continue monitoring for payer-mix shifts in pilot data.

11. Summary

The platform now runs with explicit model governance controls, locked artifacts/configuration, deterministic evaluation baselines, and production-oriented observability.

Validation status is strong for prior auth, reimbursement business metrics, appeals structure, and cross-model integrity. Denial calibration and lift are stable, with remaining work focused on closing the final discrimination/capture margin under strict multi-seed targets.