zuub-precog

gavel Rule-Based Adjudication

The system checks each dental procedure against the patient's insurance plan rules — frequency limits, age restrictions, bundling conflicts, and waiting periods — to predict whether the claim will be approved or denied, without calling the payer.

Rule-based adjudication mirrors the logic payers use internally. By pre-evaluating claims against known plan rules, practices avoid unnecessary verification calls. Each rule fired is traceable and auditable, providing a clear decision trail for compliance and payer negotiations.

CDT codes are matched against payer-specific YAML rule definitions. Each rule type (frequency, age, bundling, waiting period, alternate benefit) implements a Matches(claim) interface. Rules fire independently; results are aggregated into a coverage determination. Conflict resolution: deny wins over cover when rules contradict.

for _, rule := range payer.Rules {
if rule.Matches(claim) { hits = append(hits, rule) }
}
outcome := adjudicate(hits)

speed Confidence Score

A single number from 0% to 100% representing how certain the system is about its prediction. Higher confidence means the system has strong evidence — matching rules, historical data, and complete patient information — to support its answer.

The confidence score drives the predict-vs-call decision. Predictions above the routing threshold bypass manual verification, directly reducing operational cost. The score is decomposed into four auditable factors, enabling targeted data collection to improve weak areas. Practices typically see 60-70% of claims routed with confidence above 85%.

Weighted linear combination of four normalized factors:

confidence = w₁·ruleMatch + w₂·historicalRate + w₃·dataCompleteness + w₄·payerReliability

Default weights: w₁=0.40, w₂=0.25, w₃=0.20, w₄=0.15
Each factor ∈ [0, 1], output clamped to [0, 1]

Weights are configurable per payer. The model is intentionally linear for interpretability — each factor's contribution is directly readable from the decomposition.

call_split Routing Decision

Based on the confidence score, the system either gives you an instant prediction or recommends calling the payer. High-confidence predictions save time; low-confidence cases are flagged for manual verification so nothing falls through the cracks.

The routing threshold is the business lever controlling the precision/recall trade-off. A higher threshold means fewer automated predictions but higher accuracy. A lower threshold means more automation but higher risk of incorrect predictions. The optimal threshold depends on the cost of a wrong prediction vs. the cost of a verification call.

Binary decision based on confidence threshold comparison:

if confidence ≥ threshold → route to prediction (skip call)
if confidence < threshold → route to manual verification

Default threshold: 0.85
Configurable per payer via rule YAML

The threshold can also be derived from the decision-theoretic Value of Information (VoI) calculation when cost parameters are available.

check_circle Rule Match Score

Measures how well the patient's procedure matches known insurance rules. When the system finds clear rules that apply — like "cleanings are covered twice per year" — this score is high. When no rules match, the score is low, signaling uncertainty.

The rule match score directly reflects the depth of payer-specific knowledge in the system. A consistently low score for a payer indicates a gap in rule coverage — an actionable signal to invest in rule acquisition for that payer. Higher rule coverage correlates with reduced call volume and improved prediction accuracy.

ruleMatch = matchedRules / max(expectedRules, 1)

matchedRules: count of rules that fired with definitive outcome
expectedRules: total rules applicable to CDT code category
Normalized to [0, 1]. Returns 0.5 (neutral) when no rules exist for the payer/code combination.

trending_up Historical Rate

Looks at how often this type of procedure has been approved or denied in the past for this payer. If 95% of similar claims were approved, the system has strong historical evidence to predict approval again.

Historical approval rates provide empirical grounding for predictions. This factor improves automatically as the system processes more claims, creating a data flywheel. Payers with more historical data produce more reliable predictions, directly reducing risk exposure over time.

historicalRate = |approvalRate - 0.5| × 2

Maps approval rate to confidence: 50% approval → 0 (maximum uncertainty)
0% or 100% approval → 1.0 (maximum confidence)
Requires minimum sample size (n ≥ 10) to activate; returns 0 otherwise.

checklist Data Completeness

Checks whether all the information needed for a prediction is present — payer ID, member info, procedure details, tooth numbers, and surfaces. More complete data means a more reliable prediction.

Data completeness is the most actionable confidence factor — it identifies exactly which missing fields would improve prediction quality. Practices that achieve 90%+ data completeness see measurably higher prediction accuracy, directly reducing unnecessary payer calls and improving patient experience at check-in.

dataCompleteness = presentFields / totalFields

Required fields (weight 1.0): payer_id, member_id, cdt_code, service_date
Optional fields (weight 0.5): tooth_number, surfaces, group_number, dob

Score = Σ(field_present × weight) / Σ(weights)

verified_user Payer Reliability

Some insurance companies follow their published rules consistently; others make unpredictable decisions. This score reflects how trustworthy a specific payer's behavior has been — reliable payers get higher scores.

Payer reliability quantifies adjudication consistency across a payer's claim history. Low-reliability payers represent higher risk and should trigger manual verification more frequently. This metric also serves as a negotiation tool — demonstrating to payer representatives where their adjudication deviates from published guidelines.

payerReliability = 1 - (disagreements / totalPredictions)

disagreements: cases where outcome contradicted rule-based prediction
Requires minimum 20 resolved claims to activate.
Payers with no history default to 0.5 (neutral).
Exponentially weighted: recent claims weighted higher (λ = 0.95).

target Brier Score

Measures how accurate each coverage prediction is on a 0-to-1 scale. A score of 0 means the prediction was perfect. Lower numbers mean the system is making better predictions over time. When the Brier Score stays low, front desk staff can trust the system's recommendations without calling the payer.

The Brier Score is the standard metric used by the National Weather Service and insurance actuaries to evaluate probabilistic forecasts. It directly measures calibration — whether a "90% covered" prediction is actually covered 90% of the time. Tracking Brier Score over time demonstrates model reliability to auditors and payer partners. A score below 0.15 indicates clinical-grade prediction accuracy. Each 0.01 improvement translates to approximately 2-3% fewer unnecessary payer calls.

BS = (1/N) Σ (pᵢ - yᵢ)²

pᵢ ∈ [0,1]: predicted probability
yᵢ ∈ {0,1}: binary outcome (1 = covered)

Strictly proper: uniquely minimized when pᵢ = true probability.
Decomposes via Murphy (1973): BS = REL - RES + UNC

Brier, G.W. (1950). "Verification of forecasts expressed in terms of probability." Monthly Weather Review 78(1).

functions Log Loss

Penalizes confident wrong predictions more heavily than uncertain ones. If the system says "95% covered" and the claim is denied, that costs much more than saying "55% covered" and being wrong. This keeps the system honest about what it doesn't know.

Log Loss is the industry-standard metric for evaluating probabilistic classifiers in machine learning and actuarial science. Unlike accuracy, it rewards well-calibrated probabilities and severely penalizes overconfident errors. Financial institutions and insurance companies use Log Loss to ensure risk models don't systematically underestimate uncertainty.

LL = -(1/N) Σ [yᵢ·log(pᵢ) + (1-yᵢ)·log(1-pᵢ)]

Strictly proper scoring rule. Unbounded above (→ ∞ as p → wrong direction).
Probabilities clamped to [1e-15, 1-1e-15] for numerical stability.
Perfect calibration: LL = H(Y) (entropy of outcomes).

Good, I.J. (1952). "Rational Decisions." Journal of the Royal Statistical Society B, 14(1), 107-114. DOI ↗

globe Spherical Score

A normalized accuracy measure that balances between being confident and being correct. Unlike other scores, it ranges from 0 to 1 where higher is better, making it intuitive to read alongside confidence percentages.

The Spherical Score provides a bounded, interpretable complement to Log Loss. Its 0-to-1 range makes it suitable for dashboards and executive reporting. As a proper scoring rule, it cannot be gamed — the best strategy is always to report true beliefs. Used in forecasting competitions and weather prediction evaluation.

SS = pₖ / √(Σ pⱼ²)

where k is the realized outcome class.
For binary: SS = p_correct / √(p² + (1-p)²)
Range: [1/√2, 1]. Higher is better.
Strictly proper. Less sensitive to extreme probabilities than Log Loss.

Roby, T.B. (1965). "Belief states and the uses of evidence." Behavioral Science, 10(3), 255-270. DOI ↗

compare Brier Skill Score

Shows how much better the system's predictions are compared to simply guessing the historical average. A positive score means the system adds value beyond naive prediction. The higher the score, the more the system is earning its keep.

The Brier Skill Score benchmarks prediction quality against a no-skill baseline. A BSS of 0.30 means the model reduces prediction error by 30% compared to always guessing the base rate. This directly quantifies the ROI of the prediction system — a BSS of 0 means no value added, while higher scores justify the investment in rule development and data collection.

BSS = 1 - BS / BS_ref

BS_ref = base_rate × (1 - base_rate) [climatological Brier Score]
BSS = 1: perfect. BSS = 0: no skill. BSS < 0: worse than baseline.
Equivalent to: BSS = (RES - REL) / UNC

Murphy, A.H. (1973). "A New Vector Partition of the Probability Score." Journal of Applied Meteorology, 12(4), 595-600. DOI ↗

analytics Beta Posterior

As the system sees more claims for a payer, it continuously updates its belief about how likely coverage is. Early on with few data points, it's uncertain. After many claims, the estimate becomes precise and stable. This is how the system "learns" from experience.

Bayesian updating provides a principled framework for combining prior knowledge with observed data. Unlike simple averages, it naturally handles small sample sizes by incorporating prior beliefs, preventing wild swings from limited data. This approach is standard in actuarial science and clinical trials — domains where decisions must be made under uncertainty.

Prior: Beta(α₀, β₀) — default: Beta(1, 1) = Uniform
After n claims with k approvals:

Posterior: Beta(α₀ + k, β₀ + n - k)
Mean: (α₀ + k) / (α₀ + β₀ + n)
Mode: (α₀ + k - 1) / (α₀ + β₀ + n - 2) for α,β > 1

Conjugate prior for Bernoulli likelihood. Closed-form update — no MCMC required.

Gelman, A., et al. (2013). Bayesian Data Analysis, 3rd ed. Chapman & Hall/CRC. Ch. 2: Single-parameter models. Book ↗

expand Credible Intervals

Instead of just saying "85% chance of coverage," the system shows a range: "between 78% and 92%." This range gets narrower as the system sees more data, giving you a clear picture of how precise the estimate is.

Credible intervals communicate prediction uncertainty in an intuitive way. A narrow interval (85% ± 3%) signals high precision and justifies automated routing. A wide interval (85% ± 20%) flags high uncertainty, warranting manual review. This transparency builds trust with clinical staff and supports defensible decision-making.

95% Credible Interval: [Beta.ppf(0.025, α, β), Beta.ppf(0.975, α, β)]

Computed via inverse CDF of Beta posterior.
Equal-tailed interval (not HPD for simplicity).
Width ≈ 2 × 1.96 × √(αβ / ((α+β)²(α+β+1)))

Jaynes, E.T. (2003). Probability Theory: The Logic of Science. Cambridge University Press. DOI ↗

Interpretation: 95% probability that the true coverage rate lies within this interval, given the observed data and prior.

priority_high Posterior Threshold P(p > 0.90)

Answers the question: "How likely is it that coverage is above 90%?" This gives a direct probability that the prediction meets the routing threshold, accounting for all the uncertainty in the estimate.

This metric directly addresses the business question: "Can we trust this prediction enough to skip the call?" Rather than comparing a point estimate to a threshold, it computes the probability of exceeding the threshold, incorporating uncertainty. A posterior probability of 95% means there's only a 5% chance the true rate is below the threshold — a quantified risk statement for decision-makers.

P(p > τ) = 1 - Beta.cdf(τ, α, β)

τ = routing threshold (default 0.90)
α, β = posterior Beta parameters

Decision rule: route to prediction if P(p > τ) > γ
γ = decision confidence (default 0.95)

precision_manufacturing Fisher Information / CRLB

Tells the system the theoretical minimum uncertainty for a given amount of data. This answers "how many more claims do we need to see before the prediction becomes reliable enough?" — helping plan data collection efforts.

Fisher Information quantifies the diminishing returns of additional data. Early claims dramatically reduce uncertainty, but each additional claim provides less marginal value. This informs resource allocation: if the CRLB shows that 50 more claims would only reduce the interval by 1%, the investment in manual data collection may not be justified.

Fisher Information: I(p) = n / (p(1-p))
Cramér-Rao Lower Bound: Var(p̂) ≥ 1/I(p) = p(1-p)/n

For Beta(α,β) posterior:
Var = αβ / ((α+β)²(α+β+1))
Effective sample size: n_eff = α + β - 2 (for non-informative prior)

radio_button_checked Shannon Entropy

Measures how uncertain a prediction is, in bits. A prediction of "50/50 coverage" has maximum entropy (maximum uncertainty), while "99% covered" has very low entropy. The system uses this to identify which predictions need more evidence.

Shannon Entropy provides an information-theoretic measure of prediction uncertainty that's independent of the scoring method. High-entropy predictions represent the highest-value targets for data collection or rule development — they're the cases where additional information would have the greatest impact on routing efficiency.

H(p) = -[p·log₂(p) + (1-p)·log₂(1-p)]

Range: [0, 1] bits for binary outcome
Maximum at p = 0.5: H = 1 bit
Minimum at p ∈ {0, 1}: H = 0 bits
Convention: 0·log(0) = 0

Shannon, C.E. (1948). "A Mathematical Theory of Communication." Bell System Technical Journal, 27(3), 379-423. DOI ↗

swap_horiz KL Divergence

Measures how far the system's prediction is from "I have no idea" (50/50). A large divergence means the system has strong evidence one way or the other. A small divergence means the prediction isn't much better than guessing.

KL Divergence quantifies the information advantage of the prediction system over having no information. It provides a single metric for the total value of the system's knowledge about a specific claim. Tracking average KL Divergence over time demonstrates increasing system knowledge and justifies continued investment in rule development.

D_KL(P || Q) = Σ P(x)·log₂(P(x)/Q(x))

P = predicted distribution, Q = Uniform(0.5, 0.5)
D_KL = p·log₂(2p) + (1-p)·log₂(2(1-p))
Range: [0, 1] bits. Asymmetric. Always ≥ 0.

Kullback, S. & Leibler, R.A. (1951). "On Information and Sufficiency." Annals of Mathematical Statistics, 22(1), 79-86. DOI ↗

lightbulb Information Gain

Shows how much uncertainty was removed by the prediction. If you started with no information and the system reduced your uncertainty by 80%, that's a high information gain — the prediction is genuinely useful.

Information Gain directly measures the value of the prediction in reducing decision uncertainty. An average Information Gain of 0.7 bits (out of 1.0 maximum) means the system resolves 70% of the uncertainty in each claim, translating to proportional reductions in the need for manual verification. This is the core value metric for the prediction system.

IG = H(prior) - H(posterior)
= 1 - H(p) [when prior is Uniform]

Equivalently: IG = D_KL(posterior || prior)
Range: [0, 1] bits. Higher = more information gained.

balance Cross-Entropy & Bayes Factor

Compares the prediction system against a simple "always guess the average" strategy. The Bayes Factor tells you how many times more likely the system's prediction is to be correct compared to the naive approach. Values above 10 mean strong evidence in favor of the system.

The Bayes Factor provides an intuitive strength-of-evidence measure for the prediction system vs. a null model. Unlike p-values, Bayes Factors have a natural interpretation: BF = 20 means "the data is 20 times more likely under our model than under random chance." Industry convention: BF > 10 is "strong evidence," BF > 100 is "decisive." This metric is directly presentable to non-technical stakeholders.

Cross-Entropy: H(y, p) = -(1/N) Σ [yᵢ·log(pᵢ) + (1-yᵢ)·log(1-pᵢ)]

Bayes Factor: BF = P(data|M₁) / P(data|M₀)
Log BF = Σ log(pᵢ(yᵢ)/0.5) = N·[H(null) - H(y,p)]

M₁: prediction model. M₀: null (base rate only).
BF > 1 favors model. Jeffreys scale for interpretation.

Jeffreys, H. (1961). Theory of Probability, 3rd ed. Oxford University Press. Appendix B: Significance tests. Book ↗

payments Expected Monetary Value (EMV)

Translates each prediction into dollars and cents. It calculates the expected financial outcome of trusting the prediction vs. calling the payer, accounting for the cost of the call, the risk of a wrong prediction, and the value of getting it right.

EMV converts prediction quality into direct financial terms. By comparing EMV(predict) vs. EMV(call), each routing decision has a quantified dollar impact. Aggregated across thousands of claims, this provides a clear ROI picture: "The prediction system saved $X by avoiding Y calls while correctly predicting Z% of outcomes."

EMV(predict) = p·V_correct + (1-p)·C_wrong
EMV(call) = V_correct - C_call

V_correct: value of correct prediction (avoided call cost)
C_wrong: cost of incorrect prediction (rework, patient impact)
C_call: direct cost of verification call

Route to predict when: EMV(predict) > EMV(call)

Raiffa, H. (1968). Decision Analysis: Introductory Lectures on Choices under Uncertainty. Addison-Wesley. HBS ↗

query_stats Value of Information (VoI)

Answers "Is it worth calling the payer?" If the value of getting perfect information is less than the cost of the call, calling isn't worth it. This helps the system make smarter routing decisions based on actual cost-benefit analysis.

Value of Information provides the economic foundation for the routing decision. When VoI > cost of verification call, calling the payer is justified. When VoI < call cost, the prediction is sufficient. This framework transforms a subjective "should we call?" into an objective, auditable financial decision. Tracking VoI across payers identifies where verification processes are most and least cost-effective.

VoI = EMV(with perfect info) - EMV(current prediction)
= |C_wrong| × min(p, 1-p)

If VoI > C_call → call payer (information is worth more than it costs)
If VoI ≤ C_call → trust prediction

Threshold: p* where VoI = C_call → p* = C_call / |C_wrong|

Howard, R.A. (1966). "Information Value Theory." IEEE Trans. Systems Science and Cybernetics, SSC-2(1), 22-26. DOI ↗

workspace_premium EVPI

The theoretical maximum value of having perfect information. This puts a ceiling on how much it's worth spending to improve predictions — if EVPI is $2 per claim, no verification method worth more than $2 makes economic sense.

Expected Value of Perfect Information caps the maximum investment justifiable for improving prediction quality. If aggregate EVPI across all claims is $50K/month, any investment in better rules, more data, or payer integrations that costs less than $50K/month and closes the gap has positive ROI. This is a standard decision-analysis metric used in healthcare and insurance portfolio management.

EVPI = E[max(V₁,V₂,...)] - max(E[V₁],E[V₂],...)

For binary predict/call decision:
EVPI = |C_wrong| × min(p, 1-p)

Note: EVPI = VoI when the only alternative is a single perfect-info source.
EVPI → 0 as p → 0 or p → 1 (certainty reduces info value).

Pratt, J.W., Raiffa, H., & Schlaifer, R. (1995). Introduction to Statistical Decision Theory. MIT Press. Ch. 4: Value of Information. MIT Press ↗

scatter_plot Monte Carlo Simulation

Runs 1,000 simulated scenarios to model a range of possible financial outcomes. Instead of a single estimate, you see the full spread — best case, worst case, and everything in between. This helps understand the real risk of each prediction.

Monte Carlo simulation is the gold standard for risk analysis in financial services and insurance. By generating 1,000 scenarios, it reveals the full distribution of potential outcomes — including tail risks that point estimates miss. This approach is required by Basel III banking regulations and Solvency II insurance frameworks for risk quantification.

1. Sample p ~ Beta(α, β) [α = conf×20, β = (1-conf)×20]
2. For each sample: outcome = p > 0.5 ? +V_covered : -C_patient
3. Repeat N = 1000 times
4. Sort outcomes for quantile estimation

Uses inverse CDF method for Beta sampling.
Seed is not fixed — results vary slightly between runs.

Metropolis, N. & Ulam, S. (1949). "The Monte Carlo Method." J. American Statistical Association, 44(247), 335-341. DOI ↗

warning VaR / CVaR

Value at Risk tells you "in the worst 5% of cases, the loss could be at least this much." Conditional VaR goes further: "in those worst cases, the average loss is this much." These metrics help you understand worst-case financial exposure.

VaR and CVaR are regulatory-standard risk metrics used by every major financial institution. VaR(95%) sets the "normal worst case" boundary, while CVaR captures average tail losses — the scenarios that matter most. Presenting these alongside predictions demonstrates enterprise-grade risk management, supporting compliance requirements and payer contract negotiations.

VaR(α) = -quantile(outcomes, α)
VaR(95%) = -outcomes[floor(N × 0.05)] [sorted ascending]

CVaR(α) = -mean(outcomes where outcome ≤ -VaR(α))
= expected loss given loss exceeds VaR

CVaR ≥ VaR always. CVaR is a coherent risk measure; VaR is not.

Rockafellar, R.T. & Uryasev, S. (2000). "Optimization of Conditional Value-at-Risk." Journal of Risk, 2(3), 21-41. DOI ↗

show_chart Sharpe & Sortino Ratios

Risk-adjusted performance metrics borrowed from finance. They answer: "Is the system getting good results for the amount of risk it takes?" A high ratio means the system makes reliable predictions without taking big gambles.

The Sharpe Ratio (Nobel Prize-winning metric) and Sortino Ratio measure risk-adjusted returns. Applied to insurance predictions, they answer: "Is the prediction system generating savings efficiently, or are good results masking high variance?" A Sharpe > 1.0 is considered "good" in financial markets. The Sortino variant only penalizes downside risk, better matching the asymmetric cost structure of incorrect coverage predictions.

Sharpe = (mean(outcomes) - risk_free) / std(outcomes)
Sortino = (mean(outcomes) - risk_free) / downside_std

downside_std = √(mean(min(0, outcome - target)²))
risk_free = 0 (no-action baseline)

Sharpe penalizes all volatility; Sortino only penalizes losses.
Sortino > Sharpe when upside variance > downside variance.

Sharpe, W.F. (1966). "Mutual Fund Performance." Journal of Business, 39(1), 119-138. DOI ↗
Sortino, F.A. & Price, L.N. (1994). "Performance Measurement in a Downside Risk Framework." Journal of Investing, 3(3), 59-64.

waterfall_chart Factor Waterfall

Shows exactly how much each factor contributes to the final confidence score. Like a waterfall chart in financial reporting, you can see which factors pushed confidence up and which pulled it down, making the prediction fully transparent.

Factor decomposition provides full auditability for every prediction. Regulators, payers, and practice managers can see exactly why a prediction was made — not just the outcome. This SHAP-like decomposition is the standard for explainable AI in regulated industries, supporting compliance with emerging AI transparency requirements.

contribution_i = w_i × (factor_i - baseline_i)

baseline_i = expected value of factor_i across all predictions
Σ contribution_i = confidence - base_confidence

Analogous to SHAP values for linear models.
For linear models: SHAP values = weighted factor deviations from mean.

tornado Tornado Chart

Identifies which single factor, if changed, would most likely flip the routing decision. If the tornado chart shows "historical rate" as the widest bar, that means getting more claims data for this payer would have the biggest impact on prediction reliability.

The Tornado Chart prioritizes improvement investments. By showing which factors are closest to flipping the routing decision, it directly identifies where to focus resources. If payer reliability is the swing factor for 30% of uncertain predictions, investing in payer-specific rule development has a quantifiable expected return.

For each factor_i:
low_i = confidence with factor_i set to 0.0
high_i = confidence with factor_i set to 1.0
delta_i = high_i - low_i = w_i (factor weight)
flip_dist_i = |threshold - confidence| / w_i

Sort by delta_i descending → tornado ordering.
flip_dist_i < 1.0 → this factor alone could flip the decision.

stacked_bar_chart Murphy (1973) Decomposition

Breaks down prediction quality into three parts: reliability (are predicted percentages accurate?), resolution (can the system distinguish easy from hard cases?), and uncertainty (how inherently unpredictable are the outcomes?). This diagnosis shows exactly where to improve.

The Murphy decomposition separates prediction quality into actionable components. High reliability means the system's probabilities are trustworthy. High resolution means it successfully differentiates cases. High uncertainty is an inherent property of the domain, not a system flaw. This decomposition is used by NOAA, the European Centre for Medium-Range Weather Forecasts, and actuarial certification boards.

BS = REL - RES + UNC

REL = (1/N) Σ nₖ(pₖ - ōₖ)² — reliability (lower is better)
RES = (1/N) Σ nₖ(ōₖ - ō)² — resolution (higher is better)
UNC = ō(1 - ō) — uncertainty (fixed for dataset)

K bins by predicted probability. nₖ = count in bin k.
pₖ = mean prediction in bin k. ōₖ = observed frequency in bin k.

Murphy, A.H. (1973). "A New Vector Partition of the Probability Score." Journal of Applied Meteorology 12(4).

ssid_chart Calibration Diagram

A visual check: do predictions match reality? When the system says "80% chance of coverage," is coverage actually approved about 80% of the time? The closer the line follows the diagonal, the more calibrated (trustworthy) the predictions are.

The calibration diagram is the single most important visual for assessing prediction trustworthiness. A well-calibrated system (line near diagonal) means the probabilities can be taken at face value for decision-making. Systematic deviations reveal biases: above the diagonal means the system is under-confident (leaving value on the table), below means over-confident (creating risk). This visualization is standard in actuarial reporting.

1. Bin predictions into K equal-width bins (default K=10)
2. For each bin k: plot (mean predicted pₖ, observed frequency ōₖ)
3. Perfect calibration: all points on y = x diagonal

Confidence bands via: ōₖ ± 1.96 × √(ōₖ(1-ōₖ)/nₖ)
Minimum bin count: nₖ ≥ 10 for statistical validity.

lab_research Hosmer-Lemeshow Test

A statistical test that checks if the prediction system's probabilities are reliable across the full range. A "pass" means the system is well-calibrated; a "fail" means certain probability ranges are systematically off and may need adjustment.

The Hosmer-Lemeshow test is a standard goodness-of-fit measure for logistic and probabilistic models in healthcare and insurance. A passing result (p > 0.05) provides statistical evidence that the model is well-calibrated — a key requirement for regulatory submissions and actuarial certifications. This test is required by FDA guidance for clinical prediction models.

H-L statistic: χ² = Σ (Oₖ - Eₖ)² / (Eₖ(1 - pₖ̄))

G groups (typically 10 deciles of predicted probability).
Oₖ = observed positives in group k. Eₖ = expected = nₖ × p̄ₖ.
Under H₀ (good calibration): χ² ~ Chi²(G-2)
p-value > 0.05 → fail to reject → calibration is adequate.

Hosmer, D.W. & Lemeshow, S. (1980). "Goodness of fit tests for the multiple logistic regression model." Communications in Statistics A9.

moving Cumulative Scoring

Tracks prediction accuracy across your session, showing running scores that update after each prediction. As you process more patients, you can see the system's accuracy stabilize, giving confidence that the predictions are consistent, not just lucky.

Cumulative scoring provides real-time performance monitoring. Convergence of the running Brier Score demonstrates prediction stability — essential for quality assurance. Sudden jumps indicate either a system issue or an unusual claim mix, triggering investigation. This continuous monitoring approach mirrors SPC (Statistical Process Control) used in manufacturing quality management.

BS_cumulative(t) = (1/t) Σᵢ₌₁ᵗ (pᵢ - yᵢ)²
LL_cumulative(t) = -(1/t) Σᵢ₌₁ᵗ [yᵢ·log(pᵢ) + (1-yᵢ)·log(1-pᵢ)]

Convergence rate: O(1/√t) by CLT.
Alert if |BS(t) - BS(t-1)| > 3σ (process control limits).

bar_chart Discrimination Histogram

Visualizes how well the system separates claims that will be covered from those that won't. Good separation means covered claims get high predictions and denied claims get low predictions — with minimal overlap in the middle.

Discrimination is the complement of calibration — it measures whether the system can distinguish good from bad outcomes. Two overlapping distributions indicate poor discrimination; separated distributions indicate strong discrimination. In medical testing terms, this is analogous to the sensitivity/specificity trade-off visible in a single chart.

Separate predictions by outcome: P₁ = {pᵢ : yᵢ=1}, P₀ = {pᵢ : yᵢ=0}
Plot histogram of P₁ and P₀ on same axis [0, 1].

Good discrimination: P₁ concentrated near 1, P₀ near 0.
Quantified by: AUC = P(P₁ > P₀) [Wilcoxon-Mann-Whitney]
Discrimination slope = mean(P₁) - mean(P₀)

trending_flat Confidence Trend

A time-series view of prediction confidence across the session, with outcome markers showing whether each prediction was correct. Helps identify patterns — are certain types of claims consistently uncertain? Is confidence improving as the system processes more data?

The Confidence Trend chart provides a process-level view of system performance over time. Upward trends indicate improving predictions (data flywheel effect). Downward trends or increased variance signal potential issues — payer rule changes, unusual patient mix, or data quality degradation. This is the operational dashboard metric for day-to-day system health monitoring.

Time series: {(tᵢ, confᵢ, yᵢ)} for i = 1..N

Plotted as scatter + line with markers:
● green = correct prediction (conf > 0.5 and covered, or conf < 0.5 and denied)
● red = incorrect prediction

Optional: EWMA overlay with λ = 0.3 for smoothed trend.
Routing threshold shown as horizontal reference line.

hub Core Prediction Pipeline

tune Confidence Factor Breakdown

scoreboard Proper Scoring Rules

schema Bayesian Inference

info Information Theory

account_tree Decision Theory

casino Risk Analytics (Monte Carlo)

tune Sensitivity Analysis

straighten Calibration & Reliability

timeline Session-Level Tracking