Case study · Published July 2026

Case study: reverse causation in a risk model

A walkthrough of a real, de-identified RigorMD report on a clinical prediction model. The model is rigorously built and exemplary in its reporting — and still earns a serious rating, because the central claim outruns the evidence.

§01 The study

The manuscript develops a calculator to tell surgeons which patient, at which moment, is at high enough risk of pulmonary embolism (PE) to justify extended prophylaxis — and to update that risk after an unplanned readmission or reoperation. Built and temporally validated in 4.8 million operations, the discharge model discriminates well (AUC 0.811) with excellent calibration, and the reporting is exemplary: a full TRIPOD checklist, released coefficients and code, a working calculator. The deterministic layer found the numbers reconcile throughout.

And yet the engine graded it serious. This is the case that shows severity is not a measure of sloppiness. A meticulous study can carry a serious finding if its headline claim depends on an inference the data cannot support.

§02 The central finding: an accuracy gain built on reverse causation

The paper's headline advance is discrimination rising from AUC 0.811 to 0.892 when unplanned readmission and reoperation are added to the model. The problem is reverse causation: those returns are frequently the occasion on which the PE is diagnosed — suspected-clot readmissions were 68% PE — and the registry records the return and the PE in the same 30-day window without ordering them. A predictor that is really the outcome arriving makes the model look far more predictive than it is.

What earns this a serious rather than fatal rating is that the authors saw it and bounded it: they remove the suspected-clot subgroup, re-anchor follow-up (HR 1.08, 95% CI 0.81–1.44), and discard PEs diagnosed within days of the event — and a real residual signal survives. But the number a clinician would deploy is still 0.892, and the increment is the paper's central contribution. The general version of this trap — a predictor ascertained concurrently with the outcome — is the TRIPOD/PROBAST question in our TRIPOD walkthrough.

§03 Two compounding findings

A public calculator deployed on temporal-only validation. The model is validated on a later time slice of the same data source — no external validation in an independent health system, no prospective impact study. A deployed calculator makes a stronger claim than internal or temporal validation can support; the appropriate label is investigational.

An estimand mismatch. The registry records no anticoagulant exposure, and roughly a third of comparable patients already receive extended prophylaxis. So the model predicts PE risk under current mixed treatment — not the untreated risk that a “should I treat?” decision actually turns on. The flagged group's observed PE rate is already partly post-prophylaxis, so the counterfactual benefit of adding prophylaxis is not identified from these data.

§04 The forensic layer, and where severity comes from

The deterministic checks passed cleanly: the timing hazard ratio reconciles with its interval and p-value, the cohort split and PE incidence agree, and the reclassification table's cells, counts, rates, and number-needed-to-evaluate all reconcile (PPV 2.92%, about one PE per 34 flagged). As in the colectomy case, clean arithmetic and a serious grade coexist — because the serious finding lives in the design, not the numbers.

The severity is confined to the central claim, and it is precise about it: the model is a strong risk marker, not a validated prevention strategy. That distinction is the whole report. Sizing it as serious — rather than dismissing a well-built model or over-rating a disclosed limitation — is the calibration the grade is meant to carry.

§05 Read the full report

The report includes the discrimination figure, the full six-domain scorecard, the deterministic checks, and before/after language for the sentences that overreach — for instance, rewording “a practical strategy to reduce missed opportunities for prevention” into a claim the evidence supports.

Read the full PE risk-model report →, see how the engine works, or review pricing — the pre-submission review is $25. For the contrasting case where the same clean arithmetic yields only a moderate grade, see the colectomy cohort case study.

How to read this. This walkthrough describes a de-identified report for demonstration. RigorMD flags methodological and statistical issues for your judgment; it does not certify a manuscript, replace peer review, or replace a statistician's input on study design.