Case study · Published July 2026

Case study: catching what the record already knew

A concordance test: run the engine on a landmark trial with a documented, public correction, and check whether its independent central finding matches the reason the correction exists — then whether it sizes the problem correctly.

§01 Why test against the public record

A report is only as trustworthy as its track record. One way to earn that is to run the engine on papers that were later corrected or retracted for a documented methodological reason, and check whether its single most important finding matches the documented reason. Because the engine's findings are quote-grounded — each tied to verbatim text in the paper — the concordance is read from the manuscript, not reverse-engineered from the outcome. This is a methods exercise in a validation context, the way the literature routinely critiques published work; it is not a commentary on any author.

The first case is PREDIMED — Estruch et al., a landmark trial of a Mediterranean diet for cardiovascular prevention. The original 2013 report (N Engl J Med 2013;368:1279-90) ↗ was retracted and republished in 2018 after questions about its randomization.

§02 The documented reason, and how it surfaced

The trigger was statistical detective work. John Carlisle's survey of baseline data in 5,087 randomized trials ↗ flagged trials whose baseline distributions were too similar or too divergent to be consistent with true random sampling — and PREDIMED was among the papers this method drew attention to. On investigation, the authors confirmed the concern: randomization had not been done properly for a substantial minority of participants. The journal issued a retraction-and-republication notice ↗ — “Because of irregularities in the randomization procedures…” — and published a reanalysis (N Engl J Med 2018;378:e34) ↗.

§03The engine's independent finding

Appraising the 2018 republished article — the current authoritative version, which itself documents the deviations — the engine's top finding was compromised randomization for roughly 1,588 of 7,447 participants (about 21%): household members assigned to a relative's arm, one site randomizing by clinic rather than by individual, another applying its randomization tables inconsistently. For that subset, the trial is effectively quasi-randomized, and the effect there is identified by propensity adjustment rather than randomization — so unmeasured confounding cannot be excluded.

That finding is the same issue as the documented retraction reason. Both engines flagged it independently, and the deterministic layer separately recomputed the reported hazard-ratio intervals and reconciled the event counts with no inconsistencies. The concordance is on the primary methodological problem — the thing the record already knew. Randomization integrity is the CONSORT question worked through in our CONSORT walkthrough.

§04 The harder half: sizing it correctly

Catching the issue is only half the test. The engine rated it moderate — not serious or critical — because the republished paper discloses every deviation and shows the protective effect survives propensity-adjusted reanalysis and the omission of the 1,588 affected participants. A tool that screamed “critical” at a transparently corrected trial would be miscalibrated. Concordance on the problem; restraint on the grade. This is the same calibration discipline the sample reports show from the other direction — see the colectomy and PE risk-model cases.

The honest framing matters: one case is an anecdote, not a benchmark. As cases accumulate, the plan is to report detection performance — precision and recall by error class — including the cases where the engine misses or over-flags. The goal is a public track record, not a highlight reel.

§05 See the full benchmark

The validation page lays out the claim map, the matched quotes, the six-domain scorecard, and the honesty notes on which version was appraised and why.

Read the full concordance benchmark →, see how the engine works, or read a sample report and review pricing — the pre-submission review is $25.

How to read this. This is methodological commentary in a validation context. It does not allege misconduct and does not certify or de-certify any paper — the cited authors corrected the record themselves, which is what makes it a clean benchmark. RigorMD flags methodological and statistical issues for your judgment; it does not replace peer review or a statistician's input on study design.