A concordance test: run the engine on a landmark trial with a documented, public correction, and check whether its independent central finding matches the reason the correction exists — then whether it sizes the problem correctly.
A report is only as trustworthy as its track record. One way to earn that is to run the engine on papers that were later corrected or retracted for a documented methodological reason, and check whether its single most important finding matches the documented reason. Because the engine's findings are quote-grounded — each tied to verbatim text in the paper — the concordance is read from the manuscript, not reverse-engineered from the outcome. This is a methods exercise in a validation context, the way the literature routinely critiques published work; it is not a commentary on any author.
The first case is PREDIMED — Estruch et al., a landmark trial of a Mediterranean diet for cardiovascular prevention. The original 2013 report (N Engl J Med 2013;368:1279-90) ↗ was retracted and republished in 2018 after questions about its randomization.
The trigger was statistical detective work. John Carlisle's survey of baseline data in 5,087 randomized trials ↗ flagged trials whose baseline distributions were too similar or too divergent to be consistent with true random sampling — and PREDIMED was among the papers this method drew attention to. On investigation, the authors confirmed the concern: randomization had not been done properly for a substantial minority of participants. The journal issued a retraction-and-republication notice ↗ — “Because of irregularities in the randomization procedures…” — and published a reanalysis (N Engl J Med 2018;378:e34) ↗.
Appraising the 2018 republished article — the current authoritative version, which itself documents the deviations — the engine's top finding was compromised randomization for roughly 1,588 of 7,447 participants (about 21%): household members assigned to a relative's arm, one site randomizing by clinic rather than by individual, another applying its randomization tables inconsistently. For that subset, the trial is effectively quasi-randomized, and the effect there is identified by propensity adjustment rather than randomization — so unmeasured confounding cannot be excluded.
That finding is the same issue as the documented retraction reason. Both engines flagged it independently, and the deterministic layer separately recomputed the reported hazard-ratio intervals and reconciled the event counts with no inconsistencies. The concordance is on the primary methodological problem — the thing the record already knew. Randomization integrity is the CONSORT question worked through in our CONSORT walkthrough.
Catching the issue is only half the test. The engine rated it moderate — not serious or critical — because the republished paper discloses every deviation and shows the protective effect survives propensity-adjusted reanalysis and the omission of the 1,588 affected participants. A tool that screamed “critical” at a transparently corrected trial would be miscalibrated. Concordance on the problem; restraint on the grade. This is the same calibration discipline the sample reports show from the other direction — see the colectomy and PE risk-model cases.
The honest framing matters: one case is an anecdote, not a benchmark. As cases accumulate, the plan is to report detection performance — precision and recall by error class — including the cases where the engine misses or over-flags. The goal is a public track record, not a highlight reel.
The validation page lays out the claim map, the matched quotes, the six-domain scorecard, and the honesty notes on which version was appraised and why.
Read the full concordance benchmark →, see how the engine works, or read a sample report and review pricing — the pre-submission review is $25.