A report is only as trustworthy as its track record. This page tests RigorMD against an external, independent yardstick: published papers that were later corrected or retracted for a documented methodological reason. We run the engine on the paper, then check whether its single most important finding matches the reason the correction exists. The engine's findings are quote-grounded — each is tied to verbatim text in the paper — so the concordance is earned from the manuscript, not reverse-engineered from the outcome.
This is the first case. More will be added, scored the same way, with precision and recall reported by error class as the set grows. It is a methods exercise — critiquing published work in a validation context, the way the literature routinely does — not a commentary on any author.
Estruch R, et al. Primary Prevention of Cardiovascular Disease with a Mediterranean Diet Supplemented with Extra-Virgin Olive Oil or Nuts. N Engl J Med 2018;378:e34 (the republished version). The original (N Engl J Med 2013;368:1279-90) was retracted in 2018.
The documented retraction reason (randomization irregularities) and the engine's independent central finding (compromised randomization) are the same issue.Concordance on the primary methodological problem
The engine flags the randomization problem as central, then rates it moderate — not serious or critical — because the republished paper discloses every deviationand shows the protective effect survives propensity-adjusted reanalysis and the omission of the 1,588 affected participants. A tool that screamed “critical” at a transparently-corrected trial would be miscalibrated. Concordance on the problem, restraint on the grade.
| Severity | Domain | Finding | Disclosed? |
|---|---|---|---|
| Moderate | 01 · Design | ~21% of participants not individually randomized (household same-arm; clinic-level at one site; inconsistent tables at another) — the documented retraction reason. | Yes |
| Moderate | 03 · Statistics | For the non-randomized fraction, the effect is identified by propensity-score adjustment rather than randomization, so unmeasured confounding cannot be excluded. | Yes |
| Moderate | 01 · Design | Open-label trial with early asymmetric control-group support; mitigated by blinded adjudication of hard endpoints. | Yes |
| Mild | 02 · Alignment | Conclusion well-calibrated to the corrected evidence; minor transportability caveat (high-risk Mediterranean population, supplied foods). | Yes |
The deterministic layer recomputed the reported hazard-ratio intervals and reconciled the event counts (96 + 83 + 109 = 288) with no inconsistencies. Both the Claude and OpenAI engines independently flagged the randomization problem.
Quote-grounded. Every finding cites verbatim text from the paper; the randomization finding is read directly from the methods section, not inferred from the retraction.
Which version was appraised. We appraised the 2018 republishedarticle — the current authoritative version, which itself documents the deviations — and compared the engine's central finding with the public retraction notice. Appraising the corrected version, rather than piling on the withdrawn original, is the fairer test and the more useful demonstration: it shows the engine lands on the right issue andcredits the authors' transparent fix.
Scope. This is methodological commentary in a validation context. It does not allege misconduct, does not certify or de-certify any paper, and is not clinical advice. The cited authors corrected the record themselves; that correction is exactly what makes this a clean benchmark.