Validation · concordance benchmark · published papers with a documented public correction
RigorMD Validation

Does the engine catch what the record already knows?

A concordance benchmark · Case 1 of an ongoing series
Method · Blind appraisal vs public recordOutcome · Did the central finding match?See the sample reports →
Concordance
case 1 · PREDIMED
source: NEJM
2 engines + deterministic
central finding · MATCH

§00 What this benchmark is

A report is only as trustworthy as its track record. This page tests RigorMD against an external, independent yardstick: published papers that were later corrected or retracted for a documented methodological reason. We run the engine on the paper, then check whether its single most important finding matches the reason the correction exists. The engine's findings are quote-grounded — each is tied to verbatim text in the paper — so the concordance is earned from the manuscript, not reverse-engineered from the outcome.

This is the first case. More will be added, scored the same way, with precision and recall reported by error class as the set grows. It is a methods exercise — critiquing published work in a validation context, the way the literature routinely does — not a commentary on any author.

§01 Case 1 — PREDIMED

Estruch R, et al. Primary Prevention of Cardiovascular Disease with a Mediterranean Diet Supplemented with Extra-Virgin Olive Oil or Nuts. N Engl J Med 2018;378:e34 (the republished version). The original (N Engl J Med 2013;368:1279-90) was retracted in 2018.

The public record
The 2013 report was retracted: “Because of irregularities in the randomization procedures, we wish to retract the following article…” (Retraction and Republication, N Engl J Med 2018;378:2441). The data were reanalyzed and republished.
RigorMD's central finding (independent)
The engine's top finding: compromised randomizationfor ~1,588 of 7,447 participants (≈21%) — household members assigned to a relative's arm, one site randomizing by clinic, another using its randomization tables inconsistently — so for that subset the trial is effectively quasi-randomized.
Match
The documented retraction reason (randomization irregularities) and the engine's independent central finding (compromised randomization) are the same issue.
Concordance on the primary methodological problem
Verdict
central finding
randomization integrity
matches record
yes
severity
moderate
calibration
supported
Concordant — and correctly de-escalated
Why the grade is moderate, not critical

Catching the issue is half the test; sizing it correctly is the other half

The engine flags the randomization problem as central, then rates it moderate — not serious or critical — because the republished paper discloses every deviationand shows the protective effect survives propensity-adjusted reanalysis and the omission of the 1,588 affected participants. A tool that screamed “critical” at a transparently-corrected trial would be miscalibrated. Concordance on the problem, restraint on the grade.

01 Design / claim fitModerate
02 Results / conclusion alignmentMild
03 Statistical appropriatenessModerate
04 Reporting guideline adherenceClean
05 Numerical / statistical consistencyClean
06 Clinical interpretability / verdictModerate

§02 What the engine reported

SeverityDomainFindingDisclosed?
Moderate01 · Design~21% of participants not individually randomized (household same-arm; clinic-level at one site; inconsistent tables at another) — the documented retraction reason.Yes
Moderate03 · StatisticsFor the non-randomized fraction, the effect is identified by propensity-score adjustment rather than randomization, so unmeasured confounding cannot be excluded.Yes
Moderate01 · DesignOpen-label trial with early asymmetric control-group support; mitigated by blinded adjudication of hard endpoints.Yes
Mild02 · AlignmentConclusion well-calibrated to the corrected evidence; minor transportability caveat (high-risk Mediterranean population, supplied foods).Yes

The deterministic layer recomputed the reported hazard-ratio intervals and reconciled the event counts (96 + 83 + 109 = 288) with no inconsistencies. Both the Claude and OpenAI engines independently flagged the randomization problem.

§03Method & honesty notes

Quote-grounded. Every finding cites verbatim text from the paper; the randomization finding is read directly from the methods section, not inferred from the retraction.

Which version was appraised. We appraised the 2018 republishedarticle — the current authoritative version, which itself documents the deviations — and compared the engine's central finding with the public retraction notice. Appraising the corrected version, rather than piling on the withdrawn original, is the fairer test and the more useful demonstration: it shows the engine lands on the right issue andcredits the authors' transparent fix.

Scope. This is methodological commentary in a validation context. It does not allege misconduct, does not certify or de-certify any paper, and is not clinical advice. The cited authors corrected the record themselves; that correction is exactly what makes this a clean benchmark.

One case is an anecdote; a benchmark is a set. This is case 1. As cases accumulate, we will report detection performance (precision and recall) by error class, including the cases where the engine misses or over-flags. The goal is an honest, public track record — not a highlight reel.