Validation — Concordance Benchmark

§00 What this benchmark is

A report is only as trustworthy as its track record. This page tests RigorMD against an external, independent yardstick: published papers that were later corrected or retracted for a documented methodological reason. We run the engine on the paper, then check whether its single most important finding matches the reason the correction exists. The engine's findings are quote-grounded — each is tied to verbatim text in the paper — so the concordance is earned from the manuscript, not reverse-engineered from the outcome.

This is the first case. More will be added, scored the same way, with precision and recall reported by error class as the set grows. It is a methods exercise — critiquing published work in a validation context, the way the literature routinely does — not a commentary on any author.

§01 Case 1 — PREDIMED

Estruch R, et al. Primary Prevention of Cardiovascular Disease with a Mediterranean Diet Supplemented with Extra-Virgin Olive Oil or Nuts. N Engl J Med 2018;378:e34 (the republished version). The original (N Engl J Med 2013;368:1279-90) was retracted in 2018.

The public record

The 2013 report was retracted: “Because of irregularities in the randomization procedures, we wish to retract the following article…” (Retraction and Republication, N Engl J Med 2018;378:2441). The data were reanalyzed and republished.

RigorMD's central finding (independent)

The engine's top finding: compromised randomizationfor ~1,588 of 7,447 participants (≈21%) — household members assigned to a relative's arm, one site randomizing by clinic, another using its randomization tables inconsistently — so for that subset the trial is effectively quasi-randomized.

Match

The documented retraction reason (randomization irregularities) and the engine's independent central finding (compromised randomization) are the same issue.

Concordance on the primary methodological problem

Verdict

central finding: randomization integrity
matches record: yes
severity: moderate
calibration: supported

Concordant — and correctly de-escalated

Catching the issue is half the test; sizing it correctly is the other half

The engine flags the randomization problem as central, then rates it moderate — not serious or critical — because the republished paper discloses every deviationand shows the protective effect survives propensity-adjusted reanalysis and the omission of the 1,588 affected participants. A tool that screamed “critical” at a transparently-corrected trial would be miscalibrated. Concordance on the problem, restraint on the grade.

§02 What the engine reported

Severity	Domain	Finding	Disclosed?
Moderate	01 · Design	~21% of participants not individually randomized (household same-arm; clinic-level at one site; inconsistent tables at another) — the documented retraction reason.	Yes
Moderate	03 · Statistics	For the non-randomized fraction, the effect is identified by propensity-score adjustment rather than randomization, so unmeasured confounding cannot be excluded.	Yes
Moderate	01 · Design	Open-label trial with early asymmetric control-group support; mitigated by blinded adjudication of hard endpoints.	Yes
Mild	02 · Alignment	Conclusion well-calibrated to the corrected evidence; minor transportability caveat (high-risk Mediterranean population, supplied foods).	Yes

The deterministic layer recomputed the reported hazard-ratio intervals and reconciled the event counts (96 + 83 + 109 = 288) with no inconsistencies. Both the Claude and OpenAI engines independently flagged the randomization problem.

§03Method & honesty notes

Quote-grounded. Every finding cites verbatim text from the paper; the randomization finding is read directly from the methods section, not inferred from the retraction.

Which version was appraised. We appraised the 2018 republishedarticle — the current authoritative version, which itself documents the deviations — and compared the engine's central finding with the public retraction notice. Appraising the corrected version, rather than piling on the withdrawn original, is the fairer test and the more useful demonstration: it shows the engine lands on the right issue andcredits the authors' transparent fix.

Scope. This is methodological commentary in a validation context. It does not allege misconduct, does not certify or de-certify any paper, and is not clinical advice. The cited authors corrected the record themselves; that correction is exactly what makes this a clean benchmark.

One case is an anecdote; a benchmark is a set. This is case 1. As cases accumulate, we will report detection performance (precision and recall) by error class, including the cases where the engine misses or over-flags. The goal is an honest, public track record — not a highlight reel.