A structured, severity-scored read of the methods and statistics — anchored in GRADE certainty, reporting standards, and the bias frameworks (RoB2 / ROBINS-I) journals actually use. It flags concerns for you to weigh; it does not certify a paper.
Every manuscript is appraised across six domains and severity-scored (mild, moderate, serious, critical) in each: design and claim fit, the alignment of results with the stated conclusion, statistical appropriateness, reporting-guideline adherence (the EQUATOR family — STROBE, CONSORT, PRISMA, and peers), numerical and statistical consistency, and clinical interpretability. The overall grade is weighted by how central a problem is to the headline claim, so a serious flaw confined to a peripheral analysis does not by itself sink the paper.
For the paper’s primary conclusion, RigorMD reads two things separately: how confidently the authors state the claim (their language), and how much certainty the evidence actually warrants on the GRADE scale (very low, low, moderate, high) — downgraded for risk of bias, inconsistency, indirectness, imprecision, and publication bias by the method, not by a vote.
The headline you see is the gap between those two, computed deterministically — not judged by the model. A humble claim backed by limited evidence reads as supported; an over-stated claim on the same evidence reads as not supported or overly confident; a claim that runs against its own results reads as counter to the results. The same evidence can earn a different headline depending only on how strongly the authors phrased the conclusion — which is the point.
Each finding is written twice. The clinician spine is one or two plain sentences — what the study can and cannot support, and the clinical “so what” — with no jargon or effect-size arithmetic. Folded beneath it, a “For your statistician” panel carries the technical companion: the bias mechanism, its named RoB2 / ROBINS-I domain, the GRADE rationale, and the concrete remedy. Read the spine to decide; open the panel to defend it to a reviewer.
Two independent LLM engines appraise the manuscript blind, and the appraisal is repeated across several passes; only findings that recur across passes are graded, so a one-off observation does not become a verdict. A deterministic layer recomputes statistics where the reported numbers allow — a flag there is a calculation you can check — and every quote is verified against your own text before it is shown. Serious and critical findings are then adversarially verified: the engine tries to refute its own finding, and anything it cannot defend is withdrawn or right-sized before the report reaches you.