Every check we run

§01 How to read this page

A general-purpose chatbot reads the number you typed and agrees with it. The checks below do the opposite: wherever the underlying figures are present, RigorMD rebuilds the statistic and compares it to your printed value. A deterministic flag is a calculation you can redo yourself — it never passes back through a model that could soften it, and it can only raise a grade, never lower one.

The checks are grouped by what they are, not padded into one number. Recomputation and impossibility checks are the arithmetic core — the ones a chatbot structurally cannot stand behind. Reference and integrity checks verify what is cited and what was left in the draft. Reporting-consistency screens are lighter hygiene a careful reviewer would also raise. Each check lists what it needs to run and the most it can flag; if the underlying numbers are not in the text, the check simply does not fire.

§02Recomputation & arithmetic-impossibility checks

The core. Where the reported numbers let us rebuild a statistic, we do — and flag the ones that do not reconcile. This is the part a chatbot cannot honestly reproduce.

Check	What it catches	What it needs	Max severity
GRIM	A reported mean that no whole-number data could produce — e.g. a 1–5 scale mean of 2.53 from n = 5.	integer-summed values + N	Serious
p-value from a 2×2 table	Recomputes χ² from the table and flags a result whose printed p disagrees — including a call that flips across the 0.05 line.	a reconstructable 2×2 table	Critical
Denominator vs analytic N	Subgroups that sum to more patients than the study reported analyzing.	subgroup counts + stated N	Critical
Percentage vs count / denominator	Recomputes each percentage from its own n / N and flags the ones that do not reconcile.	a count + its denominator	Moderate
SD / SEM mix-up	Two spreads for one quantity that differ by exactly √n — the arithmetic signature of a mislabeled standard error.	two spreads + N	Moderate
Cross-location consistency	The same quantity printed at different values across the abstract, body, and tables — past what rounding explains.	the value stated in ≥ 2 places	Moderate
Distribution plausibility (Altman–Bland)	A non-negative measure whose mean − 2·SD falls below zero, so a normal summary is unlikely and a median (IQR) would fit better.	mean + SD, a non-negative quantity	Mild

Quoted — manuscript, Results

“…showed no significant difference in 30-day readmission (p = 0.04).”

a “significant” p read straight from a 2×2 table

Recomputed — deterministic

test: Pearson χ²
statistic: 3.10
df: 1
reported p: 0.04
recomputed p: 0.078

Discordant — prose contradicts a “significant” p

§03Reference & submission-integrity checks

What is cited, and what was left in the draft. Only the reference identifiers leave the worker — never your manuscript text.

Check	What it catches	What it needs	Max severity
Reference resolution	Every cited DOI and PMID is resolved against Crossref, doi.org, and NCBI PubMed; identifiers that do not resolve — or resolve to a different article — are flagged, with the date checked.	the reference list	up to Serious
Leftover AI-assistant text	Scaffolding a chatbot leaves behind — “as an AI language model”, “regenerate response”, and the like.	—	up to Moderate
Placeholder text	Draft placeholders left in the submission — “[insert …]”, lorem ipsum, an uppercase TODO.	—	up to Moderate

We fail open, and we never cry fabrication. If a registry times out or is unreachable, the reference is recorded as not checked — never as a problem. A flag means an identifier authoritatively did not resolve as of the date shown, phrased exactly that way. RigorMD does not allege that a reference is fabricated.

§04 Reporting-consistency screens

Lighter hygiene — presentation issues a careful reviewer would also raise. They do not change your result; they signal how carefully the numbers were reported. We keep them separate on purpose, so they never dress up as forensics.

Screen	What it flags	Severity
Impossible p-values	A p-value printed as an exact zero (p = 0.000); a p-value cannot be exactly 0 — report p < 0.001.	Mild
Threshold-only p-values	Results that only ever report a threshold (“p < 0.05”) and never an exact p-value.	Mild
Over-precise p-values	A p-value carried to four or more decimal places.	Mild
Percentages with no denominator	A percentage reported with no n / N stated anywhere.	Mild
Over-precise percentages	Two-decimal percentages from a denominator under 100 — more precision than the data carry.	Mild
Unlabeled ± spread	A “mean ± value” with no label saying whether the ± is an SD, an SEM, or a CI — they differ by ×√n.	Mild

§05 Beyond arithmetic: the two-engine appraisal

The deterministic checks above catch what is arithmetically wrong. Judging whether the design supports the claimtakes reading — so two independent engines (Anthropic’s Claude and OpenAI’s models) appraise the manuscript blind to each other, across all six domains, and the appraisal is repeated across several passes. Only findings that recur are graded, so a one-off observation never becomes a verdict. The two reads are reconciled into one consensus, and disagreement is surfaced, not hidden.

Serious and critical findings are then adversarially verified — the engine is made to argue against its own finding, and anything it cannot defend is withdrawn or right-sized before the report reaches you. The deterministic results are merged into that consensus verbatim: a recomputed number can raise a grade, never lower one, and the adversarial pass cannot talk a calculation away. The full method — the six domains, GRADE certainty, and the conclusion-calibration headline — is on the methods page.

§06 What these checks do not do

Recomputation works only where the underlying numbers are in the text — a reconstructable 2×2 table, integer-summed means, a stated denominator, a labeled spread. RigorMD does not re-run your regressions, hazard ratios, survival models, or other multivariable analyses from the manuscript alone; those need your raw data. It does not detect fabrication or fraud, and it is not a substitute for a qualified biostatistician or for peer review.

Calibration cuts both ways. On our public concordance benchmark, the engine independently found the randomization problem behind a real retraction — and graded it moderate, not critical, because the authors disclosed every deviation. A tool that screamed “critical” at a transparently-corrected trial would be miscalibrated. Concordance on the problem, restraint on the grade.

A clear check is the absence of a specific, detectable problem — not a guarantee that the statistics or the study are correct. RigorMD flags concerns for the authors to weigh. It does not certify correctness, validity, or fitness for publication. See a worked example on the sample report, or start a review.