RigorMD runs two kinds of review. A deterministic layer recomputes the numbers it can rebuild from what you reported — arithmetic, not opinion — and a two-engine appraisal reads the methods and claims. This page lists all of it, and is honest about where each check applies and where it stops. It flags concerns for you to weigh; it does not certify a paper.
A general-purpose chatbot reads the number you typed and agrees with it. The checks below do the opposite: wherever the underlying figures are present, RigorMD rebuilds the statistic and compares it to your printed value. A deterministic flag is a calculation you can redo yourself — it never passes back through a model that could soften it, and it can only raise a grade, never lower one.
The checks are grouped by what they are, not padded into one number. Recomputation and impossibility checks are the arithmetic core — the ones a chatbot structurally cannot stand behind. Reference and integrity checks verify what is cited and what was left in the draft. Reporting-consistency screens are lighter hygiene a careful reviewer would also raise. Each check lists what it needs to run and the most it can flag; if the underlying numbers are not in the text, the check simply does not fire.
The core. Where the reported numbers let us rebuild a statistic, we do — and flag the ones that do not reconcile. This is the part a chatbot cannot honestly reproduce.
| Check | What it catches | What it needs | Max severity |
|---|---|---|---|
| GRIM | A reported mean that no whole-number data could produce — e.g. a 1–5 scale mean of 2.53 from n = 5. | integer-summed values + N | Serious |
| p-value from a 2×2 table | Recomputes χ² from the table and flags a result whose printed p disagrees — including a call that flips across the 0.05 line. | a reconstructable 2×2 table | Critical |
| Denominator vs analytic N | Subgroups that sum to more patients than the study reported analyzing. | subgroup counts + stated N | Critical |
| Percentage vs count / denominator | Recomputes each percentage from its own n / N and flags the ones that do not reconcile. | a count + its denominator | Moderate |
| SD / SEM mix-up | Two spreads for one quantity that differ by exactly √n — the arithmetic signature of a mislabeled standard error. | two spreads + N | Moderate |
| Cross-location consistency | The same quantity printed at different values across the abstract, body, and tables — past what rounding explains. | the value stated in ≥ 2 places | Moderate |
| Distribution plausibility (Altman–Bland) | A non-negative measure whose mean − 2·SD falls below zero, so a normal summary is unlikely and a median (IQR) would fit better. | mean + SD, a non-negative quantity | Mild |
“…showed no significant difference in 30-day readmission (p = 0.04).”a “significant” p read straight from a 2×2 table
What is cited, and what was left in the draft. Only the reference identifiers leave the worker — never your manuscript text.
| Check | What it catches | What it needs | Max severity |
|---|---|---|---|
| Reference resolution | Every cited DOI and PMID is resolved against Crossref, doi.org, and NCBI PubMed; identifiers that do not resolve — or resolve to a different article — are flagged, with the date checked. | the reference list | up to Serious |
| Leftover AI-assistant text | Scaffolding a chatbot leaves behind — “as an AI language model”, “regenerate response”, and the like. | — | up to Moderate |
| Placeholder text | Draft placeholders left in the submission — “[insert …]”, lorem ipsum, an uppercase TODO. | — | up to Moderate |
Lighter hygiene — presentation issues a careful reviewer would also raise. They do not change your result; they signal how carefully the numbers were reported. We keep them separate on purpose, so they never dress up as forensics.
| Screen | What it flags | Severity |
|---|---|---|
| Impossible p-values | A p-value printed as an exact zero (p = 0.000); a p-value cannot be exactly 0 — report p < 0.001. | Mild |
| Threshold-only p-values | Results that only ever report a threshold (“p < 0.05”) and never an exact p-value. | Mild |
| Over-precise p-values | A p-value carried to four or more decimal places. | Mild |
| Percentages with no denominator | A percentage reported with no n / N stated anywhere. | Mild |
| Over-precise percentages | Two-decimal percentages from a denominator under 100 — more precision than the data carry. | Mild |
| Unlabeled ± spread | A “mean ± value” with no label saying whether the ± is an SD, an SEM, or a CI — they differ by ×√n. | Mild |
The deterministic checks above catch what is arithmetically wrong. Judging whether the design supports the claimtakes reading — so two independent engines (Anthropic’s Claude and OpenAI’s models) appraise the manuscript blind to each other, across all six domains, and the appraisal is repeated across several passes. Only findings that recur are graded, so a one-off observation never becomes a verdict. The two reads are reconciled into one consensus, and disagreement is surfaced, not hidden.
Serious and critical findings are then adversarially verified — the engine is made to argue against its own finding, and anything it cannot defend is withdrawn or right-sized before the report reaches you. The deterministic results are merged into that consensus verbatim: a recomputed number can raise a grade, never lower one, and the adversarial pass cannot talk a calculation away. The full method — the six domains, GRADE certainty, and the conclusion-calibration headline — is on the methods page.
Recomputation works only where the underlying numbers are in the text — a reconstructable 2×2 table, integer-summed means, a stated denominator, a labeled spread. RigorMD does not re-run your regressions, hazard ratios, survival models, or other multivariable analyses from the manuscript alone; those need your raw data. It does not detect fabrication or fraud, and it is not a substitute for a qualified biostatistician or for peer review.
Calibration cuts both ways. On our public concordance benchmark, the engine independently found the randomization problem behind a real retraction — and graded it moderate, not critical, because the authors disclosed every deviation. A tool that screamed “critical” at a transparently-corrected trial would be miscalibrated. Concordance on the problem, restraint on the grade.