Transparency · the statistical-audit catalog · every check we run, stated with its limits
Transparency

Every check we run on your manuscript

RigorMD runs two kinds of review. A deterministic layer recomputes the numbers it can rebuild from what you reported — arithmetic, not opinion — and a two-engine appraisal reads the methods and claims. This page lists all of it, and is honest about where each check applies and where it stops. It flags concerns for you to weigh; it does not certify a paper.

16 deterministic checksRecomputed, not guessedSee them in a real report →

§01 How to read this page

A general-purpose chatbot reads the number you typed and agrees with it. The checks below do the opposite: wherever the underlying figures are present, RigorMD rebuilds the statistic and compares it to your printed value. A deterministic flag is a calculation you can redo yourself — it never passes back through a model that could soften it, and it can only raise a grade, never lower one.

The checks are grouped by what they are, not padded into one number. Recomputation and impossibility checks are the arithmetic core — the ones a chatbot structurally cannot stand behind. Reference and integrity checks verify what is cited and what was left in the draft. Reporting-consistency screens are lighter hygiene a careful reviewer would also raise. Each check lists what it needs to run and the most it can flag; if the underlying numbers are not in the text, the check simply does not fire.

§02Recomputation & arithmetic-impossibility checks

The core. Where the reported numbers let us rebuild a statistic, we do — and flag the ones that do not reconcile. This is the part a chatbot cannot honestly reproduce.

CheckWhat it catchesWhat it needsMax severity
GRIMA reported mean that no whole-number data could produce — e.g. a 1–5 scale mean of 2.53 from n = 5.integer-summed values + NSerious
p-value from a 2×2 tableRecomputes χ² from the table and flags a result whose printed p disagrees — including a call that flips across the 0.05 line.a reconstructable 2×2 tableCritical
Denominator vs analytic NSubgroups that sum to more patients than the study reported analyzing.subgroup counts + stated NCritical
Percentage vs count / denominatorRecomputes each percentage from its own n / N and flags the ones that do not reconcile.a count + its denominatorModerate
SD / SEM mix-upTwo spreads for one quantity that differ by exactly √n — the arithmetic signature of a mislabeled standard error.two spreads + NModerate
Cross-location consistencyThe same quantity printed at different values across the abstract, body, and tables — past what rounding explains.the value stated in ≥ 2 placesModerate
Distribution plausibility (Altman–Bland)A non-negative measure whose mean − 2·SD falls below zero, so a normal summary is unlikely and a median (IQR) would fit better.mean + SD, a non-negative quantityMild
Quoted — manuscript, Results
“…showed no significant difference in 30-day readmission (p = 0.04).”
a “significant” p read straight from a 2×2 table
Recomputed — deterministic
test
Pearson χ²
statistic
3.10
df
1
reported p
0.04
recomputed p
0.078
Discordant — prose contradicts a “significant” p

§03Reference & submission-integrity checks

What is cited, and what was left in the draft. Only the reference identifiers leave the worker — never your manuscript text.

CheckWhat it catchesWhat it needsMax severity
Reference resolutionEvery cited DOI and PMID is resolved against Crossref, doi.org, and NCBI PubMed; identifiers that do not resolve — or resolve to a different article — are flagged, with the date checked.the reference listup to Serious
Leftover AI-assistant textScaffolding a chatbot leaves behind — “as an AI language model”, “regenerate response”, and the like.up to Moderate
Placeholder textDraft placeholders left in the submission — “[insert …]”, lorem ipsum, an uppercase TODO.up to Moderate
We fail open, and we never cry fabrication. If a registry times out or is unreachable, the reference is recorded as not checked — never as a problem. A flag means an identifier authoritatively did not resolve as of the date shown, phrased exactly that way. RigorMD does not allege that a reference is fabricated.

§04 Reporting-consistency screens

Lighter hygiene — presentation issues a careful reviewer would also raise. They do not change your result; they signal how carefully the numbers were reported. We keep them separate on purpose, so they never dress up as forensics.

ScreenWhat it flagsSeverity
Impossible p-valuesA p-value printed as an exact zero (p = 0.000); a p-value cannot be exactly 0 — report p < 0.001.Mild
Threshold-only p-valuesResults that only ever report a threshold (“p < 0.05”) and never an exact p-value.Mild
Over-precise p-valuesA p-value carried to four or more decimal places.Mild
Percentages with no denominatorA percentage reported with no n / N stated anywhere.Mild
Over-precise percentagesTwo-decimal percentages from a denominator under 100 — more precision than the data carry.Mild
Unlabeled ± spreadA “mean ± value” with no label saying whether the ± is an SD, an SEM, or a CI — they differ by ×√n.Mild

§05 Beyond arithmetic: the two-engine appraisal

The deterministic checks above catch what is arithmetically wrong. Judging whether the design supports the claimtakes reading — so two independent engines (Anthropic’s Claude and OpenAI’s models) appraise the manuscript blind to each other, across all six domains, and the appraisal is repeated across several passes. Only findings that recur are graded, so a one-off observation never becomes a verdict. The two reads are reconciled into one consensus, and disagreement is surfaced, not hidden.

Serious and critical findings are then adversarially verified — the engine is made to argue against its own finding, and anything it cannot defend is withdrawn or right-sized before the report reaches you. The deterministic results are merged into that consensus verbatim: a recomputed number can raise a grade, never lower one, and the adversarial pass cannot talk a calculation away. The full method — the six domains, GRADE certainty, and the conclusion-calibration headline — is on the methods page.

§06 What these checks do not do

Recomputation works only where the underlying numbers are in the text — a reconstructable 2×2 table, integer-summed means, a stated denominator, a labeled spread. RigorMD does not re-run your regressions, hazard ratios, survival models, or other multivariable analyses from the manuscript alone; those need your raw data. It does not detect fabrication or fraud, and it is not a substitute for a qualified biostatistician or for peer review.

Calibration cuts both ways. On our public concordance benchmark, the engine independently found the randomization problem behind a real retraction — and graded it moderate, not critical, because the authors disclosed every deviation. A tool that screamed “critical” at a transparently-corrected trial would be miscalibrated. Concordance on the problem, restraint on the grade.

A clear check is the absence of a specific, detectable problem — not a guarantee that the statistics or the study are correct. RigorMD flags concerns for the authors to weigh. It does not certify correctness, validity, or fitness for publication. See a worked example on the sample report, or start a review.