Article ·

The GRIM test: catching impossible means

One of the simplest forensic checks in the literature is a single line of arithmetic. It cannot be argued with, it needs no raw data, and it catches an error that is otherwise invisible on the page.

§01 What the GRIM test actually checks

GRIM stands for Granularity-Related Inconsistency of Means. The idea, introduced by Nicholas Brown and James Heathers in Social Psychological and Personality Science (2016) ↗, is almost embarrassingly simple. When you average a whole number of integer responses — a count of patients, a Likert item scored 1–5, anything measured in whole units — the mean cannot land on just any decimal. It must be a fraction whose denominator is the sample size.

With 20 patients, every possible mean of an integer variable is some multiple of 1/20 = 0.05: 3.00, 3.05, 3.10, and so on. A reported mean of 3.03 is therefore impossible — no set of 20 integers can produce it. The value is either a typo, a miscalculation, or evidence that the reported sample size is not the sample size actually analyzed. GRIM does not tell you which; it tells you the number on the page cannot be right as printed.

That is the whole test: take the reported mean, multiply by the reported n, and check whether the result is close enough to an integer. If it is not, the triple of (mean, n, decimal places) is internally inconsistent. No dataset required.

§02 How often it finds something

Often enough to be worth running. In the original paper, Brown and Heathers pulled 260 articles from three psychology journals and found 71 with enough integer data reported to apply the test. Of those 71, about half — 36 articles — contained at least one impossible mean, and 16 contained more than one. These were published, peer-reviewed papers.

Most of those inconsistencies are harmless typos. But some are not: a mean that cannot come from the stated n is a thread, and pulling it sometimes unravels a mislabeled sample size, a dropped exclusion, or a table that does not describe the analysis it sits beside. The value of the check is not that every hit is fraud — almost none are — but that it is deterministic. It either passes or it does not, and a reviewer who runs it will see exactly what you missed.

§03 When it applies to a clinical manuscript

GRIM needs three things: a mean of a variable measured in integers, a known sample size, and enough reported decimal places to have resolution. It works best when the sample is small and the decimals are generous — with 20 patients and two decimals, roughly 19 of every 20 candidate values are impossible, so a typo has nowhere to hide. With 400 patients, almost every two-decimal value is reachable and the test has little power.

In clinical papers the check bites on exactly the numbers authors report most casually: mean number of comorbidities, mean lymph nodes harvested, mean length of stay in whole days, mean count of prior operations, mean score on an integer-scored instrument. A related check, GRIMMER, extends the same logic to reported standard deviations and variances. Neither needs your data — which is precisely why a reviewer, or an automated screen, can run them before you get a chance to explain.

A caution worth stating plainly: GRIM flags an inconsistency; it does not diagnose the cause and it does not certify a paper that passes. A clean GRIM result means the reported means are arithmetically possible, nothing more. This is the same posture the rest of a good statistical review takes — it flags for your judgment; it does not certify.

§04 Running the check before a reviewer does

You can do this by hand for your headline means: multiply each reported mean by its n and confirm the product is within rounding of an integer. It takes a few minutes and it is the cheapest credibility insurance in the paper. The same discipline the GRIM test embodies — recomputing a printed number from its own components rather than trusting it — is the core of deterministic statistical forensics.

RigorMD builds that layer into every pre-submission review: alongside two independent engine appraisals, a deterministic pass recomputes what the reported values allow — proportions from their numerators and denominators, p-values against their test statistics, counts against their totals — and reports each check as passed, flagged, or not checkable from the submitted files. You can see the forensic layer at work in a full sample report →, read how the engine works, or review pricing — the pre-submission review is $25.

For the adjacent check — a p-value that does not follow from its own test statistic — see when a p-value doesn't match its test statistic, and for the wider set of screens journals now run, see how journals catch statistical errors.

How to read this. The GRIM test is a consistency check, not a verdict. A failed check means a reported mean is arithmetically impossible as printed; a passed check means only that it is possible. RigorMD flags methodological and statistical issues for your judgment; it does not certify a manuscript, replace peer review, or replace a statistician's input on study design.