Article ·

When a p-value doesn't match its test statistic

A p-value is not an independent number. It is computed from a test statistic and its degrees of freedom, so it can be recomputed — and when the recomputation disagrees with the printed value, something is wrong.

§01 Why a p-value can be checked at all

Report a result as t(58) = 2.10, p = 0.04 and you have written down three linked quantities. Given the test statistic (2.10) and the degrees of freedom (58), the p-value is fully determined — it is a lookup, the tail area of a known distribution. A reader can recompute it and get, in this case, p ≈ 0.040. The reported value agrees, so the triple is internally consistent.

Now suppose the paper instead reads t(58) = 2.10, p = 0.004. The statistic and degrees of freedom still imply p ≈ 0.040. The printed 0.004 does not follow from them. That is an internal inconsistency: at least one of the three numbers is wrong, and the reader cannot tell which. The same logic applies to F, χ², z, and r statistics — anything reported with its degrees of freedom carries its own p-value inside it.

§02 How common this is

Common enough that an entire tool exists to detect it. In a large audit that used the statcheck algorithm across eight psychology journals from 1985 to 2013 ↗, recomputing more than 250,000 reported p-values, roughly half of the papers using null-hypothesis testing contained at least one p-value inconsistent with its own test statistic, and about one in eight contained a gross inconsistency large enough to potentially change the statistical conclusion — for example, a result reported as significant that recomputes to non-significant. The inconsistencies also skewed: errors more often pushed a result across the 0.05 line toward significance than away from it.

Clinical journals are not psychology journals, but the mechanism is identical wherever a test statistic and a p-value are printed together. In one audit of orthopaedic journals, 17% of papers contained a statistical error capable of changing the conclusion ↗. A mismatched p-value is one of the most detectable members of that family.

§03 What a mismatch does and does not mean

What it does mean. The numbers on the page do not reconcile. Most often this is a transcription slip — a decimal moved, a statistic and its p pulled from different model runs, a value updated in the text but not the table. Occasionally it is load-bearing: the recomputed p crosses 0.05 and the sentence built on “significant” no longer stands.

What it does not mean. A mismatch is not proof of misconduct, and a clean recomputation is not proof the analysis was appropriate. Recomputation checks that the reported numbers agree with each other, not that the right test was chosen, that assumptions held, or that the comparison was pre-specified. It is a consistency screen — it flags; it does not certify. The two questions live in different domains: numerical consistency is one thing, statistical appropriateness another.

§04 Catching it before submission

For your own manuscript, the discipline is mechanical: for every result reported as statistic-plus-p, recompute the p from the statistic and its degrees of freedom and confirm they agree — then reconcile the value in the abstract against the value in the table, digit for digit. Free tools implement the statcheck logic for reports that follow standard formatting; anything the tool cannot parse, you check by hand.

RigorMD runs this recomputation deterministically on every pre-submission review, as one layer beneath two independent engine appraisals. Where the reported values allow it, the forensic pass recovers the implied z or t, recomputes the p-value and the confidence interval, checks that they agree with each other and with the point estimate, and reports the result as consistent or flagged — with the recomputed number shown, not asserted. You can see that check on real reported values in a full sample report → (the forensic section recomputes each CI ↔ p triple), read how the engine works, or review pricing — the pre-submission review is $25.

For the sibling arithmetic check on reported means, see the GRIM test; for the full set of screens a manuscript passes through, see how journals catch statistical errors.

How to read this. A p-value that disagrees with its test statistic is an internal inconsistency, not a verdict on the study. RigorMD flags methodological and statistical issues for your judgment; it does not certify a manuscript, replace peer review, or replace a statistician's input on study design.