Sample report · real appraisal, de-identified manuscript · for demonstration only
Three real samples:Colectomy cohort · ModeratePE risk model · SeriousBariatric benchmarks · Moderate
RigorMD Validation Report

Percentile-based weight-loss benchmarks after sleeve gastrectomy, gastric bypass, and duodenal switch: a multi-site cohort study

A retrospective cross-sectional benchmarking cohort (2,730 primary bariatric patients, five campuses)
Design · Cross-sectional benchmarking cohortClaim · Descriptive (benchmarking)Guideline · STROBE / TRIPOD
RM-2026-0207
generated 2026-07-03
report v1.0
2 engines + deterministic
overall · MODERATE
Conclusion calibration

The benchmarks are offered a little more readily than single-system evidence supports

The authors state the counseling-utility claim with moderate confidence; the evidence warrants low certainty (GRADE) — the benchmarks come from one health system, are internally validated only, and rest on a follow-up-captured cohort (55% by three years) whose non-returners were younger, heavier, and less often female, so the absolute bands probably run optimistic. The study is careful, well hedged, and every deterministic check is clean; the gap is one of reach, not error.

01 Design / claim fitModerate
02 Results / conclusion alignmentModerate
03 Statistical appropriatenessModerate
04 Reporting guideline adherenceMild
05 Numerical / statistical consistencyNo findings
06 Clinical interpretability / verdictModerate
07 Contribution & literature positioningModerate

§00 Executive summary

This manuscript takes on the one question every bariatric patient asks — how much weight will I lose? Across 2,730 primary sleeve, bypass, and duodenal switch/SADI patients in a five-campus system, it replaces the usual single average with percentile benchmarks — the median, the middle half, and the 10th-to-90th band — for each operation out to three years, using quantile regression with bootstrap intervals. The message (counsel with a range, not a number) is well matched to the design, the procedure ordering is clinically coherent, and every reported number is internally clean.

Two issues bear on how far the absolute benchmarks can be pushed. First, they rest on a follow-up-captured cohort — capture falls to 55% by three years, and the patients who dropped out were younger, heavier, and less often female (the profile that loses less), so the two- and three-year bands probably run optimistic; the authors weighting-tested the between-procedure differences but not the absolute bands. Second, because each patient contributes one weight, the time curve is stitched across calendar eras during the GLP-1 boom — an effect disclosed but not quantified. A seventh domain (contribution and literature positioning) separately flags that the percentile-benchmark method itself is not new — retrieved prior studies build procedure-adjusted weight-loss percentile charts by quantile regression, uncited here — though the DS/SADI-inclusive, contemporary US contribution is real. The deterministic forensic layer found no numerical inconsistencies, and all fifteen references resolve.

§01 Claim map

What the manuscript states, against what the evidence can support.

Stated claim
“Percentile benchmarks allow a surgeon to show patients what weight loss really looks like after each operation, side by side and out to 3 years, as a range rather than a single number — easier and more honest for counseling than the average.”
Supportable claim
“Within this five-campus system, percentile benchmarks describe the distribution of weight loss for each operation and are a more honest counseling aid than an average; the absolute bands are internally calibrated but, given attrition and single-system data, should be read as potentially optimistic pending attrition-weighting and external validation.”

§02 Domain severity scorecard

Seven-domain assessment2-engine consensus + literature layer
DomainSeverityPrincipal finding
01Design / claim fitModerateWeight-loss “time” is modeled between patients across 2021–2026 eras; the GLP-1-era shift is disclosed but not quantified.
02Results / conclusion alignmentModerateCounseling-ready framing sits in the abstract; the external-validation caveat is confined to the Limitations.
03Statistical appropriatenessModerateInformative loss to follow-up (82%→55%); only the between-procedure differences were weighting-tested, not the absolute bands.
04Reporting guideline adherenceMildSTROBE flow complete; the procedure-by-time interaction F is reported without degrees of freedom.
05Numerical / statistical consistencyNo findings14 deterministic checks pass; subgroup counts, percentages, and capture proportions reconcile; all 15 references resolve.
06Clinical interpretability / verdictModerateSound, well-hedged benchmarking; the residual is optimistic absolute bands pending attrition-weighting and external validation.
07Contribution & literature positioningModerateThe percentile-benchmark method itself is not new — retrieved prior quantile-regression percentile-chart studies go uncited; the DS/SADI-inclusive contribution is real, but the primacy framing outruns the retrieved record.
Overall severity Moderate2 central findings · sound, well-hedged benchmarking

§03 Major findings

Language calibration: 1 must-change wording · 3 precision polish. Analytic work: 2 need source-data or analytic work.

SeverityDomainFindingAuthor actionEvidenceLocus
Moderate03 · StatisticsBenchmarks condition on patients still in follow-up (capture 82%→55% by three years; non-returners younger, heavier, less often female); only the between-procedure differences were weighting-tested, so the absolute bands may run optimistic.New analysis neededQuoteCentral
Moderate01 · DesignOne weight per patient means the “time” curve is built between patients across 2021–2026 eras (later points from earlier surgeries), confounding postoperative time with the GLP-1-era secular trend the authors could not adjust for.New analysis neededQuoteCentral
Moderate03 · StatisticsDuodenal switch and SADI — two different operations — are pooled into the smallest group (271; 79 SADI), giving the widest, least-stable bands and a non-significant early separation from sleeve.Statistical precisionQuotePeripheral
Moderate02 · AlignmentThe counseling-ready framing (“what weight loss really looks like”) sits in the abstract while the single-system / external-validation caveat is confined to the Limitations.Must-change wordingQuotePeripheral
Mild03 · StatisticsMany between-procedure contrasts, percentiles, and thresholds are tested without multiplicity adjustment; the large gaps hold, but borderline contrasts (12-mo DS/SADI vs RYGB, P=0.028) are exploratory.Statistical precisionQuotePeripheral
Mild04 · ReportingThe procedure-by-time interaction that licenses the curves is reported as F=24.11, P<0.001 without its degrees of freedom, so the test cannot be reconstructed.Statistical precisionQuotePeripheral

§04 Detailed domain review

01

Design / claim fit

Moderate

These curves aren’t one patient followed for three years — they stitch together different patients measured at different times, and the three-year points come from people operated on years earlier, during the run-up in weight-loss drugs. Clinically, read the curve as a snapshot of the program’s experience by era, not a promise of how one patient will track; the later-year bands especially carry an era effect that is acknowledged but not quantified.

Why this matters statistically

Finding
Because each patient contributes a single most-recent weight, the weight-loss “trajectory” is reconstructed between patients: the 36-month points come almost entirely from patients operated on in 2021–2023, the 6-month points from 2024–2026. Secular change over that window — above all the rapid uptake of GLP-1/GIP drugs the authors could not adjust for — is confounded with postoperative time.
Bias direction
Secular-trend confounding — the drug-era shift loads onto the time axis, so the apparent post-nadir shape and the later-time bands partly reflect cohort era rather than a within-patient course.
Evidence
“This is a cross-sectional benchmarking design rather than a longitudinal one… The extract reflects a single April 2026 download, so patients operated on more recently could contribute only earlier follow-up intervals.” (Methods / Results)

Single-weight cross-section: postoperative time is a between-patient covariate confounded with operative-year era (right-censoring by the April-2026 download). Unadjusted secular trend (GLP-1/GIP uptake) loads onto the time axis. ROBINS-I: confounding. Disclosed by the authors; the residual is that the synthesized threat to the time axis is not quantified.

Fix is analytic, not fatal: stratify or adjust by operative year and show whether the procedure-by-time curves and the 36-month bands move.

Technical details
Named bias
secular-trend confounding · ROBINS-I: confounding
GRADE
indirectness · remedy: operative-year (calendar-era) sensitivity analysis
03

Statistical appropriateness

Moderate

About half of patients had no weight recorded at three years, and the ones who skipped follow-up were the ones who tend to lose less — so the benchmark you’d quote a patient at two or three years is probably a little rosier than the truth, and the gap is widest exactly where follow-up is thinnest. Clinically, this is the number that goes into counseling, so treat the later-year bands as best-case pending a follow-up-weighted re-analysis.

Why this matters statistically

Finding
Follow-up capture falls from 82.1% at six months to 55.1% at three years, and the patients who did not return were younger, heavier, and less often female — the profile that tends to lose less. Inverse-probability weighting was applied to the between-procedure differences (which stayed stable) but not to the absolute percentile bands the calculator shows patients. Both engines raised this independently.
Bias direction
Attrition (selection into follow-up), likely upward: the absolute two- and three-year bands may overstate weight loss, worst where capture is thinnest.
Evidence
“…a pattern that could bias benchmarks upward if the least adherent patients also lose the least weight.” (Results / Limitations)

Informative loss to follow-up (differential attrition: capture 82.1%→55.1%; non-returners younger / higher-BMI / less-female). IPW sensitivity is reported for the contrasts only; the marginal conditional-quantile benchmarks are not shown robust to missingness. ROBINS-I: selection of participants. This is the dominant finding.

Fix is real analytic work: re-estimate the absolute percentile bands under inverse-probability-of-follow-up weighting and report how far the 24- and 36-month medians and deciles shift.

Technical details
Named bias
attrition / selection into follow-up · ROBINS-I: selection of participants
GRADE
risk of bias · remedy: inverse-probability-of-follow-up weighting on the bands

§05 Forensic checks

Recomputed directly from the manuscript’s reported values — no numerical inconsistencies were found.

Quoted — Table 3 (between-procedure differences)
RYGB − SG at 6 months: 3.0 percentage points (95% CI, 1.2 to 4.8; P = 0.001).
Difference in predicted median %TBWL with 95% CI and p-value
Recomputed (CI ↔ p)
implied z
3.27
reported p
0.001
recomputed p
≈ 0.001
CI symmetry
consistent (linear scale)
Consistent — CI, p-value, and point estimate agree
Quoted — Table 1 (cohort)
Procedure counts: sleeve 1,796, bypass 663, DS/SADI 271.
Cohort stated as N = 2,730
Denominator check
stated N
2,730
procedure sum
2,730
difference
0
Consistent — counts reconcile to the analytic N
Quoted — Results (follow-up capture)
36 months, 698 of 1,267 (55.1%).
Reported capture proportion
Percentage check
698 / 1,267
0.5509
reported
55.1%
recomputed
55.1%
Consistent — ratio matches the reported percentage

§06 Revision priority

In order. The two text edits are cheap; the follow-up-weighting re-analysis speaks to the study’s central question.

  1. Re-estimate the absolute percentile bands (not only the between-procedure differences) under inverse-probability-of-follow-up weighting, and report how far the 24- and 36-month medians and deciles shift.Moderate
  2. Add an operative-year (calendar-era) sensitivity analysis, and state in the Methods that postoperative time is modeled between patients, not within them.Moderate
  3. Engage the prior procedure-adjusted quantile-regression percentile-chart literature RigorMD retrieved (see §07), and narrow the primacy framing to the DS/SADI-inclusive contribution the record actually supports.Moderate
  4. Carry the single-system, internally-validated-only, external-validation caveat into the abstract conclusion.Moderate
  5. Report the interaction’s degrees of freedom, note the between-procedure p-values are unadjusted for multiplicity, and clear the leftover “Click or tap here to enter text.” placeholder from the checklist pages.Mild

§07 Language calibration

Suggested wording is triaged by author action. Some wording overstates the evidence and should change; some is recommended risk reduction; some is precision polish; some is left to author discretion.

As written

“Percentile benchmarks allow a surgeon to show patients what weight loss really looks like after each operation.”

Must change Must-change wording

“Within this five-campus system, percentile benchmarks describe the distribution of weight loss for each operation; because they are internally validated only and rest on a follow-up-captured cohort, they should be recalibrated and externally validated before use elsewhere.”

The current wording makes a claim the design or results cannot support.
As written

“…no study has provided percentile-based weight-loss benchmarks across SG, RYGB, and DS/SADI in a single adjusted model.”

Recommended Recommended wording

“Procedure-adjusted weight-loss percentile charts have been reported for sleeve and bypass; we extend that approach to DS/SADI in a contemporary multi-site US cohort through 36 months.”

The wording is directionally defensible, but softer wording would reduce reviewer risk.
As written

“DS/SADI led SG by 8.9 and 12.5 points.”

Statistical precision Statistical precision

“DS/SADI led SG by 8.9 and 12.5 points — on the smallest group (271; 79 SADI) with the widest intervals; interpret the DS/SADI bands, especially the tails, as approximate.”

The sentence is acceptable, but could be made more statistically exact.
As written

“…the ordering held across sensitivity analyses.”

Author discretion Author discretion

“…the procedure ordering held across sensitivity analyses.” The sensitivity analyses confirm the between-procedure differences, not the absolute bands. Well-judged as written; this is a clarity note, not a required change.

A conservative phrasing option; the current wording is defensible.

§08 Journal compliance

Items observable from the extract. The full SOARD pre-submission gate was not run for this sample; these do not affect the methodological grade above.

§09 Reference identifiers

Cited DOI and PMID identifiers, resolved against the public registries — Crossref, the DOI handle registry, and PubMed — as of 2026-07-03. A ✓ means the registry record exists and is consistent with the citation as printed; it does not assess whether the cited work supports the claim it is attached to. An identifier the check could not reach is listed as not checked, never assumed to resolve. Problems found here also appear as findings above. All 15 cited identifiers were checked: 15 resolve.

IdentifierOutcomeRegistryNotes
DOI 10.7326/M17-2786✓ ResolvesCrossref
DOI 10.1001/jamasurg.2020.5666✓ ResolvesCrossref
DOI 10.1056/NEJMoa2206038✓ ResolvesCrossref
DOI 10.1007/s00464-025-12170-w✓ ResolvesCrossref
DOI 10.1016/S2213-8587(25)00226-8✓ ResolvesCrossref

§10 Contribution & literature positioning

Findings — see below

The prior literature RigorMD retrieved into an evidence pack and compared with this manuscript's positioning as of 2026-07-03. This is a positioning-risk check, not a novelty score: it flags where a claim may overlap, understate, or be contradicted by retrieved prior work. It never certifies that a contribution is novel or first — a clean result means only that no directly overlapping prior study was found in this evidence pack. The retrieval is bounded and time-stamped; treat it as a starting point for your own literature review, not a replacement for it. Positioning risks found here also appear as findings above. Retrieved prior studies build procedure-adjusted weight-loss percentile charts by quantile regression — the manuscript's own core method — but are not engaged; the framing that percentile-based benchmarks were previously unavailable is stronger than the retrieved record supports.

Priors we compared you against. The prior work RigorMD retrieved and compared this manuscript against — disclosed so you can see the evidence pack behind the assessment. Listing a work here is not an instruction to cite it; it is the basis on which the positioning was checked.

Prior workYearIdentifierIn your references
Centile Charts for Monitoring of Weight Loss Trajectories After Bariatric Surgery in Asian Patients2021PMID 34363141Not in your references
Prediction Model for Chronological Weight Loss After Bariatric Surgery in Korean Patients2024PMID 38974892Not in your references
Weight-Independent Percentile Chart of 2880 Gastric Bypass Patients: a New Look at Bariatric Weight Loss Results2016PMID 27138602Not in your references
07

The “first percentile-based benchmarks” framing overlaps retrieved quantile-regression percentile-chart studies

Moderate

The manuscript positions procedure-specific percentile weight-loss benchmarks as previously unavailable, but RigorMD retrieved prior studies that build procedure-adjusted weight-loss percentile charts by quantile regression — the same core method — for sleeve and bypass, plus earlier single-procedure percentile charts. The genuine, defensible contribution is the addition of duodenal switch/SADI and a contemporary multi-site US cohort through 36 months; the claim of primacy for percentile-based benchmarking itself is stronger than the retrieved record supports, and none of these methodological priors are cited.

Positioning risk
Overstated novelty · Introduction (final paragraph) and Discussion (opening)
Evidence pack
pack-002, pack-007
Longitudinal centile lines were plotted using the post-estimation predictions of quantile regression models, adjusted for type of procedure, sex, ethnicity, and baseline BMI.From the prior work RigorMD retrieved and compared

Literature assessed as of 2026-07-03. Bounded PubMed retrieval on the manuscript's own concept pair, not a systematic review; a work not surfaced here was not necessarily absent from the literature. Listing a prior is disclosure of what was compared, not an instruction to cite it.

§11 Technical appendix

What could be checked from the submitted files — and what could not. Check your own paper →

Checked from submitted files

  • ✓ passed Subgroup counts vs analytic N (procedure, race, BMI category)
  • ✓ passed Table 1 percentages vs numerators / denominators
  • ✓ passed Follow-up-capture proportions (82.1% / 74.3% / 62.3% / 55.1%)
  • ✓ passed Between-procedure CI ↔ p concordance (Table 3)
  • ✓ passed Reference identifiers resolve (15 of 15 DOIs)
  • ⚑ flag Absolute bands not weighting-tested for attrition (see §03)

Not checkable from submitted files

  • — n/a Patient-level reanalysis (no dataset provided)
  • — n/a Attrition-weighted re-estimate of the absolute percentile bands
  • — n/a Operative-year (calendar-era) decomposition of the time curve
  • — n/a External calibration in an independent health system
Scope. This report provides methodological and statistical guidance based on the submitted materials. It does not guarantee publication, replace peer review, certify research validity, or provide clinical treatment advice. Findings marked deterministic are recomputed from the manuscript's own reported values; findings marked quote are traceable to the quoted text. This sample is a real RigorMD appraisal of a de-identified manuscript. The executive summary, the before/after language revisions, and the passed-check panels are illustrative — the delivered report does not yet present them in this form; the findings, severity scorecard, conclusion calibration, journal-compliance, and reference-identifier sections reflect what the delivered report contains. The consensus artifact is downloadable as JSON alongside the PDF on your report page.

Get this report on your manuscript.

The same two-engine, severity-scored review with deterministic checks — $30 per report; most return within hours.

Tested against the public record — the PREDIMED concordance benchmark →