Sample Validation Report — Bariatric Weight-Loss Benchmarks

Conclusion calibration

The benchmarks are offered a little more readily than single-system evidence supports

The authors state the counseling-utility claim with moderate confidence; the evidence warrants low certainty (GRADE) — the benchmarks come from one health system, are internally validated only, and rest on a follow-up-captured cohort (55% by three years) whose non-returners were younger, heavier, and less often female, so the absolute bands probably run optimistic. The study is careful, well hedged, and every deterministic check is clean; the gap is one of reach, not error.

01 Design / claim fitModerate

02 Results / conclusion alignmentModerate

03 Statistical appropriatenessModerate

04 Reporting guideline adherenceMild

05 Numerical / statistical consistencyNo findings

06 Clinical interpretability / verdictModerate

07 Contribution & literature positioningModerate

§00 Executive summary

This manuscript takes on the one question every bariatric patient asks — how much weight will I lose? Across 2,730 primary sleeve, bypass, and duodenal switch/SADI patients in a five-campus system, it replaces the usual single average with percentile benchmarks — the median, the middle half, and the 10th-to-90th band — for each operation out to three years, using quantile regression with bootstrap intervals. The message (counsel with a range, not a number) is well matched to the design, the procedure ordering is clinically coherent, and every reported number is internally clean.

Two issues bear on how far the absolute benchmarks can be pushed. First, they rest on a follow-up-captured cohort — capture falls to 55% by three years, and the patients who dropped out were younger, heavier, and less often female (the profile that loses less), so the two- and three-year bands probably run optimistic; the authors weighting-tested the between-procedure differences but not the absolute bands. Second, because each patient contributes one weight, the time curve is stitched across calendar eras during the GLP-1 boom — an effect disclosed but not quantified. A seventh domain (contribution and literature positioning) separately flags that the percentile-benchmark method itself is not new — retrieved prior studies build procedure-adjusted weight-loss percentile charts by quantile regression, uncited here — though the DS/SADI-inclusive, contemporary US contribution is real. The deterministic forensic layer found no numerical inconsistencies, and all fifteen references resolve.

§01 Claim map

What the manuscript states, against what the evidence can support.

Stated claim

“Percentile benchmarks allow a surgeon to show patients what weight loss really looks like after each operation, side by side and out to 3 years, as a range rather than a single number — easier and more honest for counseling than the average.”

Supportable claim

“Within this five-campus system, percentile benchmarks describe the distribution of weight loss for each operation and are a more honest counseling aid than an average; the absolute bands are internally calibrated but, given attrition and single-system data, should be read as potentially optimistic pending attrition-weighting and external validation.”

§02 Domain severity scorecard

Seven-domain assessment2-engine consensus + literature layer

	Domain	Severity	Principal finding
01	Design / claim fit	Moderate	Weight-loss “time” is modeled between patients across 2021–2026 eras; the GLP-1-era shift is disclosed but not quantified.
02	Results / conclusion alignment	Moderate	Counseling-ready framing sits in the abstract; the external-validation caveat is confined to the Limitations.
03	Statistical appropriateness	Moderate	Informative loss to follow-up (82%→55%); only the between-procedure differences were weighting-tested, not the absolute bands.
04	Reporting guideline adherence	Mild	STROBE flow complete; the procedure-by-time interaction F is reported without degrees of freedom.
05	Numerical / statistical consistency	No findings	14 deterministic checks pass; subgroup counts, percentages, and capture proportions reconcile; all 15 references resolve.
06	Clinical interpretability / verdict	Moderate	Sound, well-hedged benchmarking; the residual is optimistic absolute bands pending attrition-weighting and external validation.
07	Contribution & literature positioning	Moderate	The percentile-benchmark method itself is not new — retrieved prior quantile-regression percentile-chart studies go uncited; the DS/SADI-inclusive contribution is real, but the primacy framing outruns the retrieved record.

Overall severity Moderate2 central findings · sound, well-hedged benchmarking

§03 Major findings

Language calibration: 1 must-change wording · 3 precision polish. Analytic work: 2 need source-data or analytic work.

Severity	Domain	Finding	Author action	Evidence	Locus
Moderate	03 · Statistics	Benchmarks condition on patients still in follow-up (capture 82%→55% by three years; non-returners younger, heavier, less often female); only the between-procedure differences were weighting-tested, so the absolute bands may run optimistic.	New analysis needed	Quote	Central
Moderate	01 · Design	One weight per patient means the “time” curve is built between patients across 2021–2026 eras (later points from earlier surgeries), confounding postoperative time with the GLP-1-era secular trend the authors could not adjust for.	New analysis needed	Quote	Central
Moderate	03 · Statistics	Duodenal switch and SADI — two different operations — are pooled into the smallest group (271; 79 SADI), giving the widest, least-stable bands and a non-significant early separation from sleeve.	Statistical precision	Quote	Peripheral
Moderate	02 · Alignment	The counseling-ready framing (“what weight loss really looks like”) sits in the abstract while the single-system / external-validation caveat is confined to the Limitations.	Must-change wording	Quote	Peripheral
Mild	03 · Statistics	Many between-procedure contrasts, percentiles, and thresholds are tested without multiplicity adjustment; the large gaps hold, but borderline contrasts (12-mo DS/SADI vs RYGB, P=0.028) are exploratory.	Statistical precision	Quote	Peripheral
Mild	04 · Reporting	The procedure-by-time interaction that licenses the curves is reported as F=24.11, P<0.001 without its degrees of freedom, so the test cannot be reconstructed.	Statistical precision	Quote	Peripheral

§04 Detailed domain review

01

Design / claim fit

Moderate

These curves aren’t one patient followed for three years — they stitch together different patients measured at different times, and the three-year points come from people operated on years earlier, during the run-up in weight-loss drugs. Clinically, read the curve as a snapshot of the program’s experience by era, not a promise of how one patient will track; the later-year bands especially carry an era effect that is acknowledged but not quantified.

Why this matters statistically

Finding: Because each patient contributes a single most-recent weight, the weight-loss “trajectory” is reconstructed between patients: the 36-month points come almost entirely from patients operated on in 2021–2023, the 6-month points from 2024–2026. Secular change over that window — above all the rapid uptake of GLP-1/GIP drugs the authors could not adjust for — is confounded with postoperative time.
Bias direction: Secular-trend confounding — the drug-era shift loads onto the time axis, so the apparent post-nadir shape and the later-time bands partly reflect cohort era rather than a within-patient course.
Evidence: “This is a cross-sectional benchmarking design rather than a longitudinal one… The extract reflects a single April 2026 download, so patients operated on more recently could contribute only earlier follow-up intervals.” (Methods / Results)

Single-weight cross-section: postoperative time is a between-patient covariate confounded with operative-year era (right-censoring by the April-2026 download). Unadjusted secular trend (GLP-1/GIP uptake) loads onto the time axis. ROBINS-I: confounding. Disclosed by the authors; the residual is that the synthesized threat to the time axis is not quantified.

Fix is analytic, not fatal: stratify or adjust by operative year and show whether the procedure-by-time curves and the 36-month bands move.

Technical details

Named bias: secular-trend confounding · ROBINS-I: confounding
GRADE: indirectness · remedy: operative-year (calendar-era) sensitivity analysis

03

Statistical appropriateness

Moderate

About half of patients had no weight recorded at three years, and the ones who skipped follow-up were the ones who tend to lose less — so the benchmark you’d quote a patient at two or three years is probably a little rosier than the truth, and the gap is widest exactly where follow-up is thinnest. Clinically, this is the number that goes into counseling, so treat the later-year bands as best-case pending a follow-up-weighted re-analysis.

Why this matters statistically

Finding: Follow-up capture falls from 82.1% at six months to 55.1% at three years, and the patients who did not return were younger, heavier, and less often female — the profile that tends to lose less. Inverse-probability weighting was applied to the between-procedure differences (which stayed stable) but not to the absolute percentile bands the calculator shows patients. Both engines raised this independently.
Bias direction: Attrition (selection into follow-up), likely upward: the absolute two- and three-year bands may overstate weight loss, worst where capture is thinnest.
Evidence: “…a pattern that could bias benchmarks upward if the least adherent patients also lose the least weight.” (Results / Limitations)

Informative loss to follow-up (differential attrition: capture 82.1%→55.1%; non-returners younger / higher-BMI / less-female). IPW sensitivity is reported for the contrasts only; the marginal conditional-quantile benchmarks are not shown robust to missingness. ROBINS-I: selection of participants. This is the dominant finding.

Fix is real analytic work: re-estimate the absolute percentile bands under inverse-probability-of-follow-up weighting and report how far the 24- and 36-month medians and deciles shift.

Technical details

Named bias: attrition / selection into follow-up · ROBINS-I: selection of participants
GRADE: risk of bias · remedy: inverse-probability-of-follow-up weighting on the bands

§05 Forensic checks

Recomputed directly from the manuscript’s reported values — no numerical inconsistencies were found.

Quoted — Table 3 (between-procedure differences)

RYGB − SG at 6 months: 3.0 percentage points (95% CI, 1.2 to 4.8; P = 0.001).

Difference in predicted median %TBWL with 95% CI and p-value

Recomputed (CI ↔ p)

implied z: 3.27
reported p: 0.001
recomputed p: ≈ 0.001
CI symmetry: consistent (linear scale)

Consistent — CI, p-value, and point estimate agree

Quoted — Table 1 (cohort)

Procedure counts: sleeve 1,796, bypass 663, DS/SADI 271.

Cohort stated as N = 2,730

Denominator check

stated N: 2,730
procedure sum: 2,730
difference: 0

Consistent — counts reconcile to the analytic N

Quoted — Results (follow-up capture)

36 months, 698 of 1,267 (55.1%).

Reported capture proportion

Percentage check

698 / 1,267: 0.5509
reported: 55.1%
recomputed: 55.1%

Consistent — ratio matches the reported percentage

§06 Revision priority

In order. The two text edits are cheap; the follow-up-weighting re-analysis speaks to the study’s central question.

Re-estimate the absolute percentile bands (not only the between-procedure differences) under inverse-probability-of-follow-up weighting, and report how far the 24- and 36-month medians and deciles shift.Moderate
Add an operative-year (calendar-era) sensitivity analysis, and state in the Methods that postoperative time is modeled between patients, not within them.Moderate
Engage the prior procedure-adjusted quantile-regression percentile-chart literature RigorMD retrieved (see §07), and narrow the primacy framing to the DS/SADI-inclusive contribution the record actually supports.Moderate
Carry the single-system, internally-validated-only, external-validation caveat into the abstract conclusion.Moderate
Report the interaction’s degrees of freedom, note the between-procedure p-values are unadjusted for multiplicity, and clear the leftover “Click or tap here to enter text.” placeholder from the checklist pages.Mild

§07 Language calibration

Suggested wording is triaged by author action. Some wording overstates the evidence and should change; some is recommended risk reduction; some is precision polish; some is left to author discretion.

As written

“Percentile benchmarks allow a surgeon to show patients what weight loss really looks like after each operation.”

Must change Must-change wording

“Within this five-campus system, percentile benchmarks describe the distribution of weight loss for each operation; because they are internally validated only and rest on a follow-up-captured cohort, they should be recalibrated and externally validated before use elsewhere.”

The current wording makes a claim the design or results cannot support.

As written

“…no study has provided percentile-based weight-loss benchmarks across SG, RYGB, and DS/SADI in a single adjusted model.”

Recommended Recommended wording

“Procedure-adjusted weight-loss percentile charts have been reported for sleeve and bypass; we extend that approach to DS/SADI in a contemporary multi-site US cohort through 36 months.”

The wording is directionally defensible, but softer wording would reduce reviewer risk.

As written

“DS/SADI led SG by 8.9 and 12.5 points.”

Statistical precision Statistical precision

“DS/SADI led SG by 8.9 and 12.5 points — on the smallest group (271; 79 SADI) with the widest intervals; interpret the DS/SADI bands, especially the tails, as approximate.”

The sentence is acceptable, but could be made more statistically exact.

As written

“…the ordering held across sensitivity analyses.”

Author discretion Author discretion

“…the procedure ordering held across sensitivity analyses.” The sensitivity analyses confirm the between-procedure differences, not the absolute bands. Well-judged as written; this is a clarity note, not a required change.

A conservative phrasing option; the current wording is defensible.

§08 Journal compliance

Items observable from the extract. The full SOARD pre-submission gate was not run for this sample; these do not affect the methodological grade above.

✓ Met: Structured abstract — expected structured (Background/Methods/Results/Conclusions); observed 243-word structured abstract (limit 250).
✓ Met: Title length — expected ≤150 characters; observed 138 characters.
✓ Met: Reporting guideline — expected design-matched checklist; observed STROBE checklist provided; TRIPOD-relevant for the model.
? Could not assess: Figure resolution ≥300 dpi — expected ≥300 dpi at print size; observed Not assessable from extracted text.

§09 Reference identifiers

Cited DOI and PMID identifiers, resolved against the public registries — Crossref, the DOI handle registry, and PubMed — as of 2026-07-03. A ✓ means the registry record exists and is consistent with the citation as printed; it does not assess whether the cited work supports the claim it is attached to. An identifier the check could not reach is listed as not checked, never assumed to resolve. Problems found here also appear as findings above. All 15 cited identifiers were checked: 15 resolve.

Identifier	Outcome	Registry
DOI 10.7326/M17-2786	✓ Resolves	Crossref
DOI 10.1001/jamasurg.2020.5666	✓ Resolves	Crossref
DOI 10.1056/NEJMoa2206038	✓ Resolves	Crossref
DOI 10.1007/s00464-025-12170-w	✓ Resolves	Crossref
DOI 10.1016/S2213-8587(25)00226-8	✓ Resolves	Crossref

§10 Contribution & literature positioning

Findings — see below

The prior literature RigorMD retrieved into an evidence pack and compared with this manuscript's positioning as of 2026-07-03. This is a positioning-risk check, not a novelty score: it flags where a claim may overlap, understate, or be contradicted by retrieved prior work. It never certifies that a contribution is novel or first — a clean result means only that no directly overlapping prior study was found in this evidence pack. The retrieval is bounded and time-stamped; treat it as a starting point for your own literature review, not a replacement for it. Positioning risks found here also appear as findings above. Retrieved prior studies build procedure-adjusted weight-loss percentile charts by quantile regression — the manuscript's own core method — but are not engaged; the framing that percentile-based benchmarks were previously unavailable is stronger than the retrieved record supports.

Priors we compared you against. The prior work RigorMD retrieved and compared this manuscript against — disclosed so you can see the evidence pack behind the assessment. Listing a work here is not an instruction to cite it; it is the basis on which the positioning was checked.

Prior work	Year	Identifier	In your references
Centile Charts for Monitoring of Weight Loss Trajectories After Bariatric Surgery in Asian Patients	2021	PMID 34363141	Not in your references
Prediction Model for Chronological Weight Loss After Bariatric Surgery in Korean Patients	2024	PMID 38974892	Not in your references
Weight-Independent Percentile Chart of 2880 Gastric Bypass Patients: a New Look at Bariatric Weight Loss Results	2016	PMID 27138602	Not in your references

07

The “first percentile-based benchmarks” framing overlaps retrieved quantile-regression percentile-chart studies

Moderate

The manuscript positions procedure-specific percentile weight-loss benchmarks as previously unavailable, but RigorMD retrieved prior studies that build procedure-adjusted weight-loss percentile charts by quantile regression — the same core method — for sleeve and bypass, plus earlier single-procedure percentile charts. The genuine, defensible contribution is the addition of duodenal switch/SADI and a contemporary multi-site US cohort through 36 months; the claim of primacy for percentile-based benchmarking itself is stronger than the retrieved record supports, and none of these methodological priors are cited.

Positioning risk: Overstated novelty · Introduction (final paragraph) and Discussion (opening)
Evidence pack: pack-002, pack-007

“Longitudinal centile lines were plotted using the post-estimation predictions of quantile regression models, adjusted for type of procedure, sex, ethnicity, and baseline BMI.”From the prior work RigorMD retrieved and compared

Literature assessed as of 2026-07-03. Bounded PubMed retrieval on the manuscript's own concept pair, not a systematic review; a work not surfaced here was not necessarily absent from the literature. Listing a prior is disclosure of what was compared, not an instruction to cite it.

§11 Technical appendix

What could be checked from the submitted files — and what could not. Check your own paper →

Checked from submitted files

✓ passed Subgroup counts vs analytic N (procedure, race, BMI category)
✓ passed Table 1 percentages vs numerators / denominators
✓ passed Follow-up-capture proportions (82.1% / 74.3% / 62.3% / 55.1%)
✓ passed Between-procedure CI ↔ p concordance (Table 3)
✓ passed Reference identifiers resolve (15 of 15 DOIs)
⚑ flag Absolute bands not weighting-tested for attrition (see §03)

Not checkable from submitted files

— n/a Patient-level reanalysis (no dataset provided)
— n/a Attrition-weighted re-estimate of the absolute percentile bands
— n/a Operative-year (calendar-era) decomposition of the time curve
— n/a External calibration in an independent health system

Scope. This report provides methodological and statistical guidance based on the submitted materials. It does not guarantee publication, replace peer review, certify research validity, or provide clinical treatment advice. Findings marked deterministic are recomputed from the manuscript's own reported values; findings marked quote are traceable to the quoted text. This sample is a real RigorMD appraisal of a de-identified manuscript. The executive summary, the before/after language revisions, and the passed-check panels are illustrative — the delivered report does not yet present them in this form; the findings, severity scorecard, conclusion calibration, journal-compliance, and reference-identifier sections reflect what the delivered report contains. The consensus artifact is downloadable as JSON alongside the PDF on your report page.

Percentile-based weight-loss benchmarks after sleeve gastrectomy, gastric bypass, and duodenal switch: a multi-site cohort study

The benchmarks are offered a little more readily than single-system evidence supports

§00 Executive summary

§01 Claim map

§02 Domain severity scorecard

§03 Major findings

§04 Detailed domain review

Design / claim fit

Why this matters statistically

Statistical appropriateness

Why this matters statistically

§05 Forensic checks

§06 Revision priority

§07 Language calibration

§08 Journal compliance

§09 Reference identifiers

§10 Contribution & literature positioning

The “first percentile-based benchmarks” framing overlaps retrieved quantile-regression percentile-chart studies

§11 Technical appendix

Checked from submitted files

Not checkable from submitted files

Get this report on your manuscript.