Abstract
Introduction
In the era of evidence-based medicine, synthesizing high-quality data is paramount to shaping best practice. Traditional pairwise meta-analyses remain central to this goal, but their limitations become evident when clinicians must choose between several treatment options not directly compared in trials. Network meta-analysis (NMA) emerged to fill this gap, allowing the combination of direct and indirect evidence to compare multiple interventions within a unified analytic framework.1–3 Using similar statistical techniques to meta-regressions, the factors of study are not patient or study characteristics, but the treatments used within randomized controlled trials.
NMA has seen an exponential rise in use over the past two decades, with publications in gastroenterology increasingly using this approach to inform guidelines, such as for colorectal cancer surveillance and inflammatory bowel disease (IBD) management. However, alongside this growth comes a parallel risk: that methodological complexity and statistical misinterpretation may lead to misuse of the findings. As this paper outlines, NMAs can unintentionally mislead when clinical certainty is obscured by statistical hierarchy.
To address these limitations, we present the promise of NMAs and a novel way of presenting NMA results, the GORDON Plot (Grade Of Results Diagram Of Network meta-analysis). 4
The promise and problem of NMA
NMA has transformed systematic reviewing by allowing simultaneous comparison of multiple treatments across a web of studies, even when head-to-head trials are missing. This facilitates broader inference and can increase precision by pooling more data.
However, this power introduces complexity. Interpreting indirect comparisons, navigating assumptions of transitivity and coherence, and accounting for multiple effect modifiers are demanding tasks. Even well-conducted NMAs risk being undermined if their findings are not conveyed clearly and accurately to clinicians and decision-makers. Furthermore, methodological errors—particularly in certainty assessment—are common, yet often overlooked.
Pitfall 1: Complexity obscures clinical interpretation
While statistical sophistication is a strength of NMA, it is also its Achilles heel. Clinicians accustomed to conventional forest plots or pairwise comparisons may struggle to understand league tables, rankograms, or network diagrams. These visualizations, though powerful, often lack intuitive clarity.
This opacity reduces the practical value of NMAs. If key stakeholders—such as guideline developers, clinicians, or patients—cannot readily interpret findings, the utility of the analysis is lost. Worse still, misunderstanding may lead to the adoption of interventions with weak evidence or limited applicability. This disconnect between statistical methods and clinical translation must be addressed.
Pitfall 2: Statistical rankings can mislead
The promise of treatment hierarchies is seductive. Ranking tables suggest simplicity—implying that the “best” treatment can be identified through probability estimates. However, this is rarely the case.
Rankings may reflect small, clinically irrelevant differences. For example, two therapies may have similar effectiveness, but a minor numerical difference may elevate one above the other in the hierarchy. If rankings are interpreted without considering effect size, precision, or relevance to patient priorities, the resulting conclusions can be misleading or even harmful.
Moreover, rankings do not account for the quality of underlying evidence. A top-ranked intervention based on low-certainty data should not be considered superior. Without robust contextualization, numerical rankings can foster false confidence. A recent example is a thorough and complete NMA in inducing remission in Crohn’s disease, 5 but ranks therapies through their SUCRA and talks in superlatives of the “highest estimate.” There are treatments that, in other reviews4,6 are noted to be less relevant due to other method issues, but with a focus on ranking alone, this potential can leave readers with an incomplete message.
Pitfall 3: Certainty of evidence is often overlooked
GRADE (Grading of Recommendations, Assessment, Development, and Evaluations) offers a structured, transparent method for assessing the certainty of evidence across multiple domains—including risk of bias, inconsistency, indirectness, imprecision, and publication bias. It is widely accepted as the gold standard in evidence synthesis.
Yet, GRADE is applied inconsistently in NMAs. In a systematic survey, less than one-third of NMAs included any form of certainty rating, and fewer than one in five applied GRADE. 7 This omission is not trivial. Without a clear sense of how certain we are in the reported effects, statistical results become clinically hollow.
Moreover, applying GRADE to NMA introduces further complexity, as indirect and network estimates must also be graded—requiring methodological transparency and expertise that many teams lack. Nevertheless, failure to include GRADE devalues the output and weakens its utility for decision-making.
Using our recent examples, it can be seen that using GRADE 6 has a major impact when compared to not using GRADE. 5 This lessens the focus on ranking alone and instead only considers the network estimates in the context of how certain they are. This links to the previous point, not removing ranking, but contextualizing in a manner that is core to correct and evidence-based interpretation.
Pitfall 4: The misuse of GRADE—Neglecting outcome balance and decision-making context
When interpreting the findings of an NMA, it is tempting to treat certainty of evidence (as assessed by GRADE) as a series of distinct, discrete judgments tied to each individual outcome. However, GRADE was not only developed to rate certainty in isolation—it was also designed to support clinical decision-making. In fact, the strength of a GRADE-based recommendation relies on balancing multiple outcomes, not just rating them separately.
In practice, any treatment will typically have multiple relevant outcomes. It is not unusual for an NMA to include four or more primary outcomes—often including one or more safety endpoints. Focusing exclusively on the “most favorable” result risks privileging outcomes that are statistically significant but weak in certainty or clinical importance. Conversely, outcomes that are neutral or equivocal may have higher certainty and provide more clinically actionable insights, such as ruling out the value of a treatment.
This is an underappreciated and often overlooked contribution to the evidence base. Unfortunately, reviewers and readers may fail to integrate outcomes into a cohesive synthesis. This synthesis is often left to the guideline developer or practicing clinician. Rarely, it is presented clearly by the review authors themselves. This disconnect between synthesis and interpretation is a critical methodological blind spot and represents one of the most common and problematic pitfalls in NMA reporting.
Pitfall 5: Effect size and clinical meaning get lost
NMAs frequently focus on statistical ranking, which privileges relative statistical differences between treatments. Yet, a top-ranked therapy may only offer a marginal benefit—for example, 2% improvement over placebo—raising important questions about whether such a difference is clinically meaningful.
Clinical magnitude of effect is not just a footnote to statistical output; it should be central to interpretation. This issue was addressed directly in the British Society of Gastroenterology Guidelines for IBD. 8 In that initiative, NMAs were used alongside pairwise comparisons, but not before a formal thresholding exercise was conducted. Over 100 expert members of the guideline development group, along with international collaborators, defined the minimum clinically important differences required for outcomes to be considered meaningful. 9
This process allowed the guideline authors to assess precision and certainty not in abstract statistical terms but in relation to real clinical practice. For example, a wide confidence interval spanning from a large to a trivial effect would be downgraded for imprecision, while an equally wide interval from “high to high” would be interpreted more favorably. More importantly, the magnitude of effect was then aligned with GRADE certainty, allowing the team to report not just “high certainty” but “high certainty of a given magnitude of effect.”
This nuanced approach exemplifies the level of interpretation required to make NMAs truly decision-supportive. Without it, rankings remain superficial and easily misused.
Principles for responsible NMA use
To support more effective and trustworthy interpretation of NMAs in IBD and other specialties, we recommend five core principles (summarized in Figure 1):

Summary of pitfalls and promise of NMA.
Transparent reporting—Full disclosure of assumptions, methods, and network geometry should be standard. Readers must be able to trace the reasoning behind indirect comparisons and understand the scope of included evidence. The reader may not fully understand how or why such judgments were made, but it should be clear and have face validity to the reader.
Contextualized rankings—Hierarchies in NMA should never be presented without effect size, confidence intervals, and an explanation of clinical relevance. Rankings must inform, not dictate, decision-making, and the presentation of them in isolation is reductionist at best and misleading to the point of dangerous at worst.
GRADE integration—Certainty must be evaluated rigorously for all estimates—direct, indirect, and network-derived. GRADE should not be optional or tokenistic, but central to NMA interpretation. This cannot be presented as a distinct element or add-on, but as central to the results as the confidence intervals.
Outcome balance—Efficacy variables must be clear, decided a priori, clinically relevant, and all be weighed alongside safety, tolerability, and patient values. No single outcome should dominate the analytic landscape in isolation. NMA should have clear protocols or plans to clarify the primary or critical outcomes and the rationale for these. Then, the results for these should all be presented to allow the reader to consider the balance among efficacy outcomes and, more importantly, balance this against safety risks.
Clear visualization—Many NMAs give ranking tables, direct and indirect data, network plots, and similar. However, these are not intuitive or clear to the reader, and they do not naturally address the principles above. There is a need to use innovative tools to provide clarity by integrating ranking and certainty, enabling clinicians and stakeholders to allow understand both the strengths and limitations of the findings.
To bridge the gap between complexity and clinical utility, we propose the use of the GORDON Plot—an innovative visualization method that aligns two critical dimensions: Ranking Probability (derived from the statistical model of the NMA) and Certainty of Evidence (evaluated using GRADE for each treatment estimate). This technique has been used in several recent IBD publications and offers great potential to address many of the pitfalls4,6,10 which show GORDON plots. However, included in Figure 2 is an example which demonstrates the clear focus on GRADE and how ranking must be considered in the context of GRADE, with some higher-ranked treatments being very uncertain. There is currently no alternative to combine ranking, absolute effects, and GRADE judgments in a single visual output, and so this is currently a unique approach.

Dummy/example GORDON plot.
Conclusion
NMA offers a powerful means of comparing complex interventions across IBD treatment landscapes. However, when misapplied or misunderstood, it can mislead clinicians and erode trust in evidence-based practice.
By embedding certainty into ranking and grounding findings in clear, balanced visual communication, the NMAs fulfill their intended role: not to confuse with complexity, but to clarify with confidence.
