Abstract
Objectives
Network meta-analysis is a popular tool to simultaneously compare multiple treatments and improve treatment effect estimates. However, no widely accepted guidelines are available to classify the treatment nodes in a network meta-analysis, and the node-making process was often insufficiently reported. We aim at empirically examining the impact of different treatment classifications on network meta-analysis results.
Methods
We collected nine published network meta-analyses with various disease outcomes; each contained some similar treatments that may be lumped. The Bayesian random-effects model was applied to these network meta-analyses before and after lumping the similar treatments. We estimated the odds ratios and their 95% credible intervals in the original and lumped network meta-analyses. We used the adjusted deviance information criterion to assess the model performance in the lumped network meta-analyses, and used the ratios of credible interval lengths and ratios of odds ratios to quantitatively evaluate the estimates’ changes due to lumping. In addition, the unrelated mean effect model was applied to examine the extents of evidence inconsistency.
Results
The estimated odds ratios of many treatment comparisons had noticeable changes due to lumping; many of their precisions were substantially improved. The deviance information criterion values reduced after lumping similar treatments in seven (78%) network meta-analyses, indicating better model performance. Substantial evidence inconsistency was detected in only one network meta-analysis.
Conclusions
Different ways of classifying treatment nodes may substantially affect network meta-analysis results. Including many insufficiently compared treatments and analysing them as separate nodes may not yield more precise estimates. Researchers should report the node-making process in detail and investigate the results’ robustness to different ways of classifying treatments.
Introduction
In many scientific fields, multiple related studies are often available to provide evidence on a certain common topic. As an effort to synthesize the evidence from different sources and find the potential differences between them, systematic reviews and meta-analyses have been increasingly popular to produce reliable and precise evidence. They are especially useful to assess treatment effects in comparative effectiveness research for decision makers. Methods for meta-analysis have been rapidly developed over the recent forty years. 1 In this era of big data, researchers continue to be enthusiastic about looking for evidence from all possible sources and combining them together in a more informative form. This motivates the idea of network meta-analysis (NMA), also known as mixed treatment comparison, which compares multiple treatments simultaneously.2–6 Compared with traditional meta-analysis on each pair of treatments at one time, NMA has an attractive advantage of allowing comparisons between all available treatments for a certain disease outcome, even if there are no head-to-head studies comparing some treatments. By combining both direct evidence (from head-to-head studies) and indirect evidence (from studies with common comparators, say placebo), an NMA likely yields treatment effect estimates with higher precision than traditional pairwise meta-analysis. Its use in health-related research has been remarkably increasing since 2009. 7
Despite these benefits, researchers take several risks, including heterogeneity between studies, when performing NMAs. 8 Even in traditional pairwise meta-analysis, the extent of heterogeneity is a critical factor that determines whether the collected studies may be properly combined. 9 Because an NMA pools not only multiple studies but also multiple treatments, inconsistent definitions of treatments between studies may lead to an additional source of heterogeneity. Because meta-analysis depends on comprehensive searching for all available studies on the targeted research topic to avoid various types of bias 10 , 11 and wasting any information, 12 it is common to collect studies with similar but not identical treatments. Such treatments may be analysed jointly as a single treatment node in some NMAs, while they may also serve as separate nodes in other NMAs. For example, in an NMA of tocolytic therapy for preterm delivery, Haas et al. 13 treated usual or standard care without a tocolytic drug in the same group with placebo. Also, they classified ritodrine, terbutaline, nylidrin, salbutamol, fenoterol, hexoprenaline and isoxsuprine in a group of beta mimetics, and many other active treatments in other groups. In total, the collected studies originally reported 25 distinct treatments, but the classifications led to eight groups. Each group was considered as the same treatment node in the NMA. In another NMA that compared antihypertensive drugs’ effects on cancer risk by Bangalore et al., 14 however, placebo and non-placebo controls were treated as two distinct treatment nodes. Lack of consensus on defining treatments and classifying them may lead to overlapping NMAs and cause serious confusion about final conclusions. 15
In the current literature, such node-making processes are generally insufficiently reported and lack widely recognized guidelines, not only for pharmacological treatments but also for non-pharmacological ones.
16
,
17
This problem is often referred to as the dilemma between
The extreme case of lumping is that all active treatments are classified as one group and all remaining non-active treatments as another group; the NMA is subsequently reduced to a pairwise meta-analysis by such lumping. This model contains the minimal number of parameters, and thus its complexity is reduced to the minimum. However, it likely fits the data poorly because the results may be seriously biased if the treatments lumped in the same group actually differ substantially. On the other hand, the extreme case of splitting is that any different definitions of treatments are classified as separate nodes in NMA. Although this may fit the data well, it complicates the NMA model and likely leads to large variances of the effect estimates. Due to the large number of parameters in the model, the estimation procedures, such as the restricted maximum likelihood for frequentist methods and the Markov chain Monte Carlo (MCMC) algorithm for Bayesian methods, may even fail to converge. 18 From the statistical perspective, the dilemma between lumping and splitting treatments is essentially the tradeoff between goodness-of-fit and model complexity for model selection. Various statistical criteria are available to deal with this problem. 19 , 20
This article reanalyses nine NMA datasets, each containing some similar treatments that may be lumped. We examine the effects of different classifications of treatments on their effect estimates. Also, we propose a criterion for assessing the appropriateness of lumping treatments.
Methods
Data sources
We extracted nine NMAs with binary outcomes that contained similar treatments from a total of 58 NMAs investigated by Trinquart et al., 21 which were originated from the datasets collected by Veroniki et al. 22 and Bafeta et al. 23 , 24 They compared treatments for various important disease conditions, including atrial fibrillation, plantar fasciitis, severe erosive oesophagitis and stroke. We denoted each NMA using its first author’s surname with the publication year. All nine NMAs contained similar treatments that may be classified as common groups. For example, these similar treatments included pharmacological ones with different dose levels, intake frequencies or intake methods and non-pharmacological interventions with different intensity levels.
No ethical approval and patient consent were required in our study, because this article focused on statistical methods for NMAs; all analyses were performed based on published data in the literature.
Lumping treatments
We considered lumping all similar treatments in each treatment class as one node in each NMA. The lumped treatments included: (1) drugs with different dose levels, such as milnacipran 100 and 200 mg/day in the NMA by Roskell et al., 25 (2) drugs with different intake frequencies, such as calcipotriol b.i.d. (twice a day) and o.d. (once daily) in the NMA by van de Kerkhof et al., 26 (3) drugs with different intake methods, such as intravenous and oral amiodarone in the NMA by Bash et al. 27 and (4) non-pharmacological interventions with different intensity levels, such as low-, medium-, and high-intensity focused shock wave therapies (with energy flux density ≤0.08 mJ/mm2, 0.08–0.28 mJ/mm2, and ≥0.28 mJ/mm2, respectively) in the NMA by Chang et al. 28
Most studies were two-armed, while the remaining studies were multi-armed (comparing more than two treatments) in the NMAs. Some studies’ designs (i.e. treatments compared within studies) changed after lumping similar treatments. If a two-arm study contained two similar treatments to be lumped (e.g. one study in the NMA by Edwards et al.
29
comparing omeprazole 20 and 40 mg), it became single-armed after lumping and thus was removed from the lumped NMA, because such a single-arm study cannot be used in the conventional contrast-based NMA model.
30
Each multi-arm study included at most one group of similar treatments. Each lumped group in all NMAs except that by Chang et al.
28
contained two similar treatments. The lumping in the NMA by Chang et al.
28
involved three similar treatments, while each study in this NMA contained at most two lumped treatments. If a multi-arm study (say,
Statistical analyses
Implementation
We used the Bayesian random-effects model to estimate (log) odds ratios with 95% credible intervals (CrIs) of all treatment comparisons and account for the heterogeneity between studies in all nine NMAs before and after lumping similar treatments. 3 , 33 The correlation coefficients between treatment comparisons within multi-arm studies were assumed to be 0.5. 3 All treatment comparisons within each NMA were assumed to have a common heterogeneity standard deviation, and the uniform prior bounded between 0 and 10 was used for it. The Bayesian NMAs were implemented using the R package ‘rjags’ 34 via the MCMC algorithm. We used three Markov chains, each having 200,000 iterations after a 50,000-run burn-in period with a thinning rate 2 for reducing sample autocorrelations. We checked the chains’ trace plots for the MCMC algorithm’s convergence.
The deviance information criterion and the adjusted deviance information criterion for lumping
We used the deviance information criterion (DIC) to assess the performance of the nine NMAs before and after lumping similar treatments.
20
The DIC is defined as
Of note, the number of treatment groups within studies may change due to treatment lumping; some studies were removed as they became single-armed, and some reduced from multi-armed to two-armed. Therefore, the deviance term and thus the DIC may need to be adjusted for fairly comparing NMAs before and after lumping treatments. Recall that the NMA by Chang et al. 28 had three similar treatments to be lumped, while the others had only two similar treatments in each lumped treatment group. In the NMA by Chang et al., 28 only one two-arm study contained two similar treatments, and it became single-armed after lumping; no studies contained all three similar treatments. Therefore, we focused on scenarios of deviance adjustments for lumping two similar treatments within studies.
Specifically, considering a
No studies in the nine NMAs belonged to any other scenarios not specified above. Each NMA had diverse scenarios of deviance adjustments; suppose it contained
Comparing the changes of odds ratios due to lumping
We investigated density plots of the estimated odds ratios of similar treatments (say A1 and A2) vs other non-lumped treatments (say B) in the original NMAs before lumping. If the estimates of A1 vs B and A2 vs B dramatically differed for many non-lumped treatments B, then lumping A1 and A2 might not be appropriate. Also, we compared the point estimates of the odds ratios and their 95% CrIs in the original and lumped NMAs. To quantitatively evaluate the effects of lumping similar treatments, we calculated the 95% CrI’s length of the log odds ratio for each treatment comparison. The CrI length’s change due to lumping similar treatments indicated the impact of lumping on the estimate’s precision. Specifically, we used the ratio of the CrI lengths, RCL =
The cutoffs of interpreting the fold change’s extent were often defined differently case by case.36–38 A fold change of one indicated no change after lumping. In this article, for a fold change larger than 1, it was considered unimportant, moderate, substantial and considerable if it was within 1–1.1, 1.1–1.2, 1.2–1.5 and >1.5, respectively. Also, reciprocally, for a fold change less than 1, it was considered unimportant, moderate, substantial and considerable if it was within 0.91–1, 0.83–0.91, 0.67–0.83 and <0.67, respectively.
Assessing evidence inconsistency
All analyses above assumed that the direct and indirect evidence was consistent in both original and lumped NMAs. However, evidence inconsistency may appear in NMAs and may impact their results. 39 , 40 Also, lumping treatments may change the structures of treatment comparisons. 41 Consider three nodes A1, A2 and B in an NMA; A1 and A2 are similar treatments to be lumped as a single node A, and B is a non-lumped treatment. There are three cases of changes of treatment comparison structures due to lumping: (i) no direct comparison exists between A1 and B and between A2 and B, so the lumped A and B still have no direct comparison, (ii) A1 and B have direct comparisons while A2 and B do not, so A and B have direct evidence after lumping and (iii) both A1 and A2 have direct comparisons with B, so A and B are still directly compared. The changes of treatment comparison structures may impact the risk of evidence inconsistency.
Therefore, in addition to using the NMA model with the assumption of evidence consistency, we also applied the unrelated mean effect (UME) model suggested by Dias et al.
42
to allow evidence inconsistency. In the UME model, each treatment comparison is considered as a separate and unrelated parameter. Like the consistency model, this model also assumes a common between-study variance for all comparisons, but it only makes use of information about direct evidence. We compared the consistency model with the inconsistency UME model in terms of DIC. The posterior mean deviances
Results
Basic characteristics
Figure 1 shows the nine NMAs’ geometry with the index for each treatment. All treatments were ordered alphabetically in each network, except placebo or control that was indexed as the first node (if available). Table 1 presents the NMAs’ basic characteristics, including the outcomes and the numbers of studies, treatments, and patients. The NMA by Chang et al. 28 investigated non-pharmacological interventions (shock wave therapies); the remaining eight NMAs compared pharmacological treatments. All nine NMAs originally treated similar treatments as separate nodes; seven NMAs contained studies that directly compared similar treatments. Table 2 lists all treatments with abbreviations, including similar treatments to be lumped in each NMA.

Network plots of the nine network meta-analyses. Each node represents a treatment, and each edge represents a direct comparison between the corresponding two treatments. The edge’s width is proportional to the number of studies that provide the direct comparison. The dashed circle contains similar treatments that may be lumped.
Characteristics of the nine network meta-analyses.
PASI, psoriasis area and severity index.
Treatment names, abbreviations and classifications in the nine network meta-analyses.
t.i.d., three times a day; b.i.d., twice a day; o.d., once daily.
DICs
Table 3 shows the DICs of NMAs before and after lumping similar treatments. The penalty terms (representing model complexity) of eight NMAs, except that by Owen, 32 decreased after lumping. These decreases were generally expected because lumped NMAs contained fewer treatments and thus the model had fewer effective parameters. The increased penalty term in the NMA by Owen 32 was possibly due to Monte Carlo error. Seven (78%) NMAs, except those by Owen 32 and van de Kerkhof et al., 26 performed better after lumping because their adjusted DICs were smaller than the original DICs before lumping.
Original and adjusted deviance information criterion values for the nine network meta-analyses before and after lumping similar treatments. The values of deviance information criterion, deviance and penalty term outside parentheses are produced by the consistency model, and those inside parentheses are produced by the inconsistency unrelated mean effect model.
aReduced groups due to lumping treatments.
bThe value of deviance information criterion for Bayesian model selection.
cThe value of deviance indicating goodness-of-fit.
dThe value of penalty term indicating model complexity.
eAdjusting for the reduced groups after lumping treatments.
fSmaller adjusted DICs produced by the inconsistency model compared with the consistency model.
The NMAs had different scenarios of the DIC adjustment (i.e. the reduced treatment groups due to lumping). For example, six treatment groups were removed due to lumping in the NMA by Bash et al.: 27 two studies belonged to scenario 1, two belonged to scenario 2 and the remaining belonged to scenario 0, so the DIC of the lumped NMA was adjusted by adding 6 for a fair comparison with the original DIC before lumping. In another example of the NMA by Owen, 32 no adjustment was applied to the DIC because all studies belonged to scenario 0.
Density plots
Figures S1–S9 in the Supplementary Material present the density plots of the estimated log odds ratios between lumped and non-lumped treatments in the nine NMAs. The density plots showed different relationships among similar treatments. For example, in the NMA by Bash et al. 27 in Figure S1, the density plots of each pair of similar treatments (2, 3) and (4, 5) were centred on common means, while their shapes were noticeably different, indicating different precisions. The density plots of the similar treatments (8, 9) had similar shapes, but their locations differed. Here, the treatment indexes are detailed in Figure 1. In the NMA by Chang et al. 28 in Figure S2, the three similar treatments (2–4) had similar density plots; however, in the NMA by Edwards et al. 29 in Figure S3, the density plots of the similar treatments (3, 4) differed in both locations and shapes.
Changes due to lumping
Figure 2 presents the changes of the estimated (log) odds ratios with 95% CrIs in the original and lumped networks by Bash et al., 27 and Figures S10–S17 in the Supplementary Material present those in the remaining eight NMAs. The comparisons among the same group of similar treatments (if any) were unavailable in the lumped NMAs, so their results were not shown in these figures; such comparisons were noted in italic face on vertical axes. In the NMA by Bash et al. 27 in Figure 2, many CrIs’ lengths noticeably shrank after lumping similar treatments, and the changes of the estimated odds ratios differed for different treatment comparisons.

Estimated log odds ratios with 95% credible intervals in the network meta-analysis by Bash et al. 27 before and after lumping similar treatments.
Figure 3 presents the RCLs and RORs of treatment comparisons in all nine NMAs and the frequencies of their fold changes to different extents. Figure 3(a) focuses RCLs < 1 that implied improved precision by lumping similar treatments; RCLs > 1 implying lowered precision are denoted as plus signs (+) at the plot’s top. RCLs < 0.67 implying considerable changes in precisions are also denoted as plus signs at the plot’s bottom. Similarly, considerable changes of RORs that were <0.67 or >1.5 are denoted as plus signs in Figure 3(b). Some treatment comparisons in four NMAs (i.e. Owen, 32 Reich et al., 44 Roskell et al. 31 and van de Kerkhof et al. 26 ) were less precise after lumping similar treatments, while the other five NMAs yielded improved precision for all treatment comparisons. For example, 17 treatment comparisons in the NMA by Bash et al. 27 had considerable changes in RCLs, and the RORs of all comparisons indicated at least moderate changes. These results were consistent with the change of DIC as shown in Table 3 in this NMA; the DIC decreased from 80.68 to 71.72 (after adjustment) after lumping, implying better model performance in the lumped NMA. Furthermore, the histograms in Figures 3(c) and 3(d) indicate noticeable changes of RCLs and RORs for many treatment comparisons in all NMAs.

The ratios of 95% credible interval lengths (RCLs, panel a) and the ratios of odds ratios (RORs, panel b) for all treatment comparisons in the nine network meta-analyses and the frequencies of fold changes of RCLs (panel c) and RORs (panel d) to different extents. In panels a and b, the plus signs (+) denote values that are below or above the vertical axis ranges, and the associated numbers are the numbers of such values not shown in the plots. The solid lines indicate no changes, and the dashed, dotted and dash-dotted lines differentiate unimportant, moderate, substantial and considerable fold changes accordingly.
Assessing evidence inconsistency
Table 3 gives the deviance terms and the DICs of both the consistency model and the inconsistency UME model. Most results were fairly close with differences <1, suggesting no substantial evidence inconsistency in most NMAs. However, in the NMA by Phung et al., 43 the adjusted DIC produced by the inconsistency model was much smaller than that by the consistency model, indicating potential evidence inconsistency in this NMA.
Figures S18–S26 in the Supplementary Material show the posterior mean deviances produced by the consistency model and the inconsistency UME model in the original and lumped NMAs. The posterior mean deviances produced by the two models were similar for most NMAs; they were distributed around the expected value 1. However, for the NMA by Phung et al., 43 Figure S22 shows that some posterior mean deviances produced by the consistency model were away from 1; again, they indicated evidence inconsistency.
Discussion
Strengths and limitations
This article included nine NMAs with various disease outcomes, so the results may be representative for a broad class of NMAs. We proposed an adjustment for the DIC to numerically evaluate the benefit of lumping treatments, and we used the RCLs and RORs to quantify the impact of lumping. The potential evidence inconsistency was also considered in our study.
Our study had some limitations. First, for the purpose of illustration, this article considered the case that all similar treatments were lumped. However, in practice, some similar treatments may have truly different effects. As shown by the density plots in Figures S1–S9 in the Supplementary Material, some similar treatments’ density plots were noticeably different. For example, in the NMA by Reich et al., 44 each of the lumped treatments, etanercept and ustekinumab, originally had two different dose levels, and their density plots in Figure S6 show that the treatment effects depended on the dose levels. In this case, directly lumping these treatments across different dose levels may mask possible dose effects, and the dose–response meta-analysis may be alternatively used to incorporate such effects. 45 In another NMA by van de Kerkhof et al., 26 Figure S9 indicates that each pair of betamethasone dipropionate b.i.d. and o.d. and two-compound formulation b.i.d. and o.d. had approximately similar density plots, while the pair of calcipotriol b.i.d. and o.d. had noticeable different density plots. Therefore, the former two pairs of similar treatments may be properly lumped, but the latter pair may not.
Second, this article investigated the effects of lumping similar treatments mainly from the statistical perspective, but many clinical considerations about the treatment definitions and effects should be employed in practical NMAs. For example, in the NMA by Bash et al., 27 although the effects of oral and intravenous flecainide were similar in our statistical analyses, the two intake methods may be dramatically different from the clinical perspective, and they should be analysed separately for certain clinical purposes.
Third, this article focused on NMAs with binary outcomes and we investigated the effects of different treatment classifications only on the estimated odds ratios. These may limit the generalizability of our conclusions in NMAs with other types of effect sizes (e.g. risk ratios).
Recommendations and future studies
Meta-analysts should provide detailed information about treatment definitions and justification for classifying treatments when performing NMAs, especially when multiple similar treatments are available. It may not be optimal to analyse all treatments separately, and NMAs with too many insufficiently compared treatments may yield underpowered effect estimates. Some similar treatments may be lumped to effectively increase the estimates’ precision, if the lumping is reasonable from both statistical and clinical perspectives.
Future studies include developing methods to evaluate the effects of lumping similar treatments while maintaining treatments’ interpretability. It will be also important to account for various types of dose effects in NMAs, instead of simply lumping them.
Conclusions
The node-making process has been recognized as an important problem in NMAs, and it was poorly reported in many NMAs and often lacked detailed explanations. 15 , 16 This empirical study has shown that different ways of making treatment nodes could substantially affect the results of NMAs. These findings have been based on nine published NMAs with similar treatments that could be lumped. The DICs were reduced in many NMAs after lumping, indicating better model performance. Also, RCLs and RORs indicated noticeable changes of the estimated odds ratios for many treatment comparisons due to lumping. The use of the UME model did not imply substantial evidence inconsistency in the NMAs except the one by Phung et al. 43
Footnotes
Declaration of Conflicting Interests
Funding
Supplemental Material
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
