Abstract
Introduction
There is considerable effort in the pattern recognition field to combine the outputs of individual classifiers to create a multiple classifier system (MCS), also termed an “ensemble,” which endeavors for robustness over any single classifier in the MCS. The underlying principle is that greater accuracy can be achieved by combining the outputs of classifiers strong in different areas of the decision space. Classifiers that are strong in different areas of the decision space are said to be diverse, and intuitively selecting diverse classifiers would lend itself to improved accuracy.
However, the concept of diversity has yet to be formalized but there is consensus among researchers that diverse classifiers make errors in different areas of the classification domain. Herein, we consider diversity to be the abstract concept that describes differences between outputs of multiple distinct classifiers, while a diversity metric will be considered to be a rigorously defined method for describing this abstract concept of diversity. Currently, there are many proposed diversity metrics, c.f. literature,1–3 without any clear consensus as to which diversity metric is best.
Although studies have examined the relationship between accuracy and diversity, c.f. literature,4–6 limitations of these studies include that only a small part of the possible classification domain was considered. By selecting different classification thresholds for each individual classifier in an MCS, it is possible to look at a much wider range of the classification domain. We introduce an alternate scoring technique that allows selection of individual classification thresholds to generate a classification
This paper is organized as follows: the next section presents a review of fusion methods and diversity metrics. The subsequent section discusses underlying theory along with the proposed scoring method. Application results, from using academic datasets, are then presented. The concluding remarks are discussed in the last section.
Background
Prior research
Numerous studies have attempted to show a relationship between the diversity measures and the performance of an MCS.3–16 Some studies have had some success in showing this relationship; however, they used diversity measures inherently correlated with accuracy. However, there have been no successes with the more “pure” measures of diversity.
Aksela and Laaksonen 4 studied classifier selection using a number of diversity metrics and fusion techniques and state that diversity metrics that disregard classifier correctness are not optimal for selection purposes. However, diversity metrics that take classifier correctness into account are “cheating'' by really making the measure about accuracy instead of diversity. In essence, it is desirable for the diversity of the errors to be high, but the agreement on the correct outputs should also be high. 4
This idea of diversity being important but not at the cost of accuracy is echoed in other research as well. Brown and Kuncheva 6 decomposed their diversity into “good” and “bad” diversity measures where increasing good diversity reduces error and increasing bad diversity increases error. However, they only did so for one fusion method and loss function combination; a separate decomposition must be performed for every combination of loss function and fusion method. 6 Brown and Kuncheva 6 also did not provide a way to use the good/bad diversity decomposition for building classifier ensembles. Canuto et al. 7 performed a study on ensemble selection with both hybrid (different types of classifiers) and non-hybrid (all classifiers are the same type) ensembles. They determined that classifier selection does have an impact on an ensemble's accuracy and diversity but they did not show any link between accuracy and diversity. They also show that hybrid ensembles provide the most diversity, this is one reason we use hybrid ensembles in our research. Gacquer et al. 8 proposed a genetic algorithm for ensemble selection that performs well with a specified accuracy-diversity trade off of 80/20, indicating that diversity must be of at least some use for selecting ensembles that generalize well over a population. However, they mentioned that this may not be true for small data sets, and it may not be true for all large data sets, either. Hadjitodorov et al. 9 looked at cluster ensembles which is a unsupervised learning technique, but still offer valid insight. They claim that accuracy peaks somewhere around medium diversity, and very high or very low diversity ensembles are a poor choice.
Alternatively, Kuncheva 10 stated that while no relationship between diversity and accuracy has been conclusively proven, it is may still be a useful idea in creating ensemble selection heuristics. Kuncheva and Whitaker 3 noted that the diversity metrics tend to cluster with themselves indicating that there is some agreed upon idea of diversity, but stated that using diversity for enhancing the design of ensembles is still an open question. Ruta and Gabrys 11 showed a correlation between one measure of accuracy, majority voting error, and two diversity metrics, the pairwise double-fault measure and the non-pairwise fault majority measure. The non-pairwise fault majority measure of diversity was designed specifically for majority voting fusion, and thus is expected to show a relationship with majority voting error. 11
Shipp and Kuncheva 12 considered a large number of diversity metrics and fusion methods but did not find a correlation between ensemble accuracy and diversity. Windeatt 2 proposed a diversity metric that is measured across classes and not classifiers; he showed it to be correlated with the base classifier's accuracy but it did not appear to be correlated with the accuracy of the MCS as a whole. While some of the studies claim a correlation between accuracy and a proposed diversity metric, all of the studies fall short of conclusively proving a link between diversity and accuracy. Part of the problem stems from the fact that there is no formal definition of diversity.
With the current state of research in this area examined, one area that has not been researched at all is the relationship between accuracy and diversity over the classifier threshold domain space. All previous studies focused on the correlation between the classification accuracy at a fixed classification threshold, i.e. for a two class problem with a decision threshold
Classifier fusion
While it is possible to classify observations with a single classifier, greater accuracy may achieved by creating multiple classifiers and combining the results. 17 Combining multiple classifiers creates an MCS.
One of the most common structures is parallel combination, conceptualized in Figure 1, which we refer to as the standard method to contrast with our alternate method described later. The standard method is certainly not the only possible structure, and many other possibilities exist for combining classifiers. Fundamentally, all combination rules within the parallel structure fall into three different levels; an abstract level which only requires class labels as outputs, a rank level which requires a ranked list of class outputs, and finally a measurement level which requires class probabilities. 11

Conceptualization of the standard method.
Majority voting
Majority voting is the simplest abstract level fusion method. It involves selecting the most commonly assigned class as the final assigned class. If there is a case where no class gets more than one vote, the final assignment is given to the individual classifier with the best accuracy. 1 There are other possible voting methods than just the simple majority described above, c.f., 1 but these are not used in our research.
Measurement level fusion
Measurement level fusion requires more information than abstract level fusion and possibly performs better due to the additional information over abstract level fusion methods. Measurement level fusion schemes require fuzzy measures on the interval [0, 1] as the classifier outputs. These fuzzy measures are treated as class probabilities or one of the other measures of evidence: possibility, necessity, belief, or plausibility. There are a wide range of measurement level fusion schemes, only some of the most popular are discussed below. The following symbol conventions are used with measurement level fusion:
Generalized mean
The generalized mean fusion method encompasses many commonly used fusion methods. The formula for a generalized mean fusion is
1
The choice of
Product rule
The product rule multiplies the support given by each classifier and if the posterior probabilities are correctly estimated then the product rule gives the best estimate of the overall class probabilities.
1
However, if one classifier gives very low support to a class, it effectively removes the chance of that class being selected
Generalized ensemble
The generalized ensemble model (GEM) is a generalized model of the mean rule, also called the BEM.
18
At its core, GEM is a weighted average of the support given by each classifier
The alphas are selected in a way that minimizes the mean squared error of the MCS. This is done by calculating the misfit function,
The weights,
Perrone and Cooper 18 state that weights calculated using equation (8) creates the linear combination of classifier outputs that minimizes the MSE. GEM is proven to be more accurate than the best individual classifier and also more accurate than using BEM for fusing classifier outputs.
Diversity metrics
As discussed in the introduction, researchers do not agree on an exact definition of diversity or a definitive diversity metric. However, as mentioned by Polikar, 1 an effort must be made to make the component classifiers of an MCS as diverse as possible to ensure an efficient MCS. In the sections below, the most common measures of diversity are discussed, as well as how to use them to create a diverse set of classifiers. The challenge we face with current diversity measures is that the goal of linking diversity to accuracy is hampered by the fact that there is not a one to one mapping between diversity and accuracy. For each diversity metric discussed below, an example is provided to demonstrate how different sets of classifier outputs may have the same diversity but vastly different accuracies.
Diversity is easy to understand qualitatively, but difficult to rigorously quantify. There are many different measures that have been proposed to measure diversity. Some of the most popular metrics are discussed below. Most diversity metrics are designed for pairwise comparisons of classifiers. There are a few global diversity measures that can handle more than two classifiers such as Entropy and Kohavi–Wolpert Variance. A common approach is to compare multiple classifiers using pairwise diversity metrics by computing the diversity of every pairwise combination and averaging these results. In the pairwise diversity metrics, the convention used is the letters
Reference for pairwise diversity metrics, from ChoiChaand Tappert. 19
Correlation
One of the most commonly used diversity metrics is the correlation between two classifiers,
Two identical classifiers that produce identical labels have
Yule’s Q
Yule's Q statistic,
Two different MCSs can have the same Yule's Q statistic as long as the products
Disagreement
Disagreement,
Maximum diversity is achieved when
Double fault
Double fault,
Maximum diversity is achieved when
Entropy
Entropy,
Kohavi–Wolpert variance
Kohavi–Wolpert variance is similar to the disagreement measure but can be calculated with more than two classifiers. Diversity is maximized when Kohavi–Wolpert variance is high. Kohavi–Wolpert variance is calculated as
3
Kuncheva has proven that Kohavi–Wolpert variance of an MCS is related to the average of all pairwise disagreement. 3 Kohavi–Wolpert variance shares the same weaknesses as the entropy measure.
Methodology
Theory
Ruta and Gabrys 11 claim the difference between abstract level fusion techniques and measurement level fusion techniques is the information used by each technique, but there is one other difference that is important to this research. With an abstract level fusion method such as Majority Voting, class labels are given by each individual classifier then fused into a single label. Because class labels are given before the fusion takes place, each individual classifier can have its own decision threshold independent of the other classifiers. With a measurement level fusion method such as Mean Fusion, the measurements are fused and then a single label is made. Because there is only one label made (and it comes after fusion), there is only one decision threshold for the entire MCS. Although the measurement level fusion techniques make use of more information (fuzzy measures vs. binary labels), they lose degrees of freedom in that they cannot apply decision thresholds to individual classifiers. The following section proposes an alternate scoring technique that attempts to keep the increased information of the fuzzy measures required for measurement level fusion but allows each classifier output to be transformed independently.
Alternative scoring technique
The proposed alternate scoring technique, conceptualized in Figure 2, transforms class probabilities into scores restricted to the interval [0, 1] by selection of a classification threshold,

Conceptualization of proposed alternative scoring technique.
The alternative scoring technique procedure takes classifier
For an individual classifier, an assignment to class 0 would occur if

Graphical representation of proposed alternative scoring technique: (a) Using the alternate scoring technique, support for class 1 (

Sample accuracy surface over a range of thresholds.
Experiment
The primary goal of this research is to discover if a relationship between ensemble accuracy and diversity exists.
Example academic datasets
In order to avoid data-driven results and to examine the relationship between accuracy and diversity across a wide spectrum of problem characteristics, 14 data sets were obtained from the UCI Machine Learning Repository. 20 The selected data sets: Balance Scale, 21 Breast Cancer Wisconsin, 22 BUPA Liver Disorders 23 Credit Approval, 24 Glass, 25 Haberman's Survival, 26 Fisher's Iris, 27 Mammographic Masses, 28 Parkinson's, 29 Pima Indians Diabetes, 30 Spambase, 31 SPECTF, 32 Transfusion, 33 and Wisconsin Diagnostic Breast Cancer. 34 These datasets have between 3 and 58 features, tens to thousands of observations, and benchmark accuracy values between 65% and 95%. All data sets have two classes or have been coerced into two class data sets by grouping similar classes until there are two distinct classes.
Classification algorithms
Six classifiers were employed to examine diversity and accuracy in ensembles:
Quadratic discriminant analysis (QDA) k-Nearest Neighbors (kNN) Feed Forward Neural Network (FFNN) Radial Basis Function (RBF) Probabilistic Neural Network (PNN) Support Vector Machines (SVM).
Classifier settings and background were as follows:
QDA was considered, consistent with Wu et al.,
35
if a dataset was rank deficient then LDA, consistent with, Wu et al.
35
and Bihl et al.
36
was used. kNN was employed consistent with Fukunaga and Narendra,
37
using the e1071 package
38
and default options. FFNN was implemented per MeyerDimitriadou et al.
39
with one hidden layer with three nodes (used throughout), with a “softmax” (log-linear model). RBF was implemented per Chen et al.
40
and DemuthBeale and Hagan,
41
with mean squared error goal of 0.0, spread = 1.0, max neurons equal to the number of input vectors, and 25 neurons added between displays. PNNs were considered, per literature,42–44 with a radial basis function spread of 0.1 SVMs used the e1071 package was used, c.f.,
38
with a linear kernel and default e1071 options.
For algorithms with tunable architecture settings, e.g. kNN, FFNN, RBF, PNNs, and SVMs, performance gains would logically be possible by selecting settings for each dataset. However, the authors have aimed for repeatability, consistent with the study in Liu and Zaidi, 45 in this study by using global settings which are likely overall suboptimal.
Experiment description
An area not examined in prior research and provided for by our alternate scoring technique is the relationship between accuracy and diversity over the entire domain of individual classifier thresholds. Most prior research has only investigated ensemble performance at single classification thresholds (typically
Experiment factor/level description.
BEM: basic ensemble model; GEM: generalized ensemble model.
In our experiments, every possible ensemble of three classifiers was evaluated at every threshold from 0.05 to 0.95 with threshold step sizes of 0.05. The diversity metrics and ensemble performances were saved in a database and used in the analysis performed.
Looking for relationships
There are a number of different ways to look for a relationship between accuracy and diversity with the wealth of data produced by our experimental design. One preprocessing step taken for all procedures was to map the diversity metrics to the interval [0, 1] where 0 is minimum diversity and 1 is maximum diversity. This mapping facilitates comparisons between accuracy and diversity and allows their relative affects to be compared directly. Some diversity metrics already meet this criteria, such as disagreement and entropy. The remaining of the diversity metrics are mapped in the following manner
Correlations
The first logical step to uncovering a relationship between diversity and accuracy is to determine if there is a linear correlation between the diversity metrics collected and the ensemble accuracies. The correlation between test set diversity and test set accuracies are examined for within set correlation, and the correlation between test set diversity and validation set accuracies are examined for between set correlation.
Regression
Another possible way to uncover a relationship between diversity and accuracy is through linear regression. If there is a relation between diversity and accuracy then the validation set accuracy may be able to be predicted by test set diversity (which would be very useful in ensemble building). It is probable that test set accuracy is the main predictor of validation set accuracy and that diversity may only explain some of the residual error. To determine if this is the case, four regressions are performed on each data set- one with diversity as the only regressor, one with accuracy as the only regressor, one with both diversity and accuracy as regressors, and one with diversity and accuracy as regressors including their interaction. In each regression, the accuracy from the validation set is used as the dependent variable and all of the independent variables come from the test set. This ensures that the regressions show the actual predictive power of the independent variables and does not show spurious correlation within the test set. The regression results are examined to determine the effect of test set diversity and accuracy on validation accuracy. To account for the effects of the diversity metric used, the data set, the ensemble combination, and the fusion technique used, dummy variables are encoded. These dummy variables are included as main effects to allow for a change in the regression intercept, and are also interacted with testing accuracy,
This regression does not take into account the diversity metric, data set, and fusion technique in use. The full regression with dummy variables is
Ensemble selection
To examine the utility of diversity to determine classifier membership in an ensemble, three ensemble selection schemes are used on the test set and compared against the most accurate ensemble and threshold combination in each validation set. The first scheme selects the ensemble with the highest ensemble test accuracy. The second scheme selects the ensemble with the three classifiers with the highest individual test accuracy. The third scheme selects the ensemble with the highest test diversity. These schemes are performed with each fusion type and their validation set accuracy is compared to the best ensemble's validation accuracy as determined by the oracle. These comparisons will be placed in percentages for relative comparison across fusion techniques, diversity measures, and data sets. If diversity is a useful metric to select classifiers for an ensemble then the selection schemes that use diversity should compare favorably against the selection schemes that use accuracy.
Results
In our analysis, we evaluate the performance of the alternate scoring technique and ensure that it did allow us to look at a greater range of diversity. Next, we show the linear correlation between accuracy and the different diversity measures and the relative effects of accuracy versus diversity using regression techniques. Finally, we demonstrate the utility of selecting MCS membership using diversity as the primary criteria performs against using accuracy as the primary criteria. The experimental results first establish that the alternate scoring technique does provide ensemble selection over a wider range of diversity. Next, the correlation between accuracy and the examined diversity measures is examined. Following that outcome, the results from the regressions and the utility of using test accuracy and diversity to predict ensemble performance with validation data are presented.
Alternative scoring technique
The alternate scoring technique in general did not provide higher MCS accuracy but did allow examination of a greater range of diversity. For three of the fusion techniques: BEM, GEM, and Product Rule (denoted PRO in the tables), the alternate scoring technique was able to achieve a higher level of accuracy. With the two remaining fusion techniques, MIN and MAX, the alternate fusion technique did not achieve a very high level of accuracy. This is attributed to the manner in which the alternate scoring technique forces one of the scores to become zero which can greatly affect the behavior of these statistics. Table 3 shows a comparison of the alternate scoring technique's maximum and average performance for each fusion technique applied, averaged across all data sets. It is apparent the alternate scoring technique has the potential to perform as well as the standard method but loses some accuracy in the “tails” as the accuracy of the alternate scoring technique averaged across the range of classification thresholds is lower than the standard method. While we are pleased that the alternate scoring technique showed a potential improvement in “tuning” some ensemble techniques for better performance, the actual performance of alternate scoring technique is not of interest for this study. The primary reason for applying this technique is to allow us to examine a greater range of classification threshold combinations and a greater range of diversity. This focus on achieving greater diversity is why we feel that the lower average performance of the alternate scoring technique is acceptable.
Comparison of standard method to alternative scoring technique-achieved accuracy.
BEM: basic ensemble model; GEM: generalized ensemble model; PRO: product.
Diversity increase
Using the alternate scoring technique allowed the exploration of ensembles over a wider range of diversity. The expectation was that this greater range of diversity achieved would provide greater insight into the relationship between the accuracy and diversity of an MCS. As shown in Table 4 the alternate scoring technique achieves a higher range of diversity for every diversity metric. The diversity ranges are averaged across all data sets in Table 4. The use of the alternate scoring technique increased the diversity for every data set and all diversity metrics.
Comparison of standard method to alternative scoring technique-achieved diversity range.
Ensemble combinations
The results of the experiment are described in Table 2.
Correlations
Similar to Kuncheva, 10 we begin our exploration of the relationship between diversity and accuracy by examining the correlation coefficient between the two measures. For each diversity metric and fusion method, we calculated the Pearson's r coefficient between the test diversity and test accuracy to determine if there was any within set correlation. The Pearson's r coefficient between the test diversity and validation accuracy was also examined to determine if there was any between set correlation that could possibly be exploited for ensemble selection. The correlation aggregated by diversity metric is perhaps the most informative, and is presented in Table 5.
Correlations by diversity metric.
The correlation for all diversity metrics is small, and for most of the metrics, the sign is opposite what the conventional wisdom states. The conventional wisdom says that higher diversity should lead to higher accuracy and therefore have a positive correlation, but most of the correlation coefficients observed are negative. This result is supported, however, by Kuncheva 10 where she shows how for most of the diversity range there is a negative correlation with accuracy as shown in Figure 5, but once diversity exceeds a certain (fairly high) threshold, the relationship reverses to a positive correlation.

A typical accuracy-diversity scatterplot. Reprinted from Kuncheva. 10
We believe that these results do not show anything new or novel; however, they serve to illustrate some common sense concepts about diversity. The more accurate a group of classifiers are, the less opportunity there is for diversity to exist. At the most extreme case if all the classifiers are 100% accurate then the ensemble will have zero diversity. Similarly, if all the classifiers are completely wrong then zero diversity will exist for all measures except for measures such as double-fault that only measure “half” the picture of diversity. We mentioned in the ‘Diversity metrics’ section, how each of the diversity measures we examined can produce multiple accuracy values for the same diversity value. Therefore, the results we observed are expected, there cannot be a one-to-one relationship between diversity and accuracy so no measure of correlation, linear or non-linear, will be able to show anything but a general trend. With regard to the double-fault measure, recall that double-fault measures the probability that both classifiers will misclassify an observation. We then changed to using the diversity score of
Regression results
Regression analysis was performed to determine if a relationship exists between accuracy and diversity and can be used for ensemble selection. With this goal in mind, we use ensemble validation set accuracy as the response and metrics from the test set as the regressors. This process emulates a real-world application of picking an ensemble based on test set performance, with the validation set as new observations that are classified after an ensemble is selected. We performed three regressions; using test set accuracy as the only regressor, using test set diversity as the only regressor, and using both test set accuracy and diversity as regressors. Dummy variables were coded to allow for differences between data sets, fusion techniques, as well as the different diversity metrics. The primary focus was the coefficients related to accuracy and diversity, which gave insight on the relationship between accuracy and diversity. The results of the regressions are presented in Table 6, including the coefficients we were interested in as well as two measures of prediction performance (consistent with KutnerNachtsheim et al.
46
), the coefficient of determination (
Regression coefficients + results.
RMSE: root mean square error.
Readily apparent is that while diversity may be used as a selection criteria, diversity as the only regressor has the lowest
When both accuracy and diversity are included the linear model, the
Ensemble selection results
As a result of creating every possible ensemble combination, it was possible to determine which one of the possible ensembles was optimal for classifying each validation set. For each data set and fusion technique, there is an ensemble that delivers the maximum possible accuracy that can be obtained by choosing the very best combination of classifiers classification thresholds. We call these best possible ensembles “oracles” because that is the ensemble that an all-knowing oracle would select if it desired maximum performance. In our analysis, ensembles were selected based on results from the test set and the performance those ensembles achieved on the validation set was compared to the best ensemble selected by the oracle. Each selected ensemble's validation accuracy was compared to the oracle validation accuracy as a percentage
The selection criteria used were ensemble test accuracy, individual classifier accuracy, and all six test diversity metrics. The percent performance that each selection criteria achieved, aggregated by fusion method, is shown in Table 7. In table 7, the best selection techniques based on accuracy and the best selection techniques based on the diversity measure are shown in bold.
Percent achieved by fusion technique.
BEM: basic ensemble model; GEM: generalized ensemble model; PRO: product.
As shown in Table 7, selecting ensembles based on accuracy achieves the highest performance for all fusion techniques, while selecting ensembles based on diversity gives lower performance. In fact, the lowest performing accuracy selection technique is never beaten (and is only tied once) by the highest performing diversity selection technique regardless of the fusion technique used. The double-fault diversity metric performed the best out of all the diversity metrics, but this is somewhat expected because of the inherent link between accuracy and the double-fault metric. This analysis shows that, even with an expanded range of diversity, test set accuracy should be the primary criteria for selecting ensembles. If there are two ensembles that tie in accuracy criteria, diversity may be useful as a secondary criteria to break the tie, this will be investigated in further research.
Conclusions
This research presented an alternate scoring technique that allowed a wider range of diversity to be reached when creating MCSs. It demonstrated that there is not a one-to-one relationship between diversity and accuracy. Among the diversity measures examined, we single out the double fault metric as appearing to be the best measure, but this is likely due to its inherent link to accuracy and not due to it being a good measure of diversity. We have shown that validation accuracy is related to diversity but is greatly outweighed by the relationship between test set accuracy and validation accuracy. With our alternate scoring technique allowing us a wider range of ensembles to examine, we confirm that test set accuracy is still the best way to select ensembles.
