Abstract
1. Introduction
Image segmentation is the foundation of computer vision applications. Its purpose is to partition the image into several independent, meaningful and semantically related regions. An effective and accurate image segmentation algorithm is crucial for many applications, such as content-based image retrieval, object recognition, and object tracking. It also facilitates higher-level image analysis and understanding.
Image segmentation is a hot research topic in academia and industry, where many algorithms have been proposed and evaluated, such as threshold-based segmentation [1], edge-based segmentation [2], region growth segmentation [3, 4], graph-based segmentation [5] and clustering-based segmentation [6, 7]. These algorithms usually have two common problems: 1) a lack of efficient methods for fusing different image features during segmentation and 2) the spatial semantics of images are ignored in the algorithms.
The visual features of images include global features, such as colour and texture, and local features, such as SIFT features. During segmentation, each visual feature has a different effect on different scenes. Algorithms based on a single type of feature can produce good results for some categories of images, but they cannot be applied to broad categories with good results. Fan [8] suggested that fusing multiple types of features could improve the performance and effectiveness of segmentation algorithms. Traditional segmentation methods [9, 10] usually employ a multidimensional feature vector based on several global features, such as colour and texture. However, the dimensions of these features are different, so the segmentation result may be affected more by features with higher dimensions. The other features might only have a limited effect on improving the final segmentation performance. Malisiewicz and Efros [11] did not use a multidimensional feature vector and instead they proposed to calculate the similarity between regions based on a single feature, before fusing the similarities using a positive linear combination function for segmentation. The problem of assigning a weight to the similarity for each feature is very challenging with this algorithm.
The “Bag-of-Words” concept is used widely for text analysis. Recently, it was introduced into image feature extraction and analysis. Most researchers [12-14] follow the approach of using clustered affine-invariant point descriptors as visual words. With this model, images are treated as documents, where each image is represented by a histogram of visual words. Cao [15] and Perronnin [16] produced visual words using global features (colour, texture and shape) and local features (SIFT), respectively. Each region had one visual word based on the global features and a set of visual words based on the local features. The global features and the local features are different, so the visual words produced are also very different. Simply combining these visual words cannot fully leverage the effectiveness of each feature, so the segmentation performance is hindered.
The spatial semantic information of pixels or regions is ignored by most existing segmentation methods. The spatial relationship of the words in a text may not affect content distillation seriously, but the spatial characteristics of images are critical for image segmentation. For example, two connected regions will usually be merged into one during segmentation if their visual features are similar, e.g., an ocean-sky image has the sky region at the top and the ocean region at the bottom. These two connected regions are similar in terms of their visual features. However, they are different objects semantically. If the segmentation process only considers the visual similarity and ignores the spatial information, the result could be incorrect.
After an in-depth analysis of these two common problems, we considered that the construction of a high-dimensional vector (visual words) from several features is inadequate because they cannot fully exploit every feature during segmentation. Instead, a better approach is to combine the segmentation results from every feature and deliver a better final result. Inspired by the cluster ensemble idea, we have built several subsegmentation tasks where each works on a single type of feature. Each feature may deliver the best result for some categories of images, so each subsegmentation task will deliver the best result for some images. The cluster ensemble method can enhance the strengths of some features and circumvent their weaknesses. We also considered that spatial information is a latent semantic for images, so it could be an effective approach for addressing the “semantic gap” issue between low-level visual features and high-level semantics. Therefore, this approach could combine the subsegmentation results effectively to provide the best final segmentation and achieve a much more stable performance over broad categories of images.
Based on the analysis above, we propose a novel cluster ensemble-based image segmentation algorithm. The major contributions of our work are as follows. 1) To improve the quality and stability of segmentation and overcome the problem of fusing different features, we introduce the cluster ensemble concept into image segmentation technology. We use a single but different type of feature, such as colour, texture or SIFT feature, to segment the image separately (subsegmentation), before the subsegmentation results are represented as a hypergraph model. The final segmentation is achieved using a spectral clustering algorithm with this hypergraph model. This algorithm effectively combines the subsegmentations based on different features, which could avoid the limitations of algorithms based on a single type of feature, a feature vector, or visual words and this approach could achieve a much stable performance for broad categories of images. Our algorithm also scales better because we could add more types of features if we find they are good for certain categories of images. The results showed that this method was highly robust to noise, exceptions and variable samples. 2) To exploit spatial information, we used the PageRank idea from Internet applications during image subsegmentation. First, we used the Normalized Cut (N-Cut) [17] algorithm to segment an image into several regions. The regions of the image are treated like web pages on the Internet and the links between the web pages (neighbouring regions) are computed based on the similarity between regions. In this algorithm, the importance of each page is calculated according to the semantic similarity between the linked pages. This is different from the original PageRank algorithm [18], which only considers the number of links. The merging process for regions selects the most similar page based on the semantic similarity from all the linked pages. The effectiveness and speed of region merging is also improved by selecting the most semantically similar page using a greedy policy.
This paper is organized as follows. Section 2 describes the algorithm in detail. Section 3 contains details of our experiments and their results. Section 4 provides our conclusions.
2. The Cluster Ensemble-based Image Segmentation Algorithm
2.1 The Phases of the Algorithm
The phases of our segmentation algorithm are shown in Fig. 1. This includes the following three phases.

Cluster ensemble-based image segmentation algorithm
Phase 1. Given an image, this algorithm begins with an initial over-segmentation algorithm, which partitions it into several homogeneous regions. To ensure that the pixels in a region belong to the same object and to avoid obtaining regions larger than the objects, we over-segment the image using the Normalized Cut (N-Cut) algorithm [17] initially. Actually, any over-segmentation algorithm [19, 20] could be used for this purpose as long as it can provide good over-segmentation results.
Phase 2. There are three parallel subsegmentation tasks during this phase because we select three types of features, i.e., colour, texture and SIFT. If we add more features, we only need to add more tasks. During each subsegmentation task, the feature will be extracted from each region. A linking graph is built where the regions are nodes in the graph. A link is added when the similarity between two adjacent regions is greater than a threshold, where the direction of the link is from the small region to the large region. Based on the PageRank algorithm, the importance of each region is computed according to the semantic similarity between a region and its linked regions. The linked regions will be clustered into clusters according to the linking relations and importance of the nodes. During each task, the subsegmentation results will be produced in parallel using each type of feature.
Phase 3. This is the cluster ensemble phase for the three subsegmentation results. As shown in Fig. 1, the subsegmentation results are represented as a hypergraph. The initial regions produced by N-Cut are the nodes while each cluster from the subsegmentation tasks is a hyperedge on the hypergraph. We achieve the final segmentation result by applying the spectral clustering algorithm to this hypergraph.
2.2 Feature-spatial Semantic-based Subsegmentation
After applying the over-segmentation algorithm in [17], the image is separated into N regions. In this way, the problem of segmenting the image is cast into merging the regions into objects. Each region has multiple connected neighbours. It has been shown that it is better to merge according to the semantic similarity of regions. But how do we merge the regions based on their semantic similarity? We mimic the web page links used on the Internet. The regions can be viewed as pages on the Internet. When two connected regions have similar semantics, there will be a link between them. Otherwise, they will not have a link even when they are neighbours. Thus, we transform the spatial neighbouring relationships of the regions into a linking relationship based on their semantic similarity. Each subsegmentation process is performed as follows.
Extract each type of visual features for each region, e.g., colour, texture or SIFT feature.
Set a similarity threshold and compute the feature similarity between each pair of neighbouring regions.
Iterate through each region and based on the feature similarity of each with its neighbours:
a link is added when the feature similarity between two neighbours is greater than the threshold.
the direction of the link points from the small region to the large region. Thus, if we compare the areas of two regions
After these steps, an image is represented as a linking graph with N nodes. The merging process can be viewed as the jumping probability from one page to another on the Internet. We use the jumping probability
Where
The semantic similarity of linked regions cannot be computed directly. Thus, we use the visual feature similarity between the linked regions to simulate the semantic similarity.
Where
Based on the above description, the equation for the PageRank algorithm has been modified to Eq. (4).
Where
When we change the neighbour relationship to a linking relationship, we always assume that the direction of the link is from the small region to the large region. This assumption is for spatial semantics. The merging weights for the larger region will be higher than for small regions. Thus, the large region has a higher possibility of being merged with surrounding similar neighbours to form an object. This improves the accuracy of segmentation. We also use greedy policy by always picking the region with the maximum weight for merging, which speeds up the process.
Leveraging the linking relationship and the merging weight for each region ensures that image regions will be clustered into different clusters in a distributed manner and will produce several initial subsegmentation results.
2.3 Cluster Ensemble-based Subsegmentation Integration
Cluster ensembles [21] combine multiple clustering results obtained from different sets of features to produce the final result. We use a cluster ensemble policy to combine the initial subsegmentation results, which are based on different visual features of the image, into the final segmentation result.
During this phase, the process is as follows.
The subsegmentation results are represented as a hypergraph model. As stated in Section 2.2, we use the colour, texture and SIFT features and perform the subsegmentation tasks in parallel. How do we map the subsegmentation results into a hypergraph? As shown in Fig. 2, the label vectors
Spectral clustering based on hypergraph integration. When three sub segmentation results are represented as a hypergraph, we use the spectral clustering algorithm to combine the results of sub segmentation into the final segmentation. According to spectral clustering theory, the assignment of two regions to one cluster means that these two regions are similar so they can be merged during several subsegmentation tasks based on different features. Therefore, the integration result will merge them into one region. By contrast, if the two regions are only merged in a few subsegmentation tasks, this means that they may belong to different objects so they should not be merged. For example, two green regions may be grass or bushes. These regions will be merged during a colour-based subsegmentation task, but they will not be merged by texture and SIFT-based subsegmentation tasks. Thus, they will not be merged after the integration.

Hypergraph modelling of three subsegmentation results for cluster ensembles
2.4 The Cluster Number of a Cluster Ensemble
Optimal combined clustering should share the most information with the original clustering of subsegmentations. During segmentation, the sum of the differences for the regions in the object should be lowest when the linked regions with similar features are clustered into one object. Good segmentation requires that all regions are clustered into several objects correctly, so the sum of the differences for all clusters should be lowest. This also applies to the subsegmentations. The best result for all the subsegmentations is the one with the lowest sum of the differences. The number of clusters for this subsegmentation may be used as the cluster number for the final result. This process is performed as follows.
For each image, calculate the sum of differences of each subsegmentation result as
After normalization, we find the subsegmentation with the smallest sum of differences to get the cluster number,
The final cluster number is set as
3. Experiments
To evaluate the proposed approach, extensive experiments have been conducted on four different data sets and comparisons with state-of-the-art approaches have also been performed. Segmentation results from diverse images are presented for intuitive and perceptual judgments, while F-measure and amount of fragmentation are adopted for quantitative evaluation.
3.1 Data sets
Four publicly available data sets are exploited in our experiments, which contains a great diversity of images, so that the assessment can be more objective and convincing.
Berkeley Segmentation Data Set (BSDS500) [22]. This data set is probably the most used one and is very challenging. It includes 500 natural images of all categories, each of which is accompanied with five ground truth segmentations.
Weizmann Segmentation Evaluation Database [23]. There are 200 images in this data set and there are also three manual segmentations for each image. Half of the images contain only a single object in the foreground, the size of which varies from image to image, while the other half contain two objects that are also in different sizes. These objects differ from their surroundings in some or at least one type of low-level features, e.g., texture, colour, intensity, etc. Therefore, segmentation algorithms based on a single type of feature will have great difficulty achieving stable performances on this data set.
Weizmann Horse Database [24]. The data set consists of 328 images of horses that vary in poses, sizes and backgrounds. All the images are manually segmented.
Microsoft Research Cambridge Object Recognition Image Database (MSRC) [25]. In this database, a variety of digital photographs are grouped into categories, including trees, cows, sheep, cars, flowers, etc. The sizes of the images are generally 640 × 480 and they are downsized by half for processing efficiency in our experiments.
3.2 Evaluation Scheme
To demonstrate the effectiveness of the proposed approach, abundant segmentation results from all four data sets are presented for intuitive and perceptual judgments along with comparisons with several state-of-the-art algorithms, i.e., mean shift [26], normalized cuts [17], Gpb [27] and spatial-LTM [15]. Segmentation evaluation can be subjective because people usually have different understandings towards the same image and such distinctions in semantics lead to inconsistent segmentation evaluation. To avoid such divergence, we only focus on the most salient objects in each image.
Quantitative measures are also adopted for evaluation. F-measure [23] is used to assess the consistency between segmentation results and ground truth segmentations. By denoting the precision and recalling the values of segmentation by P and R respectively, the corresponding F-measure is defined as
In addition, the amount of fragmentation, which is defined as the number of segments needed to cover a single object, is computed as well.
3.3 Features and Settings
As previously mentioned, we need to partition each image into over-segmented regions first. There are several methods that can be used to obtain an over-segmentation, such as those from [19, 20]. Here we use N-Cut [17]. The number of over-segmented regions for each image is set to 50 in our experiments.
With regards to the subsegmentation tasks, three features are employed in the experiments, colour, texture and SIFT. Note that these features can be simply replaced by others or more can be added since the subsegmentation scheme does not depend on a specific type of feature and all the subsegmentation results are integrated using spectral clustering on the constructed hypergraphs. The proposed approach enjoys great flexibility and extensibility.
A colour histogram represents the number of pixels that have colours in each of a fixed list of colour ranges. It can be built for any kind of colour space such as HSV or RGB. In our experiments, the HSV colour histogram is computed for each over-segmented region, resulting in 72-dimensional feature vectors. The texture features are based on grey-level co-occurrence matrices and eight statistics, including mean and variance of energy, entropy, inertia and correlation, are used to describe a region. For the SIFT descriptor, first a visual word dictionary with 1000 entries is built according to the Bag of Words (BoW) model. Then a visual word histogram is constructed by mapping the descriptors to the dictionary.
Similarity between over-segmented regions is simply based on the Euclidian distances between corresponding feature vectors and the Gaussian similarity function with a fixed parameter of 0.6. It is set to the mean of the similarity values for the similarity threshold during the construction of the linking graph. And the area threshold is set to 0.05. For a few images that contain very small objects, it is adjusted to 0.01 instead. In addition, the constant factor ε in Eq. (4) is empirically set to 0.85.
3.4 Segmentation and Comparison Results
3.4.1 Results on Weizmann Segmentation Evaluation Database
The proposed approach is compared with three state-of-the-art algorithms for this data set and F-measure and amount of fragmentation are calculated for quantitative comparisons.
Mean Shift [26]. The algorithm measures similarity in both spatial and range domains based on a computed attraction force field. Only intensity cues are used for segmentation. For the majority of the experiments, the parameters for mean shift, i.e., hs, hr and minimum region size M, are set as 8, 7 and 1000 respectively. M shrinks to 500 for images containing very small objects. (Source code and precompiled binary are available at http://coewww.rutgers.edu/riul/research/code/EDISON).
Normalized Cuts (N-Cut) [17]. The segmentation problem is also formulated as graph partitioning and brightness values as well as spatial locations are used for calculation of edge weights. Note that N-Cut segmentation starts from pixels while our approach is based on over-segmented results. In the experiments, the number of segments for N-Cut is set to five. (Matlab implementation is available at http://www.cis.upenn.edu/˜jshi/software).
Contour Detection and Hierarchical Image Segmentation (Gpb) [27]. After the contours have been detected, sequences of threshold values in the range from zero to one are tried for segmentation until the optimal results are met. (Matlab implementation is available at http://www.eecs.berkeley.edu/Research/Projects/CS/vision/grouping/resources.html).
F-measure and the amount of fragmentation are shown in Table 1 and segmentation results are illustrated in Fig. 3 and Fig. 4. By comparison, it can be seen that the proposed approach is superior to the others, with a better F-measure and amount of fragmentation. Our algorithm can segment out the salient object in its entirety, particularly for more textured and complex ones. N-Cut and mean shift are significantly outperformed, as indicated in Table 1, since only one type of feature is used in them, which cannot be adapted to a wide range of images. Note that the N-Cut segmentation results are displayed with only segmented region boundaries because the object is sometimes torn apart and covered by several segments. This mainly results from its tendency to partition the “graph” into more balanced clusters. Gpb is relatively more powerful but it suffers from over-segmenting due to strong intra-region variations, which is a common issue with contour-based approaches. Besides, weak boundaries can lead to over-merging and make it hard to determine the threshold for segmentation.

A sample of the results obtained by applying our algorithm to images compared to other algorithms. From top to bottom: Original images, N-Cut, mean shift, Gpb and our method

A sample of the results obtained by applying our algorithm to images compared to other algorithms. From top to bottom: Original images, N-Cut, mean shift, Gpb and our method
Salient Objects Segment Coverage Test Results on the Weizmann Segmentation Evaluation Database
Unlike the algorithms analysed above, the proposed approach takes advantage of multiple types of features and then fuses the subsegmentation results by clustering over the constructed hypergraph. An object may be segmented into several pieces in one subsegmentation, but as long as these pieces are consistent in terms of at least some features, they can be merged in other subsegmentations and form a better result after final integration (see Section 3.4.3 for further discussions).
As stated in the previous section, because of the semantic gap problem, two regions with similar visual features may have different semantics and belong to different objects. In the meantime, two regions with different visual features can have the same semantics and belong to the same object. Take the image of a vase in Fig. 3 for example, many regions within the vase have completely different visual features, like colour, texture, SIFT or contour. Therefore, the methods of N-Cut, mean shift and Gpb can't deliver the whole vase in the final segmentation result. Our algorithm uses the PageRank scheme in each subsegmentation task. The merging of two regions is not simply based on the similarity of visual features. We used linking relationship and merging weight (PR, defined in Eq. 4) instead, which measure more semantic similarity, so we can achieve better segmentation results in such cases.
3.4.2 Results on BSDS500
Comparisons with Gpb are also conducted on BSDS500 [22], with some results presented in Fig. 5 and table 2.

A sample of the results obtained by applying our algorithm to images compared to Gpb. From left to right: Original images, Gpb and our method
Salient Objects Segment Coverage Test Results on BSDS500
Table 2 shows our algorithm is slightly better than Gpb. As shown in row 1-3 of Fig. 5, when the internal contour of salient objects is weak, the segmentation using Gpb is better than our algorithm on the completeness of the objects. Our algorithm uses N-cut in the beginning for over segmentation. At this stage, if some pieces of the objects are segmented into other objects and do not form a single region, it cannot be corrected in the final results. For example, in Fig. 5, the legs of the horse were in the same region as the grass after N-cut. The horse object will miss some legs in the final result of our algorithm. However, when the internal texture or contour of the salient object is very complex, the Gpb algorithm has the problem of over segmentation, as shown in row 4-8 of Fig. 5. In these cases, our algorithm can avoid this over segmentation problem and achieve better results, as explained in Section 3.3.2.
3.4.3 Results on Weizmann Horse Database
Using the Weizmann Horse Database [24], the proposed approach is compared with Spatial-LTM [15], which combines multiple types of features based on a graphical model similar to LDA and enforces spatial coherency by sharing the same topic label within a region. As with the proposed approach, Spatial-LTM also starts from over-segmentation. Comparison results are presented in Fig. 6 and Table 3.

A sample of the results obtained by applying our algorithm to images compared to Spatial-LTM [15]. (a) Original image, (b) Spatial-LTM, (c) Our method
Salient Objects Segment Coverage Test Results on Weizmann Horse Database
It can be clearly seen that our approach outperforms Spatial-LTM, even though it makes use of multiple types of features as well. Spatial-LTM estimates the topic label for each region by maximizing the likelihood, which is the product of all the factors corresponding to the features within this region. The problem is that for one region there is only a single appearance feature (average value of pixel colour and texture) but quite a few visual words (SIFT descriptors). Such an imbalance weakens the influences of the appearance feature and most of the contributions to the final results come from visual words. Consequently, the benefits from multiple types of features are significantly constrained.
By contrast, the proposed approach provides great flexibility for all the features employed and they can work independently and thus more effectively. It is the subsegmentation results instead of the features that are fused, elegantly settling the problem of imbalances between multiple types of features. Fig. 7 presents the three subsegmentation and the final results for three images, illustrating how different features are consolidated and jointly produce a better segmentation. It is not known beforehand which features are more suitable for a certain image, so we use all of them and then fuse all the results.

A sample of the results obtained by subsegmentation. From left to right: Original images, colour-spatial semantic-based subsegmentation, texture-spatial semantic-based subsegmentation, SIFT-spatial semantic-based subsegmentation and the final result for the cluster ensemble
3.4.4 Results on MSRC and Caltech-101
Comparisons with Spatial-LTM [15] are also conducted on MSRC [25] and Caltech-101 [28]. In the MRSC dataset, we selected 210 pictures from seven categories for comparison. In order to be different from previous experiments, the images selected in this experiment have more complex contours, like trees and flowers, etc. Some have multiple objects of different sizes, for example flower, sheep, cow and car. The others have more complex backgrounds, for example buildings and signs. From the Caltech-101 data set, we randomly selected 30 face images for this experiment.
The results for each category are shown in Table 4. Some of the experiment images are showed in Figure 8. In the results of this experiment, it can be clearly seen that the average F-measure score of our approach outperforms Spatial-LTM. In addition, we achieve a better performance than Spatial-LTM in every category we tested. The results demonstrate the effectiveness of our algorithm, which fuses several sub-segmentation results based on different features.

A sample of the results obtained by applying our algorithm to images on MSRC and Caltech-101
Salient Objects Segment Coverage Test Results on MSRC and Caltech-101
In the results of our approach, the best segmentation results are achieved in the categories of trees, cows and faces. Although the sheep images have similar background and contours to the cow images, the visual features of sheep heads and legs are very different from those of sheep bodies. Therefore, during the initial N-cut over-segmentation, the sheep heads and especially the sheep legs, stay in the regions containing large areas of grass. This impacts the segmentation performance for the category of sheep and leads to missing legs or heads in the final segmentation results. The results of this experiment also show that among all eight categories of images, our algorithm acts worst with flower images. We analysed this case and found two reasons. First, as shown in Figure 8, flower images can be very complex containing many flowers and the objects are quite small. In addition, because of the complicated background, it is challenging to segment the salient object from the background. Second, since our algorithm starts from the over-segmentation result of N-cut, some small objects will be missing in the final segmentation results as well, which will impact the overall segmentation performance. In summary, we achieve the object integrity in most of the images even for complex ones.
4. Conclusions
This paper proposes a novel image segmentation algorithm, which creatively leverages the cluster ensemble method to effectively fuse the segmentation results based on different visual features. Also, the idea of PageRank is exploited to incorporate the spatial information of regions, providing better semantic similarity measures. Our algorithm is capable of adapting to various kinds of images since it has a more comprehensive view of images in multiple perspectives. The segmentation naturally benefits from those appropriate features with the effects of inappropriate ones suppressed. The spatial information integrated successfully addresses the problem of partitioning a complex object into multiple pieces and exhibits a better performance at preserving the object integrity.
Extensive experiments have been performed on four data sets with a large number and a wide diversity of images, and comprehensive comparisons have been made with the state-of-the-art approaches. The results demonstrate the effectiveness and superiority of our method and fusing the sub-segmentation results based on multiple features can produce more stable segmentations.
In the next stage, we will conduct more testing with more visual cues in more challenging situations and seek a different technique to produce better over-segmentations. We are also considering exploiting this algorithm in the research on automatic image annotation.
