Abstract
Keywords
Introduction
Road scene understanding is a key technique for the success of many vision-based applications such as autonomous driving, driver assistance, and personal navigation. Semantic segmentation of all the scene objects is an important step toward complete understanding of the scene. This problem, also called scene parsing and scene understanding for simplicity, has become an actively studied problem in recent years. The goal of semantic image segmentation is to segment all the objects in an image and identify their categories. It is a challenge task since it combines the traditional problems of object detection, segmentation, and recognition in a single process. 1 The problem is even harder for a road scene due to the wide variety of categories and complex environment on it. In many cases, the road scene is well structured and can be handled by standard classifier-based approaches. However, the appearance of road scene always varies, which restraints the effectiveness of many systems. Besides, the algorithms should also be computationally efficient to achieve real-time capabilities in the practical applications.
Many approaches to this problem have been proposed which are mainly grouped into two categories. The first one estimates labels pixel by pixel.2 –5 Another type is to divide an image into superpixels and predict their labels. 6 –10 The pixel-based methods fail to capture the global structure information across the whole image and most suffer from high computational complexity. The use of superpixel for segmentation is more efficient than pixel. But this kind of method tends to be highly sensitive to the accuracy of the initial segments. Sometimes the superpixel contains more than one class and is hard to distinguish by its appearance. Most methods of the two kinds use low-level feature to train classifier on annotated data. However, the trained models usually suffer from data set bias. 8 The performance will decline when the test sets have large dissimilarity with the training data sets. Besides, the low-level cues in a local region are too restrictive to represent various patterns in the road scene. The contextual information is an effective supplement which can help disambiguate local image information. So most of the existing methods resort to graphical models such as Markov random field (MRF) or conditional random field (CRF) to include context for image parsing. 2 The graphical model inference ensures consistency of the labeling and is commonly used as classic frameworks for semantic segmentation.
In the past several years, deep convolutional neural networks (DCNN) have become popular in computer vision research by advancing the state-of-the-art of various high-level vision problems. The success of this technique can be partially attributed to the strong capability to learn mid- and high-level representations of natural images. This property inspires researchers to extract
To apply dense CRF as post-processing step, the common approach is to use output of classifier to calculate unary potentials in CRF energy function and leverage color contrast information in pairwise potentials. While this approach has been widely used, it suffers from some drawbacks. The useful features in the network are neglect when constructing CRF model. Using color vector to construct low-level pixel affinity functions often fails to capture semantic relationships between objects. 19 The smaller classes are easily lost after several inference steps. It is also time-consuming when multi-class objects present in an image. In this article, a new road scene segmentation method is proposed to alleviate the limitations of the traditional graph-based methods. A deep encoder–decoder network is applied for a fast pixel-wise classification. Then, a hierarchical graph-based inference (HGI) is performed to get an accurate segmentation result. We do not directly infer multiple object classes using a single dense CRF model. Instead, all the classes are grouped into fewer categories. Each category contains at least one object class. We first label the category for image superpixels. An initial segmentation is obtained by superpixel labeling using MRF inference. For each category, a pixel-level labeling operation based on dense CRF is performed to divide image into different classes that belong to this category. In additional to color-based Gaussian potentials, the feature maps from network are also used in pairwise potentials. After the inference for all categories, the results are integrated together to get the final segmentation. This hierarchical inference scheme can alleviate the confusion of classes belonging to different categories. It performs well for small objects without adding computational burden. The performance evaluations on benchmark data sets show the effectiveness of the proposed hierarchical scheme.
The rest of this article is organized as follows. We first review related work in the second section. The proposed methods are thoroughly described in the third section. After that, we give a detailed description of the experimental settings and show some results and comparisons. The last section concludes this article and briefly introduces the further work.
Related work
The problem of the scene segmentation is important in understanding the content of images and has been approached with a wide variety of methods recently. The earlier method to this problem usually trained classifiers for pixels or image regions using appearance or specially designed features. 2,6,3,4,7,20 Several road scene data sets have become publicly available, 21 –24 which promote the development of the supervised methods for scene segmentation. Some data-driven approaches based on database have also emerged as compelling alternatives to the model-based approaches. 9 Recently, the success of DCNNs for object classification has led researchers to exploit their feature learning capabilities for semantic segmentation problem.
The predicted segments by classifier tend to be blobby and lack fine object boundary. So most of the existing methods resort to graphical models such as MRF or CRF to refine the classification result. They can model the joint distribution of labels by defining both unary term and pairwise terms. Unary term represents the confidence of assigning labels while pairwise terms reflect the constraints between connected vertices. MRF usually incorporates local relationships between neighboring nodes into the model, which makes it inefficient at capturing distant interactions. 2 The CRF model has been successfully applied in multi-class image segmentation. Basic CRF models construct pairwise potentials on neighboring pixels or patches. 25,3 This structure is limited to model long-range connections within the image and often leads to excessive smoothing of object boundaries. Some higher order potentials defined on image regions have been proposed to improve the labeling accuracy. 2,26,27 However, this method is restricted by the accuracy of unsupervised image segmentation. An alternative to the adjacent CRF structure is the fully-connected CRFs that build pairwise potentials on all pairs of pixels in the image. It can model long-range connections and has been already used for semantic labeling in the past, 28 but it has more computations than the adjacent structure. Krahenbuhl and Koltun 18 proposed a highly efficient inference algorithm for fully-connected CRFs model. They applied color and position vectors to form Gaussian pairwise potentials and proposed an efficient mean field approximation method for MAP inference. However, these contrast-sensitive Gaussian potentials are not sufficient to reflect the connections between pixels. This makes some small objects easily to be mislabeled.
In dense CRF model, the pairwise potentials are used measure the dissimilarity of the two connected pixels. Deep network features are good representations of image, it is more powerful in type discrimination than low-level features. So, our motivation is to use DCNN features to form the Gaussian pairwise potentials. For road scene segmentation, a major challenge in this task is posed by nonuniform distribution of classes in database. These class can be divided into stuff (e.g. road, sky, and building) and thing (e.g. people, bike and signs) categories. The stuff category has arbitrary shape and takes the most part of the image. The thing usually has fixed shape but varies in size due to the perspective effect. For example, a traffic sign close to the camera is bigger than that far away from it. Using a single CRF model to infer both stuff and object classes is not suitable, since they have the same smooth kernels in Gaussian potentials. The small object will be easily treated as isolated regions of stuff class. This is why most parsing systems that cascaded with a single graphic model get much lower performance on object classes than on stuff classes. We intend to recognize the different classes in separate processes and propose a hierarchical graphical model. All the classes are grouped into a few categories. We first estimate the category for image pixels. For labeling classes in each category, we perform dense CRF inference with different settings. This strategy enables an accurate segmentation for stuff classes and classification for small objects. Using hierarchical graph model with DCNN features, our system performs well in object classification and better than using a single CRF model.
Road scene segmentation via hierarchical inference
Overview of the proposed framework
A brief description of the proposed HGI algorithm is given in this subsection. We first train a deep encoder–decoder network for a fast pixel-wise classification. Then, the test image is segmented into different categories. Each category contains one or more semantic classes. Here, we use superpixel MRF labeling for a fast inference. For each category, a pixel-level labeling based on dense CRF is performed to segment image into different classes that belong to this category. After the inference for all categories, the results are integrated together to get the final segmentation. The flowchart of the proposed method is shown in Figure 1.

Flowchart of the proposed method.
Deep network for pixel classification
To predict the class probabilities of the image pixels, we employ the ENet model, 29 which adopts a view of ResNet 30 for fast inference. The network architecture consists of an initial block and three basic bottleneck modules: down-sample, basic, and up-sample. These modules are shown in Figure 2. All bottleneck module has three convolutional layers: a 1 × 1 projection that reduces the channels of the feature map, a main convolutional layer, and a 1 × 1 expansion that increases the channels of the feature map. The batch normalization and PReLU 31 are placed between all convolutions.

Network modules: (a) initial block, (b) down-sample block, (c) basic block, and (d) up-sample block.
The initial block shown in Figure 2 is used to reduce the input size and extract feature maps. Since image is highly spatially redundant, the compact version of the image can make the inference more efficient. For the down-sample bottleneck shown in Figure 2(a), a max pooling layer is added in one branch, and the first 1 × 1 projection is replaced with a 2 × 2 convolution with stride 2 in both dimensions. For the up-sample bottleneck shown in Figure 2(d), an up-sample layer is added in one branch and the main convolutional layer is replaced with deconvolution. The pooling indices in down-sample bottleneck are fed into the corresponding up-sample bottleneck. This operation allows to reduce network parameters and memory requirements. The network avoids strong down-sample operation which will cause spatial information loss. To enlarge the receptive field, the dilated convolutions 32 are used in some basic bottleneck modules.
The whole network architecture is presented in Table 1. We use an example input image with a size of 3 × 360 ×480 to report the output size in each stage. Suppose the number of object classes is
Network architecture.a
aAn example input of 360 × 480 is used.
Hierarchical graph-based inference
After pixel-wise classification, a graph-based inference is usually applied to refine the classification results. It can constrain the consistency of the labeling by leveraging image contextual information. A common strategy to realize this purpose consists of defining a graph and an energy function whose optimal solution corresponds to the desired segmentation. (
To overcome these limitations, we use an HGI scheme that segment image in different levels. All object classes are grouped into fewer categories which contains at least one class. We first perform category level segmentation and then label the classes for each category. For the category segmentation, the pixel-wise way tends to be too inefficient, especially for a large size image. So, we assign category labels to superpixels. A graph-based over-segmentation technique 33 is used to get superpixels in this article. A coarse segmentation can be obtained by labeling the superpixels into different categories. This problem can be solved by minimizing an MRF energy function. 34
Let
where
where
The pairwise potential
where
To solve the energy function, we perform MRF inference using the
After the category level segmentation, the class level segmentation is performed. The fully-connected CRFs are applied to for each category separately. Suppose a category
Suppose image
where
where
where
The first term is smoothness kernel which is used to removes small isolated regions. The other one is appearance kernel which believes that nearby pixels with similar color are likely to be in the same class. The effects of the two kernels are controlled by parameters
where
where
The unary potentials are derived from the network softmax layer and the category segmentation result. It is calculated as
where
where
Road scenes segmentation via hierarchical graph-based inference.
Markov random field; CRF: conditional random field.
The proposed HGI can alleviate the confusion between classes in different groups. This is the main advantage over the method that inferring multi-label by a single CRF model. Though the problem is decomposed into multiple class level segmentations, the computational complexity is not increased. From equation (13), it can be found that each update contains a sum over all labels for each label. Theoretically, it requires quadratic complexity in the number of label
Experiment
In the experiments, we evaluate the performance of the proposed algorithm for road scene segmentation and compare the method with several state-of-the-art techniques. To validate the HGI inference, the results obtained using a single CRF model after network classification are also compared. Both qualitative and quantitative assessments are performed to evaluate the segmentation results.
Data sets and implementations
The experiments are conducted on two road scene segmentation benchmarks: Camvid
22
and Cityscapes.
24
Camvid data set consists of 367 training and 233 testing RGB images from several video sequences. The original frame resolution for this data set is
Both qualitative and quantitative assessments are performed to evaluate the segmentation results. For the quantitative evaluation, we use class average accuracy, global average accuracy, and mean intersection over union (IOU). In all the metrics, the background pixels are ignored. Suppose is the confusion matrix for the
The class average accuracy is defined as the mean value of
Experimental results
The quantitative results of the proposed algorithm and the compared methods on Camvid data set are presented in Table 2. We have highlighted the highest values in boldface for each metric. The traditional methods 20,9 have low values on both global and class accuracy. It can be found that these methods do not perform well on small classes. SegNet 17 achieves best performance on global and mean IOU metrics. This network is trained on additional data set which brings about 10% improvement in average accuracy. Our objective was to understand the performance of the graph-based inference model. We do not use additional data to train the network and the result is not so competitive. To compare the proposed HGI with the single dense CRF inference, 18 both of them are applied to refine the network classification results. The single dense CRF even worsens the accuracy of smaller classes such as sign and pedestrian and only brings small improvement on global accuracy and mean IOU metric. While the proposed method gives the biggest improvements on the smaller classes and brings a better performance increase on all the three metrics. HGI outperforms all other methods in terms of class average metric and has a slightly lower mean IOU value than SegNet.
Quantitative evaluation on Camvid data set.a
CRF: conditional random field; HGI: hierarchical graph-based inference; mIOU: mean intersection over union.
aBoldface represents the highest values for each metric.
The qualitative evaluation is also performed to verify the superiority of the proposed method. Some results on Camvid data set are exhibited in Figure 3. As can be seen, the network outputs usually predict rough object borders and contain many specks on large classes. As the single dense CRF can remove these isolate pixels, it also removes or mislabels small objects. Our method can obtain a high quality of predictions at the boundary between different classes. The results are more close to the ground truth labels.

Examples of segmentation results on Camvid test set. (a) Image, (b) ground truth, (c) ENet, (d) ENet + CRF, and (e) HGI. CRF: conditional random field; HGI: hierarchical graph-based inference.
We further explored the performance of HGI using validation set of Cityscapes data set. The qualitative results are listed in Table 3. It can be seen that the single CRF model has a little effect on the result of network classification. While the result of HGI has different degrees of improvements on three metrics. Some examples on this validation set are shown in Figure 4. We can also see the fine segmentation of our method at the edge of different classes. From the comparisons, it shows that the proposed method outperforms the traditional CRF method, which proves the effectiveness of the proposed hierarchical scheme.
Quantitative evaluation on cityscapes data set.a
CRF: conditional random field; HGI: hierarchical graph-based inference; mIOU: mean intersection over union.
aBoldface represents the highest values for each metric.

Examples of segmentation results on Cityscapes val set. (a) Image, (b) ground truth, (c) ENet, (d) ENet + CRF, and (e) HGI. CRF: conditional random field; HGI: hierarchical graph-based inference.
To evaluate the computational efficiency, we report the CPU average running time on two data sets. All the experiments were performed on a computer with 2.4-GHz Intel Xeon CPU and Titan X GPU, using MATLAB 2016a for the implementation. The results are listed in Table 4. As can be seen, ENet is faster than the other methods. Since the proposed HGI method performs a serials of graph inferences after the network inference, it requires more time than a single network-based method. Compared with the traditional dense CRF model, the proposed method has smaller complexity of computation. This proves that the hierarchical inference is efficient than the way of inferring all classes together.
Average running time (seconds) comparison on two data sets.
CRF: conditional random field; HGI: hierarchical graph-based inference.
Conclusions
In this work, we introduced a new framework for road scene segmentation. It is running automatically without human intervention. A deep encoder–decoder network is applied for a fast pixel-wise classification. Then, an HGI is performed to get an accurate segmentation result. We do not directly infer multiple object classes using a single dense CRF model. Instead, all the classes are grouped into fewer categories. Each category contains at least one object class. We first label the category for image superpixels. An initial segmentation is obtained by superpixel labeling using MRF inference. For each category, a pixel-level labeling based on dense CRF is performed to divide image into different classes that belong to this category. In additional to color-based Gaussian potentials, the feature maps from network are also used in pairwise potentials. After the inference for all categories, the results are integrated together to get the final segmentation. This hierarchical inference scheme can alleviate the confusion of classes belonging to different categories. It performs well for small objects without adding computational burden. The performance evaluations on benchmark data sets show the effectiveness of the proposed hierarchical scheme.
