Sage Journals: Discover world-class research

Abstract

The road scene segmentation is an important problem which is helpful for a higher level of the scene understanding. This article presents a novel approach for image semantic segmentation of road scenes via a hierarchical graph-based inference. A deep encoder–decoder network is first applied for a fast pixel-wise classification. Then, hierarchical graph-based inference is performed to get an accurate segmentation result. In the inference process, all the object classes are grouped into fewer categories which contains at least one class. The category labels are assigned to image superpixels using Markov random field model. For each category, a pixel-level labeling based on fully connected conditional random fields is performed to divide image into different classes. After the inference for all categories, the results are integrated together to get the final segmentation. In additional to low-level affinity functions, the feature maps from network are integrated in pairwise potentials of the graphical models. This hierarchical inference scheme can alleviate the confusion of classes belonging to different categories. It performs well for small objects without adding more computational burden. Both qualitative and quantitative assessments are adopted to evaluate the proposed method. The results on benchmark data sets prove the effectiveness of the proposed hierarchical scheme, and the performance is competitive with the state-of-the-art methods.

Keywords

Road scene segmentation deep convolutional neural network multi-class image labeling Markov random field conditional random field

Introduction

Road scene understanding is a key technique for the success of many vision-based applications such as autonomous driving, driver assistance, and personal navigation. Semantic segmentation of all the scene objects is an important step toward complete understanding of the scene. This problem, also called scene parsing and scene understanding for simplicity, has become an actively studied problem in recent years. The goal of semantic image segmentation is to segment all the objects in an image and identify their categories. It is a challenge task since it combines the traditional problems of object detection, segmentation, and recognition in a single process.¹ The problem is even harder for a road scene due to the wide variety of categories and complex environment on it. In many cases, the road scene is well structured and can be handled by standard classifier-based approaches. However, the appearance of road scene always varies, which restraints the effectiveness of many systems. Besides, the algorithms should also be computationally efficient to achieve real-time capabilities in the practical applications.

Many approaches to this problem have been proposed which are mainly grouped into two categories. The first one estimates labels pixel by pixel.^2

–5 Another type is to divide an image into superpixels and predict their labels.^6

–10 The pixel-based methods fail to capture the global structure information across the whole image and most suffer from high computational complexity. The use of superpixel for segmentation is more efficient than pixel. But this kind of method tends to be highly sensitive to the accuracy of the initial segments. Sometimes the superpixel contains more than one class and is hard to distinguish by its appearance. Most methods of the two kinds use low-level feature to train classifier on annotated data. However, the trained models usually suffer from data set bias.⁸ The performance will decline when the test sets have large dissimilarity with the training data sets. Besides, the low-level cues in a local region are too restrictive to represent various patterns in the road scene. The contextual information is an effective supplement which can help disambiguate local image information. So most of the existing methods resort to graphical models such as Markov random field (MRF) or conditional random field (CRF) to include context for image parsing.² The graphical model inference ensures consistency of the labeling and is commonly used as classic frameworks for semantic segmentation.

In the past several years, deep convolutional neural networks (DCNN) have become popular in computer vision research by advancing the state-of-the-art of various high-level vision problems. The success of this technique can be partially attributed to the strong capability to learn mid- and high-level representations of natural images. This property inspires researchers to extract DCNN features for scene segmentation problems.^8,11,1 But these methods still rely on traditional classifiers which perform predictions independently on each pixel, the results are usually spatially disjoint. Recently, the end-to-end scheme that directly predicts labels for pixels has been advocated for semantic segmentation.¹² Several deep architectures have advanced the state-of-the-art by applying this learning strategy.^13

–17 However, these methods usually make predictions at a coarse resolution and result in inaccurate object boundaries. Therefore, some networks are cascaded with dense CRF to refine the results.¹⁸ However, it adds additional parameters that are difficult to tune and can only provide a little improvement for high-resolution image prediction.

To apply dense CRF as post-processing step, the common approach is to use output of classifier to calculate unary potentials in CRF energy function and leverage color contrast information in pairwise potentials. While this approach has been widely used, it suffers from some drawbacks. The useful features in the network are neglect when constructing CRF model. Using color vector to construct low-level pixel affinity functions often fails to capture semantic relationships between objects.¹⁹ The smaller classes are easily lost after several inference steps. It is also time-consuming when multi-class objects present in an image. In this article, a new road scene segmentation method is proposed to alleviate the limitations of the traditional graph-based methods. A deep encoder–decoder network is applied for a fast pixel-wise classification. Then, a hierarchical graph-based inference (HGI) is performed to get an accurate segmentation result. We do not directly infer multiple object classes using a single dense CRF model. Instead, all the classes are grouped into fewer categories. Each category contains at least one object class. We first label the category for image superpixels. An initial segmentation is obtained by superpixel labeling using MRF inference. For each category, a pixel-level labeling operation based on dense CRF is performed to divide image into different classes that belong to this category. In additional to color-based Gaussian potentials, the feature maps from network are also used in pairwise potentials. After the inference for all categories, the results are integrated together to get the final segmentation. This hierarchical inference scheme can alleviate the confusion of classes belonging to different categories. It performs well for small objects without adding computational burden. The performance evaluations on benchmark data sets show the effectiveness of the proposed hierarchical scheme.

The rest of this article is organized as follows. We first review related work in the second section. The proposed methods are thoroughly described in the third section. After that, we give a detailed description of the experimental settings and show some results and comparisons. The last section concludes this article and briefly introduces the further work.

Related work

The problem of the scene segmentation is important in understanding the content of images and has been approached with a wide variety of methods recently. The earlier method to this problem usually trained classifiers for pixels or image regions using appearance or specially designed features.^2,6,3,4,7,20 Several road scene data sets have become publicly available,^21

–24 which promote the development of the supervised methods for scene segmentation. Some data-driven approaches based on database have also emerged as compelling alternatives to the model-based approaches.⁹ Recently, the success of DCNNs for object classification has led researchers to exploit their feature learning capabilities for semantic segmentation problem.

The predicted segments by classifier tend to be blobby and lack fine object boundary. So most of the existing methods resort to graphical models such as MRF or CRF to refine the classification result. They can model the joint distribution of labels by defining both unary term and pairwise terms. Unary term represents the confidence of assigning labels while pairwise terms reflect the constraints between connected vertices. MRF usually incorporates local relationships between neighboring nodes into the model, which makes it inefficient at capturing distant interactions.² The CRF model has been successfully applied in multi-class image segmentation. Basic CRF models construct pairwise potentials on neighboring pixels or patches.^25,3 This structure is limited to model long-range connections within the image and often leads to excessive smoothing of object boundaries. Some higher order potentials defined on image regions have been proposed to improve the labeling accuracy.^2,26,27 However, this method is restricted by the accuracy of unsupervised image segmentation. An alternative to the adjacent CRF structure is the fully-connected CRFs that build pairwise potentials on all pairs of pixels in the image. It can model long-range connections and has been already used for semantic labeling in the past,²⁸ but it has more computations than the adjacent structure. Krahenbuhl and Koltun¹⁸ proposed a highly efficient inference algorithm for fully-connected CRFs model. They applied color and position vectors to form Gaussian pairwise potentials and proposed an efficient mean field approximation method for MAP inference. However, these contrast-sensitive Gaussian potentials are not sufficient to reflect the connections between pixels. This makes some small objects easily to be mislabeled.

In dense CRF model, the pairwise potentials are used measure the dissimilarity of the two connected pixels. Deep network features are good representations of image, it is more powerful in type discrimination than low-level features. So, our motivation is to use DCNN features to form the Gaussian pairwise potentials. For road scene segmentation, a major challenge in this task is posed by nonuniform distribution of classes in database. These class can be divided into stuff (e.g. road, sky, and building) and thing (e.g. people, bike and signs) categories. The stuff category has arbitrary shape and takes the most part of the image. The thing usually has fixed shape but varies in size due to the perspective effect. For example, a traffic sign close to the camera is bigger than that far away from it. Using a single CRF model to infer both stuff and object classes is not suitable, since they have the same smooth kernels in Gaussian potentials. The small object will be easily treated as isolated regions of stuff class. This is why most parsing systems that cascaded with a single graphic model get much lower performance on object classes than on stuff classes. We intend to recognize the different classes in separate processes and propose a hierarchical graphical model. All the classes are grouped into a few categories. We first estimate the category for image pixels. For labeling classes in each category, we perform dense CRF inference with different settings. This strategy enables an accurate segmentation for stuff classes and classification for small objects. Using hierarchical graph model with DCNN features, our system performs well in object classification and better than using a single CRF model.

Road scene segmentation via hierarchical inference

Overview of the proposed framework

A brief description of the proposed HGI algorithm is given in this subsection. We first train a deep encoder–decoder network for a fast pixel-wise classification. Then, the test image is segmented into different categories. Each category contains one or more semantic classes. Here, we use superpixel MRF labeling for a fast inference. For each category, a pixel-level labeling based on dense CRF is performed to segment image into different classes that belong to this category. After the inference for all categories, the results are integrated together to get the final segmentation. The flowchart of the proposed method is shown in Figure 1.

Figure 1.

Flowchart of the proposed method.

Deep network for pixel classification

To predict the class probabilities of the image pixels, we employ the ENet model,²⁹ which adopts a view of ResNet³⁰ for fast inference. The network architecture consists of an initial block and three basic bottleneck modules: down-sample, basic, and up-sample. These modules are shown in Figure 2. All bottleneck module has three convolutional layers: a 1 × 1 projection that reduces the channels of the feature map, a main convolutional layer, and a 1 × 1 expansion that increases the channels of the feature map. The batch normalization and PReLU³¹ are placed between all convolutions.

Figure 2.

Network modules: (a) initial block, (b) down-sample block, (c) basic block, and (d) up-sample block.

The initial block shown in Figure 2 is used to reduce the input size and extract feature maps. Since image is highly spatially redundant, the compact version of the image can make the inference more efficient. For the down-sample bottleneck shown in Figure 2(a), a max pooling layer is added in one branch, and the first 1 × 1 projection is replaced with a 2 × 2 convolution with stride 2 in both dimensions. For the up-sample bottleneck shown in Figure 2(d), an up-sample layer is added in one branch and the main convolutional layer is replaced with deconvolution. The pooling indices in down-sample bottleneck are fed into the corresponding up-sample bottleneck. This operation allows to reduce network parameters and memory requirements. The network avoids strong down-sample operation which will cause spatial information loss. To enlarge the receptive field, the dilated convolutions³² are used in some basic bottleneck modules.

The whole network architecture is presented in Table 1. We use an example input image with a size of 3 × 360 ×480 to report the output size in each stage. Suppose the number of object classes is L, the output of the last deconvolution layer has L feature maps with a size of 360 × 480. These feature maps are then fed to a multi-class softmax classifier for pixel-wise classification. It generates L probability maps for L object classes.

Table 1.

Network architecture.^a

Module	Output size
Input	$3 \times 360 \times 480$
Initial block	$16 \times 180 \times 240$
Down-sample a	$64 \times 90 \times 120$
Basic $\times 4$	$64 \times 90 \times 120$
Down-sample b	$128 \times 45 \times 60$
Basic $\times 8$	$128 \times 45 \times 60$
Basic dialate	$128 \times 45 \times 60$
Up-sample b	$64 \times 90 \times 120$
Basic	$64 \times 90 \times 120$
Basic	$64 \times 90 \times 120$
Up-sample a	$16 \times 180 \times 240$
Basic	$16 \times 180 \times 240$
Deconvolution	$L \times 360 \times 480$
Softmax	$L \times 360 \times 480$

^aAn example input of 360 × 480 is used.

Hierarchical graph-based inference

After pixel-wise classification, a graph-based inference is usually applied to refine the classification results. It can constrain the consistency of the labeling by leveraging image contextual information. A common strategy to realize this purpose consists of defining a graph and an energy function whose optimal solution corresponds to the desired segmentation. (V, E) is an undirected graph with vertices $v \in V$ and edges $e \in E \subseteq V \times V$ . Each pixel in the image is associated with a vertex, and every neighboring vertex has an edge between them. Each edge e has a corresponding weight $w (v_{i}, v_{j})$ to measure the dissimilarity of the connected vertices.¹ If the weight cannot describe the difference of the nodes correctly, the results may be contrary to the expected segmentation. Besides, it will easily cause confusion for small objects when inferring multi-classes with a single graphical model.

To overcome these limitations, we use an HGI scheme that segment image in different levels. All object classes are grouped into fewer categories which contains at least one class. We first perform category level segmentation and then label the classes for each category. For the category segmentation, the pixel-wise way tends to be too inefficient, especially for a large size image. So, we assign category labels to superpixels. A graph-based over-segmentation technique³³ is used to get superpixels in this article. A coarse segmentation can be obtained by labeling the superpixels into different categories. This problem can be solved by minimizing an MRF energy function.³⁴

Let $S = {s_{i}}$ denote a superpixel set of an image. All object classes are divided into M categories. $g_{i} \in {1, \dots, M}$ is the possible category label of s_i, and it also denotes the set of class labels belonging to it. A pairwise MRF is built for labeling which consists of the sum of unary potentials and the sum of pairwise potentials

E_{1} = \sum_{s_{i} \in S} U_{i} (g_{i}) + λ \sum_{(s_{i}, s_{j}) \in R} V_{i j} (g_{i}, g_{j})

where i and j are superpixel indices, g_i and g_j are candidate labels of superpixels i and j, respectively. R is the set of all pairs of adjacent superpixels, and λ is a weight to balance the unary term $U_{i}$ and pairwise term $V_{i j}$ . In this article, the two energy terms are derived from DCNN. Let P denote the classification result and F denote the feature maps from the last deconvolution layer. The unary potential $U_{i}$ is the cost of assigning label g_i to s_i. It is calculated independently for each superpixel using the classification result. Since P has L probability maps corresponding to L object classes, we combine the maps that belong to the same category to get probability maps for M categories. The unary potential is defined as

U_{i} (g_{i}) = - log \frac{1}{n_{i}} \sum_{(x, y) \in s_{i}} \sum_{l \in g_{i}} P (x, y, l)

where i is the index of the superpixel, l is the object label that belongs to the category g_i, P(x, y, l) is the probability that the pixel (x, y) takes label l and n_i is the number of pixels in s_i.

The pairwise potential V between a superpixel pair {s_i , s_j } is a cost term that penalizes adjacent s_i and s_j for taking different labels. The Potts model³⁵ is used to formulate the cost function which has the following form

V_{i, j} (g_{i}, g_{j}) = w (g_{i}, g_{j}) \cdot T (g_{i}, g_{j})

T (g_{i}, g_{j}) = {\begin{matrix} 1, & g_{i} \neq g_{j} \\ 0, & g_{i} = g_{j} \end{matrix}

where w is a weight to measure the similarity of the two adjacent superpixels. It is calculated by the feature maps

w_{i} (g_{i}, g_{j}) = - log [\frac{1}{n_{i}} \sum_{(x, y) \in s_{i}} \sum_{l \in g_{i}} F (x, y, l) - \frac{1}{n_{j}} \sum_{(x, y) \in s_{j}} \sum_{l \in g_{j}} F (x, y, l)]

To solve the energy function, we perform MRF inference using the α-expansion algorithm³⁶ to get all superpixels category labels of the image.

After the category level segmentation, the class level segmentation is performed. The fully-connected CRFs are applied to for each category separately. Suppose a category g has a set of object classes $g = {l_{1}, l_{2}, \dots, l_{k}}$ , all the other classes that are not in g are denoted as a single label $\tilde{l}$ . We define a new label set $\tilde{g} = {l_{1}, l_{2}, \dots, l_{k}, \tilde{l}}$ . Then, we perform pixel labeling on this augmented domain using fully-connected CRFs. However, the inference of fully-connected CRFs for an image is a tedious task since all pairs of pixels in the image are connected. In this article, a highly efficient CRF inference method¹⁸ is applied for the pixel labeling, which is given as follows.

Suppose image I has N pixels, a random field X which contains N random variables is conditioned on the image. The range of these variables is the label set $\tilde{g}$ . The standard CRF energy function is defined as

E (X | I) = \sum_{i} E_{u} (x_{i}) + \sum_{i < j} E_{p} (x_{i}, x_{j})

where i and j range from $1$ to N, $E_{u}$ is the unary potential which represents the cost of assigning a label in $\tilde{g}$ to variables x_i, and $E_{p}$ is the pairwise potential which indicates the label consistency of two variables x_i and x_j. In the fully-connected CRFs, each variable is connected to other variables. The large size of the image will lead to a huge complexity of inference in this model. An efficient inference method for this problem is proposed in the work of Krahenbuhl and Koltun,¹⁸ in which the pairwise edge potentials are defined by a linear combination of Gaussian kernels

E_{p} (x_{i}, x_{j}) = μ (x_{i}, x_{j}) \sum_{m = 1}^{K} ω^{(m)} k^{(m)} (h_{i}, h_{j})

where w(m) are linear combination weights, and each k(m) is a Gaussian kernel

k^{(m)} (h_{i}, h_{j}) = exp [- \frac{1}{2} {(h_{i} - h_{j})}^{T} Λ^{(m)} (h_{i} - h_{j})]

where K is the number of kernels, $ω (m)$ is linear combination weight, $Λ^{(m)}$ is a label compatibility function, k(m) is a Gaussian kernel, h_i and h_j are feature vectors corresponding to variables x_i and x_j, respectively. For multi-class image segmentation, contrast-sensitive two-kernel potentials are used, which are defined in terms of the color vectors $I_{i}$ and $I_{j}$ and positions p_i and p_j

\begin{array}{l} k (f_{i}, f_{j}) = ω^{(1)} exp (- \frac{{| p_{i} - p_{j} |}^{2}}{2 θ_{α}^{2}}) \\ + ω^{(2)} exp (- \frac{{| p_{i} - p_{j} |}^{2}}{2 θ_{β}^{2}} - \frac{{| I_{i} - I_{j} |}^{2}}{2 θ_{γ}^{2}}) \end{array}

The first term is smoothness kernel which is used to removes small isolated regions. The other one is appearance kernel which believes that nearby pixels with similar color are likely to be in the same class. The effects of the two kernels are controlled by parameters $α, β, and γ$ . However, the color of object always varies, especially of the small targets, which will affect the effectiveness of appearance kernel to measure the similarity of the two pixels. In addition to the contrast-sensitive kernels, we incorporate the deep features in the pairwise potentials. A new feature vector is defined for each pixel based on deep network features maps

H (x, y, l) = {\begin{array}{l} F (x, y, l), & l \in g \\ \sum_{l \notin g} F (x, y, l), & otherwise \end{array}

where $F (x, y, l)$ is the value of pixels (x, y) on the l-th feature map. The new pairwise potentials are defined as

\begin{array}{l} k (f_{i}, f_{j}) = ω^{(1)} exp (- \frac{{| p_{i} - p_{j} |}^{2}}{2 θ_{α}^{2}}) \\ + ω^{(2)} exp (- \frac{{| p_{i} - p_{j} |}^{2}}{2 θ_{β}^{2}} - \frac{{| I_{i} - I_{j} |}^{2}}{2 θ_{γ}^{2}}) \\ + ω^{(3)} exp (- \frac{{| H_{i} - H_{j} |}^{2}}{2 θ_{δ}^{2}}) \end{array}

where $H_{i}$ and $H_{j}$ are the new defined feature vectors for pixels i and j.

The unary potentials are derived from the network softmax layer and the category segmentation result. It is calculated as

E_{u} (x_{i}) = {\begin{cases} - log P_{i} (x_{i}), x_{i} \in g \\ - log (0.01), otherwise \end{cases}

where $P_{i} (x_{i})$ is the label assignment probability at pixel i as computed by the softmax classifier. In the fully-connected CRFs, each variable is connected to other variables. The large size of the image will lead to a huge complexity of inference in this model. A highly efficient inference method based on mean field approximation is used to solve the CRF function. The method computes an approximate distribution Q(X) that can be factorized into the product of independent marginal $Q (X) = Π Q_{i} (X_{i})$ . The solution is to update the following mean field equations iteratively¹⁸

\begin{array}{l} Q_{i} (x_{i} = l) \\ = \frac{1}{Z_{i}} exp [- E_{u} (x_{i}) - \sum_{l^{'} \in L} μ (l, l^{'}) \sum_{m = 1}^{K} ω^{(m)} \sum_{j \neq i} k^{(m)} (h_{i}, h_{j}) Q_{j} (l^{'})] \end{array}

where $Z_{i}$ is the local normalization factor. A high-dimensional filtering based on the permutohedral lattice³⁷ is used to accelerate the updation, which makes the inference process highly efficient in practice. After several iterations, we can divide the image region of this category into different classes. After the process for all categories, the objects of different classes are combined together to get the final segmentation result. We summarize the implementation steps for the HGI segmentation method in Algorithm 1.

Algorithm 1.

Road scenes segmentation via hierarchical graph-based inference.

Input: input image I, number of object classes L, number of categories M.

Network inference:

1. Put the image into the network to generate L probability maps.

Category inference:

2. Over-segment the image into a set of superpixels.

3. Calculate the unary potentials in MRF model based on equation (2).

4. Calculate the pairwise potentials based on equation (3).

5. Perform α-expansion algorithm to label the category of each superpixel.

Class inference:

6. for

k = 1

to M do

7. Create augmented label set

{\tilde{g}}_{k}

for category g_k.

8. Label each pixel a class in

{\tilde{g}}_{k}

via fully-connected CRFs inference.

9. end for

10. Combine the class segmentation results in all categories together.

Output: segmentation map.

Markov random field; CRF: conditional random field.

The proposed HGI can alleviate the confusion between classes in different groups. This is the main advantage over the method that inferring multi-label by a single CRF model. Though the problem is decomposed into multiple class level segmentations, the computational complexity is not increased. From equation (13), it can be found that each update contains a sum over all labels for each label. Theoretically, it requires quadratic complexity in the number of label L and variables N. Suppose $L = 12$ , the computational cost of the update mean field is about $O (122 N^{2})$ . If we divide the 12 classes into 4 groups in average and perform the inference separately, the computational cost will decrease to $O (64 N^{2})$ . Along with the category level segmentation, the whole HGI has a similar running time with the single CRF model in practice.

Experiment

In the experiments, we evaluate the performance of the proposed algorithm for road scene segmentation and compare the method with several state-of-the-art techniques. To validate the HGI inference, the results obtained using a single CRF model after network classification are also compared. Both qualitative and quantitative assessments are performed to evaluate the segmentation results.

Data sets and implementations

The experiments are conducted on two road scene segmentation benchmarks: Camvid²² and Cityscapes.²⁴ Camvid data set consists of 367 training and 233 testing RGB images from several video sequences. The original frame resolution for this data set is $960 \times 720$ . In order to compare with the previous methods, we down-sample the images to $480 \times 360$ for the testing. This data set contains 11 common object classes existed in residential, urban, and mixed road scenes. These classes have been divided into four groups including moving object, fixed objects, road, and sky. Cityscapes is a recent large-scale data set on street scene images from 50 different European cities. It consists of 5000 fine-annotated images, in which 2975 are available for training, 500 for validation, and the remaining 1525 for testing. The images of this data set have resolution $2048 \times 1024$ . During the evaluation, we down-sample the images by $2$ and use validation set for the comparison. This data set contains 19 classes which are belonging to 7 categories: ground, construction, object, nature, sky, human, and vehicle. To determine the parameters in MRF and CRF energy functions, we use grid search process¹⁸ on a subset of the training set.

Both qualitative and quantitative assessments are performed to evaluate the segmentation results. For the quantitative evaluation, we use class average accuracy, global average accuracy, and mean intersection over union (IOU). In all the metrics, the background pixels are ignored. Suppose is the confusion matrix for the L classes, where columns indicate the actual classes and rows the estimated ones. $M_{i j}$ refers to the number of pixels of label i which is labeled as j. The accuracy for class i is defined as

C_{i} = \frac{M_{i i}}{\sum_{j = 1}^{L} M_{i j}}

The class average accuracy is defined as the mean value of $C_{i}$ . The global average accuracy is the percentage of pixels correctly classified. The mean IOU is defined as

mIOU = \frac{1}{L} \sum_{i = 1}^{L} \frac{M_{i i}}{\sum_{i = 1}^{L} M_{i j} + \sum_{j = 1}^{L} M_{i j} - M_{i i}}

Experimental results

The quantitative results of the proposed algorithm and the compared methods on Camvid data set are presented in Table 2. We have highlighted the highest values in boldface for each metric. The traditional methods^20,9 have low values on both global and class accuracy. It can be found that these methods do not perform well on small classes. SegNet¹⁷ achieves best performance on global and mean IOU metrics. This network is trained on additional data set which brings about 10% improvement in average accuracy. Our objective was to understand the performance of the graph-based inference model. We do not use additional data to train the network and the result is not so competitive. To compare the proposed HGI with the single dense CRF inference,¹⁸ both of them are applied to refine the network classification results. The single dense CRF even worsens the accuracy of smaller classes such as sign and pedestrian and only brings small improvement on global accuracy and mean IOU metric. While the proposed method gives the biggest improvements on the smaller classes and brings a better performance increase on all the three metrics. HGI outperforms all other methods in terms of class average metric and has a slightly lower mean IOU value than SegNet.

Table 2.

Quantitative evaluation on Camvid data set.^a

Methods	Building	Tree	Sky	Car	Sign	Road	Pedestrian	Fence	Pole	Pavement	Bicyclist	Class average	Global average	mIOU
Ladicky et al.²⁰	81.5	76.6	96.2	78.7	40.2	93.9	43	47.6	14.3	81.5	33.9	62.5	83.8	n/a
Tighe and Lazebnik⁹	87	67.1	96.9	62.7	30.1	95.9	14.7	17.9	1.7	70	19.4	51.2	83.3	n/a
SegNet¹⁷	89.6	83.4	96.1	87.7	52.7	96.4	62.2	53.45	32.1	93.3	36.5	71.2	90.4	60.1
ENet²⁹	71.6	84.4	92.9	84.2	64.3	94.4	82.4	40	62.2	87.5	58.5	74.8	84.5	54.4
ENet + CRF¹⁸	74.2	86.2	95	78.7	63.8	94.5	78.4	40.5	57.2	88	59.4	74.2	85.5	55.2
HGI	80.2	86.8	93.8	82.2	65.2	94.7	82.9	41.5	56.4	87.7	63.6	76.0	87.1	59.3

CRF: conditional random field; HGI: hierarchical graph-based inference; mIOU: mean intersection over union.

^aBoldface represents the highest values for each metric.

The qualitative evaluation is also performed to verify the superiority of the proposed method. Some results on Camvid data set are exhibited in Figure 3. As can be seen, the network outputs usually predict rough object borders and contain many specks on large classes. As the single dense CRF can remove these isolate pixels, it also removes or mislabels small objects. Our method can obtain a high quality of predictions at the boundary between different classes. The results are more close to the ground truth labels.

Figure 3.

Examples of segmentation results on Camvid test set. (a) Image, (b) ground truth, (c) ENet, (d) ENet + CRF, and (e) HGI. CRF: conditional random field; HGI: hierarchical graph-based inference.

We further explored the performance of HGI using validation set of Cityscapes data set. The qualitative results are listed in Table 3. It can be seen that the single CRF model has a little effect on the result of network classification. While the result of HGI has different degrees of improvements on three metrics. Some examples on this validation set are shown in Figure 4. We can also see the fine segmentation of our method at the edge of different classes. From the comparisons, it shows that the proposed method outperforms the traditional CRF method, which proves the effectiveness of the proposed hierarchical scheme.

Table 3.

Quantitative evaluation on cityscapes data set.^a

Methods	Class average	Global average	mIOU
ENet²⁹	70.3	91.4	53.2
ENet + CRF¹⁸	70.4	90.9	52.4
HGI	71.7	92.2	59.6

CRF: conditional random field; HGI: hierarchical graph-based inference; mIOU: mean intersection over union.

^aBoldface represents the highest values for each metric.

Figure 4.

Examples of segmentation results on Cityscapes val set. (a) Image, (b) ground truth, (c) ENet, (d) ENet + CRF, and (e) HGI. CRF: conditional random field; HGI: hierarchical graph-based inference.

To evaluate the computational efficiency, we report the CPU average running time on two data sets. All the experiments were performed on a computer with 2.4-GHz Intel Xeon CPU and Titan X GPU, using MATLAB 2016a for the implementation. The results are listed in Table 4. As can be seen, ENet is faster than the other methods. Since the proposed HGI method performs a serials of graph inferences after the network inference, it requires more time than a single network-based method. Compared with the traditional dense CRF model, the proposed method has smaller complexity of computation. This proves that the hierarchical inference is efficient than the way of inferring all classes together.

Table 4.

Average running time (seconds) comparison on two data sets.

Methods	Camvid	Cityscapes
Methods	$480 \times 360$	$1024 \times 512$
ENet²⁹	0.05	0.11
ENet + CRF¹⁸	0.94	2.03
HGI	0.87	1.76

CRF: conditional random field; HGI: hierarchical graph-based inference.

Conclusions

In this work, we introduced a new framework for road scene segmentation. It is running automatically without human intervention. A deep encoder–decoder network is applied for a fast pixel-wise classification. Then, an HGI is performed to get an accurate segmentation result. We do not directly infer multiple object classes using a single dense CRF model. Instead, all the classes are grouped into fewer categories. Each category contains at least one object class. We first label the category for image superpixels. An initial segmentation is obtained by superpixel labeling using MRF inference. For each category, a pixel-level labeling based on dense CRF is performed to divide image into different classes that belong to this category. In additional to color-based Gaussian potentials, the feature maps from network are also used in pairwise potentials. After the inference for all categories, the results are integrated together to get the final segmentation. This hierarchical inference scheme can alleviate the confusion of classes belonging to different categories. It performs well for small objects without adding computational burden. The performance evaluations on benchmark data sets show the effectiveness of the proposed hierarchical scheme.

Footnotes

Authors’ note

This article was presented in part at the CCF Chinese Conference on Computer Vision,Tianjin,2017. This article was recommended by the program committee.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research,authorship,and publication of this article: This work was supported by the National Natural Science Foundation of China under Grant No. 61502364.

ORCID iD

Yanzi Deng

References

Farabet

Couprie

Najman

. Learning hierarchical features for scene labeling. IEEE Trans Pattern Anal Mach Intell 2013; 35(8): 1915–1929.

Zemel

Carreira-Perpinan

. Multiscale conditional random fields for image labeling. In: 2004 IEEE conference on computer vision and pattern recognition (CVPR), Washington, DC, USA, 27 June–2 July 2004, volume 2. pp. II–695–II–702. IEEE.

Shotton

Johnson

Cipolla

. Semantic texton forests for image categorization and segmentation. In: 2008 IEEE conference on computer vision and pattern recognition (CVPR), Anchorage, AK, USA, 23–28 June 2008, pp. 1–8. IEEE.

Sturgess

Alahari

Ladicky

. Combining appearance and structure from motion features for road scene understanding. In: Proceedings of the British machine vision conference, London, UK, 7–10 September 2009, pp. 62.1–62.11.

Scharwachter

Enzweiler

Franke

. Stixmantics: a medium-level model for real-time semantic scene understanding. In: Proceedings, V: ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014, pp. 533–548. Cham, Switzerland: Springer International Publishing.

Hoiem

Efros

Hebert

. Recovering surface layout from an image. Int J Comput Vision 2007; 75(1): 151–172.

Gould

Fulton

Koller

. Decomposing a scene into geometric and semantically consistent regions. In: Computer vision, 2009 IEEE 12th international conference on, Kyoto, Japan, 29 September–2 October 2009, pp. 1–8. IEEE.

Alvarez

Gevers

LeCun

. Road scene segmentation from a single image. In: Proceedings of ECCV, Florence, Italy, 7–13 October 2012, pp. 376–389. Heidelberg, Berlin: Springer Berlin Heidelberg.

Tighe

Lazebnik

. Superparsing. Int J Comput Vision 2013; 101(2): 329–349.

10.

Cordts

Rehfeld

Enzweiler

. Tree-structured models for efficient multi-cue scene labeling. IEEE Trans Pattern Anal 2017; 39(7): 1444–1454.

11.

Farabet

Couprie

Najman

. Scene parsing with multiscale feature learning, purity trees, and optimal covers. In: Proceedings of the 29th international conference on machine learning (ICML-12). ICML ‘12, Edinburgh, Scotland, 26 June-1 July 2012, pp. 575–582.

12.

Long

Shelhamer

Darrell

. Fully convolutional networks for semantic segmentation. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), 2015; 39(4): 3431–3440.

13.

Noh

Hong

Han

. Learning deconvolution network for semantic segmentation. In: The IEEE international conference on computer vision (ICCV), Santiago, Chile, 7–13 December 2015, pp. 1520–1528. IEEE.

14.

Papandreou

Chen

Murphy

. Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation. In: 2015 IEEE international conference on computer vision (ICCV), Santiago, Chile, 7–13 December 2015, pp. 1742–1750. IEEE.

15.

Zheng

Jayasumana

Romera-Paredes

. Conditional random fields as recurrent neural networks. In: 2015 IEEE international conference on computer vision (ICCV), Santiago, Chile, 7–13 December 2015, pp. 1529–1537. IEEE.

16.

Chen

Papandreou

Kokkinos

. DeepLab: semantic image segmentation with deep convolutional nets, Atrous convolution, and fully connected CRFs. IEEE Trans Pattern Anal 2017; PP(99): 1–1.

17.

Badrinarayanan

Kendall

Cipolla

. SegNet: a deep convolutional encoder-decoder architecture for scene segmentation. IEEE Trans Pattern Anal 2017; 39: 2481–2495.

18.

Krahenbuhl

Koltun

. Efficient inference in fully connected CRFs with Gaussian edge potentials. In: Proceedings advances in neural information processing systems conference, Granada, Spain, 12–17 December 2011. Curran Associates, Inc.

19.

Bertasius

Shi

Torresani

. Semantic segmentation with boundary neural fields. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016, pp. 3602–3610. IEEE.

20.

Ladicky

Sturgess

Alahari

. What, where and how many? Combining object detectors and CRFs. In: Proceedings, IV ECCV 2010: 11th European conference on computer vision, Heraklion, Crete, Greece, 5–11 September 2010, pp. 424–437. Berlin, Heidelberg: Springer Berlin Heidelberg.

21.

Bileschi

. Streetscenes: towards scene understanding in still images. PhD Dissertation, Massachusetts Institute of Technology, MA, USA, 2006.

22.

Brostow

Fauqueur

Cipolla

. Semantic object classes in video: a high-definition ground truth database. Pattern Recognit Lett 2009; 30(2): 88–97.

23.

Geiger

Lenz

Urtasun

. Are we ready for autonomous driving? The KITTI vision benchmark suite. In: 2012 IEEE conference on computer vision and pattern recognition, Providence, RI, USA, 16–21 June 2012, pp. 3354–3361. IEEE.

24.

Cordts

Omran

Ramos

. The cityscapes dataset for semantic urban scene understanding. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016, pp. 3213–3223. IEEE.

25.

Gould

Rodgers

Cohen

. Multi-class segmentation with relative location prior. Int J Comput Vision 2008; 80(3): 300–316.

26.

Kohli

Ladický

Torr

PHS

. Robust higher order potentials for enforcing label consistency. Int J Comput Vision 2009; 82(3): 302–324.

27.

Ladicky

Russell

Kohli

. Associative hierarchical CRFs for object class image segmentation. In: 2009 IEEE 12th international conference on computer vision, Kyoto, Japan, 29 September–2 October 2009, pp. 739–746. IEEE.

28.

Toyoda

Hasegawa

. Random field model for integration of local information and global information. IEEE T Pattern Anal 2008; 30(8): 1483–1489.

29.

Paszke

Chaurasia

Kim

. ENet: A deep neural network architecture for real-time semantic segmentation [Online]. Available at: https://arxiv.org/abs/1606.02147

30.

Zhang

Ren

. Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016, pp. 770–778. IEEE.

31.

Zhang

Ren

. Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In: 2015 IEEE international conference on computer vision (ICCV), Santiago, Chile, 7–13 December 2015, pp. 1026–1034. IEEE.

32.

Koltun

. Multi-scale context aggregation by dilated convolutions. In: 2016 international conference on learning representations, caribe hilton, San Juan, Puerto Rico, 2–4 May 2016.

33.

Felzenszwalb

Huttenlocher

. Efficient graph-based image segmentation. Int J Comput Vision 2004; 59(2): 167–181.

34.

Szeliski

Zabih

Scharstein

. A comparative study of energy minimization methods for Markov random fields with smoothness-based priors. IEEE Trans Pattern Anal 2008; 30(6): 1068–1080.

35.

Gridchyn

Kolmogorov

. Potts model, parametric maxflow and k-submodular functions. In: 2013 IEEE international conference on computer vision, Sydney, NSW, Australia, 1–8 December 2013, pp. 2320–2327. IEEE.

36.

Boykov

Veksler

Zabih

. Fast approximate energy minimization via graph cuts. IEEE Trans Pattern Anal 2001; 23(11): 1222–1239.

37.

Adams

Baek

Davis

. Fast high-dimensional filtering using the permutohedral lattice. Comput Graph Forum 2010; 29(2): 753–762.