Abstract
Keywords
Introduction
Image steganography is a technique that embeds secret information into the cover image and modifies the image content and statistical features as little as possible. 1 The embedding of secret information can be accomplished in two domains: spatial domain and frequency domain. Steganography based on the spatial domain is characterized by slightly modifying the pixel values to achieve similar visual quality between cover image and steganographic image. Steganography based on the frequency domain is generally applied to JPEG images and accomplished by changing discrete cosine transform (DCT) coefficients. Least significant bits (LSB) 2 is an early spatial-domain steganography algorithm which embeds secret information into the lowest significant bit of the pixel value of the cover image. The algorithm is simple but changes the statistical features of the image. Nowadays, many adaptive steganography algorithms have been proposed, such as HUGO, 3 WOW, 1 and S-UNIWARD 4 in the spatial domain, in which texture-rich regions of images are selected to embed secret information. The main approach of adaptive steganography algorithms is to define distortion function and calculate the cost of pixel changes to estimate whether the pixel is suitable to be modified. The algorithm makes steganography trace difficult to detect and maintains higher order statistical features of images, which brings great challenges to steganalysis.
Steganalysis is a technique to detect the trace of steganography. Early steganalysis is mainly based on simple statistical features in lower dimensions. In order to detect adaptive steganography algorithms, steganalysis algorithms using higher order statistical features have been proposed, such as spatial rich model (SRM) 5 and several models6,7 based on it. SRM generates a rich model of the noise component using a variety of high-pass filters and combines it with the support vector machine (SVM) or ensemble classifiers. However, to improve the performance of the model, the dimension of features becomes larger, which increases the computational complexity. In addition, the design of features in SRM depends on experience, and it is difficult to be optimized by a large margin.
Due to the development of convolutional neural networks (CNNs), a variety of steganalysis algorithms based on CNN were proposed to enhance the efficiency of steganalysis. Qian et al. 8 proposed a network called gaussian-neuron convolutional neural network (GNCNN) which uses high-pass filtering (HPF) layer to enhance the steganographic noise and uses Gaussian function as the activation function. Xu et al. 9 proposed a new network in which an absolute activation (ABS) 5 layer is added after the first convolutional layer to improve statistical modeling in the following layers. In addition, a TanH 10 activation function is used to avoid overfitting. Batch normalization (BN) 11 is also used in the network to prevent the training of network from falling into poor local minima and optimize scales and biases for feature maps. The CNN proposed by Ye et al. 12 utilizes a series of high-pass filters used to calculate residual maps in SRM to initialize the weights of the first convolutional layer. Furthermore, it incorporates the knowledge of selection channel 13 and uses a new activation function called truncated linear unit (TLU) to improve the performance of CNN. Yedroudj et al. 14 proposed a new network using TLU activation function and BN layer inspired by the work in XuNet 9 and YeNet. 12 Zhang et al. 15 proposed a CNN utilizing separable convolutions 16 and spatial pyramid pooling (SPP). 17 Separable convolutions are used to achieve group convolution of residuals generated by high-pass filters. SPP enables the network to steganalyze arbitrary size images. However, the parameters of manually designed high-pass filters are fixed which cannot be adjusted with the learning of the network. Furthermore, state-of-the-art networks generally contain a large number of convolutional layers and convolutional kernels which will impact the efficiency of networks, and in most networks, all residuals are input to the network with the same importance.
In IAS-CNN, to enhance the steganographic noise and reduce the impact of image content, a high-pass filter in SRM is used to calculate residual maps. Considering that the manually designed filter is not necessarily optimal, the parameters of the filter are added to the learning of the network. In adaptive steganography, especially in the low steganography rate, the changes of images brought by steganography are slight. Inspired by the work in YeNet, 12 to further enhance the signal of steganographic noise, the selection channel is combined into the network which aims to strengthen residuals in regions that have a high probability of being embedded with information and promote the network to learn key features. Embedding probability maps of images are computed and incorporated into the residuals before feature extraction in the proposed network. The depth of the network and the number of the parameters can influence the efficiency of the network. To improve the speed of network processing, the network is designed as a lightweight network.
The rest of the article is organized as follows. In section “Proposed CNN,” we introduce the overall architecture of our network and the details of each layer. In section “Experiments,” we present the experimental results and analysis. Finally, the conclusion and future research are summarized in section “Conclusion.”
Proposed CNN
Overall architecture
Figure 1 shows the overall architecture of the proposed network. The network contains pre-processing layer, feature extraction layer, and classification layer. In pre-processing layer, one of the filters of SRM is used to extract residual features of the image and then the features are combined with the knowledge of selection channel as the output. Feature extraction layer is composed of five convolutional layers and five average pooling layers. The final classification layer consists of two fully connected layers, two dropout layers, and a two-way softmax. In our network, the two-way softmax is implemented by a fully connected layer and softmax function. Some parameters of the proposed network have been shown in the figure. The specific processing of each layer will be described below.

Overall architecture of the proposed network.
Pre-processing layer filter
Steganography can be viewed as adding a slight amplitude noise to the cover image. We can hardly see the difference between steganographic image and cover image from the image content. However, the noise changes the dependencies between neighboring pixels. Thus, the dependencies can be used to detect the steganographic noise which can be applied to the steganalyzers. High-pass filters designed in the SRM can be used to calculate residual maps and capture the noise of steganography. Therefore, we add a pre-processing layer for residuals calculation prior to feature extraction in the proposed network. At the same time, we also utilize the knowledge of selection channel which will be described in section “Selection channel.” In CNN, the operation of residuals information extraction can be accomplished by convolution.
Filters from SRM are usually used to simulate the extraction of residuals. To reduce the number of training parameters and improve the speed of processing, we chose a filter from SRM as the convolution kernel in the convolutional computation. The number of residual maps generated from each image in the pre-processing layer is related to the number of filters. If

The changing trend of the number of weights in the second convolutional layer as the number of filters increases.
Generally, filters of size 3 × 3 or 5 × 5 are chosen as the convolutional kernels of the pre-processing layer. Five filters selected from classes “First,”“Second,”“Third,”“SQUARE 3 × 3,” and “SQUARE 5 × 5” are used to initialize the first layer convolutional kernel, respectively. Filters in classes “First,”“Second,” and “SQUARE 3 × 3” are used to initialize convolutional kernels of size 3 × 3. The remaining two filters are used to initialize convolutional kernels of size 5 × 5. Since the manually designed convolutional kernel is not necessarily the optimal convolutional kernel, the learning of convolutional kernel is usually added to the CNN. In our network, the convolutional kernel of the pre-processing layer is continuously adjusted as it learns. We normalize the selected filter and preserve the form of residual extraction before initializing the convolutional kernel of the pre-processing. Taking the filter of SQUARE 3 × 3 as an example, in equation (1), we use multiplication to change the center element to −1 while keeping the sum of the values of the convolutional kernel to 0. Our experimental results showed that the first layer convolutional kernel changed slightly in the training process, so it is no longer constrained during the network training process
For the steganography algorithm WOW at the payload of 0.4, different filters are used to initialize the first layer of the network, and the experimental results are shown in Table 1.
The comparison of network performance using different filters.
Table 1 shows that the accuracy of the network using the filter of SQUARE 3 × 3 is higher than that of the network using other filters. The residuals extracted by the filter of SQUARE 3 × 3 are more beneficial to steganalysis. At the same time, compared with a convolutional kernel of size 5 × 5, a convolutional kernel of size 3 ×3 has fewer training parameters. Thus, we choose the filter of SQUARE 3 × 3 in SRM to initialize the first layer of the proposed network.
Selection channel
In order to improve the performance of the network against adaptive steganographic schemes, we apply the selection channel to the CNN. The embedding probability of each pixel is used to enhance the residual of the regions with high steganography probability.
Inspired by the work in YeNet,
12
we use the upper bound of the expectation of
where
We use
We calculate the costs of pixel modification and then estimate the embedding probability maps of images.
where
The performance of the network on WOW at the payload of 0.4
The performance of the network on WOW at the payload of 0.4
Tables 2 and 3 show that, in the case of
The comparison of network performance when
The selection channel can also be combined into CNN through elementwise summation. We estimated the performance of the method of summation and multiplication

The training loss change of the elementwise summation method.

The training loss change of the elementwise multiplication method.
Comparing the two curves, we can easily find that the decline rate of training loss using elementwise summation method is significantly higher than that using elementwise multiplication method. Thus, we choose the method of elementwise summation in our network to use the selection channel.
From Ye et al. research, it can be indicated that when the activation function of each convolutional layer is ReLU, the propagation and contribution of
Equation (7) represents the output of the second convolutional layer, where
In summary, in order to better propagate the knowledge of selection channel, ReLUs are used as the non-linear activation functions from the second convolutional layer to the sixth convolutional layer. Table 5 shows the effect of selection channel on the accuracy of steganography detection. The steganography algorithm WOW and three payloads of 0.2, 0.4, and 1.0 were used to test.
The effect of selection channel on the accuracy of steganography detection.
Feature extraction layer
After the pre-processing layer, our network generates an output that contains the residuals of image and the selection channel of the same image. Then, the network needs to further extract features before inputting the features into the classifier. IAS-CNN uses five convolutional layers to extract features. The first four convolutional layers used 16 convolutional kernels of size 3 × 3, while the remaining one uses 16 convolutional kernels of size 5 × 5. At the same time, each convolutional layer is followed by an average pooling layer.
Classification layer
The network obtains the features extracted from the image after passing through the layers described above. As referred to the section “Selection channel,” the features are composed of two parts, one is extracted from the residual of the image and the other from the selection channel. Then, the features need to be integrated and divided into cover and stego, which are implemented in the classification layer. In order to achieve the above functions, the classification layer mainly consists of two components: fully connected layer and softmax layer. In summary, the classification layer takes the extracted features as input and the classification result as output.
Generally, most of the learning parameters of CNN exist in the fully connected layer. A large number of parameters in the fully connected layer can reduce the training efficiency of the network and make the network run into a problem called overfitting. To reduce the number of the parameters, we set the stride of the pooling layer to be 2, which can reduce the size of each feature map, and set the number of convolutional kernels of the last convolutional layer to be 16. Most of the existing steganalyzers, such as YedroudjNet and ZhuNet, have more convolutional kernels in the last convolutional layer. At the same time, we used the dropout proposed by Hinton et al. 18 to solve the same problem. The dropout method can be described as follows: when the network propagates forward, the activation value of a neuron stop working with a certain probability. In our proposed network, we use two 128-D feature fully connected layers and added a dropout after each fully connected layer. The parameter of each dropout was set to be 0.5.
Furthermore, the problem of overfitting also can arise due to insufficient data, and several exciting ways can be used to improve the generalization ability of the network. In addition to the method of adding dropout layer, we mainly mention two methods. One is to use regularization, that is, to add a penalty term to the loss function. The second is to use validation set to judge whether overfitting has occurred by comparing training loss and validation loss. Considering the computational efficiency, we only used the validation set to avoid overfitting.
In order to output the result of classification, we apply the two-way softmax at the end of IAS-CNN.
Experiments
Data set
The data set used in this article is BOSSBase v1.01 19 containing 10,000 512 × 512 gray-level cover images. And, we scaled the images to the size of 256 × 256 pixels in all experiments. In addition, we generated steganographic images of different algorithms and also generated embedding probability maps of cover images and steganographic images according to different steganography algorithms.
Parameters
During each experiment, we divided the data set into training, validation, and test three sets. Training set contained 8000 cover images and 8000 steganographic images. Validation set contained 1000 cover images and 1000 steganographic images. Test set contained the remaining 1000 cover images and 1000 steganographic images. The images used in the training set, validation set, and test set did not coincide with each other. Each image has its corresponding embedding probability map.
Custom initial value of the convolutional kernel was used in the pre-processing layer. Xavier 20 initializer was used to initialize the convolutional kernels of the following five convolutional layers. And the initial biases from the second and sixth layers were set to be zero. IAS-CNN includes two dropout layers, and the parameter was set to be 0.5 in both layers. When the payload is 0.2, 0.4, and 1.0, the learning rate of the proposed network was set to be 0.01, 0.03, and 0.04, respectively, using the ADADELTA 21 gradient descent algorithm. The parameters based on ADADELTA were described as follows: the decay rate was set to be 0.95; the fuzz factor epsilon was set to be 1 × 10−6; the mini-batch size is 100, which contained 50 cover images and 50 steganographic images. Based on the above settings, the proposed network was trained to minimize the cross-entropy loss.
Results and analysis
Table 6 shows the detection accuracy of IAS-CNN. Three steganography algorithms of the spatial domain, such as HUGO, WOW, and S-UNIWARD were used to evaluate the performance of the network. And, three payloads of 0.2, 0.4, and 1.0 were applied to each algorithm. In the steganalysis of payload 0.2 and payload 0.4, we used parameters trained by relatively higher embedding rate data sets to initialize the network and then used the corresponding data sets to adjust the network which is the difference from the experiments of sections “Pre-processing layer filter” and “Selection channel.”
The performance of IAS-CNN.
It can be indicated from Table 6 that the accuracy of IAS-CNN increased with the increase in payload. Comparing the experimental results of Tables 5 and 6, it can be concluded that the method of initializing the network with parameters obtained from relatively higher payload training has better performance in low payload steganalysis. The embedding probability map is used to incorporate the selection channel into the network. The calculation of the embedding probability map is related to the payload, which is usually unknown. In order to evaluate the detection ability of IAS-CNN using the mismatched selection channel and to assess the generalization ability of the network, we chose WOW as a steganography algorithm and then train the network with embedding probability maps with payloads of 0.2, 0.4, and 0.6, respectively, and then carry out steganalysis with payloads of 0.2, 0.4, and 1.0. During training, the selection channel and steganography payload are matched. Experimental results are shown in Table 7.
The performance of IAS-CNN when a certain payload selection channel is used for training and other steganography payloads are used for steganalysis.
We use the probability map payload to represent the payload corresponding to the embedding probability map and use the steganalysis payload to represent the payload of the test steganographic image. There are two cases of probability maps mismatch: the first is that the probability map payload is lower than the steganalysis payload and the second is the opposite. In Table 7, it can be observed that when performing the steganalysis with the payload of 0.4, the detection accuracy of IAS-CNN trained with embedding probability maps of 0.2 payload is 73.7%, while that of IAS-CNN trained with embedding probability maps of 0.6 payload is 67.2%. The detection performance of the network in the first case is better than that in the second case. At the same time, Table 7 shows that in the steganalysis with the payload of 1.0, as the probability map payload increases, the detection accuracy of the network gradually improves. As mentioned above, in the first case of probability maps mismatch, the closer the two payloads are, the more beneficial the channel selection is to enhance residuals in regions with high embedding probability. In the second case, the detection performance of the network is weakened, because the selection channel enhances the residuals of regions where the secret information is not embedded.
In order to further evaluate the performance of the proposed network, we compared it with the existing steganalyzers. Figure 5 shows the comparison of IAS-CNN with the steganalysis model SRM based on manually extracted features and the steganalysis model GNCNN based on deep learning. In Figure 5, it can be observed that the detection accuracy of IAS-CNN is higher than that of SRM and GNCNN. 8

The comparison of IAS-CNN with SRM and GNCNN (on the payload of 0.4).
Table 8 shows the comparison of IAS-CNN with other steganalysis models based on deep learning. Table 9 shows the comparison of IAS-CNN with steganalysis models using the knowledge of selection channel.
The comparison of IAS-CNN with other steganalysis models.
The comparison of IAS-CNN with maxSRMd2 and SC-DA-YeNet (on the payload of 0.4).
In Table 8, IAS-CNN, YedroudjNet, 14 and ZhuNet 15 use the data set of BOSSBase, and NSC-YeNet represents YeNet 12 without selection channel and uses the data set of BOSSBase. In Table 9, maxSRMd2 uses data sets BOSSBase and BOWS2, and SC-DA-YeNet represents YeNet using selection channel and is trained on BOSSBase, BOWS2, and data augmentation. Table 8 shows that the performance of IAS-CNN is better than NSC-YeNet. In WOW steganalysis of payload 0.4, while the detection accuracy of YedroudjNet and ZhuNet is between 80% and 90%, the detection accuracy of IAS-CNN also reaches more than 80%. In S-UNIWARD steganalysis of payloads 0.2 and 0.4, the performance of IAS-CNN is similar to that of YedroudjNet. Table 9 shows that even though IAS-CNN uses about half of the data of maxSRMd2, its performance is similar to that of maxSRMd2. Actually, for payload 0.4, the accuracies of them are both higher than 80% in detecting WOW, and are both higher than 75% in detecting S-UNIWARD. Furthermore, IAS-CNN has its advantages in the condition of limited computing resources. The specific performances are as follows: First, there are about 60,000 parameters in our network, fewer than those in YedroudjNet, ZhuNet, and YeNet. Second, IAS-CNN has fewer residual extractions. YedroudjNet, ZhuNet, and YeNet use 30 filters to get residual maps, while IAS-CNN takes only one. Last but not least, IAS-CNN is more effective benefiting from fewer convolutional computations, and ZhuNet and YeNet cost more calculations due to their deeper networks. In addition, SC-DA-YeNet uses approximately eight times as much data as IAS-CNN. YedroudjNet and IAS-CNN have the same number of convolutional layers, but the number of convolutional kernels in each layer of YedroudjNet is much more than that of IAS-CNN.
Conclusion
Hand-crafted high-pass filters are used in SRM, a mature steganalysis model, to extract a variety of image residuals. However, with the increasing complexity of steganography, manual extraction of features becomes more complex and difficult. As a result, deep learning has been applied to steganalysis, especially CNNs. In CNN, self-learning image features replace manual design features. At the same time, to improve the efficiency of CNN, an HPF layer is generally used to generate residual maps before feature extraction. However, hand-crafted filter used in HPF layer is fixed, that is to say, it will not change during the training process. As a result, we select a manual design filter to initialize the pre-processing layer and add it to the learning of the network. In addition, we integrate the knowledge of selection channel into the image pre-processing to enhance crucial residuals and initialize the network with parameters trained with a high payload rate data set to improve the performance of the network. Furthermore, IAS-CNN has fewer residual extractions and convolutional computations. When computational capability and storage space are limited, IAS-CNN is more suitable for steganalysis. In our proposed network, a type of filter is used in the pre-processing layer. In the future, we will consider increasing the diversity of feature extraction, for example, using multiple filters to obtain residuals and then generating minimum residual map and maximum residual map for further feature extraction.
