Abstract
Introduction
Traditional defect detection mainly relies on experienced professionals. It has a certain subjectivity and depends on the personal experience of the inspectors. In addition, the long-term labor would cause the detection rate to be greatly reduced. It is to notice that the manual detection speed is difficult to meet the needs of real-time online detection. The traditional automatic defect detection method is mainly based on the artificially designed feature set, including the statistical features, 1 structural features, 2 and spectral features 3 of the image. These methods have achieved good results for specific fabric products. However for new fabric designs, or when the images capture environmental changes, these methods must be modified or even redesigned.
Compared with the previous artificial design features, the deep learning algorithm can automatically learn the multi-scale features of the image through the multi-layer network, and can simultaneously acquire the local information of the image, as well as the abstract semantic information of the upper layer. The application of convolutional neural network (CNN) in fabric image defect detection can solve the multi-deformation and multi-scale problem of image, make it possible to construct deep and complex texture defect model, and realize intelligent detection and location of defects, which is of great significance to improve the product quality.
In 2006, Hinton et al. 4 put forward the concept of deep learning for the first time, and mentioned that the deep neural network model has strong feature learning ability. In 2014, Ross B. Girshick (RBG) and others used candidate regions to replace the sliding windows, CNNs instead of artificially designed features, and proposed R-CNN, which is a region-based convolutional neural network named. 5 Based on this, spatial pyramid pooling (SPP)-NET 6 transforms multiple convolutions into one convolution in R-CNN, which greatly reduces the computational complexity. Fast R-CNN 7 combines the structural advantages of R-CNN and SPP-NET, and uses multi-task loss function to train the network to achieve the purpose of target detection; however the fast R-CNN still cannot meet the real-time application of target detection. Faster R-CNN 8 combines candidate region and CNN classification into a complete network, which improves the detection speed of the network and realizes end-to-end training of the network.
In 2015, YOLO 9 adopts an integrated detection scheme, which integrates candidate frame extraction, CNN learning features, and non-maximum suppression optimization 10 to make the network structure simpler. The detection speed is nearly 10 times higher than that of the faster R-CNN. This makes the deep learning target detection algorithm to meet the requirement of real-time detection tasks under the computing power at that time; however, the detection performance on small targets is not good. YOLOv2 11 improves the network structure of YOLOv1 by adding batch normalization, 12 high-resolution classifier, 13 convolution with anchor boxes, 14 dimension clusters, 15 and other optimization models to improve the accuracy of target regression and positioning. YOLOv3 16 uses the residual network on the basis of YOLOv2 and combines the feature pyramid network (FPN) 17 structure, using the binary cross loss function as the loss function. After extracting the features, the upper two layers of the feature map are up-sampled and merged with the corresponding feature maps of the network. After the convolutional network, the prediction results are obtained, and the accuracy and speed are achieved. Based on the good performance of YOLOv3, many researchers have introduced the network model into their own research field and achieved good results.18–20 Based on the above research, we apply the CNN to textile companies to solve the problem of fabric defect detection.
To improve the detection rate of fabric defects, the deep CNN YOLOv3 is used as the basic defect detection framework and is optimized to better detect fabric defects. The remainder of the article is organized as follows. First, we introduce the relevant part of YOLOv3. Then, the priori frame of YOLOv3 is modified using the
YOLOv3 network
YOLOv3 is an end-to-end target detection algorithm based on the regression theory. Combining CNN, non-maximum suppression algorithm and feature pyramid, it can predict the defect borders and categories. The bounding box is classified by independent logistic regression classifier instead of softmax, and the target class is predicted by binary cross entropy loss. Based on the above design ideas, YOLOv3 achieves good results in accuracy and speed.
Feature extraction
Based on the validity of CNN for feature extraction, YOLOv3 still uses CNN for feature extraction. YOLOv3 integrates YOLOv2, darknet-19, and ResNet to design the network structure for feature extraction. The convolution layers of 3 × 3 and 1 × 1 with better performance are used. The convolution step of convolution layer is set to 2 instead of pooling layer. Scale invariant features are transmitted to the next level of convolution, and a shortcut connection is added. Batch normalization and dropout operations are added after each level of convolution. The feature extraction network has 53 convolution layers, thus it becomes darknet-53. Comparing the performance of darknet-53 with others in the ImageNet, the test result of the Top-1 is to reach 77.2% mAP.
Usually, the detection targets have different scales. To detect objects of different sizes at the same time, the network must have the ability to detect objects of different sizes. However, as the network depth increases, the feature map decreases gradually; the smaller the size of the target is, the more difficult it is to detect. To detect objects of different sizes at the same time, YOLOv3 adopts the idea of FPN, uses up-sampling and feature fusion to detect objects of different sizes on feature maps, which improves the detection performance of small targets. The YOLOv3 network structure is shown in Figure 1. Detection layers are the 79, 91, and 103 layers that detect defects on multi-scale feature maps. The

The structure of YOLOv3 network.
Proposed network model
Prior frame determined
In YOLOv3, the idea of anchor boxes used in faster R-CNN is introduced. The
Applying a larger priori box on a smaller feature map can better detect larger objects. The size of some defective target boxes is shown in Figure 2. It is difficult to obtain accurate target information directly using the prior boxes in YOLOv3.

The size of partial sample defect.
There is an overlap between the predicted border and the actual border. The larger the overlap area, the better the model prediction result is. The overlap area size can be quantitatively analyzed by calculating the intersection-over-union (IoU).
After labeling the gray cloth data set, the cluster analysis is carried out, and the relationship between the number of clusters and

The clustering results of gray cloth.
Network optimization
The CNN extracts the features of the target through layer-by-layer abstraction. One of the important concepts is the receptive field. If the field is too small, only local features can be observed. If the field is too large, too much invalid information is obtained. Therefore, a various multi-scale model structures have been designed, mainly image pyramid and feature pyramid. The specific network architectures can be divided into the following: (1) multi-scale input, (2) multi-scale feature fusion, and (3) multi-scale features and predictive fusion.
In the YOLO model, the third structure is used for target detection, and the prediction is performed at different feature sizes. Finally, the results are fused. This structure is represented by the FPN in the target detection, which adds the high-level features to the adjacent low-level to form new features, and each layer separately forecasts. In the YOLOv3 model, when the input image size is 416 × 416 pixel, YOLOv3 performs target recognition, respectively, on the feature maps of sizes 13 × 13, 26 × 26, and 52 × 52. Use smaller priori box for defect detection on larger feature maps.
The deeper features in CNN have a large receptive field and rich semantic information. The deeper features are robust to the attitude change, occlusion, and local deformation of the object, but due to the reduction of resolution, geometric details are lost. On the contrary, the shallow features have very small receptive fields and rich geometric details. The resolution is high but the semantic information is relatively scarce. In a CNN, the semantic information of objects can appear in different layers, which is related to the size of the detected objects.
For small objects, shallow features contain some details. With the deepening of the number of layers, due to the large receptive field, the geometric details in the extracted features may disappear completely. It is difficult to detect small objects through deep features. For large objects, its semantic information will appear in deeper features.
In the process of network target recognition, the low-level features have rich details of the target and location information, while the high-level targets have rich semantic features. Through the multi-layer convolution and pooling process, the details and location information of the target are gradually reduced, whereas the semantic information is increasing. Figure 4(a) shows the input data, and the feature extraction is performed using a CNN. The low-level features are shown in Figure 4(b) and the high-level ones are shown in Figure 4(c).

(a) Input data, (b) low-level features, and (c) high-level features.
In Figure 4, the defect type of the input data is scratch, and the defect area is small. The defect contour area is obviously different from the normal part. However, with the deepening of convolution and pooling, image texture features will become more and more blurred, which will increase the difficulty of defect recognition. Therefore, feature fusion can be used to detect defects. Combined with the image pyramid, the high-level information obtained by the up-sampling is merged with the low-level features to obtain feature maps of different scales, and the detection layer is added to improve the network structure.
The target detection structure of the improved network model is shown in Figure 5, and the data on the left side indicate the number of repeating units. The detection layer is added when the feature map size is 104 × 104. The dimensional clustering of target frames in data sets is carried out by the

The framework of the target detection.
Results and analysis
Experimental environment and data
YOLOv3 is a representative multi-scale target detection algorithm, which can take into account both small and large targets, and has a good performance of detection for small targets. Ubuntu operating system is used in the experiment. The processor is Intel®CoreTMi7-6800K CPU@3.40 GHz, the memory is 125.8 GiB, and the graphics card GTX1080 Ti. The darknet framework is configured in the experimental environment. It is a relatively lightweight open source deep learning framework based on C and CUDA. Its main features are easy to install, no dependencies, very good portability, and support for both CPU and GPU computing.
The test results of the YOLO model are greatly influenced by the samples, which need to be diverse and representative. Taking gray cloth and lattice as research objects, defect images were collected by industrial camera. The defect database is formed by enhancing and expanding the data set by means of rotation and contrast enhancement. The data set is divided into training set and test set. The number of defect samples is shown in Figure 6. The abscissa is the defect category and the ordinate is the number of data.

Defect sample number: (a) gray sample number and (b) lattice sample number.
The main defect types in gray cloth include scratch, foreign matter, and fold. The scratches are mainly characterized by fine stripes. The foreign matter is expressed as a region with obvious contrast with the background color. The fold is characterized by partial protrusion or depression. The three types of defects have obvious differences in appearance, and each type of defect sample is similar.
The lattice fabric mainly contains three types of defects: ribbon yarn, broken ends, and hole. Compared with ribbon yarn, broken ends have more broken yarns, and the appearance of broken ends is rectangular. Hole is the defect in some areas of the sample. Each type of defect has similarities, and there are significant differences between the three types of defects.
We name the image according to the same rule and set the image size to 416 × 416 pixel. The labelImg software is used to mark the image according to the defect category and position. The defect is marked in the image, and the corresponding .xml file is generated, which contains the file name and ground truth of the corresponding image. To reduce the amount of calculation, the ground truth is normalized to a data range of 0–1.
Network training
Using the gray cloth data as the experimental object, the priori frame of YOLOv3 network is modified according to clustering results, and the experiment was compared with the YOLOv3 and the improved network model. The initial learning rate of the network is 0.001 and the total number of iterations is 8000 steps. The learning rate is reduced to 0.0001 and 0.00001 at 7000 steps and 7500 steps, respectively. Each 64 images are iterated in batches. The parameters are initialized using the source weights to accelerate the convergence of loss function. In the training process, the curve of loss and IoU are drawn as shown in Figure 7. The abscissa is the number of iterations, and the ordinate is the loss value and the mean IoU.

Curve of the training process: (a) loss value curve and (b) IoU curve.
From Figure 7, it can be seen that as the number of iterations increases, the curve of the average loss value tends to reach zero. When the number of iterations is about 2500, the loss value decreases to 0.01, and the improved method decreases rapidly. The merging ratio of the target box and the actual border is close to 1.
Network testing
To verify the accuracy of the model, the training model is tested with the gray cloth test set, and its detection result is shown in Table 1. The actual number of defects is counted, while the number of false detection and the rate of false detection are calculated. A portion of the test results are shown in Figure 8.
The test results of gray cloth.

The inspection results of gray cloth.
From Table 1, it can be seen that the total error detection rate of the improved network model is 2.19% and the original network model is 4.39%. The improved model is more accurate than the original model, and the error rate is reduced by 2.2%. Because the size of scratch defect in gray cloth is small and it is close to the background of gray cloth, the error detection rate of scratch is higher. Compared with foreign matter and fold, the error detection rate is lower.
On this basis, the improved network model is used to detect the defects of lattices. Dimensional clustering analysis was carried out on the defect markers of samples. When the clustering number is 12, the cluster center is selected as the a priori box, which is (42, 8), (10, 56), (28, 30), (15, 83), (11, 128), (10, 169), (53, 31), (12, 146), (132, 21), (124, 23), (27, 108), (35, 102). The accuracy of the model is verified using test set. The test results are shown in Table 2, and the partial test in shown in Figure 9.
The test results of lattice.

The test results of lattice.
From Table 2, we can see that the total error detection rate of the improved network model is 1.76% and the original network model is 3.28% for the lattice data set. The improved model is more accurate than the original model, and the error rate is deceased by 1.52%. Because the size of the broken ends in the fabric is small and the color is close to the background of the checked fabric, the false detection rate of the ribbon is higher. Compared with the background of the test sample, the characteristics of the holes are more obvious, the detection rate is higher and the error rate is 0.25%.
As each type of defect in the test set has similarity, and there are obvious differences between different defects, the false detection rate is low, and the error rate is mainly caused by missed detection.
In the network model testing, some of the missing samples are shown in Figure 10, and the arrow is the defect location. For the gray cloth samples, when the amplitude of the wrinkle defects is small, it is difficult to locate and classify the sample defects during the model detection; the defects such as holes and foreign objects are difficult to detect the defect location when they are located at the edge of the sample or the defects are small. There are few defects on the edge. The sample defect in the detection area is incomplete, and only the partial area of the defect is included, and the feature is incomplete and difficult to detect. For the lattice samples, the detection accuracy is improved, but some samples are also difficult to detect due to the low discrimination between the defect and the background color and the small defect in the total sample area.

Partial missed sample.
To verify the performance of the network, the improved network model was compared with other networks on the test data set. Compared with the average accuracy of the network, under the same parameters, the experimental results are shown in Figure 11. The improved network model has a large improvement on the gray cloth and the lattice fabric, which can better detect the fabric defects.

Test results of network models.
During the experiment, calculate the average test time of the samples for comparison. FPS represents the number of picture frames processed per second, and we use FPS to evaluate the network detection speed. We test the samples on our experimental platform. When the size of the input samples is 416 × 416 pixel, the average test time in the original network model of YOLOv3 is 27.7 fps, while the average test time in the improved network model is 21.8 fps. In the improved network model, the detection time increased slightly, but the missed detection rate of the detection samples decreased. The total error detection rate of the gray fabric decreased from 4.39% to 2.19%, and the total error detection rate of the lattice fabric decreased from 3.28% to 1.76%.
Conclusion
YOLOv3 network model is applied to fabric defect detection to realize fabric defect detection. To solve the problem that the initial anchor points in the YOLOv3 model are not suitable for fabric defect detection, the
