Abstract
Keywords
Introduction
With the rapid development of network infrastructure and mobile devices, video information transmission on the Internet is becoming more and more important, especially in the Internet of things (IoT) environments. Streaming media technology achieves the function of displaying as the video has been downloaded, 1 which becomes the preferred technology for all kinds of network video applications and gains various kinds of customer recognition. Mobile intelligent terminal equipment, such as smart phones and tablet personal computer (PC), has the natural convenience in socializing, online video displaying, video interaction, and many other application scenarios and becomes the mainstream equipment carrier of network video.
However, the network stability of mobile device is poor and can easily be effected by many factors like swinging of signal, shielding of building, and disturbing on signal source. Correspondingly, the negative effects on the network transmission are high packet loss ratio and network delay, and the negative effects on the user experience are video lag and poor effect issued real-time playing especially when the frame rate and distinguishability are higher.
The main reason for the above problems is that when the Internet service uses the best-effort way, packets would be dropped actively if network congestion appears. It will bring lots of effect on the real-time streaming media transmission and could not ensure the quality of video display. 2 In order to enable real-time streaming media to adapt to the characteristics of the Internet and improve the quality of streaming media services, the researchers have put forward some solutions and optimization program. In the aspect of transmission rate control, Parikh and Kim 3 proposed a scalable approach to mitigate different levels of packet loss by classifying packets in real networks and using different slow packet loss methods for different types of packet loss. Bansal and Jain 4 proposed a method to improve the receiver quality by controlling transmitter transmission based on the feedback of real-time transport protocol (RTP) receiver packet loss rule. Perkins and Singh 5 defined a set of minimized constraints “circuit breakers” to control the RTP transmitter, which can protect the network from excessive congestion and enhance the user’s multimedia experience. Yang and Meng 6 proposed a video transmission method based on feedback information which can avoid transmission delay jitter, reduce packet loss rate, and prevent network congestion according to the network parameters and information obtained by the RTP/real-time transport control protocol (RTCP) collection. In the aspect of data packet loss processing, Shen et al. 7 proposed a video transmission method based on the forward error correction (FEC) flag to control the occurrence of packet loss in the FEC coding network. Melliar-Smith et al. 8 proposed a method which can recover the burst and random lost packets by odd-even checking the packet during real-time multimedia communication. Frnda et al. 9 analyzed the packet loss and delay in different situations and proposed a quality-of-service (QoS) model for estimating the triple play service, which can improve the quality of video service. In the aspect of buffer mechanism research, Lin et al. 10 proposed a method to recover the lost packets using the buffer mechanism of periodic synchronization frames. In the aspect of data source and receiver optimization, Singh et al. 11 proposed a multipath algorithm in real-time streaming, which used the RTP service algorithm across multiple paths at the sending end and used the corresponding dithering algorithm at the receiving end to improve the transmission quality.
Meanwhile, security problems exist during transmission process, such as video forgery, which is a technique for generating fake video by altering, combining, or creating new video contents. 12 To this problem, in 2015, Patel and Patel 13 proposed methodologies that used exchangeable image file format (EXIF) image tag information to detect the forgery region frames of given input video. Bozkurt et al. 14 constructed correlation image using binarized discrete cosine transform (DCT) features extracted from the frames, and then estimated the exact location of the forgery line on the correlation image to detect the forgery. Sitara and Mehtre 15 proposed a frame-shuffling detection method that exploits abnormalities in the spatio-temporal domain and compressed domain, which can localize and differentiate the type of tampering present in the video. Mathai et al. 16 proposed a video-forgery detection and localization method based on statistical moment features and normalized cross correlation factor. Yao et al. 17 detected object-based forgery in the advanced video by deep learning.
Most of the above research is only for the improvement of a single link for the entire real-time video play process, such as the optimal transmission for encoded video, the improvement for the transmission process and the simple processing only on sender or receiver. Moreover, they solved the problems by considering only the improvement of quality or just the security problem. There are few research on considering and optimizing the real-time video streaming sending, transmission, decoding, and security as a whole. Especially in the heterogeneous, low bit rate, high packet loss rate, strong interference network environment, and wireless networks environments, real-time streaming media transmission technology needs further research and improvement.
In this article, we analyzed the whole real-time video display process and proposed a method based on RTP protocol to improve the quality of H.264 real-time video display. This method achieves the RTP packet re-order of video data and the retransmission of missing keyframe by combining the network condition, the video resolution, and the frame rate. This method can effectively improve the quality of real-time video display under poor network environment conditions and ensure real-time performance at the same time. Later, we discussed the security problem and proposed a detection algorithm for real-time video based on time-related token to solve the problem that the video may be tampered with.
Real-time video streaming displayprocess analysis
In the process of real-time video display, video coding and streaming media transmission are the two most important links. Currently, the most widely used video coding compression standard is H.264/MPEG-4 AVC (H.264). The streaming media transmission protocol usually selects the RTCP/RTP protocol to meet the higher real-time requirements.
Based on the traditional hybrid coding framework with the predictive-transform pattern, H.264 can improve the compression rate by using multi-reference frame prediction, motion vector of 1/4 pixel precision, integer transformation, and intra space estimation but has the drawback of being susceptible to the transmission errors. 18 A bit of error may cause a serious degradation of the decoding quality or even make the video fail to decode. The delayed packets may be discarded by the decoder because of the time expired.
H.264 encoding is structurally divided into video coding layer (VCL) and network abstraction layer (NAL). The VCL carries video-encoded data, and NAL is responsible for packaging and transmitting data. An H.264 video file consists of a set of network abstraction layer units (NALUs), which contains the encoder processing and the packaged video data. 19 According to the type of slice included in NALU, there are three main types of H.264 protocol frames: I frame, P frame, and B frame. I frame, with the complete decoding information, is independent and has high data volume which can generate a complete picture. P frame uses one-way inter-frame prediction coding which needs to refer to the previous I frame or P frame for coding and has smaller data volume. P frame may be affected by the error or missing of the reference frame which can cause failure to normally decode. B frame adopts bidirectional inter-frame prediction coding, which needs the front and rear frames to be the reference frame. 20 The instantaneous decoder refresh (IDR) frame is a special I frame because the following P frame and B frame will not use any frame before the IDR frame as a reference frame.
B frame is not an essential frame during transmission and display process. P frame is also not an essential frame, but is generally indispensable. I frame and IDR frame are indispensable. During the transmission process, no matter what type of data frame is lost, it will lead to the decoding process error, the manifestations are flower screen, frame skip, and so on.
RTP achieves the real-time end-to-end transmission services based on the user datagram protocol (UDP). When the data transfer volume is large, packet re-ordering and packet loss may be occurred. Especially if the video resolution and frame rate are high or the network condition is poor, NALU fragment is more likely to lose, which can lead to the incomplete video data frame.
Specifically, when using RTP to transmit the NALU of the video stream, the sender encapsulates each NALU into a series of RTP packets. The sequence number is
At this point, the received package is just the package to be unpacked, then the packet can be parsed and displayed directly. However, in real cases, due to the complexity and dynamics of the network, the transmission paths of the packets are not the same, and the arrival times to the receiving end are not the same either. Usually, there are many situations in which some latter data packages may arrive before the previous packages. Therefore, the distort problem of packet is inevitable. And as that happens, the receiver receives the packet
As a consequence, in the receiver end, there must be a packet re-ordering process. For the received packet, if
In addition, the size of NALU of high-definition video encoded by H.264 is generally large. Because of the limits of maximum transmission unit (MTU), RTP needs to adopt the FU-A or FU-B way to package the video data (FU-A and FU-B, which are two versions of the fragmentation unit, identified with the NALU type numbers 28 and 29, respectively). 21 The FU-A or FU-B approach essentially divides an NALU into multiple RTP packets, which increases the likelihood of out-of-order or packet loss during the transmission process and aggravates the decline of display quality.
Compared with the desktop environment, mobile device terminal is weak in computing, storage, and network performance, the impact of video play quality is more obvious. Therefore, it needs some special treatment to ensure video display effect.
Video data reconstitution algorithm
In this section, an improved video packet reconstitution algorithm is proposed. The core idea of this algorithm is to provide a variable time range for the RTP packets which belong to the same NALU. All the out of order and packet loss problems of RTP packets within an NALU will be handled in this time range. If the NALU is a keyframe, the retransmitted lost RTP packet is supported within a variable time range.
When the player end is receiving RTP packets, a hash table will be used to cache the RTP packets. In the hash table, the RTP packet number and RTP packet form a key-value pair. When the player receives an RTP packet, its serial number and content will form a key-value pair and will be inserted into the hash table. The player end keeps taking associated RTP packet from the hash table to restore NALU and to decode it, as shown in Figure 1.

Use a hash table cache RTP packet.
As can be seen from Figure 1, an NALU may be packaged into one or more RTP packets. So, at the receiving end, the number of RTP packets being processed and received are displayed.
During the video transmission, the timestamps of the RTP packets belonging to an NALU are the same, and the RTP packet sequence number is continuous. Therefore, when the RTP packet is received, if the timestamp of the packet is changed, it indicates that an RTP packet of a new NALU has been received. And previously received RTP packets can be composed of a complete NALU, which can be directly provided to the decoder for decoding and display.
Assuming that the number and timestamp of the currently received RTP packet are
When the number of elements of the set
When the number of elements of the set

RTP transmission H.264 NALU.
When the elements in the set
The sliding window is to achieve the purpose of order correction by waiting for the packet with small serial number but late received. The sliding window has two time boundaries: the left boundary
where
where
The relationship between the difference value of the number of unreceived packet and received packet
When 0 <
When
When
The entire pseudo-code of the algorithm based on sliding window and keyframe is as follows:
Begin
P = recvRTPFromNet()
In the practical application, the video transmission process and decoding process of multimedia application mainly require developers to achieve it by themselves. Our method can be used in the stage including getting video data from network video recorder/digital video recorder (NVR/DVR), packet and transmission, and packet organizing and display. The proposed method does not need to rely on NVR/DVR manufacturers to modify or update, it is very easy to operate.
Video-forgery detection method
Due to the lack of security design, attackers can exploit the following methods to achieve the attack purposes in the existing video surveillance technology.
The attacker can implant a virus, Trojan horse or other malicious code in a monitor computer or other devices to make the monitoring terminal continue playing a non-real-time fake video.
The attacker can transmit non-real-time pseudo-monitoring video to the monitoring terminal by a certain attack method from a point between the camera and the monitoring terminal.
All these attack methods will cheat the monitoring staffs so that the real situation of the monitored location cannot be obtained.
Video security problem during transmission can be viewed as the problem of distinguishing the tampered videos from untampered original ones reliably. The diagram of video forgery is shown in Figure 3. Assume the frame insertion occurs, frame-m to frame-n is a forgery part, frame-m and frame-b are connections. No matter what the inter-frame forgery method is, the frames around the connection must be the same without any traces left in visual. 22 That means frame-a and frame-m should be the same and so are the frame-n and frame-b.

Video-forgery diagram.
In this case, we propose a video-forgery detection method based on time-related token to solve this problem. The main idea is to add a field between RTP packet header and video data to represent token. One token should be related to time, cannot be calculated from the previous token, and can be verified on another side. Moreover, the token values in the RTP packets of one frame (P/I frame) are same, and the token values of RTP packets in different frames are different. In real-time video transmission, both the sender and the receiver can calculate the token of each frame separately. By this way, when receiving the RTP packets, the receiver can verify the consistency of the token in the received packets with its own calculated token, and then detect the inserted forged video when the attack occurs.
How to generate token is the key point in this method. We propose an algorithm to meet the requirement, as shown in Figure 4. The generation of token needs to be associated with the start time of playing to ensure that it is associated with a video play process. For a play process of real-time video, the generation sequence of token should be same at the video sender and receiver. That is, the token generated by the sender must be consistent with the token generated by the receiver. To solve this problem, we should design the function for generating the random number sequences based on the seeds, and make the same seed produce the same random number sequence. In this way, we can provide same seed for sender and receiver during one transmission, so they can generate the same token in the corresponding number of times. In this method, we need to pay attention to the processing and verification of token when packet loss and I frame retransmission occur.

The diagram of generating token sequence.
Experiment results and analysis
The camera used in our experiment is DS-2CD4232FWD-I (HIKVISION), three kinds of real-time video stream are used to be the sampled data. The encoding information is (A) 15 fps, 720 × 480; (B) 15 fps, 1280 × 720; (C) 15 fps, 1920 × 1080, respectively, choosing 8 min video stream of A, B, and C as the sender of the video input source.
The experiment is carried out in the internal network, the sender PC and the player-end mobile phone are in different internal network. The sender PC uses the wired connection, and the mobile phone uses the wireless connection. The network transmission between the two subnets in the intranet does not fluctuate for a certain period of time. All the experiments in the following are completed in a continuous period of time.
The main goal of the proposed algorithm is to improve the quality of real-time video pictures and ensure the security of real-time video in a network environment with poor transmission conditions, and there is no significant impact on the real-time performance of video. In order to show the advantages of the algorithm in this article, we mainly compare two normal transmission scenarios with our algorithm. The test scenarios are:
UDP-based standard RTP/RTCP transmission scenarios. This transmission is used by OpalVOIP, sipdroid, and IMSDroid. During the processing of RTP reception, the NALU that has lost RTP packets will be discarded and avoid the mistake caused by the wrong NALU received by the decoder. In this experiment, we use sipdroid as the contrast sample.
TCP-based NALU transmission. Here, the comparison object mainly refers to the TCP-based RTMP protocol. The softwares using the RTMP protocol are often some SDK provided by the camera manufacturers such as HCNetSDK of HIKVISION and some source softwares such as FFmpeg, ijkplayer. We use ijkplayer as the contrast sample.
Our improved RTP protocol method will be referred to as RTP+.
The experiment will analyze the algorithm and other methods from four aspects: video image quality, image delay, transmission efficiency, and transmission security.
Video image quality
In the same network environment, three types video stream (A, B, and C) are all displayed in the abovementioned three scenarios. The number of flower screen and discontinuous screen will be recorded, and for the comparison, the results are shown in Figure 5. As shown in the figure, the horizontal axis represents the transmission and processing mode, and the vertical axis represents the number of times the abnormal display of the video. The number of these abnormal displays is counted in the 8-min display time.

Picture quality comparison.
According to the results in Figure 5, we can see that with the increase of the video resolution and the data size, the algorithm proposed in this article presents a dominant growth. When playing a 720p video, the screen in the scene 3 display screen is significantly less than the use of scene 1 transmission, but for the scene 2, there is no particular advantages. And when playing 1080p, the image quality advantage of our method is obvious. We have less video lag than scene 2 and less flower screen than scene 1.
The algorithm in this article discards the incomplete NALU of the non-I frame, and retransmits the incomplete I frame, which can reduce the number of flower screen and video lag times. The discarding of the non-I frame may cause the frame skip, but it will not affect the overall effect of display during the actual monitoring process.
Image delay
In order to quantify the value of the screen delay
In the experiment, when recording the sending time and reception time in the player end of some I frame, using
We use the video with 1080p, 15 fps H.264 format as the input source and compare the display effect between different methods. For every scene, we choose the sending time and the time providing to the decoder of 30 I frame. After using Δ

Delay time comparison.
As can be seen from Figure 6, the delay of the scenario 2 is significantly greater than that of scene 1 and scene 3, and even some NALUs have multiplied. The processing latency of NALU in scene 1 is minimal. In scenario 3, the NALU processing delay is relatively stable, but slightly larger than the scene 1. This result can also be predicted by analysis, when the data transmission is heavy, packet loss will happen because of the congestion. The TCP protocol approach is to achieve the retransmission and to confirm, and the standard RTP/RTCP is based on the UDP protocol, which has no mechanism to guarantee the reliability of the data. The algorithm in this article retransmits the keyframe on the basis of dealing with out-of-order. This retransmission mechanism has time-limited efficacy, which does not guarantee that the lost data can certainly be retransmitted. Based on the analysis of these results, the proposed algorithm has obvious advantages when the video resolution and video quality are at a high level.
Transmission efficiency
We use the received data from the player end to compare the video transmission efficiency in scene 1, scene 2, and scene 3. The results are shown in Figure 7. The horizontal axis in the figure represents the time and the vertical axis represents the number of transmitted data. When the video source is 480p, the transmission rate of three scenes is similar, but when the video resolution increases, the algorithm in this article will show a certain advantage.

Transmission efficiency comparison.
Scene 2 uses TCP to transmit, and due to the mechanism of TCP, the video transmission speed is slower than UDP. Scene 1 and scene 3 use UDP to transmit video, so the video transmission speeds are faster. The algorithm in this article has obvious advantages over scene 2.
For the video transmission in scene 1 and scene 3, the scene 3 needs to retransmit the keyframe if the keyframe is lost, so the average transfer rate is lower than the scene 1. Furthermore, both scene 1 and scene 3 contain control commands, of which scene 1 uses RTCP to provide the feedback information of the data transmission and scene 3 will inform the sender to retransmit the lost data if the keyframe is lost. RTCP protocol packets are sent intermittently, and the keyframe retransmission request is sent out only if the keyframe is lost. So, in the control information, the data size transmitted in scene 3 is less than scene 1.
Of course, the network environment has great effect on the transmission rate comparison. In a very poor network environment, if the number of RTP packet loss is frequent, then the possibility of keyframe loss will become larger, and the number of retransmissions will increase. If the retransmission request is implemented by using the UDP protocol, the retransmission request may be lost due to the poor network, which makes the keyframe that can be really retransmitted reduce, so the efficiency of our algorithm is almost similar to the scene 1.
Transmission security
In the experiment, the original video sequences and result of frame insertion are shown in Figure 8. In this case, frame number a and b are faked frames inserted in frame location between frame 6 and 7. We examine the forgery by our method. The result is that our method that can detect the token generated by the receiver is not consistent with the token in the package. Meanwhile, the calculation time of token is positively correlated to the number of frames and size of resolution in original video, as shown in Table 1.

Original video sequence and result of frame insertion.
Calculation time of token with different frames and resolution.
Conclusion
The quality of real-time video transmission is often disturbed by the instability of cyber-physical-social system (CPSS), and the security of real-time video can be compromised if fake contents are embedded into the video during the transmission. According to the video coding information and the conditions of CPSS-like mobile network, we analyzed the receiving and display process of the real-time video stream, and discussed the reassembling process of the received data packet in the receiving end. Then we proposed an improved data packet reassembling algorithm aiming at the problem of instability of the mobile client network and its resulting video lag, video interrupt, or even shutdown. The algorithm used the hash table to cache the data packet, carried on the keyframe retransmission for the missing packet, then used a sliding window mechanism to treat the video transmission. According to the position of the difference between the unpacking serial number and the receiving packet serial number in the window, the anomaly of the video transmission can be handled. The algorithm is different with the traditional method which used timestamp and serial number to sort the packet, which can guarantee the accuracy and streamline the process of video packet sorting and quickly complete the sequential unpacking and display process. We also proposed a method based on time-related token to ensure security of real-time video transmission. On the process of receiving packages, the receiver examines whether the token in the package is consistent with its own calculated token and can detect the video forgery when the attack occurs.
The experimental results show that the proposed algorithm not only improves the video clarity in the case of excellent network environment but also improves the video display in the poor network environment such as mobile terminal equipment and the wireless mobile network, and the network sending and receiving signals are poor. The fluency and accuracy of our algorithm have obvious effect. When the bandwidth is poor, it will lead to the increased data transmission delay. The transmission time of each video frame is longer, and when the transmission time is greater than the video display waiting time, it will cause the screen to stop and waiting for data buffer. In such cases, the keyframe retransmissions in the algorithm may aggravate the network congestion, which will result in a delay increase. Therefore, how to adjust the retransmission strategy according to the bandwidth situation is also a focus of follow-up research. In the follow-up study, we will study the effects of bandwidth on the algorithm especially under low-bandwidth conditions, how to further improve the video picture quality, and how to further reduce the screen delay.
