Abstract
1. Introduction
The notion of wireless multimedia sensor networks (WMSNs) is frequently reported as the convergence between the concepts of wireless sensor networks and distributed smart cameras [1]. As a result, an increasing number of video clips is being collected, which has created complex data-handling challenges [2]. Further, some types of video data are naturally tied to geographical locations. For example, video data from traffic monitoring may not contain much meaning without its associated location information. Therefore, most potential applications of a WMSN require the sensor network paradigm to support location-based multimedia services as well as manipulate large scale data at the same time to provide a high quality of experience (QoE), which raises an important issue. How to investigate an intelligent processing method for georeferenced multimedia? Although the question has been extensively addressed theoretically, the method of mapping video sequences to geographical scenes remains to be described. On the other hand, with the growth of geographic information system (GIS) whose major growth area is the convergence between GIS and multimedia technology, a new paradigm named video-GIS emerged [3–5]. The major researches facing video-GIS are the coding of georeferenced video and the content and types of services that should be provided by georeferenced video. Further improvement of these processes is contingent on deeper understanding of video, as well as improved understanding of the spatial relationship of geographic space. It is due to the necessity of using video-GIS to visualize the relationship between the video analysis methods and the real geographical scene, resulting in georeferenced multimedia intelligent processing method based on context-based random graphs.
Georeferenced video is fundamental process in video-GIS development. Prior research activities on georeferenced video technologies and applications have been conducted. Most of them make use of video and GPS sensors. In [6, 7], Stefanakis and Peterson and Klamma et al. proposed a unified framework for hypermedia and GIS. Pissinou et al. [8] explored topology and direction under the proposed georeferenced video. The work of Hwang et al. and Joo et al. [9, 10] defined the metadata of georeferenced video, which support interoperability between GIS and video images. In the field of georeferenced video search, Liu et al. [11] presented a sensor-enhanced video annotation system, which searches video clips for the appearance of particular objects. Ay et al. proposed the use of geographical properties of videos [12], while Wang gave a method of time-spatial images to extract the basic movement information [13]. Although single media have been studied extensively, its semantics in geographic space are poorly understood. How to determine the spatial relationship of video elements is one of the most important operations on georeferenced video. For instance, a moving video element changes its position, shape, size, speed, and attribute values over time. Understanding the changing process and rules of these attributes is of important significance to the geographical description of the video.
Many techniques for video event recognition have been proposed. As the work on model-driven methodology which has become well established and approached maturity, the most common and popular conceptualization of fusion systems is the SVM model [14, 15]. However, such methodology not only cannot solve the problems, such as multi-instance, diversity, and multimodal, but needs a large number of training samples. Most previous studies to date have used data-driven method [16] which has been carefully designed to signal clear and distinct semantic of the videos [17–21]. In our event recognition application, we observe that some events may share common motion patterns. Though involved in pattern discovery, data-driven method also contributes to social network during pattern discovery [22–25]. These works have showed a high accuracy in the differentiating of video and its semantic extraction frame. However, most multimedia applications are unknown and uncertain, which are extremely difficult to meet the requirements of real-time stream processing.
Previous studies have shown that multimedia intelligent processing method is important to the development of video-GIS and have achieved inspiring progress. However, these solution methods have suffered from the classical ensemble average limitation presented by the analysis of low-level characteristics. Therefore, the spatial data gathered are sometimes inconclusive and, in part, contradictory. These algorithms usually build or learn a model of the target object first and then use it for tracking, without adapting the model to account for the changes in the appearance of the object, for example, large variation of pose or facial expression, or the surroundings, for example, lighting variation. Furthermore, it is assumed that all images are acquired with a stationary camera. Such approach, in our view, is prone to performance instability, and thus it needs to be addressed when building a robust visual tracker.
To overcome these problems, we will begin by looking at some valid models, which are suitable for georeferenced video understanding and behavior analysis. In this paper, we propose a new event recognition framework for consumer videos by leveraging a large amount of videos. As we know, graph structure provides a complex, dynamic, and robust framework for assembling complex relationships involved in the objects [26], which is suitable for our goal. Thus, multiple random behaviors are presented in certain movement, making the graph structure unsuitable for describing the real video scenario. To circumvent this problem, random graph model has been taken into consideration, which can be seen as a rather simplified model of the evolution of certain communication net [27]. In our research, it could simplify the analysis of the interaction between video objects substantially for revealing the new insight into the relationships between objects and its complex interaction. Our analysis focuses on describing spatial relationships bound to objects using random graph grammar in georeferenced video, developing a scientific analysis of behavior and structured methods of georeferenced video understanding.
2. Preliminary
Surveillance video data is mostly non-ortho image data so that it does not match up with the geography scene vector data using the traditional method. To solve this problem, a mapping method of video scene imaging data to geography scene vector data is adopted in the paper, as showed in Figure 1. Firstly, the virtual viewpoint camera is constructed by the camera interior and exterior parameters. Secondly, geography scene virtual imaging can be gained from geography scene vector data using the process of model transformation, viewpoint transformation, and pruning according to the computer graphics rendering process, with the corresponding relationship between an object in virtual imaging and vector object. Thirdly, the image matching technology based on the features that have invariant character for translation, scale and rotation is used to match the geography scene virtual imaging and video image. Finally, the corresponding relationship between video image and vector data is established using that between an object in virtual imaging and vector object, with the purpose of accomplishing the mapping of video scene to objects in geography scene.

Process of the mapping relationship of video scene imaging data to geography scene vector data based on virtual viewpoint.
In the following part, we will introduce several preliminary key steps.
2.1. Selection Algorithm of Multicamera Based on Spatial Correlation and Target Priority
Multicamera surveillance system should not only gain detecting and tracking information of motion element of the single camera, but also make the coherent dynamic scene description using all the observations to some extent. Meanwhile, every motion element could be tracked by cameras simultaneously. How to select cameras for tracking a specific target is particularly important in video sensor networks. Based on the spatial correlation [28] and target priority, the paper proposes a selection algorithm of multicamera with task allocation optimized to achieve the automatic selection according to the target priority at each moment.
The algorithm is based on the assumption that a camera with no task carries out the basic single camera tracking which has lower power consumption, and the high-priority task could be preempted when bending. The selection algorithm of multicamera is shown as in Algorithm 1.
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13)
The set of images
2.2. Organization of Video and Location Data
We have put forward a coding model of video-GIS that is comprised of video and camera's position in conjunction with its view direction and distance. Thus, the location data can be collected automatically by various small sensors to a camera, such as a GPS and a compass (see Figure 2). This eliminates manual work and allows the annotation process to be accurate and scalable. Therefore, we investigate the real-time collection, coding, and integration of video information and GPS information on the SEED-VPM642 platform, and finally we can obtain two different bit-rate location-based streaming media. The lower bit-rate one can be positioned to the wireless network broadcast live, and the higher one can be positioned to the hard disk storage.

Experimental hardware and software to acquire georeferenced video.
In the coding of video-GIS, we need to calculate the three-dimensional coordinate of the video object [29]. As video-GIS coding based on mobile sensor cannot calculate single video frame by three-dimensional control field, the most effective way is using digital map and spatial geometrical relations (see Figure 3).

Geometry for calibrating multiple sensors.
Therefore, the geometric relationship among GPS, posture sensor, imaging space, and object space should be built. It is assumed that the axis of imaging space
In detail,
For acquiring a more precise spatial locating information, we need to get the GPS information and attitude information generated by a posture sensor at least. Therefore, the spatial locating information is described by the combination of GPS and angle direction elementary (Heading, Pitch, and Roll), which obtained by Micro Inertial Measurement Unit (MIMU), as shown in Table 1.
Sample of GPS and MIMU.
As shown in Table 1, there are two kinds of the spatial locating information:
2.3. Digital Map-Based Image Resolution
The features of digital maps are expressed by a two-dimensional plane on the vertical projection of the vector data. From the standpoint of this work, the video image is a raster data expression of the feature in the height direction of the information, and video image can also be expressed as the data format of the dotted line surface after the vector processing. Video images and digital map on the point, the line, and the corresponding expression of the surface can be shown at Table 2.
Correspondence between video images and digital map.
From the view of technology, we subject map-based image resolution to a three-dimensional measurement challenge and then use single-frame video images and digital map matching to define the changes in three steps. The first step is feature extraction of dense range image, which aims to extract the features of point and line. Under the premise of the full calibration to video frame, we can identify the particular characteristics of extracted target to meet the special requirement. For instance, the corners of building or telegraph pole as a fixed line characteristic for the expression of video image is perpendicular to the target. Once formulated, the second step is to combine the line characteristics into the characteristics of the surface using texture information. The third step is matching with digital map vector data. The contents include a variety of different matching points, points and lines, a line and a line, and the line and the plane between form and technique, which is shown in Figure 4.

Mapping from Image to Digital Map.
3. Syntactic Structure
3.1. Syntactic Description of Motion Element
Video motion element mainly refers to the entity objects that could be identified clearly in visual and are important in morphology, such as pedestrians in video surveillance. The description methods of motion element are mainly based on color and texture at present, which is difficult to support the definition of motion element, behavior analysis, and behavior understanding. For a better description of the dynamic characteristic of the video motion element, the paper first gives a definition to some related concepts of motion element.
Definition 1.

The definition of
Definition 2.
Definition 3.
Definition 4.
Definition 5.
3.2. Behavior and Interaction of Motion Element
In the georeferenced video stream,

As one of the expression forms of motion element in the video stream,
Under the influence of temporal subspace
Due to the close correlation of spatial relation at any time point
Meanwhile, the measuring value

A diagram of interaction relation.
In the georeferenced video stream, the dynamic update function of interaction relation within the scope of spatial constraint is shown as follows:
Among them,
4. Semantics and Formalization of Georeferenced Video
For the accurate description and behavior understanding of motion elements in the georeferenced video stream, the paper proposes an analysis method based on sparse random graphs with the purpose of observing the character evolution with time and presents an indicating and measuring method of video motion element with dynamic topology structure information based on context-sensitive sparse random graph grammar.
4.1. Formalization of Georeferenced Video
Random graph
Each edge of random graph
Among them,
The motion element vertex of random graph can be defined as follows:
4.2. Evolution Rule
As a posterior method, dynamic process of motion elements in the video stream can be visually described and showed based on sparse random graph. The temporal and spatial evolution model of motion element is able to describe the basic character and dynamic process of spatial relation accurately. The essence of dynamic evolution process of sparse random graph is the continuous transition process of state space in random graphs.
Therefore, the state transition function of sparse random graph can be defined as a mapping relation
Among them,
The dynamic evolution process of sparse random graph includes its character update of motion element vertex
Input: sparse random graph detection and recognition information; Output: return (1) IF (2) Create first node (3) End IF (4) While (5) IF (6) Find nearest node (7) Create new edge (8) Add (9) End IF (10) (11) (12) (13) (14) (15) (16) (17) (18) (19) (20) (21) (22) (23) (24) (25) (26) (27) (28) (29) (30) Return
We can get the corresponding dynamic evolution model of sparse random graph using the evolution rule algorithm. Step (2) in the algorithm shows the creating and adding root vertex
4.3. Random Subgraph
If vertexes
4.4. Early Warning of Video Event
Using the numerical calculation method of interaction relationship, abnormal behavior and emergency in video can be distinguished based on random graph grammar, and the possible special situation can be early warned. There are two different threat levels generated by video event: notify and alarm, which is shown in Figure 8.

Notify and Alarm processing of video event.
The paper is mainly to detect the unexpected crowd incident and conflict in the massive video events and proposes a novel two-layer discriminate method, which consists of individual attribute layer and group attribute layer. Once occurring video abnormal event, the corresponding real-time status of random graph must be described, which can be expressed as follows.
Specifically speaking, the detection and selection of variation range or interval of movement attributes in random graph can use sliding window. In the continuous movement attribute value
Once either circumstance occurred, it must be entering the next notify phase.
When entering the notify discriminative phase, the random subgraph showing diffusion or flocking status makes numerical calculation. Using the computing method of structure entropy value, the corresponding random subgraph status is measured, and the entropy value
The warning degree
The discriminate method based on the random graph is defined as graph-based reasoning (GBR) in the paper, while the improved GBR fused with traditional CBR method is GBR-C. The intelligent analysis for different video scenes plays an important role in the real-time detection of video abnormal behaviors and mass incidents. The instantaneous status information of video motion element is integrated with the random graph model and summarizes the random subgraph patterns and behavior rules with a statistical description. In violation of the behavior regularity of common video events, it is a latent exceptional event, and extracts the features of video motion elements involved which are recorded in object layer stream for the efficient retrieval of content-based video.
5. Experiment and Analysis
In order to verify the feasibility and availability of the proposed framework, space information of a motion element is extracted at real-time based on the detection and tracking [31, 32]. According to the dynamic change situation of space semantics, a timing description method using random graph grammar depicts the event development of video stream clearly.
5.1. Interaction Description

Dynamic change process of interaction relation.
In Figure 9, function
The previous results show that it can accurately depict the dynamic varying changes of the interaction relation of video motion elements. However, the accurate depiction is an indispensable premise for the description of the georeferenced video stream.
5.2. Georeferenced Video Stream Description
Based on the richer spatial semantic of motion elements in the georeferenced video stream, we can realize the intelligent parsing of georeferenced video content using context-sensitive sparse random graph grammar. The spatial relationship of motion elements in image space is transformed to that of object space, and the motion status and interaction relation can be depicted using random graph. The continuous transition process of inner state space in random graph is enforced with the dynamic evolution process of sparse random graph.
With the spatial reference data, the sparse random graph evolution processing based on the monitoring target is achieved. And the consecutive people emerged within the video surveillance range are labeled as A, B, C, and D which are shown in Figure 10. As soon as the moving object appears, a new random graph node will express it; when it leaves the surveillance confine, the corresponding node will disappear while the edge set constituted by the interaction that associated with the node is set to null. Using our video test data, the evolutionary process and timing evolving description diagram of the video clip trim from frame 1041 to frame 1712 is shown in Figure 10.

Timing evolving description diagram of the georeferenced video stream.
We can see that the timing evolving description diagram can be constructed by the automatic intelligent analysis and calculation of a video clip, and it verifies the correctness and effectiveness of the evolution rule algorithm of sparse random graph. Within the scope of the specific geographical space, the time-varying attributes of random graph nodes are visual displayed, such as behavior state, spatial location, and movement parameter. And the basis recorded information of each video motion element is shown in Algorithm 3.
//
Among them, the basic information consists of attribute information, spatial location information, and other movement parameter, which are shown in Algorithm 3. The attribute information

Structural description of video motion feature.
The automatically generated file mainly consists of two parts: the configuration data and content data. The movement status information about motion element
5.3. Performance of Video Event Warning
To validate the proposed early warning method of video abnormal behavior and emergency, we analyzed the performance of various attributes using the video test data which involves a crowd video scene. Experimental analysis mainly contains the real-time warning entropy value of random subgraph, warning degree, and real-time changes of corresponding subgraph node number and the total graph node number, which are shown, respectively, in Figure 12. And the horizontal axis indicates the video running time with 10 seconds as a scale unit.

Structural description of video motion feature.
As can be seen from the previous illustration, the warning entropy value of real-time random subgraph using the computing method of structure entropy value is due to random fluctuations in Figure 12(a). According to the warning degree of video abnormal behavior and emergency, three different warning threshold intervals are set in our test. And the Warning2 degree occurred between 252 and 270 seconds shown in Figure 12(b). The Warning1 indicates the early warning degree in most of the time, which means that video abnormal event will be emerged. Figure 12(c) shows the real-time nodes number of random subgraph in the video surveillance scope while Figure 12(d) shows the total graph node number.
5.4. Performance Comparisons of Intelligent Analysis Methods
In this section, we compare the proposed method with other methods, such as the Coarse-Grained SVM, Fine-Grained SVM [15], and MKL [19]. Using the three sample videos (Table 3) which involve some events that contain a group of people interact with each other, we carry out the comparison study. And all the chosen samples are considered as the labeled training data within the target domain.
Three test sample videos.
GBR accomplishes a concise numerical calculation and avoids the problems of computing complexity in the traditional CBR method. In Tables 4, 5, and 6, we compare the performance of GBR, GBR & CBR, with other methods using three different videos.
Comparison of crossing sample A with different methods.
Comparison of flocking sample B with different methods.
Comparison of conflict sample C with different methods.
From Tables 4, 5, and 6, we observe that GBR extends the processing time in a common detection of video event, but the forecasting accuracy of video abnormal behavior and emergency increased significantly with lower computation and complexity. Therefore, the energy consumption of sensors will be reduced which is consistent with the transmission costs, especially in the nonrecurring flocking emergency with complex video event modeling.
6. Conclusion
In summary, findings from the present study are all based on low-level visual features, which mean that there was a shortage of spatial constraints and coupling analysis with geography environment. It is necessary to establish the relationship between video analysis method and the real geographical scene. A georeferenced video analysis method is proposed based on the context-based random graph. The data are obtained using a wireless network of environmental sensors scattered at the supervising area and a vision sensor monitoring the same geographical area. Experimental results prove that the proposed description method of georeferenced video using random graph is feasible and efficient. Through the intelligent parsing of the georeferenced video data stream, we can get a novel visual description method using random graph which can clearly depict the development clue of video scenes and also offer the possibility to browse the video stream quickly. Meanwhile, random graph can be used as an effective nonlinear indexing for the content-based video indexing and browsing application.
As a future work, we will propose the enhancement of the implemented algorithms with alternative combination rules and the fusion of audio and video to deal with the uncertainty, imprecision, and incompleteness of the underlying information. In addition, large amounts of data should be conducted to set various parameters, such as thresholds, false alarm rates, and fusion weights.
