Sage Journals: Discover world-class research

Abstract

Location prediction impacts a wide range of research areas in mobile environment. The abundant mobility data, produced by mobile devices, make this research area attractive. Randomness makes people’s future whereabouts hard to predict, although studies have proved that human mobility shows strong regularity. Most previous works, in general, tend to discover an association between a user’s social relations in real world and variances in trajectory and then utilize this association to model the user’s mobility which is used for location prediction. However, these methods normally require some specific data, which make them hard to be migrated to other platforms. Moreover, by focusing on social relations, these methods neglect the potential value of the associations among strangers’ trajectory. Based on this argument, this article has proposed a novel location prediction approach trajectory similarity–based location prediction. It applies the social contagion theory and introduces a novel similarity computing-based trajectory method along with the trajectory sampling, which is achieved by covering algorithm to accelerate the process of computing similarity. Experiment results on real dataset show that trajectory similarity–based location prediction achieves higher accuracy and stability comparing to the state-of-the-art approaches.

Keywords

Mobility modeling location prediction social contagion Markov Chain covering algorithm

Introduction

Predicting human mobility has always been one of the hot topics, which includes human migration pattern, understanding human behaviors, evolution of epidemics and predicting the spread of disease to the emerging location–based social network (LBSN), and elevating the quality of service (QoS) of mobile application.¹ With the uprising development of mobile devices and applications, people leave their digital “footsteps” as well as they use mobile devices, such as Global Positioning System (GPS),^2–4 device’s linking information of access point (AP) like cell tower^5–7 and WIFI,^8,9 and the check-in data of LBSN.^10–12 These abundant mobility data facilitate our ways of understanding human mobility and the study of location prediction. Recent research shows that people’s mobility trajectories possess a strong regularity most of their time, people tend to being at a few important places (i.e. home, office) by fixed route. In the meanwhile, human mobility possesses a certain degree of randomness, people have occasionally gone to places where usually they would not have been (i.e. cinema or some restaurants).¹³ We manage to study the overall location prediction based on the two above characters of human mobility. Location prediction, in general, is using users’ historical digital footprints to build their mobility models by statistical methods and predicting users’ whereabouts by utilizing these models and related data.¹⁴ The strategy and QoS of the various applications that mentioned above have all depended on the accuracy of user’s location prediction. Therefore, predicting user’s location accurately is of great significance to various fields.

Location prediction has been extensively studied in the literature. To predict user’s location more accurately, different researchers work on it from different angles and come up with a variety of location prediction algorithms that aim at different kinds of digital footprints. Some researchers solve the problem from the perspective of statistics, summarize the times of user being at each locations, and then calculate the transition probability between locations.^2,3,11 However, this approach relies on user’s past movements entirely, it will dis-function when user appears at new locations or user’s mobility pattern shifts. To resolve the situation when user deviates from original mobility patterns, there are also people taking advantage of users’ relationships. This type of methods usually uses a specific connection between users, which impacts user’s mobility, to analyze and model movement patterns and then uses this new obtained model to modify the prediction result which is given by the personal historical data-based prediction model. Although these kinds of methods can, to a certain extent, predict user’s location when they deviate from original movement patterns. However, these methods usually depend on some specific data (i.e. call record) which make them hard to transfer to other platforms. Moreover, these methods normally find a connection between users first and then analyzing how it affects user’s trajectory. This means pre-narrowing the connection between users and neglecting the mobile similarity among strangers.

Inspired by social contagion theory, we proposed a novel location prediction algorithm by discovering groups of users who have high trajectory similarity and using their mobility models to modify the prediction results, called trajectory similarity–based location prediction (TLP). Social contagion theory points out that personal behavior under the influence of crowd tends to adapting oneself to group behavior.¹⁵ This way, our method extracts users’ connection directly from the mobility dataset, cast off the limitation of pre-choosing the connection between users. We first employ the second-order Markov Chain (2-MMC) to build user’s personal mobility model, which is suitable for the movement regularity. The next step is followed using the concept of covering algorithm to compress the large-scale trajectory data, and thus to get the consistent subset from trajectory set. Subsequently, we proposed a trajectory-based user similarity computing method, and selecting a group of users of high similarity with the tested user from the subset, and naming it personal group of similar trajectories. Finally, apply mobility models from the group to modify the location prediction result.

The rest of this article is organized as follows: in section “Related works,” the related work of the current approach of location prediction will be introduced briefly. In section “The proposed problem and its formulation,” first of all, we will point out the proposed research problem and then formalize the problem in detail. In section “A social contagion theory–based approach for location prediction,” the system architecture of TLP is designed first in mobile environment. Moreover, the proposed solution is put forward for addressing the formalized research problem. At the end of this section, the design and implementation process of the proposed algorithm are introduced in detail. In section “Experiments,” we will present experiments and results. In section “Conclusion and future work,” we draw conclusions and discuss future work.

Related works

In recent years, lots of user location prediction strategies use the digital footprints produced by user’s mobile devices—AP node (Cell Tower,^5–7,16 WiFi^8,9), GPS,^2–4,17 and check-in data from location-based service (LBS) of social network.^10,11 Aiming to solve the problem of predicting user’s continuous locations, we focus on employing the more continuous and intact digital footprints on location prediction, due to the sparsity and lack of integrity in the nature of check-in data.¹⁰ Based on whether others’ historical data are used by the predicting algorithm, we classify current location prediction algorithms into two categories: independent model-based algorithm and combining others’ trajectory-based algorithm.

Independent model-based algorithm

As mentioned above, from the long-term perspective, human mobility possesses the characteristics of strong regularity and periodicity.^10,13 Independent model-based algorithm leverages these characteristics of human mobility by analyzing the pattern in user’s personal historical trajectory data. In the literature, Ashbrook and Starner² proposed an algorithm which first extracts most frequently visited place of interest (POI) from user’s past GPS data by k-means clustering algorithm. Then calculate the transmitting rate from one POI to another resulting in a mobility Markov model to predict user’s location. However, this algorithm needs to set the number of cluster centers in advance and the number affects the predicting accuracy. Gambs et al.³ extract POI by density and join (DJ)-clustering algorithm and model user’s mobility by multi-order Markov Chain (n-MMC), and predict user’s locations.

Because of the number of the cluster centers varies with different tightness of GPS data, the extracted POIs are more suitable to present the pattern of mobile trajectory. But n-MMC algorithm cannot offer a correct answer when user moves toward an unvisited location because the algorithm is completely based on user’s historical data. Ying et al.¹⁷ came up with a prediction method based on the semantic information of location, which transfers a fixed physical location to the semantic information of the location (i.e. “home,”“work”) and store the semantic trajectory. When predicting location, the algorithm returns the best matched location by searching the tested user’s context in semantic trajectory. Yuan et al.¹¹ proposed a location prediction algorithm, $W^{4}$ (who, when, where, what), including the temporal, spatial, and behavior information in user’s mobile trajectory. $W^{4}$ extracts semantic information from user’s Twitter text messages, and associates the information with the statistical probability model based on location data and its occurrence time, and predicts user’s location by time and tested user’s contextual information. However, this kind of none-time-linear prediction algorithms is only suitable for forecasting several locations in a range of time, not for next-place prediction problem. Monreale et al.⁴ build a decision tree-like model, called T-pattern tree, with GPS data gathering by vehicles and forecast user’s location based on finding the best matched movement trace. Even though past data are covered by decision tree, the weight of the high-frequency trace is equal to the low ones. This would have drawn back the accuracy when the high one is similar to the low ones. Methods mentioned above just use personal historical mobility data to predict. Although these methods predict location based on the regularity of human mobility, but regularity aside, human mobility possesses a certain degree of randomness¹³ Aiming to resolve this problem, we use social contagion theory to generate a personal similar trajectory for tested user by calculating trajectories similarity and then use the mobility model from the group to modify the prediction result which is given based on personal data.

Location prediction algorithm includes others’ trajectories

Since human mobility possesses the characteristics of randomness and exploratory, researcher recently focusing on how to take advantage of others’ mobility data to elevate the accuracy of location prediction. Facing the problems in location prediction and recommendation, user’s social connection has been taken into consideration.^5,18–20 Musolesi et al.²⁰ proposed a location prediction algorithm group mobility (GM) which introduces social relationship into prediction model. Boldrini and Passarella¹⁸ added user’s location preferences into GM algorithm. Backstrom et al.²¹ extracted user’s social relationships from Facebook and combined it with user provide location semantic information to predict user’s location. In the literature, Hossmann et al.¹⁹ use the contact information in user’s mobile phone as their social relationships. Zhang et al.⁵ discovering the latent connection between users’ call records and their interactions with each other and, proposed a location prediction algorithm based on call records. The core idea of these methods is using external or internal data to obtain connection between users and building user’s mobility model based on the influence of connection on user’s trajectory, then including the mobility model in location prediction algorithm. However, there are always some latent connections between users being neglected, we proposed a location prediction algorithm which directly discovers group of users who have influence over the tested user and uses their mobility model to modify the result deriving from user’s personal mobility model.

In order to discover group of users who have impact on the tested user’s movement, the wildly used method in recommending system—collaborative filtering—has been applied to location recommendation and prediction by more and more researchers.^{10,13,22–24} Lian et al.²² proposed a location recommendation algorithm GeoMF using the geographic influence between locations and user’s location preference. The influence between locations is preset instead of analyzing from user’s movement pattern. There are also other algorithms using collaborative filtering in location recommendation.^22,23 However, location recommendation methods are not quite suitable for location prediction. Lian et al.¹³ proposed an algorithm called collaborative exploration and periodically returning model (CEPR) which uses collaborative filtering for location prediction. CEPR uses hidden Markov Chain to build a mobility for user’s periodic moving pattern in order to predicting location and apply collaborative filtering for location prediction when user explores new places. But the accuracy of CEPR is limited by the accuracy of initiative user’s moving pattern detection. Wang et al.¹⁰ also proposed a method which builds a mobility model and processing strategy separately for the periodic and random mobility situation and then merges the two into a universal mobility model to predict user’s location. But when predicting the next locations, instead of considering user’s current mobility trace, the algorithm is using time and possibility of user appear at every location which is calculating from user’s historical trajectory. So, it is more suitable for forecasting user’s whereabouts in the future on a macroscopic level than predict user’s next location on a microcosmic level.

The main contributions of this article are as follows:

We proposed a social contagion theory–based location prediction algorithm.

We present a novel method to calculate users’ similarity based on mobility traces.

We implement a large-scale trajectory data compressing method by leveraging the idea of covering algorithm.

The proposed problem and its formulation

Proposed problem

In mobile environment, system records people’s mobility data through their mobile devices, that is, location information at every moment. This information can be an absolute position, like GPS coordinates, a relative position, or the AP’s location of which the mobile device links in. System uses user’s current or historical location data to build a user’s mobility prediction model and predicts user’s next location with the model. In general, system uses the mobility prediction model constructed by user’s location information to predict user’s future whereabouts. Different mobility prediction models may be constructed from different kinds of historical data and by data deriving from mobility data. The former, for example, use exclusively the tested user’s mobility data or combine with others to build the model. The later, for example, analyze the average time duration between phone calls and users’ interaction and build the prediction model with calling records. Different modeling methods may use various data, but they normally use accuracy as the measurement metrics. The prediction accuracy is the ratio between the number of correct predictions and the number of overall predictions. There is no doubt that a more advanced mobility prediction modeling method will achieve a better accuracy. Therefore, it is necessary to design and implement an efficient mobility prediction modeling strategy.

Problem formulation

The problem of predicting user’s next location is defined as follows: assuming there are n users in current scenario, the users set is $U = {u_{1}, u_{2}, \dots, u_{n}}$ , where $u_{i}$ represents the ith user; there are m locations which are defined as set $L = {l_{1}, l_{2}, \dots, l_{m}}$ . Assuming the gathered users’ mobility data as $D = {D_{1}, D_{2}, \dots, D_{n}}$ after initializing, where $D_{i} (d_{1}, d_{2}, \dots) (i \in {1, n})$ includes all $u_{i}$ ’s mobility data $d = (uid, t, l)$ in the scenario, uid is the unique identification for each user and t is the timestamp. Since $d_{i}$ is arranged in chronological order, $D_{i}$ can be considered as user’s trajectory record. Based on the historical data above, TLP builds mobility prediction model for each and every one of the users and predicts user’s next location as $L_{prediction}$ .

The result of predicting locations is barely satisfactory simply using the tested user’s historical data for modeling. Hence, during the process of constructing user’s mobility prediction model, we normally need to calculate other’s degree of association with the tested user. Choose a part of strong connection users to forge into a group and use their mobility data $D_{su b_{i}} {D_{j} | j \in [1, n]}$ to build the tested user’s mobility prediction model. Our problem is integrating the tested user’s mobility trajectory and the changes in trajectory under the influence from the strong connection user group and using the integration to predict user’s location. In this article, we use Markov Chain to model user’s mobility data which results in user’s state transfer matrix. And then extract the k users in which trajectories are most similar to the tested user $u_{i}$ as a group $C_{i}$ , $C_{i} {u | u \in U}$ . Integrating the transfer matrix of $u_{i}$ and every user in $C_{i}$ , we get the mobility prediction model for $u_{i}$ .

In this article, we choose 2-MMC to model user’s mobility data. $X = {X_{1}, X_{2}, \dots, X_{M}}$ represents the set of user’s current states in Markov Chain model, where m is the count of states. In standard Markov Chain model, $X_{i}$ means the user’s current location l, here $X_{i}$ represents the two sequential locations. $Y = {Y_{1}, Y_{2}, \dots, Y_{n}}$ is the set of transfer states; n represents total sum of states. Calculate the transfer matrix as follows: assume $n_{ij}$ is the observation sum of the observation user u transfer from state $X_{i}$ to $Y_{j}$ , so the probability of user transfer from $X_{i}$ to $Y_{j}$ is

$p_{ij} = \frac{n_{ij}}{n_{i}}$ (1)

where $n_{i}$ represents the total number of user transfer from $X_{i}$ .

There are m states in set X and n states in set Y, accordingly the state transfer matrix of the tested user $u_{i}$ is as follows

$P_{i} = (\begin{matrix} p_{11} & \dots & p_{1 n} \\ ⋮ & ⋱ & ⋮ \\ p_{m 1} & \dots & p_{mn} \end{matrix})$ (2)

The current state of user is known $x \in X$ . The problem of predicting user’s next location can be transferred into calculating the max probability $p_{\max}$

$p_{\max} = \max (p_{xj} | j \in [1, n])$ (3)

Calculate users’ similarity. At first, we convert users’ mobility data into their trajectories. The complete trajectory of user $u_{i}$ is shown as $Tra j_{i} = {(l, t_{d})_{t}, \dots}$ , where $l \in L$ , $t_{d}$ represents time duration of the staying at location l and t is the timestamp. Combining all users’ trajectories together, we get the trajectory set $Traj = Tra j_{1} \cup Tra j_{2} \cup \dots \cup Tra j_{n}$ . The process of finding $u_{i}$ ’s personal trajectory similarity group $C_{i}$ is the process of finding the k users whose trajectories are the most similar with $u_{i}$ ’s trajectory $Tra j_{i}$ in Traj. In order to reduce the complexity of computing the similarity between user’s trajectory, we split the complete trajectory into traces under specific conditions. The condition of splitting $u_{i}$ ’s trajectory $Tra j_{i}$ is $t_{i} \geq T_{CUT}$ , where $T_{CUT}$ is a preset time threshold. After splitting, we get $u_{i}$ ’s trace set $T r_{i}$ , $T r_{i} = {trac e_{1}, \dots, trac e_{n} | trac e_{i} \in L, 1 \leq i \leq n}$ . Combining all the trace sets together, we get the global all users’ trace set $T r_{total} = T r_{1} \cup \dots \cup T r_{n}$ . Giving every trace in $T r_{total}$ an unique number as identification, trace id, and use trace ids to convert $u_{i}$ ’s trace set $T r_{i}$ . Assuming there are m elements in $T r_{total}$ , $V_{i} = {v_{1}, v_{2}, \dots, v_{m}}$ is the m-dimensional Boolean type vector which is transfer from $T r_{i}$ . In the vector, $v_{i}$ represents the number i trace in $T r_{total}$

$v_{i} = {\begin{matrix} 0, & i th trace \notin T r_{i} \\ 1, & i th trace \in T r_{i} \end{matrix}$ (4)

The similarity between any two users $u_{i}$ and $u_{j}$ is computed as $count (v_{i} & v_{j})$ . Sum up the remaining 1 after the AND operation; higher number means users are more similar.

To sum up, the proposed approach is a new aspect of building user’s location prediction model. It uses the tested user’s and other’s trajectories to derive user’s next location. The details will be given in the following.

A social contagion theory–based approach for location prediction

Design of system architecture

Figure 1 describes the architecture of the proposed TLP approach in mobile environment. It shows the interaction between TLP and other entities. Moreover, TLP plays an important role in the whole architecture. First of all, the data fusion module gathers users’ real-time connection information from each AP and then transfers them to user’s mobility data $d (uid, t, l)$ and stores it in the location database. TLP pulls the mobility data from database and computes users’ location prediction models and then it stores the models. When the outside service sends request of predicting $u_{i}$ ’s location to TLP, it imports $u_{i}$ ’s location prediction model and sends the result to the outside service.

Figure 1.

The architecture of TLP.

In this article, the proposed TLP algorithm is a social contagion theory–based location prediction method. It is suitable for forecasting user’s location in big data environment and uses crowd’s mobile patterns to adjust the prediction results. TLP applies the reduction idea of covering algorithm and the 2-MMC model to predicting user’s next location in cloud computing environment. Furthermore, it refines the prediction results and demonstrates better spatial and temporal performances at the same time.

The location prediction process of TLP is shown in Figure 2. First, TLP gathers all users’ mobility data and transfers the data into every user’s trajectory $Tra j_{i}$ . Second, it scans $Tra j_{i}$ to get the state set X, Y and then calculates the state transfer matrix $P_{i}$ . Third, TLP takes advantage of the idea of covering algorithm to extract users consistency subset $U_{sub}$ . Furthermore, it computes the similarity between $Tra j_{i}$ and every user’s trajectory in $U_{sub}$ and gets user’s personal trajectory similarity group $C_{i}$ . Finally, TLP combines the state transfer matrix $P_{i}$ and others’ whom is in $C_{i}$ and gets user’s location prediction model $P_predic t_{i}$ . When TLP receives a service request for $u_{i}$ ’s future whereabouts, return the predicted location $l_{p}$ according to $P_predic t_{i}$ .

Figure 2.

The process of TLP.

Algorithm of TLP

In this section, we describe the specific process of TLP. Details are given in the following.

Model building phase

Step 1. Before predicting user’s location, we first build personal location prediction model based on user’s mobility data. TLP utilizes 2-MMC to build the initial $u_{i}$ ’s prediction model, which is state transition matrix $P_{i}$ . In order to compute the transition matrix, we scan $Tra j_{i}$ sequentially and record $X_{i}$ , $Y_{i}$ , and $n_{ij}$ during the scanning. After the scanning, compute user’s transition matrix $P_{i}$ according to formula (1). To accelerate the processing speed, TLP is implemented in MapReduce computing framework. Split trajectories data into s chucks and send each chuck to a computing node. During the mapping, set the input data format $< key, value >$ where key is set as offset and value is $(uid, l, t_{d})_{t}$ . Set uid as the new key, while value is assigned as $(l, t_{d})_{t}$ . Then send the new key-value pairs to Reduce node which is responsible for computing $P_{i}$ to the corresponding uid.

Step 2. Split trajectory data $Tra j_{i}$ to obtain trace set $T r_{i}$ . This part of the algorithm is also implemented in MapReduce framework. First, we split U into s pieces to s Map nodes as input data, and $value = uid$ in the key-value pair. Then we take out the corresponding $Tra j_{i}$ for splitting. Time duration of staying at a location $d_{t}$ is the criterion of splitting. People tend to stay longer at the location where the behavior takes place than the location they pass by. Therefore, we set a time threshold $t_threshold$ . When the time duration $t_{d}$ at location l is longer than $t_threshold$ , l is the splitting point which will be set as the ending point of current trace and starting point of the next. Trace set $T r_{i}$ is obtained after splitting $Tra j_{i}$ completely, and it is stored into database.

Step 3. Code the trajectory. To make the following data sampling and mobility similarity computing more convenient, a trajectory coding method is proposed. First of all, during the mapping phrase, each compute node extracts the corresponding trace set $T r_{i}$ . Then process every trajectory fragment $trac e_{j}$ orderly and return the trace id j after searching $trac e_{j}$ in $T r_{total}$ . During the searching process, if $trac e_{j}$ is not in the set of $T r_{total}$ , then add it into $T r_{total}$ and get an unique trace id in return. At the same time, modify the total number of traces $m = m + 1$ ; otherwise, return the corresponding trace id j. At the end of mapping, set the new key-value pair $< key = uid, value = j >$ . During the reducing phrase, get the total number of traces from $T r_{total}$ and apply an m-dimensional BOOL type vector $V_{i}$ for each uid and initialize every bit as 0. Set the jth bit of $V_{i}$ as 1 based on the input key-value pairs. Return user’s coding trajectory $V_{i}$ after the process.

Step 4. Sample users. During the process of finding user $u_{i}$ ’s personal trajectory similarity group, by comparing $u_{i}$ with all other users the final k users are selected. Any two users need to compute similarity with each other; therefore, the total number of computations is $n \times (n - 1) / 2$ times. When the number of users is tremendous, the similarity computing will occupy an awful lot of hours and computing resources. According to social contagion theory, user’s behavior tends to be influenced by others and adjust itself to crowd’s behavior. Based on this theory, we can extract a consistent subset of users and then the tested user’s personal trajectory similarity group $C_{i}$ is obtained by computing the trajectory similarity with the users in the subset $U_{sub}$ .

The sampling process is based on covering algorithm. Evenly distribute the trace id in the total trace set $T r_{total}$ to each mapping node. In every key-value pair that mapping node get, set the value as i, the id of trace. The set of users’ trajectory coding vectors is $V {v_{1}, v_{2}, \dots, v_{n}}$ . The vector V whose $v_{i} = 1$ is classified as $G_{positive}$ —that is, the trajectory contains $trac e_{i}$ , and others are classified as $G_{negative}$ . Based on the process of covering algorithm, we get the set of circle centers A. Each vector a in A is determined as follows: if $a \in V$ , then a is the element in sample space. The corresponding uid is set as the output value in key-value pair; otherwise, compute the nearest vector in sample space to set its uid as the value. The mapping function repeats the above procedure until all trace ids are computed. In the reducing phrase, the function gathers all key-value pairs together and remove the duplicates and then the remaining uids are fused into the consistent subset of user $U_{sub}$ .

Step 5. Find personal trajectory similarity group. Split user set U and distribute them to mapping node. In each input key-value pair, $value = uid$ . Take out the corresponding $V_{i}$ to the uid. For every trajectory vector $V_{j}$ of users in $U_{sub}$ , compute the similarity between $V_{i}$ and $V_{j}$ based on formula (4). The result is format as a set ${(uid, similarity) | uid \in U_{sub}}$ after calculation, and the element of the set is ordered descendingly. The tested user’s personal trajectory similarity group $C_{i}$ is formed by the uids of the k front elements in the set above. Having stored $C_{i}$ into database, mapping function repeats the above procedure until the input is null.

Step 6. Construct the prediction model. This section is also applying the MapReduce framework. User set U is splitted and distributed to mapping nodes. In the input key-value pairs, the value is set as uid. When an uid is got, the mapping function extracts the corresponding user’s personal mobility model P and the ones from similarity group. The probability of a user transferring from current state to others is obtained. Hence, in the location prediction problem, the influence of personal trajectory similarity group to user can be regarded as the influence of other users’ probabilities to that of user. Including the influence of the similarity group, the user’s location prediction model can be represented as follows

$P_{predict}^{(i)} = α \sum_{u_{j} \in C_{i}} P_{j} + β P_{i}$ (5)

Algorithm 1: Model building phase of TLP.
Input: users’ mobility dataset D;
Output: the location prediction model of every user $P_{predict}^{(i)}$ ;
1: // $D_{i} \Rightarrow Tra j_{i}$
2: for each $d (uid, t, l) \in D$ do //mapping function, separate user’s mobility data from each other.
3: key=uid; $value = (l, t_{d}, t)$ ; emit $(key, value)$ ;
4: end for
5: for each uiddo //reducing function, build personal mobility prediction model
6: Scanning user’s trajectory $Tra j_{i}$ and compute the second-order Markov transition Matrix $P_{i}$ ;
7: emit $(uid, P_{i})$ ;
8: end for
9: Initialize the global trace set $T r_{total}$ as NULL;
10: for each uiddo //mapping function, split trajectory $Tra j_{i}$ , get trace set $T r_{i}$
11: get out $Tra j_{i}$ , initialize trace set $T r_{i}$ as Null;
12: for each $(l, t_{d})_{t} \in Tra j_{i}$ do
13: if $t_{d} \geq t_threshold$ then
14: select l to split $Tra j_{i}$ , get a trace;
15: put the trace into $T r_{total}$ , get the unique integer trace_id and store the id in $T r_{i}$ ;
16: $T r_{i} \leftarrow get_traceid (T r_{total}, trace)$ ;
17: end if
18: end for
19: emit $(uid, T r_{i})$ ;
20: end for//there are m elements in $T r_{total}$
21: for each $T r_{i}$ do // Reducing function, code the trajectory
22: initialize every bits of the m-dimensional BOOL type vector $V_{i}$ as 0;
23: $V_{i} [{trace_id \in T r_{i}}] = 1$ ;
24: emit $(uid, V_{i})$ ;
25: end for
26: for each $trace_id \in T r_{total}$ do //mapping function, user sampling
27: Classify the $V_{i}$ whose $V [trace_id] = 1$ into $G_{positive}$ and others into $G_{negative}$ ;
28: Calculate the set of circle center of $G_{positive}$ based on covering algorithm;
29: Calculate the nearest point $V_{i}$ for every element in A, emit $(0, i)$ ;
30: end for
31: Use reducing function to combine uid to get $U_{sub}$ ;
32: Use mapping function to select the front k users in $U_{sub}$ who has the highest similarity to $U_{i}$ ’s trajectory. And combine the users as $C_{i}$ ;
33: for each $uid \in U$ do //mapping function, build prediction model
34: get out $P_{i}, C_{i}$ and compute prediction model $P_{predict}^{(i)}$ according to formula (6);
35: emit $(uid, P_{predict}^{(i)})$ ;
36: end for

Since there are k users in $C_{i}$ , so $k * α + β = 1$ . And the formula above can be transferred as

$P_{predict}^{(i)} = α (\sum_{u_{j} \in C_{i}} P_{j} - k * P_{i}) + P_{i}$ (6)

where $\sum_{u_{j} \in C_{i}} P_{j} - k * P_{i}$ is the modification factor of the tested user’s Markov model and $α$ is the modification coefficient. $P_{predict}^{(i)}$ is the location prediction model of $u_{i}$ . Store the model and repeat the above process until the input is null. The pseudo code of TLP algorithm has been given in the following.

Predicting phase

Step 1. TLP receives a location prediction request of $u_{i}$ from outside service.

Step 2. TLP imports the location prediction model $P_{predict}^{(i)}$ and user’s current state X (the tested user’s current location and the previous one). Get the prediction result of user’s next location and send it to outside service.

From the microcosmic perspective, we apply the method of covering algorithm for sampling the large-scale coding trajectory data. Even though covering algorithm is normally used as a kind of supervised machine learning classification algorithm, we use it, from another angle, for data sampling. The main idea of covering algorithm is dividing the sample space by a series of hypersphere into two parts, inside and outside the hypersphere. Through computing, the results are a set of sphere center $A (a_{1}, a_{2}, \dots, a_{h})$ and a set of radius $R (r_{1}, r_{2}, \dots, r_{h})$ . The hyperspheres, which are represented by center set A and radius set R, include the positive samples within and negative samples are outside. We apply the above idea and take whether the trajectory holds a specific trace as the classification mark. Using covering algorithm–based sampling method, we get a set of sphere center point of the positive samples and select the sample which is the closest to each center as sampling result. Apply the above process to each and every trace; then user’s subset is the union set of all selected samples. The trajectory of every user in the subset processes well influences on others, since the odds of strong influence sample being selected out is higher. However, in real life, the influence of a person is not confined to one similarity group. Hence, when we extract the people with multi-group influence, it ensures the performance of data compression and consistency with the original dataset at the same time. Covering algorithm fits the above process description exactly: the person who has more traces appears more times at the sampling processing which means the possibility of being selected is higher. Hence, the sample with strong influence is more likely being extracted. Through the trajectory sampling method proposed in this article, the consistency and influence of original dataset is intact. At the same time, the subset increases the efficiency of later finding user’s personal trajectory similarity group.

The TLP algorithm proposed in this article uses 2-MMC for modeling based on a presentative feeling. The core idea of Markov prediction is that next state of the tested user is only related to n previous states, not the whole historical states. The mobility model shows just the above character. The future location of a user is related only to the n previous location where user at, not the entire trajectory. We choose the 2-MMC for constructing the initial mobility model, that is, use user’s two previous locations to predict next place. Because in the standard Markov model, next location is only related to current location, it would affect the accuracy of prediction badly. For example, a user lives a simple life of home-bus station-work. When the current location of user is “bus station,” the standard Markov can only give one between “home” and “work,” which means half of the overall predictions are wrong. However, using 2-MMC, we can know the previous location is “work,” current one is “bus station,” and naturally next place is “home.” From the example, 2-MMC achieves better accuracy without excessively increasing the state space.

From the whole, the idea of this article comes from social contagion theory. In social contagion theory, personal behavior is unconsciously, on some degree, influenced by crowd. Accordingly, a person gives up some of the personal behavior and adopts the crowd behavior as the new personal behavior. However, the personal behavior within the crowd is more similar to certain groups which have strong influence on the person. We define the group of people who have strong influence on the tested user as personal trajectory similarity group. Inspired by social contagion theory, when this article builds the location prediction model, we use not only user’s personal mobility data, but also the mobility data of users from the personal trajectory similarity group to modify former. However, when people speak of personal trajectory similar group, they naturally think of social relation between users in real life (i.e. classmate, coworker, relative). We consider simply using social relations to construct personal trajectory similarity group is not the most optimal option in terms of social contagion. Respect to location prediction, fairly speaking, the personal trajectory similarity group which is constructed based on social relations possesses strong influence on the tested users under certain situations. To address the above problem, constructing a more comprehensive personal trajectory similarity group and discovering all the contagions between users which are hidden in historical data are required. In this article, as part of the TLP algorithm, the proposed procedure of searching personal trajectory similarity group is based on the idea of collaborative filtering. The similarity of trajectory is chosen to be the sole standard of determining the influence intensity between users. This makes the procedure focusing on finding out the users who have similar trajectory with the tested user and leaving out the social relations and the user’s co-occuring behaviors. With regard to calculating trajectories’ similarity, considering user’s randomness of mobility, there are hardly two trajectories exactly the same. Hence, user’s trajectory can be considered as a series of traces. After splitting the trajectory into traces, we calculate user’s mobility similarity by counting the number of same trace between different users. Finally, elements in the tested user’s personal trajectory similarity group are selected by similarity descendingly.

Experiments

In this section, we conduct a series experiments to evaluate and verify the proposed TLP algorithm with dataset gathered in real scenario. The two aforementioned algorithms, 2-MMC and NextCell, are chosen to be the baselines. Particularly, we would like to answer:

The influence of time duration, between current time and the arriving time of next location, on prediction accuracy of the algorithms;

The influence of the size of the user group on prediction accuracy;

The influence of the parameter K, which defines the size of the personal trajectory similarity group, on the accuracy of TLP;

The influence of the social activeness of user on prediction accuracy.

Experiment design

Metrics

We select two metrics to measure the performance of the prediction algorithms—accuracy and time-related accuracy. The accuracy of location prediction is defined as

$accuracy = \frac{n_{correct}}{n_{total}}$ (7)

where $n_{correct}$ represents the number of correct predictions and $n_{total}$ is the total number of the predictions.

The time-related accuracy, $accurac y_{i, j}$ , represents prediction accuracy of which the time of user j arriving at next location is in the ith hour from current time. $P_{i, j}$ is the total number of locations where user j arrives at the ith hour and $P_{i, j}^{'}$ represents the number of correct predictions among the abovementioned locations

$accurac y_{i, j} = \frac{P_{i, j}^{'}}{P_{i, j}}, i \in I$ (8)

Experiment design details

We choose MIT Reality Mining Dataset²⁵ to evaluate the performance of proposed location prediction algorithm. This dataset contains the location information, communications, and the mobile application preferences of 106 users during 9 months gathering. The details of the dataset are in Table 1.

Table 1.

Details of MIT dataset.

Item	Description
Starting time	September 2004
Ending time	June 2005
Number of users	106
Number of cell towers	32,579
Number of GSM traces	2,667,895
Average number of GSM traces
per person per day	47
Logical location	CellArea, CellTower

GSM: Global System for Mobile Communication.

We implement the proposed TLP algorithm on Hadoop platform and MapReduce programming framework.²⁶ There are 21 nodes in the cluster, 1 master node and 20 slave nodes, each of which is equipped with 16 GB memory and CPU 2.83 GHz.

The influence of arriving time

In this section, we examine the influence of the time duration between current time and the arriving time of location on prediction accuracy. The reason for conducting this experiment is many research works employ methods of predicting future locations from an un-linear time point of view. However, TLP is built as the predicting model from the perspective of trajectory (personal trajectory and trajectory similarity). So, we design the experiment to examine the influence of time on the trajectory-based location prediction method. We divide the locations by the time duration between it and its previous one. One hour as an interval, there are five intervals. To examine the impact of social contagion on TLP, we remove the social contagion–based part of TLP and use the remaining personal trajectory–based method, 2-MMC, as one of the baselines. Meanwhile, we invite a state-of-the-art social-based prediction algorithm, NextCell,⁵ as another baseline. The result is illustrated in Figure 3. In general, the TLP is affected by the time variant least, the accuracy slightly descends with an increase in time. The difference of accuracy between the best (first interval) and worst (fifth interval) is merely 11%. NextCell performs slightly worse than TLP, but it also shows the same stability. The difference between the highest and lowest is about 17%. NextCell is a location and un-linear time-based statistics method, so time will not make too much difference on its accuracy. However, the accuracy of 2-MMC declines sharply with an increase in time. At the first interval, the different between it and TLP is 13%, but the difference raise to 27% at the fifth interval. Since the only difference between the two is the social contagion part, we think that the social contagion adjusts well with varying time. The reason might be that with the time duration between two locations shorter, user may focus on a kind of periodic behaviors or a series of movement left by in order to finish a task. This means the bonding between locations in the trajectory is closer. When the time duration increases, user might have already finished the behaviors, which means the association between locations starting to fade and user might begin another kind of behaviors. So, when the link between locations is weakening, the accuracy of 2-MMC declines sharply and the accuracy of TLP influence by social contagion of other people is impacted slightly.

Figure 3.

Influence of time duration between locations on the accuracy of the model.

The influence of user group size

In this section, to examine the effect of user group size on accuracy, we extract 30 users and 60 users randomly from MIT dataset,²⁵ along with the whole 106 users in the dataset as three comparing datasets. The result of the three datasets is shown in Figure 4. It is shown in the histogram, 2-MMC has the least impact and its accuracy is floating around 45%–50%. This is because that the prediction accuracy is only related to personal historical trajectory. However, the accuracy of TLP and NextCell is ascending with an increase in user number. NextCell is mildly lower than TLP except the user number is 60, accuracy of NextCell is 63.8% which is higher than the 62% of TLP. This is caused by the dataset containing more phone call records in it, and the NextCell has more useful information to elevate its prediction accuracy. When the user number is 106, the accuracy of 2-MMC, NextCell, and TLP is 48.6%, 66.7%, and 71.6%, respectively.

Figure 4.

Influence of user number on prediction accuracy.

The influence of parameter K on TLP

Because the prediction accuracy of TLP associates with the tested user’s personal trajectory similarity group, we adjust the parameter K which decides the size of the group and observe the accuracy of the model under different K. The result is shown in Figure 5. Initially, when $K = 0$ , the TLP equals to 2-MMC which has the lowest accuracy rate and descends the most quickly with an increase in time. With an increase in K, model of the tested user has more references to modify the prediction. Hence, the accuracy rises up. When $K = 10$ and the time interval is 1st hour, the accuracy rate is the highest, 75.5%. However, after this, the accuracy descends with an increase in K (i.e. $K = 20$ and hour = 1st, accuracy = 73.3%). We infer that the reason of this situation is that there are too much users in the group and the prediction is interfered by the data noise. But it should be noted that even though the highest accuracy is achieved at when $K = 10$ , different dataset may have different optimal K. Because K represents the balance between the positive information and the data noise, the strength of social contagion and interference varies from datasets.

Figure 5.

Influence of parameter K accuracy.

The influence of user’s social activeness

Because the social contagion of TLP is based on the associations between crowd and a person in the real world to construct the mobility model, the crowd is not divided into friends and strangers. So, in this section, we want to examine the influence of social activeness. We classify the users in the dataset as active, semi-active, and passive based on the total number of communications with other people and the number of friends which are both offered in the dataset. We choose one user from each classification—user A (active), user B (semi-active), and user C (passive). 2-MMC and NextCell are the baselines in this section. The result is demonstrated in Figure 6. All three algorithms show the same tendency, in which the accuracy of C is the highest and the accuracy of A is the lowest. The accuracies of 2-MMC, NextCell, and TLP are, respectively, 38.4%, 53.3%, and 58.7%. The reason is that the trajectory of the active user shows more randomness in it, so it is more difficult to predict his next whereabouts. The semi-active user demonstrates more regularity, so the prediction is more accurate. Especially the passive user, the accuracies of the three algorithms are 90.6%, 93.8%, and 94.3%, respectively, as it is shown in the histogram. Even though affected by user’s mobility randomness, the difference of TLP’s prediction accuracy rate is the smallest, 35.5% between the best and worst, which is a bit smaller than the NextCell’s 40.2% and much smaller than 2-MMC’s 52.2%. TLP is based on trajectory similarity to find the similar users. During prediction, it can use others’ decisions under the same state which the tested user is in as reference for predicting the next locations. Hence, whether the user is active or not, TLP can achieve preferable accuracy.

Figure 6.

Influence of user’s social activeness.

In conclusion of the results above, we have achieved these following objectives: First of all, we use the time duration as the variable to see how it affects prediction accuracy. And TLP shows better accuracy and is affected by the time variant least. Second, the result shows that with more users involved as input, TLP achieves better accuracy than itself and than others. Third, we adjust the size of the similarity group by changing the value of parameter K. And the peak of TLP’s prediction accuracy is when $K = 10$ . Fourth, through the experiment of the comparison among different social activeness user groups, we found that all algorithms achieve more accuracy with more active user group and TLP shows more stability against this change because it is based on similarity between users, not only friends.

Conclusion and future work

This article proposed a location prediction algorithm TLP based on social contagion theory. Addressing to the regularity and randomness of human mobility, TLP builds second-order Markov-based personal mobility model and user’s personal trajectory similarity group based on social contagion theory. During the process of finding the personal group, we split user’s trajectory and transfer it into the coding trajectory. Then combining the idea of covering method, we compress the large-scale user trajectory dataset into a consistent subset. Furthermore, we apply the trajectory-based user similarity algorithm to find out user’s personal group in the subset. Finally, we conduct a series of experiments on the performance of the proposed algorithm from multiple aspects with MIT Reality Mining dataset. The result shows that the proposed social contagion–based location prediction algorithm possesses great accuracy.

In the future, we are planning to use the semantic information of location (such as user-defined location tags) as users’ trajectories, which should be beneficial for achieving more prediction accuracy and have wider range of realistic usage due to the semantic information. Moreover, a more advanced trajectory model can also be used for similarity computing, such as a tree-like model or graph structure. These models can preserve more information than the BOOL vector and achieve more accuracy in similarity computation. Furthermore, we prepare to apply TLP to the applications of resource allocation in wireless network.

Footnotes

Academic Editor: Florentino Fdez-Riverola

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research,authorship,and/or publication of this article: This work was funded by the European Seventh Framework Program (FP7) under Grant No. GA-2011-295222,the National Natural Science Foundation of China under Grant No. 61073009,the National Sci-Tech Support Plan of China under Grant No. 2014BAH02F03,the National Science-Technology Support Project under Grant No. 2014BAH02F02,the National Natural Science Foundation of China under Grant No. 61133011,the Youth Science Foundation of Jilin Province of China under Grant No. 20160520011JH,and the Jilin Provincial Education Office (the 13th Five-Year Plan Science and Technology Research Project (2016) No. 347).

References

Noulas

Scellato

Lathia

. Mining user mobility features for next place prediction in location-based services. In: Proceedings of the 2012 IEEE 12th international conference on data mining, ICDM ’12, Brussels, 10–13 December 2012, pp.1038–1043. Washington, DC: IEEE Computer Society.

Ashbrook

Starner

Learning significant locations and predicting user movement with GPS. In: Proceedings of sixth international symposium on wearable computers (ISWC 2002), Seattle, WA, 7–10 October 2002, pp.101–108. Washington, DC: IEEE Computer Society.

Gambs

Killijian

Núñez del

. Next place prediction using mobility Markov chains. In: Proceedings of the first workshop on measurement, privacy, and mobility, MPM ’12, Bern, Switzerland, 10–13 April 2012, pp.3:1–3:6.

Monreale

Pinelli

Trasarti

. WhereNext: a location predictor on trajectory pattern mining. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’09, Paris, 28 June–1 July 2009, pp.637–646. New York: ACM.

Zhang

Xiong

. NextCell: predicting location using social interplay from cell phone traces. IEEE T Comput 2015; 64(2): 452–463.

Eagle

Pentland

Lazer

. Inferring friendship network structure by using mobile phone data. P Natl Acad Sci USA 2009; 106(36): 15274–15278.

Gonzalez

Hidalgo

Barabasi

AL.

Understanding individual human mobility patterns. Nature 2008; 453(7196): 779–782.

Scellato

Musolesi

Mascolo

. NextPlace: a spatio-temporal prediction framework for pervasive systems. In: Proceedings of the 9th international conference on pervasive computing, San Francisco, CA, 12–15 June 2011, pp.152–169. Berlin, Heidelberg: Springer.

Song

Kotz

Jain

. Evaluating location predictors with extensive Wi-Fi mobility data. In: Proceedings of the twenty-third annual joint conference of the IEEE computer and communications societies INFOCOM 2004 (vol. 2), 2004, pp. 1414–1424. DOI: 10.1109/INFCOM.2004.1357026.

10.

Wang

Yuan

Lian

. Regularity and conformity: location prediction using heterogeneous mobility data. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’15, Sydney, NSW, Australia, 10–13 August 2015, pp.1275–1284. New York: ACM.

11.

Yuan

Cong

. Who, where, when and what: discover spatio-temporal topics for twitter users. In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’13, Chicago, IL, 11–14 August 2013, pp.605–613. New York: ACM.

12.

Cho

Myers

Leskovec

Friendship and mobility: user movement in location-based social networks. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’11, San Diego, CA, 21–24 August 2011, pp.1082–1090. New York: ACM.

13.

Lian

Xie

Zheng

. CEPR: a collaborative exploration and periodically returning model for location prediction. ACM Trans Intell Syst Technol 2015; 6(1): 8:1–8:27.

14.

Baumann

Kleiminger

Santini

. The influence of temporal and spatial features on the performance of next-place prediction algorithms. In: Proceedings of the 2013 ACM international joint conference on pervasive and ubiquitous computing, UbiComp ’13, Zurich, 8–12 September 2013, pp.449–458. New York: ACM.

15.

Burt

RS.

Social contagion and innovation: cohesion versus structural equivalence. Am J Sociol 1987; 92(6): 1287–1335.

16.

Calabrese

Smoreda

Blondel

. Interplay between telecommunications and face-to-face interactions: a study using mobile phone data. PLoS ONE 2011; 6(7): e20814.

17.

Ying

JJC

Lee

Weng

. Semantic trajectory mining for location prediction. In: Proceedings of the 19th ACM SIGSPATIAL international conference on advances in geographic information systems, GIS ’11, Chicago, IL, 1–4 November 2011, pp.34–43. New York: ACM.

18.

Boldrini

Passarella

HCMM: modelling spatial and temporal properties of human mobility driven by users’ social relationships. Comput Commun 2010; 33(9): 1056–1074.

19.

Hossmann

Spyropoulos

Legendre

Putting contacts into context: mobility modeling beyond inter-contact times. In: Proceedings of the twelfth ACM international symposium on Mobile Ad Hoc networking and computing, MobiHoc ’11, Paris, 16–20 May 2011, pp.18:1–18:11. New York: ACM.

20.

Musolesi

Hailes

Mascolo

. An ad hoc mobility model founded on social network theory. In: Proceedings of the 7th ACM international symposium on modeling, analysis and simulation of wireless and mobile systems, MSWiM ’04, Venice, 4–6 October 2004, pp.20–24. New York: ACM.

21.

Backstrom

Sun

Marlow

Find me if you can: improving geographical prediction with social and spatial proximity. In: Proceedings of the 19th international conference on world wide web, WWW ’10, Raleigh, NC, 26–30 April 2010, pp.61–70. New York: ACM.

22.

Lian

Zhao

Xie

. GeoMF: joint geographical modeling and matrix factorization for point-of-interest recommendation. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’14, New York, 24–27 August 2014, pp.831–840. New York: ACM.

23.

Schmidt

Winther

Hansen

LK.

Bayesian non-negative matrix factorization. In: Proceedings of the 8th international conference on independent component analysis and signal separation, ICA 2009, Paraty, Brazil, 15–18 March 2009, pp.540–547. Berlin, Heidelberg: Springer.

24.

Liu

Yao

. Learning geographical preferences for point-of-interest recommendation. In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’13, Chicago, IL, 11–14 August 2013, pp.1043–1051. New York: ACM.

25.

Eagle

Pentland

Reality mining: sensing complex social systems. Pers Ubiquit Comput 2006; 10(4): 255–268.

26.

Dean

Ghemawat

. MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th symposium on operating systems design & implementation, OSDI’04 (vol. 6), San Francisco, CA, 6–8 December 2004, pp.107–113. Berkeley, CA: USENIX Association.