Abstract
Keywords
Introduction
With the development and application of data acquisition equipment and technology in the Internet of things, the joint use of multiple data centers is now regarded as essential for many online services. 1 Simultaneously, data center privacy has also become a focus of attention. For example, Europe’s General Data Protection Regulation (GDPR) requires data center users to focus on privacy. 2
Due to the dynamic and heterogeneous nature of multiple data centers, privacy and security are regarded as the most difficult challenges; the misuse of legitimate access to data is a severe information security concern for both organizations and individuals. For example, after the 9/11 attacks, the US government allowed a greater sharing of information among government, public security, military and other departments as a defense procedure against future terrorist strikes. However, this plan leads to a massive leak of sensitive data by low-ranked personnel. 3 From a security engineering viewpoint, this problem is partially due to abuse of privileges that have been granted by authentication or authorization services, both of which may be regarded as functions of access control.
Access control constitutes a fundamental aspect of security and privacy protection. Typically, access control is used to prevent unauthorized users from gaining access to system resources, to prevent authorized users from accessing resources in an unauthorized manner, and to allow authorized users to access resources in an authorized manner. Due to the complex and dynamic nature of the multi-center big data application field, some of these factors cannot be addressed or phrased in terms of traditional access control. In Crampton and Huth, 4 a new access-control architecture was formulated, the realization of which might form part of an overall strategy for addressing the insider problem. In this architecture, trustworthiness and risk-assessment methodologies were combined and extended in traditional role-based access control. Here, a risk-assessment methodology is used to make decisions based on the context of the resource, and the requestor’s trustworthiness is defined as the probability of the requestor not abusing the authorization, which could be derived via statistical analysis of the requestor’s behavior.
Moreover, cross-center data processing applications are typically implemented as workflows, 5 which renders the access-control-based privacy protection more complicated. A work flow could be represented by a task graph G = (V, E) that consists of a set of tasks that are represented by vertices V. In mass workflow environment systems, tasks are executed sequentially according to the execution order. Therefore, it is impossible to identify unusual entities from a running system that deviate from its normal pattern of behaviors. The use of user access sequences for dynamic access control provides a new approach. 6 However, the current sequence anomaly analysis is still used in the application of system security detection, and little research has been conducted on the application of user access control. In-depth research on abnormal user behavior in access requests for resources or services is lacking. Therefore, a dynamic and semantic-aware access-control (DSAAC) model that is based on sequence anomaly evaluation is proposed by considering the characteristics of the workflow in a multiple data center environment. Through the sequence anomaly evaluation method and by introducing semantic constraints, users and resources in the access-control process can be restricted, and simultaneously, user resource associations can be defined; hence, this process is suitable for dynamic access control in a massive-resource environment.
The rest of this article is organized as follows. Section “Related work” presents the concept of the access control and different access-control models. Section “Our DSAAC model” presents the scheme and specific algorithm steps of the dynamic access-control model. Section “Experiments and analysis” presents experiment and analyzing. Finally, section “Conclusion” concludes the article.
Related work
From privacy and security perspectives, access control is one of the most fundamental security methods. Access control allows authorized users to obtain physical or logical access to various resources and ensures the confidentiality and integrity of system resources.7,8 Earlier, most information systems were small in scale and number of users, simple in business, and corresponding access-control systems mainly adopted the method based on access-control list. Along with the increasing complexity and scale of systems, in order to improve the flexibility of authorization management, role-based access-control (RBAC) model was proposed in combination with the management model of user authorization and role authorization.
Based on the proposed concepts of tasks and workflows, the task-based access-control (TBAC) model has begun to receive research attention, which focuses on the process of user execution and authorizes each requested resource according to the execution status of the current task.9–11 Many scholars have expanded TBAC according to their needs and have proposed such models as the task-role-based access-control (TRBAC) model12–14 and the extended TRBAC model.15,16 These models are expanded versions of the TBAC model and have realized satisfactory application results in various scenarios. However, extended access-control models of this type still use static policy rules, do not effectively use historical execution information, and cannot dynamically adjust the allocation of permissions according to the historical execution of tasks and other issues. Due to the relative complexity of TBAC in formulating static authorization policies, less research has been conducted on TBAC.
In a massive-resource environment, dynamic authorization is an effective technique for providing dynamic resource authorization by combining user behavior, context, task status, and other attribute information. Dynamic authorization approaches are characterized by the use of not only the policies but also environment features that are estimated in real time to determine access decisions. Among them, the risk-assessment method is an effective dynamic solution for assessing the uncertainty of a user’s behaviors in a complex environment to control the insecurity from a requestor. In Zheng and Cai 17 and Diep et al., 18 a model for risk assessment has been proposed, which provides a useful reference for the consideration of context in risk assessment. An access-control model that is based on user behavior, context vulnerabilities, and resource properties is proposed in Bouchami et al., 19 but it focuses only on defining the risk level between collaborative environments. Lakshmi et al. proposed a model for the identification of insider attackers by adjusting the session based on risk assessment.20,21 Fall et al. 22 also proposed a risk-adaptive authorization mechanism to satisfy the dynamicity in the cloud environment.
In addition, several preliminary studies combined risk and access control in using trust to assign users to roles. In Shaikh et al., 23 a risk-trust authorization mechanism is proposed for assessing user credentials to adjust the access rights dynamically. In Ni et al., 24 risk-based access-control systems that are based on fuzzy inferences are proposed. This study showed that fuzzy inference is a satisfactory approach for implementing a risk-based access-control system.
The current research on dynamic access control mainly focuses on the relationship between the user behavior and the context. In the access-control model that is based on user behavior and context, effective evaluation of user historical data is lacking, and it focuses on the analysis of the current states of users while paying less attention to the attributes of resources.
In workflow systems, a service is a task that is executed over a period and can be formalized into a task sequence. In security research in web and computing environments, anomaly detection in a user behavior sequence has become a hot topic, and remarkable results have been obtained. Sendi et al. 6 uses a hidden Markov model to analyze user commands, to establish a normal behavior archive for the user’s command sequence, and to compare the current sequence with normal behavior sequences to determine the similarity, which is used to distinguish legal users from intruders. Sequence mining is often used to mine hidden user behaviors to realize better personalized recommendation and resource allocation. Sequence mining is also used to detect abnormal behaviors of users. As discussed in Xie and Yu, 25 a hidden Markov model is used to describe the browsing behaviors of web users and to detect abnormal behaviors. Zhou et al. 26 proposed a method for user behavior anomaly detection that is based on data stream sequence mining. By studying the sequence relationships between subsequences, a user behavior anomaly was discovered, and the common problems of low delay and low accuracy in sequence anomaly detection algorithms were overcome. Sequence anomaly detection can provide low-latency and high-efficiency detection if the amount of data is large.
However, in access-control technology, few studies that are based on sequence anomaly detection have been conducted. The user’s behavior is a chronologically ordered set of observation records. A method that is based on sequence anomaly detection is introduced into access control, which can well realize dynamic and flexible authorization in a multiple data center environment. Thus, we study how to introduce the method of sequence anomaly detection into the access-control process and how to improve the available sequence pattern mining and detection algorithm so that it can provide dynamic authorization management in the TBAC scenario and protect the privacy.
Our DSAAC model
In this DSAAC model, users must pass the identity authentication of the system first. Then, the request in the workflow process is formally modeled as a behavior sequence, and the user’s permission for each step of the task execution is decided via the method of sequence anomaly detection. The scheme is illustrated in Figure 1.

Dynamic and semantic-aware access-control model scheme.
First, characteristic attributes and historical behavior requests are extracted from the request object. The characteristic attributes include subject attributes, object attributes, and other related attributes of the request. We model the historical behavior request using a serializing model. Then, we can authorize user access to the system automatically if the current behavior sequence has a high likelihood of being a normal sequence or warn the administrator if the current behavior sequence has a high likelihood of being an abnormal sequence (risk assessment). Finally, administrators can review the decision results, add the positive sample to the sample library for semantic sequence pattern mining, update the sequence pattern library, and provide normal behavior sequence patterns for the sequence anomaly detection module. Next, we will introduce behavior sequence modeling, sequence pattern mining, and sequence anomaly detection in detail.
Behavior sequence modeling
The modeling of a behavior sequence is the representation of the user’s request in a standard behavior sequence format. We represent a request as a behavior sequence via formalization, pruning, and merging (Figure 2). Then, we can identify and use this standard behavior sequence in the pattern mining and anomaly detection process.

Standard behavior sequence template.
Basic unit definition
The most basic unit in the behavior sequence is requests. Requests are also associated with subjects, resources, and tasks, among others. We divide each request into a triple
where request
Standardizing the sequence is equivalent to standardizing the sequence
A task is composed of a standard
where behavior sequence
We establish a sequence pattern library according to the behavior sequence. The sequence pattern library is a database that contains many normal sequence patterns
Semantic-related definition
Due to the diversity needs among the types of businesses in various systems, the modeling of basic units alone cannot satisfy the expressions regarding the tasks, behaviors, and resources. For example, in a public security system, the staff must specify a file number to query a case. The file number is distributed according to the level and grouping of the staff, and it is private attribute of the staff; it has no relation with the forward step. This file number is semantic information. Therefore, we introduce semantic-related definitions. When matching sequence patterns, semantic constraints are used to check the legality of current behaviors with resource operations. We also use semantic information to guide risk assessment when mining sequence patterns. The following describes the establishment of a semantic-related model.
Semantic recognition
The main objective of semantic recognition is to identify semantics. In special tasks, we must attach semantic information to the tasks. We typically use a URL to represent a task; thus, specified semantic information can be appended through URL parameters. In a user request log, a request record is of the following format:
A request record can be sliced to obtain recognizable information according to the preset regular rule “/id=([0-9a-z]+)&iaction=(\w+)&ast=(\w+)/”, and we can obtain triple{
[id:1407246132s7jn1j8b","action:view","ast:0f"]
Semantic variables
The semantic variable is defined as a quad
where
Semantic rules
A semantic rule is a regular sentence. A sentence is composed of elements of an alphabet ∑, namely,
For example, consider sentence “User A accesses resource B with semantic constraint person_job_family 1.” The process is as follows.
Use the relevant semantic rules to check the semantics of the statements that are processed above, as expressed in formula (3)
in which
Semantic constraints
Semantic constraints are related constraints from the perspective of resources, such as inclusive relationships, functional relationships, mutually exclusive relationships, self-reflexive relationships, and self-checking constraints. These constraints are combinatorial conditions that are based on common logical relationships, such as (A and B), (A or B), and (nor A). For example, semantic value A is PERSON_JOB_FAMILY and equals 1; B is PERSON_LOCATION and equals 2. We posit that only when A and B are satisfied simultaneously can they be satisfied normally. Corresponding semantic constraints are defined on each step of the task to constrain the operation of resources. A standard semantic constraint is of the following form:
<PERSON_JOB_FAMILY==1 and PERSON_LOCATION==2>
Sequence modeling process
Based on the above definition, the process of modeling a user behavior sequence will be described below. Each step of the user’s request involves the corresponding subject and object and semantic information. Behavior sequence modeling is the extraction of information for anomaly detection from the user’s request. The steps are illustrated in Figure 3.

Sequence modeling process.
In the figure, a request
Session information maintenance
Data collection is a prerequisite step for sequence modeling. In this step, we will complete session information maintenance to obtain the forward task. We collect basic subject information and object information that are related to the current request
Semantic environment extraction
Process the basic information of the request to obtain semantic environment information. For example, a standard network request log format is presented in Table 1. Through semantic constraints, useful subject environment semantic information and semantic annotation of object resources can be extracted from basic information, such as the IP address, date, file type, resource domain, and user agent.
Semantic description.
Sequence construction
Sequence construction is the process of merging a single request into a request sequence. In the process of sequence construction, it is necessary to conduct data preprocessing, clean up dirty/noisy data, extract and merge data from various sources, and convert the data into a suitable format. We identify tasks by semantics, and we filter out repeated tasks by pruning and setting time windows. Finally, we merge the tasks to construct the task sequence
Sequential pattern mining
A sequential pattern mining algorithm is used to realize risk assessment. The advantage of using sequential pattern mining is that it can mine more instructive behavior patterns without requiring security administrators to formulate many complicated policy rules. Moreover, the pattern library will not store behavior sequential patterns that have the same meaning, which can reduce the burden of database storage. We utilize a closed sequence mining algorithm that is based on behavior.
Core strategy in algorithm design
The core strategy of the closed pattern mining algorithm is to expand the
Due to the particularity of TBAC scenarios, the mining of behavior sequence patterns differs substantially from the traditional mining of sequence patterns. In mining behavior sequence patterns in TBAC scenarios, the following three requirements must be satisfied:
To overcome the problem that the result set is too large and contains many sequences that have the same meaning, the behavior sequence pattern mining algorithm adopts the closure sequence mining algorithm. For example, only closed sequence patterns are mining and identical are filtered out. We use first fixed frequent sequence for expansion, which can increase the mining efficiency. We also use semantic constraints to extend the pattern.
Basic definition
The relevant definitions for this behavior-based closure mining algorithm are as follows:
Implementation of the sequential pattern mining algorithm
The flow of the sequential pattern mining algorithm is illustrated in Figure 4.

Sequence pattern mining process.
To overcome the particularity of behavior items, semantic constraints are used for guidance in the mining process. Semantic constraints are defined on each task and are constraints on resources. In this algorithm, we separate
Simultaneously, to reduce the storage space, we use a pseudo projection database in our algorithm. Projected
Sequence anomaly detection
The normal sequence patterns that are obtained via sequence pattern mining can be used as a pattern library in sequence anomaly detection. Sequence anomaly detection includes pattern matching of sequences and semantic checking of current behaviors. Pattern matching of the behavior sequence is the comparison of the normal behavior sequence of the user with the current behavior sequence pattern to judge whether the current user request is abnormal. Define
The detection process of several request sequences in the pattern library and table is illustrated in Figure 5. The patterns in the pattern library are stored as trees. Figure 5(a) shows the stored partial normal sequence patterns in the pattern library. Each task sequence starts at

Abnormal sequence detection process: (a) normal behavior sequences recorded in tree structure in the pattern library, (b) the two request sequences are determined to be normal by pattern matching and semantic checking, and (c) the two request sequences are determined to be abnormal by pattern matching and semantic checking traffic requests.
The support degree of the normal sequence to the current behavior sequence, namely,
Sequence patterns of equal length, as shown in task sequence
The lengths differ, and the prefixes of the current sequence pattern are normal behavior sequences in the matching pattern library, for example,
Select a set of similar behavior patterns for calculating the support degree of the normal behavior pattern with respect to the current behavior, such as in formula (6), wherein
By combining the length similarity and the semantic attribute distance, we can obtain the support degree for the current mode, as expressed in formula (7), where
Finally, we can grant authorization or not according to the support degree of the normal sequence in the pattern library with respect to the current sequence and the threshold value.
Experiments and analysis
In our experiments, the training and test data sets are obtained from the
This experimental environment is the host of a core i5 processor, with 8GB memory, 256 GB, and Windows 10 operating system.
Performance comparison of mining algorithms
The sequence pattern data-mining model is used to train the legitimate request sequence to obtain the training behavior pattern set. Since the selected training set size and the setting of the minimum support (
Comparison of training set sizes.

Runtimes.
The running time of
The running time is related to the complexity of the algorithm. Since
Experimental analysis of the authorization strategy
First, select several training sets that are generated by the three considered algorithms for testing. To distinguish these training sets, we numbered them; for example, we numbered the five sets of training sets that were generated by
We design sequence rules that are based on the existing sequence and the pattern that is generated by the
The semantic information in the experiment is based on the rule base that is provided by the data platform. Its main function is to identify tasks. Since no relevant professional knowledge is available for reference, this experiment lacks the formulation of semantic constraints. Therefore, in practical applications, security managers can formulate suitable semantic constraints that are based on their professional knowledge. In a perfect semantic library system, the actual application performance of this model should be higher than that in this experiment. The experimental results are presented and analyzed below.
Performance test
We chose the response time as a reference index in the performance test of access control. This experiment compares the three algorithms in terms of the average response time, namely, the time between when each request is sent and the authorization result is obtained.
According to Figure 7, the times for anomaly detection that are realized using the generated pattern databases that are based on the

Response time.
Accuracy analysis
We use the statistical false-positive rate and the accuracy rate to evaluate our model.
False-positive rate test
The false-positive rate refers to the proportion of the number of requests that are regarded as abnormal behaviors in the test data. First, we compare the false-positive rates of the training sets that are generated using the three algorithms. The experimental results are presented in Figure 8.

False-alarm rate (comparison of the training sets that were generated by three algorithms).
Algorithm

False alarm rate (with semantics algorithm).
As seen from the above figure, after using semantic constraints, the false-positive rate of this algorithm decreased from 7%–5% to 2%–3%, which is close to the policy-based access-control model. Because the static policy rules of our model are generated according to legal sequences, the false-positive rate is low.
Accuracy test
The correct rate refers to the proportion of abnormal behaviors that are regarded as abnormal in the test results relative to the number of abnormal requests. We compare the accuracies of the training sets that were generated by the three algorithms, and the experimental results are presented in Figure 10.

Correct rate (comparison of training sets generated by different algorithms).
According to the experimental results in Figure 10, the correct rate of the training set that was produced by

Correct rate (with the semantics algorithm).
According to the experimental data in Figure 11, when using
Conclusion
In this article, we studied the privacy protection issues of joint analysis across multi-data centers from the perspective of access control. In traditional service-based authorization models, static rules are used, which render the authorization process inflexible and unable to support the authorization requirements in multiple data center scenarios. According to the characteristics of scientific computing workflows across data centers, we incorporated the method of semantic verification and sequence anomaly detection into the access-control process and provided dynamic authorization management in the TBAC model and protected data privacy and security in multiple data center environments.
