Abstract
Keywords
Introduction
Database technology has grown from the mid-1960s and has now become an active subject area. 1 Database-based information systems are becoming information infrastructure in the areas of economy and government, and the value of information stored in databases is becoming increasingly high. Therefore, the security of database systems is becoming more and more important. In the new network environment, the database system needs to face more security threats, because the database system attacks are endless. Database system security technology refers to the security measures established for the database system to protect the database system software and the data in which it is not accidentally and maliciously caused by illegal copying, destruction, alteration, and leakage. At present, the database system security, network security, and operating system security together constitute the most important information system security research field.
In general, the security requirements of the database system can be summarized as integrity, confidentiality, and availability of three aspects. 2 The integrity of the database system mainly includes physical integrity and logical integrity. Physical integrity means that the data of the database are not affected by physical failures (such as hardware failures and power down), and it is possible to rebuild and recover the database in the event of catastrophic destruction. Logical integrity refers to the protection of the logical structure of the database, including the semantic integrity of the data and operational integrity. Semantic integrity mainly refers to the data access to logically satisfy the integrity constraints, and operational integrity mainly refers to the concurrency transaction to ensure the logical consistency of the data.
The confidentiality of the database means that unauthorized users are not allowed access to the data. 3 It is generally required to identify and authenticate the identity of the database user and take a certain access control policy to ensure that the user can only access the authorization data. So that it can grant different access rights to different users of the same set of data. At the same time, it should also be able to track and audit the user’s access operation. In addition, it should be controlled that the user can obtain unauthorized data from the known data through reasoning, resulting in information leakage.
Database availability means that the authorized user should not deny the normal operation of the database, while ensuring the efficiency of the system, and give users a friendly human–computer interaction interface. In general, the confidentiality and usability of the database is a contradiction. The analysis and resolution of this contradiction constitute the security model of the database system and the main objectives of a series of security mechanisms.
Database watermarking
Introduction of database watermarking system
Database watermarking is to embed imperceptible and difficult to remove tags in the database by signal processing without damaging the content and availability of the database, so as to protect the security of the database. It cannot damage the contents of the database and availability under the premise of the purpose of protecting the database security.3,4
From the signal processing point of view, embedded in the database, watermark signal can be seen as a strong background superimposed on a weak signal as long as the superimposed watermark signal strength is below the allowable distortion threshold of the database availability. The watermark signal can be embedded in the database without satisfying the requirements of the concealment without affecting the availability of the database.
From the perspective of digital communications, database watermark embedding can be understood as a narrowband signal (watermark) on a wideband channel (carrier database) which uses spread spectrum communication technology.4–6 Although the watermark signal has a certain amount of energy, the energy distributed to any frequency in the channel is difficult to detect. The decode (detection) of the watermark is a problem of detection of weak signals in a noisy channel. The database watermarking system consists of four parts: watermark generation, watermark embedding, watermark extraction, and watermark detection. The watermark signal is generated under key control, typically a binary bit sequence. 7 The watermark signal may contain adaptive information to extract the characteristics of the database; may also contain data owner information, such as trademark logo and copyright text; generally need to be processed before the signal conversion; and then used to generate watermark signal. By extracting the feature of the relational data, and by appropriate transformation processing, the watermark channel is obtained with the participation of the key. The embedding and detection of watermark signals are carried out under key control, which is shown in Figure 1.

Database watermark generation and embedding process.
The process of database detection generally does not require the original database to participate. The use of detection and extraction of the watermark signal can be the database for copyright certification and integrity verification. Furthermore, you can achieve the location of data tampering and piracy tracking. If the watermark contains data owner information, it can be obtained by preprocessing the watermark signal. In the database watermarking system, the watermarking hidden algorithm can be made public. Watermark generation, transformation processing, watermark carrier channel acquisition, watermark embedding, detection and extraction, and so on are completed under the control of the key, and the security of the system depends on the key. Figure 2 shows the watermark detection and extraction process.

Database watermark detection and extraction.
The process of database watermarking involves several key algorithmic steps that have a significant impact on the performance of the database watermark.8,9 These algorithms are watermark signal generation, watermark carrier channel acquisition, relationship, tuple tag, watermark embedding, and watermark detection and extraction. These complete steps form the key to the entire database model.
Basic model of database watermarking
According to the characteristics of the digital watermarking system, the database watermarking system can be represented by an octet: Watermark Database (WD) = (R, K, W, G, E, A, D, X).
In the model, R represents the set of relational data sets R, K represents the set of watermark keys K, and W represents a set of watermark signals W.
In general, the database watermark signal can be expressed as
Among them, W is the watermark signal domain; can be image, graphics, text, and so on; and is generally a binary form, that is, U is the output of the above (0, 1). G denotes a watermark generation algorithm, that is, using the watermark key K and the relational data set R to collect the watermark signal.
In some cases, the generation of watermark signals also incorporates the copyright information given by the data owner, such as trademark patterns, copyright statements, and so on. In this case, the copyright information can be merged into a generalized watermark key. Set E to represent the watermark embedding algorithm, that is, embed the watermark signal W into the relational data under the control of the key K, so that Rw is obtained.
A represents the watermark attack algorithm,5,9 that is, the attacker uses the forged key, attacking watermarked relational data set Rw to get Rw.
D indicates the watermark detection algorithm. Furthermore, for different watermark types, the watermark extraction algorithm X may contain different authentication functions. According to the watermark information W extracted by X, the data owner can be copyrighted for the robust watermark. For the fragile watermark, the authenticity and integrity of the data can be verified and tampered with the data, and the data can be recovered. It uses digital fingerprint technology to track pirates.
Research on database watermarking based on independent component and multiple rolling
Independent Component Analysis introduction
Independent Component Analysis (ICA) is originated from the blind source separation in signal processing technology. The blind source separation is based on the statistical characteristics of the input source signal without knowing the source signal and the transmission channel parameters, 10 and only recovered by the received aliasing signal, which is the process of each source signal. Independent vector analysis is a new method of signal processing and data analysis in the process of blind source separation. Because of its statistical properties, it is not sensitive to large distortion and noise. Therefore, it is applied to watermarking algorithm that has a strong anti-distortion and anti-noise ability.
ICA technology is a new blind source separation 11 technology developed in recent years. In the absence of any other a priori information, the source signal is extracted directly from the observed signal only on the basis of the statistical independent nature of the source signals, which is used in image signal processing, speech signal processing, and biomedical signal processing, and has a significant practical value.
“Blind source” has two meanings: the source signal cannot be observed, and how the source signal is mixed is unknown. Obviously, blind source separation1,9 is a natural choice when it is difficult to establish its mathematical model based on the transmission characteristics from the source to the sensor, or if the prior knowledge of transmission is not available. The core problem of blind source separation is the learning algorithm of separation (or de-mixing) matrix. The basic idea is to extract statistical independent features as input representation without losing information.
ICA method is a method of mixing matrix in blind source separation. Both principal components analysis and singular value decomposition are linear transformation, but the latter two can only decompose the data according to the energy size to eliminate the second order between the data correlation, while ICA is able to eliminate the high-order correlation of input data (Figure 3). 12

The blind signal processing model.
The main idea is, by ICA, from a group of observed signals: to arrive at a statistical independent source estimated value.
In 1994, Pierre Comon explained the concept of ICA, and proposed a cost function based on high-order construction algorithm, and obtained an adaptive method of separating matrix, which makes it possible to carry out blind signal separation on-line. Tony Bell and Terry Sejnowski proposed the method of information maximization, that the maximum difference in neural network output information difference entropy means that the input and output of the interaction between the maximum information and the use of random gradient reduction learning method to achieve the difference in the maximum Lee et al. The extended maximum entropy fractionation algorithm estimates the kurtosis of the signal, distinguishes the Gaussian signal from the sub-Gaussian signal, and the breakthrough cannot be used simultaneously for the separation limit of the super Gaussian signal and the sub-Gaussian signal. We have studied a fixed point algorithm for independent vector analysis and proposed a method to extract an independent component from the signal and obtain a fast algorithm of ICA.
First, the observation signal is averaged,
13
which is the most basic and necessary step of the ICA algorithm. The process is to subtract the mean vector

Algorithm principle.
Common ICA algorithm
At present, the common ICA algorithm4,10 can be divided into two broad categories: one is based on the batch operation of a related criterion function of the algorithm—the advantages of such algorithms are independent of any distribution of components that are suitable, but they generally need to be complicated matrix arithmetic or vector operations, such as joint approximation of feature matrices and fourth-order blind identification—and the other is an adaptive algorithm based on stochastic gradient method. These algorithms can be implemented by neural networks to ensure convergence to a corresponding. But the main problem is that the convergence of the algorithm is slow, and the correct selection of the learning rate parameters plays a decisive role in convergence, such as minimizing mutual information, InfoMax, and so on.
From the above data mixed model and its assumptions can be seen, ICA signal separation when a prior information is very small, which can be successfully used ICA signal separation of mathematical basis lies in the mathematical statistics of the central limit theorem. The central limit theorem tells us that the distribution of the sum of multiple independent Gaussian random variables is closer to the Gaussian distribution than the respective random variables. Therefore, it is possible to determine whether or not an independent signal is obtained by performing a non-Gaussian metric on the respective components of the output vector
There are two kinds of non-Gaussian measures: the kurtosis of random variables and negative entropy.
The kurtosis of the random variable
For the separation process after whitening, the covariance matrix of the output vector is the unit matrix, and the kurtosis becomes
The kurtosis of the Gaussian random variable is zero, and the kurtosis of the Gaussian variable is not zero. The positive variance of the random variable is called the super-Gaussian variable, and the negative kurtosis random variable is called the sub-Gaussian variable.
The negative entropy of the random variable
where the entropy of
Ratio ICA database watermarking scheme
It should be noted that the traditional ICA method in the solution of the separation matrix, by the mixed matrix A, is prone to estimate the order of the signal, inconsistent with the original signal situation. Therefore, when the ICA method is applied to the digital watermarking technique, it is necessary to consider the arrangement order uncertainty first, so as to ensure that the watermark information1,14 can be extracted accurately when the watermark is extracted. For the uncertainty of the amplitude of the ICA method, the uncertainty of the amplitude can be tolerated when the digital image information is embedded as a watermark. This is because the magnitude of the ICA method is scalable and scaled. If the amplitude occurs, reversal of the situation will not affect the understanding of digital images.
Based on the above considerations, the ICA algorithm can be used to embed the digital watermark, which can effectively guarantee the accuracy of the watermarking process. The ICA algorithm is mainly aimed at the mixed separation of two images, which is easy to implement in the engineering, and is obtained from the case of matrix mixing, which is not affected by the probability distribution of the original signal.
Without loss of generality, it is assumed that the two sets of different original signals
The mixed matrix A gets the mixed signal
Set
So
Generation of watermark independent component and marking of watermark locations
The generation of the watermark information uses a method in which a binary image is mapped into bit information. In the watermark embedding process, we map the image as copyright information into a binary string, generate watermark bit information, and then repeatedly embed it into the redundant data of the database. When detecting, recombine according to the extracted watermark information bits to form a complete watermark information. The watermarked image is processed by the ICA method to generate several watermarked independent components. The process is called watermark component generation.
Due to the frequent variability of the database data, the watermark needs to be embedded in the database in a decentralized form. Since the specific part of the watermark is embedded in a specific tuple, it is necessary to find the tuples used for embedding when extracting the watermark. 8 To find the same tuple, it is necessary to mark the tuple, that is, for each tuple assigned a role similar to the ID number of the tag ID. For a tuple, it is necessary to ensure that the watermark can be found using the same tag after the watermark embedding, good intentional modification, or malicious attack. Therefore, the key step of embedding watermark is to use the marking algorithm to calculate the ID value of the primary key. The basic idea of the tagging strategy is to first use the encrypted single hash function, such as the MD5 function, to determine which tuples need to be tagged based on the user’s given key and tuple primary key values and the tuple to be marked, and then the number of bits determine the attribute of the tag and the position of its bit. This will use some of the numeric values of some of the tuples in the relational database as a token. In this way, the bits of many bit-mark combinations in the entire relational database are embedded watermark information. The attribute of the tuple, the bit position of the attribute, and the specific bit value are determined by the algorithm of the key, the tuple primary key value, and the tuple proportional control algorithm that needs to be marked. The key, tuple tag, the number of markup attributes, and the number of bits are known only to the owner of the relational database. In order to ensure that the database can resist attacks, an effective marking algorithm must have strong robustness. Inspired by literature,15-17 in this article, a one-way hash function is used as the marking algorithm, which always outputs a fixed length hash value for a certain length of input message, and the hash function is characterized by the easy computation of the forward calculation, and the difficulty of the reverse computer is greatly improved (Figure 5).

Watermark independent component generation chart.
Watermark roll representation and watermarking information embedding
Assuming that the vector
The embedding of watermarks in a relational database6,14 is based on the following two assumptions: the watermark embedding object is only a numeric attribute in a relational database, and the numeric attribute values in a relational database can tolerate minor changes. Redundancy in the database refers to the redundant space of the embedded watermark, which is the sum of the data changes that can be used in the database. Relational database watermarking technology embeds the copyright information by embedding the watermark information into the database. In order to avoid the modification of the numerical value beyond the acceptable range, it is necessary to assume that the acceptable standard is limited to protect the value of the data. If the scope of the modification is too large, it will affect the normal use of the data; if too small, the redundant space is too small, and watermark information cannot be embedded. There are two ways to calculate redundant information in a database:
Using the mean square error δ: The mean square error is an index to describe the data mutation (error). The Gaussian theory of accidental error can be used to know that the error is usually between (–δ, δ), in order to obtain the most suitable allowable deformation value, which is very important.
Absolute errors: For some databases, if there is some modification beyond which the data will cause a change in the meaning of the data, the modification of each data can only be within its scope and not beyond a certain range. The absolute error for each data cannot be greater than a certain value.
The main steps of the embedded algorithm are as follows:
The primary key and the numeric attribute of the relational database are arranged in a certain order, and the corresponding value of each primary key of each tuple is calculated according to the password hash function SHA: ID=Hash(
The data are selected according to the embedding factor γ, and then all the element groups are grouped according to different remainders according to the value of each tuple ID divided by the length
Embedding the
We think that the mean variance is ideal for the processing of sampled data with the same distribution and little difference, and the range of different fields in the relational database will be different, which will result in the calculation of the obtained value only for some data items,10,18 limiting the capacity of watermark embedding. For the database, it is more to consider the change of the embedded data compared with the original data and the usability of the data after embedding the information. In this case, the absolute allowable error is used as a measure of the embedding of the watermark.
We carefully analyze these allowable error, select the actual relationship model database to experiment, from which to filter out all the data items, after the group of data items to discuss the increase in the database redundant space and watermarking method. In watermark coding, the combination of the key is to invalidate the database field and watermark distribution frequency analysis, which increases the difficulty of deletion and forgery.
Watermark extraction
Watermark extraction is the inverse of watermark embedding. In the process of watermark extraction, we need several system parameters, namely, watermark key
In the detection process, the embedded process mechanism should be used to restore the watermark. At the same time, it should be considered that the database to be detected may be modified or the specified attribute of the detected database may have data without embedded watermark information, so the detection process may also extract the error watermark information in the data. Therefore, it is necessary to correct and judge the extracted watermark information. The extraction process of the watermark is similar to the operation of the embedding process, the difference is that the watermark information is extracted from the data and selected by the majority election method to obtain an accurate. The watermark information bits are then restored to the original watermark information.
Simulation experiment
We use the forest coverage type distribution data set in the University of California Irvine (UCI) Knowledge Discovery in Databases Archive as a test set. The data set contains 581,012 data, each of which contains 54 attribute fields. In order to simplify the experimental operation, we take a subset of the database as the experimental database of this chapter. We select eight attributes such as ID, Elevation, Aspect, Slope, distance hydrological Horizontal_Distance_To_Hydrology, Vertical_Distance_To_Hydrology, Horizontal_Distance_To_Roadways, Hillshade_9am, and Hillshade_Noon, and select the first 30,000 records to form the experimental database. The experimental environment is Pentium G620, 4 GBRAM, SUNJDK1.41, using Windows 7, background database system using Microsoft Office Access 2010; front-end programming environment is java; and database is accessed by Java DataBase Connectivity (JDBC). The watermark generation algorithm of this experiment is implemented by MATLAB. Attack on the watermark is the process that may weaken the watermark signal. Watermarking system should be able to protect the original copyright information of the data and prevent the data theft of various forms of malicious attacks on the relational database, and common malicious attacks are the following (Figure 6).

The flowchart of watermark embedding.
Subset selection attack
An attacker does not use all the properties and tuples of the watermarking library but uses only a subset of its attributes or tuples to erase the watermark. Sub-extraction attacks on the watermark caused direct damage and will lose part of the watermark information. However, when the watermark information in the partial carrier data reaches a certain threshold, the watermark information can be recovered. For tuple extraction attacks, in the design of the database watermark embedding strategy, the carrier data can be grouped. Each group repeatedly embedded watermark sequence, and it can effectively prevent the subset of extraction attacks. For attribute extraction attacks, because the integrity of the tuple is destroyed, the watermark and tuple synchronization relationship is also generally destroyed. To prevent this from happening, it is possible to determine the embedding position only by the tuple primary key and the key when the watermark is embedded (assuming that the primary key is deleted or tampered will seriously affect the availability of the data).19,20 In this way, the watermark embedding position is not dependent on other attributes, then the attribute extraction attack does not destroy the synchronization relationship between the watermark and the tuple, and the watermark information can be effectively recovered from the partial attribute data (Figure 7).

Subset selection attack.
The results show that the more subsets the attacker chooses, the more watermarks are extracted during the detection, and the higher watermark images can still be extracted well when 30% data are selected. The attacker can successfully extract the watermark copyright information after selecting 30% to 100% by means of subset selection.
Subset addition attack
An attacker has stolen the entire database or part of the database plus a portion of the data in the database. The attacker added a portion of the malicious tuple to the watermark vector database. For watermark, synchronization only depends on the watermark of the current tuple. The added tuple data generally do not affect the watermark in the original carrier data. The essence of the subset added attack is to increase the watermark-free data, thereby reducing the embedding strength of the watermark. For watermarking of packet embedding, the synchronization of the watermark is generally related to other tuples in the group, and the added tuples may destroy the watermark in the original carrier data. Therefore, the effective way to prevent the subset from adding is to make the embedding of the watermark in the tuple only related to the current tuple, so that the synchronization information of the watermark is kept in the current tuple, regardless of the other tuples that are maliciously added (Figure 8).

Subset addition attack.
The results show that the increase of subset has little effect on the copyright information image. The original 30,000 tuples contain complete watermarking information. After increasing the amount of data from 30,000 to 60,000, the watermarking information is affected to a certain extent, which makes the watermarking image noisy, but the watermarking copyright image still shows good robustness.
Subset change attack
An attacker has stolen the entire database or part of the database and changed some of the data. The attacker changed part of the data in the watermark vector database. Subset changes are the most common watermark attacks, depending on the object and method of change, which can be divided into least significant bit (LSB) bit reset and random data replacement, similar to rounding attacks. LSB reset attacks generally use LSB bit replacement algorithm for the database watermark, by resetting part of the LSB bit, to achieve the purpose of damaging the watermark. General LSB bit reset means include bit flip and random bit replacement. The watermark embedded by the LSB bit reset algorithm is generally random and balanced. For the LSB bit flipping attack, if the attacker wants to accurately attack the original watermark carrier bit, it is equivalent to distributing the original watermark position bit sequence to the small probability region, but obviously exposes the attacker’s piracy intention. In fact, the attacker uses more LSB random bit replacement attacks, that is, part of the specific LSB bit reset random bits. Generally, with the increase in the proportion of bits, the degree of damage to the watermark increases, but the availability of the carrier data decreases.
Similarly, random data replacement and similar rounding attacks can also cause damage to the watermark, which is essential by adding noise to weaken the watermark signal. At the same time, the availability of the carrier data is correspondingly reduced. Watermark algorithm designers generally improve the watermarking of the embedded factor to improve the robustness of the watermark to prevent such attacks (Figure 9).

Simulation experiment of subset after attack.
From the watermark attacks described above, it can be seen that some of the attack ways such as the addition and extraction of the subsets are essentially the result of the loss of the matching information of some attribute values, or the partial data information. We use the most common subset to change the attack as the primary means. It uses the ratio ICA method to perform 100 Monte Carlo simulation comparison experiments on 150 × 30 and 300 × 60 watermark images. The simulation results are shown in Figure 9.
Experimental simulation attackers randomly change the database content (5%, 20%, 40%, and 50%), and the results are shown in Figure 10. It can be identified from the experimental results that when the data changes by 40%, it is difficult to identify when it exceeds 50%. From the perspective of relational database analysis, if the data in the database have undergone great changes, the use of data is also destroyed. This experiment proves the anti-attack performance of the watermark from the subjective and objective point of view, which proves that the watermark embedding model is not only simple in the algorithm but also has good detection effect and strong controllability.

Simulation experiment under random attack.
Analysis
ICA method is a new method of recent rise, with strong vitality. Here we combine the ICA method and watermark information detection method, so that the watermark information embedded in the database is completely disturbed and effectively prevents unauthorized attacks. This ICA method uses a new ratio ICA extraction algorithm that accurately and efficiently extracts watermark information from the database, and the extraction process does not require the original database. The data show that the method has high feasibility and robustness.
Conclusion
In the rapid development of the network today, with the increasing use of the database, how to effectively protect the database copyright is a new issue of database security. In recent years, some scholars have proposed the use of digital watermarking to achieve the copyright protection of the database, although it has achieved some results, but the research model and algorithm are not mature.
This article presents a database watermarking method based on ICA and multiple data. The ICA method is used to extract the watermark image, and the ratio ICA method is used to solve the problem of the influence of the uncertainty of the arrangement order. This method is more intuitive than previous methods, and it is easy to support watermarking recognition. By introducing various attack models, we evaluate the performance of the algorithm. The simulation results show that the method has good effectiveness, feasibility, and robustness.
