Abstract
Keywords
Introduction
With the rapid development of the information age, data processing technology has been widely used in people’s lives. As a common data processing technology, 1 –3 information retrieval technology is the main way for users to query and obtain information and also the method and means to find information. The narrowly defined information retrieval refers only to information retrieval, 4,5 that is, the customer takes a specific method according to the needs and uses the search tool to find out the search process of the required information from the information collection. The generalized information retrieval is a process in which information is processed, organized, and stored in a certain way, and then the relevant information is accurately found according to the specific needs of the information user. Information retrieval is the field of research in library and computer science, with the aim of providing a faster, more accurate and more complete search method. Information retrieval, especially text retrieval, has become one of the most influential search tools. On the Internet, it helps people around the world easily access a variety of information at almost no cost. This informational search has provided a powerful fuel for human economic, cultural, and technological development. With the rapid development of the Internet, digital cameras, multimedia, and the popularity of the Internet, people are now increasingly immersed in online search for information, and image queries are becoming indispensable.
Today’s society is rapidly developing in the era of networking and informationization. Computer network technology 6 has become one of the most influential technologies in the world and society. It covers a wide range of areas, covering almost all social and economic fields, and has been well applied in people’s livelihood and military. Therefore, its development and application are crucial and have become synonymous with a long period of time. In countries around the world, “machine substitution” is the mainstream trend in manufacturing. Not only that, but also widely used in education, finance, medical, transportation, security, electricity, and many other fields, reflecting the huge application advantages and market potential. Unfortunately, the robot industry should be a high-end manufacturing industry, but China’s robot industry still has not got rid of the strange circle that can only participate in the low-end field of high-end industries. At the same time, the development of the robot industry lacks top-level design, which undoubtedly affects the development of the robot industry. Information retrieval is inseparable from network technology and is also based on network technology. However, the data often contain some sensitive elements. If these sensitive elements are directly published or shared, it will cause leakage of user privacy. Therefore, we must consider how to accurately handle sensitive data 7,8 in a large number of data.
Similarity measure is a measure that comprehensively assesses the similarity between two things. The closer the two things are, the greater their measure of similarity and the more alienated the two things, the smaller their measure of similarity. The similarity measure
9,10
has a wide variety of methods and is generally selected according to actual problems. Commonly used similarities are correlation coefficient
11
–13
(measuring the proximity between variables) and similarity coefficient
14
(measuring the proximity between samples). If the sample gives qualitative data, then measure the proximity between the samples, the matching coefficient, consistency, and so on of the available samples. To quantify things by quantitative methods, we must use quantitative methods to describe the degree of similarity between things. A thing often needs to be characterized by multiple variables. For example, if a group of sample points described by
The information retrieval model (IRM) is the use of mathematical language and tools to translate and abstract information and its processing in information retrieval into a mathematical formula. It is determined in three aspects: (1) the perspective of processing query formulas and documents, (2) the theory of dealing with query formulas and document relationships, and (3) the algorithm between query formulas and documents. The IRM uses mathematical or other language and tools to framework and method for representing and calculating the main elements of information retrieval and the degree of matching between them. Experts and scholars in related fields have been studying a more suitable search model and retrieval method. Since 2000, many experts and scholars have carried out research and research on IRMs. Information retrieval theorists have proposed a large number of IRMs. At present, models such as Boolean vector space 15 –17 and traditional probability are accepted widely.
We have a preliminary understanding of sensitivity metrics and probability-related knowledge. Therefore, based on the above analysis, according to the probability model and the similarity measure of sensitivity degree, a spectral clustering 18 –20 algorithm for improving the similarity measure is proposed, which improves the similarity measure. It overcomes the problem of sensitivity to scale parameters, improves clustering accuracy, and achieves a good clustering effect algorithm to improve the probability theory model and the accuracy of information retrieval. In order to better study the probabilistic model of sensitive similarity measure in information retrieval, this article will introduce the information retrieval, information retrieval probability, and similarity measure algorithm in detail.
Information retrieval technology
Development of information retrieval technology
Information retrieval technology has experienced the development of early information retrieval 21,22 technology to computer modern retrieval technology. Before the computer retrieval technology was produced, the information retrieval technology generally experienced the development stage from the complete manual retrieval system → semi-mechanical retrieval system → electromechanical and photoelectric retrieval systems. Before the 1940s, the way of information retrieval was mainly manual retrieval, using some search tools 23,24 such as books, indexes, abstracts, and so on, which were arranged by literature attributes such as classification, subject words, and authors to find the required documents. More representative such as library catalog cards and some well-known search journals such as CA, BA, SCI, IM and China’s Zhongmu, foreign orders and so on. Although the manual retrieval method is convenient, flexible, and easy to use and master, the retrieval speed is slow, the reliability is poor, the retrieval efficiency is susceptible to external influences, and the multipath and multi-angle search literature cannot be simultaneously performed. Therefore, the quality and quantity of services that are manually retrieved are inefficient. In order to eliminate these limitations, it is necessary to develop new methods of detection, new retrieval equipment, and establish a more complete system of inspection. In this context, semi-mechanical retrieval methods and electromechanical and photoelectric retrieval methods were gradually developed in the 1950s and 1960s. The semi-mechanical method is represented by the edge perforation card method and then the overlap-to-hole card retrieval method. Their essence is a hand-checking perforation card system, which uses such a retrieval tool. Although a certain degree of multivariate search and the combination of subject concepts can be completed, the search efficiency and retrieval time are improved compared with the full manual search method. The actual retrieval speed is still not high, and the retrieval process is mainly based on manual operation. As technology advances, various mechanical searches have been developed. Things always move forward, and information retrieval methods are constantly evolving as the various movements in the retrieval system change. There is a contradiction between the vast literature and information resources and people’s specific needs. This is the fundamental contradiction in information retrieval. It promotes the development of information retrieval theory and technology methods. On the one hand, it is the “explosion” and “pollution” caused by the huge increase in knowledge and information. On the other hand, people ask for accurate, convenient, and convenient information to find their own useful information. This has led to changes in information retrieval technology. In 1946, the United States successfully developed the world’s first electronic computer ENNIC. In 1949, the United States made the second generation of transistor computers. In 1964, IBM built a third-generation integrated circuit 25 computer. In 1970, the fourth generation of large-scale integrated circuit computers such as IBM-370 came out. In 1971, Intel made the world’s first commercial microcomputer. In 1970, computer networks emerged, and the development of computers became a new category of technology. This was information technology and led to the information revolution. 26 It is in the context of the rapid development of computer technology that computer retrieval technology is ushered in.
In the development of information retrieval, we learned that it has experienced multiple stages of development, and the trend of development is becoming more and more intelligent. Today, with the rapid development of science and technology, there are more and more information retrieval object, including not only text information such as documents and data but also media information 27,28 such as graphic images, sounds, and videos. These are the categories of information retrieval research. Nowadays, information retrieval has realized the development from network to intelligence. The object of information retrieval has been a long-term improvement from the previous closure to the present, from the previous stability and consistency to the current dynamic and wide distribution. As the Internet becomes more popular, the amount of information resources we need to face is increasing. If you want to get the information you need in the shortest amount of time, it will bring great difficulties to computer information retrieval. But with the development of technology, this is absolutely achievable. Figure 1 shows the framework of the intelligent information retrieval form.

The framework of the intelligent information retrieval form.
Principles of information retrieval probability model
Its application is based on four related principles: the principle of related mind independence, the independence of words, the relevance of literature, and the principle of probability ordering. Based on probability theory,
29,30
the model builds a probability model for documents and queries and calculates the similarity between documents and queries based on the model. The probability model is based on the distribution of question keywords in related and unrelated documents and is represented by the weight of the keywords. The query results are sorted according to the sum of the weights of the keywords that meet the question. The probability model is a model that is simple to implement and works well. It is assumed that both the document
where
Similarity measurement algorithm
Similarity metrics use quantitative methods to classify things, and quantitative methods must be used to describe the degree of similarity between things. A thing often needs to be characterized by multiple variables. If a group of sample points described by
The spectral clustering algorithm is based on the theory of spectral partitioning, and the data clustering is regarded as the graph partitioning problem. The essence of the graph partitioning problem is the approximation of the graph partitioning criterion. The optimal solution of graph partitioning is an non-deterministic polynomial (NP)-hard problem. Think of all the data samples as fixed vertices
Improve similarity measure
Let
Definition 1
Any two vertices
where
Definition 2
In order to standardize the similarity value between 0 and 1, to improve the density–sensitive distance in the DSSC algorithm, the improved manifold distance measurement function is defined as follows
Definition 3
Weighting factor for each data point
Definition 4
Improved manifold distance similarity function
Equation (5) satisfies the characteristics of nonnegative, reflexive, symmetrical, and triangular inequalities and satisfies the global consistency clustering hypothesis and the local consistency clustering hypothesis.
Algorithm steps
The time complexity of the algorithm is
Experimental results and analysis
In order to verify the effectiveness of the proposed algorithm, the NJW and DSSC algorithms were compared with the artificial data set and University of California Irvine (UCI) data set, respectively. In order to evaluate the performance of each clustering algorithm, this article uses the evaluation indicators to measure.
Evaluation indicators
Due to the data set in the UCI database, the number of clusters and the correct classification of each data point are known. Therefore, it is only necessary to use clustering indicators of external metrics to evaluate the effectiveness of the clustering results for these data sets. The two clustering indicators implemented in this article belong to the external measurement method. After implementing these two effectiveness indicators, the clustering results of a certain data set in the UCI database can be measured to match the preknown structure. Evaluate the pros and cons of different clustering algorithms for clustering results of the same data set. In the clustering performance evaluation method, the validity index can find the partition with the best number of clusters. In order to evaluate the correctness of the clustering results, this article gives a comparison of the Rand index and the F-measure index. These two statistics can be used to calculate how similar the two clustering results are to the expected results.
Rand indicator
The Rand indicator is a commonly used evaluation indicator for clustering results. It is used to measure the degree of agreement between clustering results and external standard classes of data. Each sample is either divided into the same class or different classes. Among the clustering results, the data originally belonging to the same class still belong to one class in our clustering results. Data that do not belong to the same class still do not belong to one class in our clustering results. The accuracy is equal to the ratio of the correct matching logarithm to the total matching logarithm, that is, RI = the correct total number of matches/total number of matches, which is
Let
The Rand index, that is, a cluster structure of the data set is
F-measure indicator
F-measure is a combination of two indicators: Precision and Recall. In order to accurately describe the evaluation index, the number of data points in different cases is represented by variables (take the classification of Iris data sets in the UCI database as an example), as shown in Table 1.
Definition of variables.
The total F-measure is
The clustering results of the affinity propagation (AP) algorithm to the Iris data set in the UCI database are shown in Table 2.
Chaotic matrix of AP algorithm for clustering Iris data sets.
As can be seen from Table 2, the clustering results of the AP algorithm on the Iris data set, the 50 data points actually belonging to the Setosa class, are correctly clustered into the Setosa class. There are actually 50 data points in the versicolor class, 45 are correctly clustered into Versicolor, and 5 points are incorrectly clustered into the Virginica class. There are 50 data points actually belonging to the Virginica class, 43 points are correctly clustered into the Virginica class, and 7 points are incorrectly clustered into Versicolor.
The calculation of the F-measure indicator should first calculate the accuracy and recall rates as
In the formula,
Artificial data set
The experiment was carried out on the artificial data set, and then the experiment was evaluated for the Rand index. The
Artificial data set.

Comparison of artificial data set Rand evaluation indicators.

Comparison of F-measure evaluation indicators of artificial data sets.
From the comparison diagrams of Figures 2 and 3, it can be found that the performance of DSSC is slightly worse on the three circles of 20 data sets, and the performance of NJW and the algorithm of this article are good. This shows that all the spectral clustering algorithms have obvious clustering effect on the convex data set, and whether the similarity measure method has no direct influence on the clustering effect. The method of this article is slightly better than the NJW algorithm on the Size5 data set. The DSSC performs best on the squarel data set. It can be seen that the DSSC is not sensitive to parameter changes and the algorithm is unstable. The DSSC results on the square4 data set are better than NJW and the algorithm in this article. The above comprehensive analysis shows that the proposed algorithm is better than DSSC. This is because the algorithm of this article can fully exploit the global characteristics of the data and can well handle outliers such as noise points, which can be better applied to the data sets of convex and manifold. Therefore, the algorithm of this article is better than NJW algorithm and DSSC algorithm.
UCI data set
In order to further verify the effectiveness of the proposed algorithm, the UCI data sets Glass, Wine, Iris, and Vehicle are selected. NJW, DSSC algorithm, and the algorithm of this article are compared. These data sets have category label experiments, and the clustering effect is more clearly contrasted with the expected effect. Table 4 lists the basic information of these four UCI data sets. Figure 4 shows a comparison of the Rand evaluation indicators for the three algorithms. Figure 5 shows a comparison of the F-measure evaluation indicators of the three algorithms. The comparison shows the best clustering effect on the data set.
UCI data set.

UCI data set Rand evaluation indicators comparison chart.

UCI data set F-measure evaluation index comparison chart.
It can be seen from Figures 4 and 5 that the DSSC algorithm has the largest F-measure evaluation value on the Glass data set, but the Rand evaluation value is lower than the algorithm in this article. The DSSC algorithm performance on the three data sets Iris, Wine, and Vehicle is not as good as the algorithm in this article. From the overall situation of the UCI data set, because the algorithm of this article improves the manifold distance measure, it fully exploits the various intrinsic links between data points. Therefore, the algorithm of this article can find the optimal solution under two kinds of evaluation indicators, and the algorithm is relatively stable, which is better than NJW and DSSC algorithms.
Through the comparison of the experimental results of the above artificial data set and UCI data set, the improved manifold distance spectral clustering algorithm proposed in this article has achieved good clustering effect. Considering the global and local consistency, it fully reflects the data. The spatial characteristics, good robustness, stable algorithm, and good processing of outliers such as noise points make the similarity of theoretical calculations more consistent with the real situation and better clustering performance.
Conclusion
Because the similarity measure is very important for the clustering effect of spectral clustering, the traditional clustering parameter sensitivity and multi-scale problem cannot get good clustering effect. The existing DSSC algorithm is not stable enough. The spectral clustering algorithm of similarity metrics overcomes the problem of sensitivity to scale parameters by improving the similarity measure, improves the clustering accuracy, and achieves a better clustering algorithm than the DSSC algorithm. It can not only handle the data set of the convex distribution but also the data set of the manifold distribution, with good robustness and better performance. The time complexity of the algorithm in this article is
As the amount of information in various fields continues to increase, the demand for information retrieval is increasing. Traditional information retrieval methods are gradually being replaced by intelligent information retrieval systems. Intelligent information retrieval satisfies people’s needs for information diversification and is conducive to improving the efficiency of information retrieval. The intelligent information retrieval technology based on the Semantic Web enhances the ability of computers to recognize natural language and accelerates the realization of knowledge representation and acquisition. However, in many computer information retrieval processes, due to the use of natural language indexing and retrieval, inaccurate queries may occur. Especially in the era of Internet information, search demand is gradually difficult to meet people’s growing demand for information retrieval. There are still the following problems. (1) Content problem: At present, network information resources are becoming more and more abundant, whether the retrieved content is accurate, and whether the network information resources of the query can be displayed, which is a problem. When we search for information, it is common to search for content that does not meet our requirements. Therefore, in order to increase the amount of retrieval and ensure the singularity of the query method, a lot of work needs to be done. (2) Object problems: In the process of information retrieval, the information retrieval needs of different people are different. How to classify these requirements to personalize the user’s use and also ensure accuracy, these are the objects that need improvement.
In response to the above problems, we propose corresponding countermeasures. (1) Language intelligence: The so-called “smart intelligence,” that is, when we input keywords into the information retrieval system through natural language, we can search processing and ambiguity analysis and assist the query at the knowledge level or concept level. Through the system to give us some intelligent tips, we can help us get the best results. (2) Content specific: In an information retrieval system, the ability to analyze content needs to be improved. In this process, information that is not related to the search content should be screened out. This not only makes the title and the full text a search point but also searches by sound, image, and the like. (3) Technology intelligence: Nowadays, some intelligent retrieval technologies have emerged in China, including not only automatic indexing, automatic summarization technology but also intelligent technology such as automatic tracking and automatic roaming. These search techniques are gradually being improved and optimized. In recent years, concepts such as “smart browsers” and “knowledge sharing agents” have been proposed. With the in-depth study of the IRM, we find that each retrieval model has its own characteristics, advantages, and deficiencies. Their development is not synchronous but complementary. In addition, many models are in the active stage of exploration and experimentation, and the development of each model is not the same due to the different scope of application. The general development trend of modern network information retrieval technology is to develop in the direction of multifunctionalization and intelligence, to adapt to the transformation of information organization from structural to unstructured, so as to meet the requirements of people’s information acquisition and utilization to the utmost extent. Although search technology has developed rapidly in all aspects, there are still many problems in information retrieval technology in the network environment. For example, the object feature is automatically drawn and taken. Based on multi-similar feature indexing, query, retrieval, and other issues, the ontology theory derived from the field of knowledge engineering and artificial intelligence can well handle natural language understanding problems and language inference mechanisms. It is the hot issue of information retrieval in the current web environment. As information service personnel, we should constantly track and master the latest developments in modern information technology and should have a strong sense of technology promotion, make full use of modern information technology to carry out work, and make information services for the whole society.
