Abstract
Introduction
Large Knowledge Graphs (KGs) support a growing set of real-world knowledge-intensive tasks that rely on the notion of similarity [22,43,99]. Similarity-based search facilitates use cases in a commercial setting, where, e.g., one might search for a dress in a certain price range that is similar to what celebrities wear [99]. The article screening process for systematic reviews of publications can be facilitated by retrieving similar concepts and concept relations [43]. In a biomedical setting, one might integrate knowledge extracted from documents and publications into existing domain ontologies [22]. Workable solutions for such knowledge-intensive tasks like advanced search, information retrieval, and integration rely heavily on robust notions of similarity between concepts, as well as a rich and reliable knowledge about those concepts.
As the largest public KG with billions of statements and a large number of concepts, Wikidata [94] provides knowledge at an unprecedented scale that can intuitively support similarity-driven applications. Recognizing that, for instance, the concept of
While estimating similarity between concepts in large KGs is relatively understudied, the related task of word similarity has been very popular in computational linguistics [35,65,78,90]. Early work generally relies on ontology-based similarity metrics [44] that leverage properties of a word in an ontology like WordNet [57]. Ontology-based vector representation models [2,31] leverage the ontological structure in resources like WordNet, e.g., based on random walks, to create word vector representations. More recently, pre-trained word embeddings have been shown to natively capture word similarity at scale [20,51,56]. Early word embeddings have been shown to benefit from knowledge injection [21], again based on lexical resources like WordNet. A variant of the word similarity task, called
In this paper,
In summary, we make the following contributions:
We motivate the task of estimating the similarity between pairs of concepts in Wikidata. While prior work has considered word similarity and well-curated resources like WordNet, this is the first paper that addresses conceptual similarity in large and relatively noisy sources such as Wikidata (Section 2).
We present a computational framework that includes three families of models, publicly available knowledge for self-supervision, and a retrofitting method that leverages this knowledge to customize the models (Section 3). To our knowledge, we are the first to propose the TopSim metric that aggregates over KG regions, and we are the first to retrofit similarity estimations based on knowledge in Wikidata and ProBase.
We adapt three benchmarks to Wikidata (Section 4) and perform extensive experiments that investigate the comparative success of different models and their combinations, the impact of retrofitting, and the impact of various background knowledge sets (Section 5).

Example of similarity judgments between natural science and seven other concepts in Wikidata.
The entire code and pointers to the data supporting our experiments can be found on GitHub:
In this section, we provide background information from cognitive studies on similarity and put the problem of automatically estimating conceptual similarity in the context of prior work in the fields of Computational Linguistics and Semantic Web.
Background
Conceptual similarity is a central theoretical construct in psychology, facilitating the transfer of a situation to an original training context [54,62]. Tversky [91] poses that the
Gentner and Markman [29] argue that similarity is like an analogy, in the sense that both rely on the alignment between the two compared objects or domains. When provided a pair of similar concepts, for example, hotel-motel, people align them based on their shared properties (e.g, used for accommodation), and are able to easily point out differences (e.g., hotels are in cities, motels are along highways). The authors discuss that it makes no sense to talk about differences in the absence of a meaningful alignment (e.g., kitten – magazine). Which features are or should be considered when computing similarity? According to [60], the relative importance of a feature depends on the stimulus task and the context. This flexibility of the estimation of similarity led to criticism, which argues that the set of features that are being compared is seemingly arbitrary [32]. Yet, the relatively high inter-annotator agreement across human subjects when judging similarity indicates systematicity in the human judgments of similarity [54]. In addition, some of the variations across human subjects can be explained with phenomena like selective learning [83], developmental changes [28], and knowledge and expertise [13]. Similarity judgments are also known to be impacted by the context of the task [79], i.e., the features activated depend on the object we compare against; as well as the direction of the comparison: people tend to rate the similarity of North Korea to China higher than the reverse [54].
One can distinguish three measures of similarity: indirect, direct, and theoretical [54]. An example for indirect similarity comparison is asking human participants to identify potentially confusable stimuli, such as judging whether an object has been observed before. Similarity can be measured directly, by rating the similarity of stimuli on a numeric scale. The theoretical similarity is observed as a component in human cognition, for instance, when participants categorize an item by comparing its fit in various categories.
In this paper, we consider the task of literal similarity between two concepts. Given two concept nodes,
Related work
Lastra-Díaz et al. [48] survey a wide range of methods for word similarity, and they categorize the methods into two large families: ontology-based semantic similarity and distributional features captured with word embeddings. The ontology-based metrics are further divided into topological measures, gloss-based measures, and ontology-based vector models. Word embedding metrics are divided into two groups: text-based word embeddings and word embedding computation of ontologies and distributional models. Multiple ways to use ontologies and distributional models together exist, including joint approach [69,98], injection [21,59], and embedding combination [30]. Our framework for conceptual similarity in Wikidata is partially aligned with the categorization of [48] for word similarity. Namely, we include topological measures, as well as ontology-based vector models in the form of graph embeddings. Within word embeddings, we specifically focus on language model embeddings, which have become superior in the meantime. Rather than only combining knowledge and word embeddings, we promote the knowledge injection component to an independent component in our framework that enriches both embeddings of language models and knowledge graph nodes, in an attempt to emphasize the extensive knowledge available in external sources like Wikidata. We also include embedding combinations as a separate framework component, where we contribute a novel combination approach, called TopSim. Following common practice in the word similarity tasks, we assume that the similarity of two concepts can be measured on a continuous numeric scale. We compare the relative order of the machine similarity for a dataset against human judgments.
Framework for estimating similarity
Our framework for estimating similarity is visually depicted in Fig. 2. We use graph embedding and text embedding models, as well as ontology-based metrics, as initial similarity estimators. We also concatenate the embeddings in order to combine their scores. We use retrofitting as a knowledge injection mechanism to further tune the individual or the combined embedding models, through distant supervision over millions of weighted pairs extracted automatically from large-scale knowledge graphs. For a given concept pair from Wikidata, the similarity scores generated by the retrofitted embedding models can be combined with the scores by the ontology-based models. We next describe the individual components of our similarity framework.

Overview of our similarity estimation framework.
We distinguish between similarity models based on KG embeddings, language models, and ontology-based topological metrics. We employ representative methods from each category, as well as aggregate methods that combine them in intuitive ways. The goal of our framework is to evaluate a representative sample of metrics and their combinations for the novel task of estimating similarity between Wikidata nodes. To the extent possible, we follow the terminology of [48] and we adapt methods and ideas from prior work on word similarity and concept similarity with WordNet and DBpedia. We leave it to future work to perform an exhaustive evaluation of the wide range of metrics for word and concept similarity developed in the past, e.g., for WordNet [35,48], within our framework.
We adopt popular models based on translation, We note that these models may not achieve state-of-the-art across KG tasks and cannot encode semantic information about literals. Including state-of-the-art methods, like RotatE [85] and JointE [106], and literal-aware methods, like LiteralE [46] and KGA [95], is a key future pursuit. As a side note, understanding the quality of KG embeddings trained on large KGs for similarity tasks is a timely pursuit, as KG embeddings so far have mainly been trained on relatively small graphs and evaluated on the task of link prediction [26,95].
For random walk models, we adopt
Our For more information, we refer the reader to the documentation page:
Our second metric, which we dub We compute the set of IsA parents for both concepts based on their instanceOf (P31) and subclassOf relations (P279). Formally: We compute the common parents for both concepts, formally, We compute the total
This is not to be confused with a measure of the same name but different semantics described in [50].
TopSim is an aggregation framework that can be instantiated with an arbitrary set of measures. Let
For each node
Our KG and LM embedding models, as well as their combinations, can be expected to capture a wide range of rich information about concepts. KGEs will capture information about concepts in relation to their instances in Wikidata, whereas LM embeddings will connect a portion of the graph information with the rich background knowledge that LM models have acquired during pretraining. However, none of these models are directly intended for the task of concept similarity over Wikidata. Their application to estimate similarity is relatively speaking in a “zero-shot” manner. In this section we experiment with retrofitting, a technique for knowledge injection that is aimed to tune the models to the task of concept similarity in Wikidata. To support retrofitting, we experiment with subsets from two large knowledge graphs: Wikidata and ProBase. We select Wikidata to understand whether directly tuning on the graph used for prediction can improve model performance. We include ProBase because of its rich coverage of is-a knowledge, which is essential for estimating model similarity.
We define three weighing methods for the generated pairs from these datasets: (1)
Experimental setup
Benchmarks and metrics
This paper is based on the premise that the similarity between Wikidata concepts should be evaluated with large-scale benchmarks. However, while we were able to infer large datasets for data augmentation at training time, large-scale similarity evaluation datasets for Wikidata are not available. As a proxy, we turn to the existing word similarity benchmarks [48]. Adapting word similarity datasets to Wikidata requires (dominantly) manual mapping of words into Qnodes, which is laborious and expensive. Therefore, we select three popular word similarity benchmarks for this paper and map them to Wikidata, resulting in the novel benchmarks WD-WordSim353, WD-RG65, and WD-MC30. With these benchmarks, we follow the
Statistics about the three benchmarks can be found in Table 1. We evaluate using three metrics to measure correspondence with human annotations: Kendall-Tau (KT), Spearman rank (SR), and Root Mean Square Error (RMSE). We make the resulting benchmarks available for future evaluations through our project’s GitHub page.
Statistics of our evaluation benchmarks
For Lexicalization, we use the BERT embeddings provided by [45], which have been shown to work relatively well for entity similarity. These embeddings include the following Wikidata properties: P31 (instance of), P279 (subclass of), P106 (occupation), P39 (position held), P1382 (partially coincident with), P373 (Commons Category), and P452 (industry). We include two baseline versions of Lexicalization that use restricted inputs:
Statistics of our subset of Wikidata, which is used as a basis to train the KGE models

Frequency distribution of properties in the data used to train the KGE models.
Correlation scores for the raw methods and combinations that we have, for each of the benchmarks: Kendall-Tau (KT), Spearman rank (SR), and Root Mean Square Error (RMSE). Best values per column are marked in bold
How well do different algorithms and combinations capture semantic similarity?
As can be seen from Table 3, the Abstract model performs best among all language model variants, and overall. It outperforms the other LMs because DBpedia’s abstracts, being written by humans, contain information that is more comprehensive and tailored to concepts than automatically lexicalized knowledge in Wikidata. The Lexicalization model outperforms the other simpler baselines, but it may be improved further by a dynamic selection of properties, e.g., through profiling [45]. Language models perform worst when they consider labels only, which can be expected because node labels contain the least information. Adding a description to labels yields a notable improvement, and the Lexicalization method further improves upon Labels + desc. These results together show that the exact kind and amount of information fed to language models matters strongly for estimating similarity.
The graph embedding methods each focus on abstracting the rich information available in Wikidata. Among these methods, the Deepwalk embeddings perform the best. These methods are consistently outperformed by the Lexicalization and Abstract methods, suggesting that the graph embeddings’ wealth of information to consider is a double-edged-sword: many properties are considered that may not be useful for determining similarity, adding distractions that can decrease performance. The Abstract method has an additional advantage over the graph embeddings in that it is less restricted in terms of the kind of information it can consider, whereas the graph embeddings focus solely on relations and can not make use of literals directly.
Our topological models (OT category) perform better than the KGE and worse than the LME models on the WD-WordSim353 dataset, and the worst among the categories on the other two datasets. Class performs better than Jiang Conrath on two out of three datasets, though the difference between the methods is not significant. Generally speaking, TopSim is able to combine the different regions in a way that outperforms most of the individual models, and it clearly outperforms the methods that only rely on ontological structure (Class and Jiang Conrath). However, TopSim consistently performs worse than the composite embeddings. Among the combinations, Composite-Best performs the best, indicating that combining a small set of reliable models may be a better strategy than composing a larger set of embeddings together. However, here we note that the composite embeddings do not improve over the Abstract LME score, despite the fact that they include Abstract as one of their embeddings. This indicates that it is difficult to combine models that consider additional information without adding noise that decreases the utility.
Impact of retrofitting across the different benchmarks. Here we show results on retrofitting with WD-all , where the edges are scored with BERT-based cosine similarity. Highest Kendall-Tau (KT) values and increases per column are marked in bold
Impact of retrofitting across the different benchmarks. Here we show results on retrofitting with
Retrofitting is overall beneficial for estimating similarity (Table 4). On average across the three benchmarks, it improves the performance of nine out of the eleven methods. The highest overall improvement is observed for the S-Deepwalk method, whose Kendall-Tau score on the WD-MC30 benchmark is increased by 0.2. Despite this bump, the new S-DeepWalk score is still relatively low (0.067). We also note a consistent improvement with the simpler methods, like Label and Label + Desc, which can be expected given that these methods do not consider taxonomic information sufficiently before retrofitting. For example, the distance between dissimilar objects, like credit and card, is nearly the same before and after retrofitting the Label method, whereas the distance between highly similar objects like money and cash decreases significantly (from 3.7 to 2.2, on a scale where 4 is the maximum and 1 is the minimum). The impact of retrofitting is lower on methods that consider richer information already, like Abstract and Lexicalized. This is because these methods already integrate taxonomic information, and retrofitting might bring concepts that are nearly identical or merely related too close in the embedding space. For instance, retrofitting decreases the distance between seafood and lobster from 2.8 to 1.3. Still, the impact of retrofitting on Abstract is positive on two out of three benchmarks, leading to the new top results on the benchmarks WD-WordSim353 and WD-MC30.
What knowledge is most beneficial for retrofitting?
Impact of different retrofitting knowledge variants on the WD-WordSim353 dataset. Highest Kendall-Tau (KT) increases per column are marked in bold
Impact of different retrofitting knowledge variants on the WD-WordSim353 dataset. Highest Kendall-Tau (KT) increases per column are marked in bold

Similarity scores with the KGTK similarity GUI [40] between fairy tale, and its parent (tale) and siblings.
We analyze the impact of different retrofitting knowledge sources in Table 5. Among the Wikidata variants, we observe that retrofitting with child-parent data performs comparable to using both child-parent and sibling data together, and this finding is consistent across the methods. This result indicates that WD-sibling data is less useful for retrofitting of models compared to parent-child data. We believe that this observation is due to the Wisdom-of-the-crowd [86] knowledge creation method of Wikidata, which results in a wide ontology with many children per parent. Each of the child-parent relations is reasonable and connects two relatively similar concepts, but two children connecting to the same parent may be dissimilar and may specialize the parent node in different dimensions [39]. We illustrate this reasoning with the following example. Fairy tale (Q699) is a child of the concept tale (Q17991521), and the similarity between the two concepts is relatively high. At the same time, other children of tale include old-fashioned tale, cumulative tale, urban tale, and Zeichengeschichte (a German television genre), whose similarity with fairy tale is lower, as each of these siblings describes a different aspect of a tale: its location, ethnic tradition, form, or genre. To quantify this phenomenon, we show similarity scores with the KGTK similarity GUI [40] between fairy tale, and its parent (tale) and siblings in Fig. 4. As apparent in the figure, the similarity between the child and the parent (fairy tale and tale) is clearly higher than any of the similarities between two siblings (e.g., fairy tale and old-fashioned tale), according to any of the similarity metrics.
Retrofitting with ProBase’s IsA relations yields consistently negative results across all methods. This could be due to the quality of the underlying data, our choice to use the relation counts as similarity estimates, or the imperfect mapping of ProBase nodes to Wikidata. Comparing the results across the different methods, we again observe that the simpler methods and the composite methods benefit most from retrofitting, whereas the more elaborate methods benefit from retrofitting much less.
In summary, our evaluation of a variety of KGE, LME, OT, and combinations of models reveals several key insights. First, language models are strong indicators of concept similarity in KGs, however, they are extremely sensitive to the kind of input that they operate on. Therefore, scalable and reliable lexicalization is a key component of LME-based similarity models. Second, KGE models, which largely transfer knowledge about instances to concepts, are also strong indicators of similarity, but not encoding the literal content is a key limitation for these models. Third, retrofitting is helpful across the board, though its impact is larger for simpler models that originally did not encode structural information. Fourth, we note that careful selection of knowledge for retrofitting is essential, given the size of the sources of background knowledge and their creation methods. Here, we note that parent-child relations from Wikidata are most useful for retrofitting, whereas knowledge from ProBase generally hurts performance.
Looking forward, we discuss four key considerations for reliable methods and evaluation of concept similarity over Wikidata: dimensional metrics of similarity, extending coverage to entity nodes, large-scale evaluation, and downstream applications.
Conclusions
This paper designed a framework with representative models for estimating the similarity of concepts in Wikidata. We considered language model embeddings, knowledge graph embeddings, and topological information. We developed combinations of these models, and experimented with knowledge injection via retrofitting to two large knowledge graphs: Wikidata and ProBase. The experiments revealed that pairing language models with well-curated information found in abstracts led to optimal performance. Balancing between information wealth and noise, on the one hand, and between structure and content, on the other hand, are important considerations for future model engineering. We found that retrofitting with taxonomic information from Wikidata generally improved performance across methods, with simpler methods benefiting more from retrofitting. Retrofitting with the ProBase KG yielded consistently negative results, indicating that the impact of retrofitting directly depends on the quality of the underlying data. Future work should investigate contextual similarity between concepts, which would characterize the partial identity of concept and entity pairs. The key obstacle to developing reliable metrics for concept similarity in Wikidata lies in the lack of representative evaluation – addressing this challenge is a high-priority task for subsequent research. Finally, applying the similarity metrics at scale for downstream reasoning tasks is a necessary follow-up step to understand their potential for impact and their latent shortcomings.
