Abstract
Keywords
Introduction
Knowledge graphs (KGs) have emerged as powerful tools for a range of applications, including information retrieval, question answering, and data federation (Ji et al., 2021). An entity in a KG refers to a distinct and identifiable concept, which can be, for example, a concrete object, an abstract idea, or an event. Entities are represented as nodes, forming the building blocks of the graph. The relationships between entities are modeled as edges, serving as connections that establish associations or interactions between the nodes. These edges convey various meanings, representing attribute properties or relation properties of the entities. Examples of attribute properties include “birth date,” “genre,” or “description,” and examples of relation properties include “located in,” “established by,” or “worked with.” Based on the given definition (Person1, birth date, “1989-09-30”) indicates an attribute triple, while (City1, located in, Country1) indicates a relation triple. 1 Essentially, the combination of entities and their interconnecting relationships forms a structured representation of knowledge. Hence, KGs are designed in a way that facilitates storage, access, semantic understanding of data and reasoning over it and are widely used in a variety of domains, including the Semantic Web in general (Bonatti et al., 2019; Ryen et al., 2022; Villazón-Terrazas et al., 2020), cultural heritage (Achichi et al., 2018; Carriero et al., 2019; Dou et al., 2018; Marchand et al., 2020), biomedicine (Ernst et al., 2015; Nicholson & Greene, 2020; Sanou et al., 2022; Unni et al., 2022), sociology (Cao et al., 2020; Tchechmedjiev et al., 2019; Wang, Chen, et al., 2018), and data-driven industries (Bader et al., 2020; Kejriwal et al., 2019).
As data comes from different sources, it is often scattered across multiple KGs, even if it conveys the same information, leading to various challenges. One such challenge is identifying and matching entities from a source KG to their equivalents in a target KG that represent the same real-world object (Ferrara et al., 2011)—a task known as entity alignment (EA). EA in turn facilitates data integration, information retrieval, and entity disambiguation across diverse knowledge sources (Achichi et al., 2019; Beretta et al., 2020; Saha et al., 2018; Zou, 2020).
We start by providing some
By dataset, we mean an EA dataset, which consists of a pair of KGs: a source and a target KG to be interlinked, together with a reference alignment that helps evaluate or train the models. A reference (or seed) alignment is a manually curated set of correspondences or alignments (often together with a confidence score) between entities across the two different KGs. By unmatchable entities, we mean pairs of entities from the source and target KGs that are not to be aligned (i.e., they refer to different real-world entities).
We distinguish two main types of datasets. A synthetic benchmark dataset refers to a dataset that consists mostly of KGs sampled from larger ones, following a motivation of having smaller KGs that mimic certain characteristics of the real large KGs. Additionally, in machine learning applications, benchmark datasets could potentially be fully synthetic, that is, entirely generated from scratch, not using a real KG as a starting point. Often in benchmark datasets the source entities are matched with their corresponding counterparts in the target KG under the 1-to-1 assumption (meaning that each source entity has exactly one match in the target graph). 2 In this paper, we use the terms “benchmark dataset” and “synthetic benchmark dataset” interchangeably. A real-world dataset, on the other hand, is one that is issued from a real-world scenario and contains the unchanged KGs that are not sub-sampled from larger KGs under some conditions, such as being sparse or dense, or retaining a similar degree distribution as the KGs they are sampled from.
For the purposes of this study, we categorize EA techniques into two main groups: embedding and non-embedding-based methods. Non-embedding-based approaches apply user-crafted representations of entities and relations and align the entities across the KGs based on similarity measures or logic axioms. This group of approaches prioritizes symbolic reasoning, logical inferences, and linking specifications defined by domain experts to guide the alignment process (Zeng et al., 2021). Embedding-based approaches, in contrast, automatically represent entities in a feature space and predict alignments based on similarity metrics over the learned embeddings. Embedding refers to representing an object as a vector in a continuous space based on a given number of constraints (e.g., close in meaning entities should have vectors that live close to one another in the embedding space; Zeng et al., 2021). Following this paradigm, embedding-based EA models commonly use an embedding and an alignment module. While the embedding module represents each KG entity as a vector in a low-dimensional embedding space, the alignment module ensures that aligned entities are close together in a unique embedding space or learns a mapping between KGs with respect to the reference alignment. During the model’s training phase, through iterative interactions between the embedding and alignment modules, all entities from both KGs are embedded, and predictions are made regarding which entities are most likely to align.
Relying on non-embedding-based methods may be more suitable in scenarios dealing with sparse or small datasets, or in situations where there is not a high variety of heterogeneities, such as predicate, class, or graph linking problems (as defined by Salazar et al., 2023). Real-world datasets often do not meet these ideal conditions, and therefore, embedding-based EA methods promise more efficiency, as they are flexible in cross-lingual scenarios, scalable to large KGs, and globally consistent in representations across KGs.
This paper’s focus is on analyzing specifically the embedding-based EA models, with respect precisely to both synthetic benchmark datasets and real-world datasets, as well as considering the different training and evaluation strategies and model types. Hence, we add to several recent surveys and studies that investigate that question from a critical viewpoint (Ardjani et al., 2015; Shvaiko & Euzenat, 2011; Zeng et al., 2021). For example, Bengio et al. (2013), Hamilton et al. (2017), and Sun et al. (2017) analyze the performance of embeddings-based EA models and compare them regarding their performance on benchmark datasets (Euzenat et al., 2013; Fanourakis, Efthymiou, Kotzinos, & Christophides, 2023; Huang et al., 2022; Jiang et al., 2023; Sun et al., 2020; Zeng et al., 2021; Zhang et al., 2022). Previous research has extracted more realistic EA benchmark datasets (Sun et al., 2020) from large knowledge bases such as DBpedia (Lehmann et al., 2015). We enhance the work of the studies cited above by expanding the benchmarks with datasets containing a low percentage of matchable entities to better reflect real-world scenarios, and evaluating the performance of two of the leading embedding models (RDGCN and BERT-INT) when we include all entities in the target KG as alignment candidates during the model evaluation. Furthermore, we include an in-depth discussion on the evaluation metrics that are commonly used for the EA task, building on preliminary remarks found in Leone et al. (2022). As an overarching question, we consider several recent EA embedding-based models having state-of-the-art performance on synthetic benchmark datasets and analyze their capacities when they deal with heterogeneous real-world data. We show a considerable drop in performance in the latter scenarios. To help understand this observation, on the one hand, we analyze and compare the real-world and synthetic benchmark datasets with respect to a set of dataset profiling features (Ben Ellefi et al., 2018) studied and applied for the EA task. On the other hand, we pair these observations with a look into the underlying nature of the embedding-based models. Todorov (2019) observes that cutting-edge EA models did not address the particular properties of data well because they prioritized genericity and automation. Indeed, the results of our study demonstrate that while embedding-based models perform well on certain synthetic benchmark datasets, they struggle in real-world scenarios due to insufficient consideration of the inherent characteristics and nature of the data. Finally, in order to be able to compare the embedding-based models to methods from the non-embeddings group, we include in our analyses the DLinker system (Happi Happi et al., 2022), for reasons explained in Section 3.
The main contributions of this analysis paper are: A novel look into and comparison of the frameworks of established EA methods having different embedding bases: we propose a novel categorization of embedding-based EA methods based on their embedding approaches. A comparison of the features of synthetic benchmark and real-world datasets from aspects related to EA: although it appears difficult to isolate a structure-related meta-feature which explains the performance of all methods on the different datasets (because each method embeds the structure from a different aspect), we find that the semantic similarity is the dataset meta-feature that correlates at the strongest with the performance of embedding-based EA methods. A discussion of the commonly used evaluation metrics for the EA task: we explain how Hit@1 is equivalent to precision and recall when under the 1-to-1 assumption in the validation set, and when and why each evaluation metric should be applied. An analysis of the performance drop of EA methods on real-world datasets in comparison to their performance on established synthetic benchmarks: we present shreds of evidence and probable reasons to explain the observed drop in performance; we go beyond the 1-to-1 assumption during the model evaluation and investigate the performance drop of the EA models using both Hit@1 and Analysis of the different categories of embedding models with respect to synthetic and real-world datasets: we find out interaction training models to be the best-performing category of EA methods on real-world large-scale data.
The paper is organized as follows: Section 2 summarizes related surveys and empirical studies on EA methods, positioning our work within that context. Section 3 outlines the key features of the embedding-based EA methods selected for performance analysis. In Section 4, we compare several benchmark and real-world datasets, highlighting the greater heterogeneity and complexity of the latter. Finally, Section 5 examines the performance of EA models on heterogeneous real-world data and explores the reasons for their performance decline on these datasets.
This section provides an overview of surveys and related analytical studies that critically examine existing EA approaches, with a particular focus on embedding-based methods. In line with the scope of the paper, specific alignment methods are not discussed here.
Several studies have contributed to the understanding and advancement of KG embeddings (KGEs) and their applications, such as link prediction, KG completion and reasoning, and EA (Choudhary et al., 2021; Ji et al., 2021; Lu et al., 2020; Sharma & Talukdar, 2018; Wang et al., 2017). While Sharma and Talukdar (2018) reveal sharp differences in the geometry of embeddings produced by various KGE methods, Tran and Takasu (2019) introduce a multi-embedding interaction mechanism for analyzing KGE models such as DistMult (Zhang et al., 2019) and ComplEx (Trouillon et al., 2016). The latter study unifies and generalizes these models, offering an intuitive perspective for their effective use. Luo et al. (2022) introduce a scalable and open-source Python library for multisource KG embeddings. Supporting joint representation learning, it implements 26 KGE models and 16 benchmark datasets. Moreover, Cao et al. (2024) categorize the existing KGE models based on representation spaces and discusses whether they have algebraic, geometric, or analytical structures.
Several surveys and experimental studies have been conducted on methods for EA across KGs (Fanourakis, Efthymiou, Kotzinos, & Christophides, 2023; Zhao, Jia, et al., 2020; Zhao, Zeng, et al., 2020). Broadly, these studies categorize EA techniques into two main groups: embedding-based methods and traditional approaches (Sun et al., 2020; Zeng et al., 2021; Zhang et al., 2022). Traditional EA methods rely on user-defined rules, Web Ontology Language reasoning, and/or similarity computations based on symbolic features of entities. We refer to these as non-embedding-based methods.
Moving now the focus toward embedding-based methods, Sun et al. (2020) created an open-source toolkit, named OpenEA. The authors discuss the characteristics and functionalities of embedding-based methods, highlighting how they predict matching entities through nearest-neighbor searches among target entity embeddings. Two combination paradigms are outlined: one encoding KGs in independent spaces and learning a mapping using seed alignment, and another representing KGs in a unified space, considering highly similar embeddings for aligned entities. The study underscores the incorporation of entity relations and attribute properties into embedding modules to enhance accuracy, categorizing relation embeddings into triple-based, path-based, and neighborhood-based groups. Attribute embedding, achieved through correlation or literal methods, is also explored for improving entity similarity assessment. Fanourakis, Efthymiou, Kotzinos, and Christophides (2023) present the meta-features of the OpenEA datasets, which adhere to the 1-to-1 assumption, and explain the technical details of several embedding-based EA models. However, they do not include details regarding the generation of the dataset’s meta-features, such as description similarity. In this work, we provide formulas to compute the extracted meta-features, and we analyze the performance of EA models on both benchmark and real-world datasets that do not follow the 1-to-1 assumption.
Zhang et al. (2022) analyze the performance of translational and graph neural network (GNN)-based EA methods with respect to the seed alignment and dataset sizes, the use (or not) of attribute triples, the presence of multilingual data, and the embedding size. They propose new benchmark datasets sampled from large-scale KGs such as Wikidata (Vrandečić & Krötzsch, 2014) and Freebase (Bollacker et al., 2008) that do not fulfill the 1-to-1 assumption (40% and 75% of the entities in every pair of KGs combined in the datasets do not have matches). The authors then tested several EA models on the newly sampled dataset. However, one of the issues with this approach is the fact that the 1-to-1 assumption is not a condition that only holds for the datasets, but also many EA models consider that constraint during the model evaluation. Hence, even though Zhang et al. generate new data including not only
Zeng et al. (2021) provide a brief overview of research in EA, covering traditional methods, knowledge representation learning, and alignment based on representation learning in KGs. They conducted their research only on one single dataset—DBP15K, which is a synthetic benchmark dataset holding the
Fanourakis, Efthymiou, Christophides, et al. (2023) explore indirect biases of EA methods due to structural diversity in the KGs and introduce a sampling algorithm to generate challenging benchmark datasets by changing the properties of the KGs. In that way, the authors assess EA methods’ robustness against such diversity. Modifications include changing connectivity metrics such as “average node degree,” “max component size,” and “ratio of weakly connected components” to control the level of structural heterogeneity of the generated datasets. In our work, we do not use a sampling algorithm; instead, we experiment with EA methods having different design bases on widely used benchmark datasets and real-world datasets.
Leone et al. (2022), provide a discussion on the evaluation metrics for EA for datasets that do not follow the 1-to-1 assumption. To go beyond this assumption, the authors generate sub-sampled datasets whose KG sizes are different, where each dataset variant includes about 30% unmatchable entities. In comparison to Leone’s study, in addition to the fact that our real-world datasets are original and not obtained by sampling, the proportion of unmatchable entities for each of our real-world datasets is more than 80% (KGs in Doremus and AgroLD on average have more than 87% and 82% unmatchable entities, respectively). Furthermore, we report the Hit@k measures for the case that a 1-to-1 assumption does not hold on our datasets to compare the model’s performance with the one reported in previous studies.
To sum up, our work builds on previous research by extending and refining key aspects of EA evaluation. In particular, we study the performance of embedding-based EA methods with distinct representation learning principles on real-world and benchmark datasets. In that, our study stands out for its attention to data quality considerations (Ben Ellefi et al., 2018). While prior studies provide valuable insights into meta-features and sampling methods, we advance this by offering explicit formulas for meta-feature extraction and testing EA models on both benchmark and real-world datasets that do not follow the 1-to-1 assumption. Instead of generating sampled datasets, we focus on real-world original data with over 80% unmatchable entities, providing a more rigorous evaluation. In this way, our work continues and enhances existing research, bringing new perspectives to real-world EA challenges.
Methods for EA via Representation Learning
Certain studies categorize embedding-based methods according to their use of semantic information to represent the KGs (Wang et al., 2023), while others categorize them according to whether they use attribute or relation predicates for embedding learning, their alignment modules (i.e., whether they embed both KGs in the same space or separately), or their learning strategy (supervised, semi-supervised, or unsupervised; Fanourakis, Efthymiou, Kotzinos, & Christophides, 2023; Sun et al., 2020). Based on recent studies (Fanourakis, Efthymiou, Kotzinos, & Christophides, 2023; Jiang et al., 2023; Sun et al., 2020; Wang et al., 2023; Zeng et al., 2021; Zhang et al., 2022) and our analysis, we propose to classify the embedding-based EA models into four groups: (1) translational, (2) GNN-based, (3) graph transformers (GTs)-based, and (4) interaction training models.
Several EA models, such as MTransE (Chen et al., 2017) and IPTransE (Zhu et al., 2017) have been designed by using translational techniques such as TransE (Bordes et al., 2013) for KGE and EA across KGs (Sun et al., 2018). A KG is usually represented as a directed graph, in which nodes refer to entities and edges refer to relations between entities, or simply by a set of triples of the type
The final group of models we introduce (and that prior works categorize as “others”; Jiang et al., 2023) has a common important characteristic: learning the embeddings of the two KGs simultaneously (Tang et al., 2020; Wang et al., 2023; Yang et al., 2020; Zeng et al., 2020). We refer to this group as the interaction training group. Unlike other methods that embed entire KGs independently and then align entities, interaction training models do not need to embed entire KGs, which makes their inference more adaptable to unseen data. Instead, these models embed pairs of entities from both source and target KGs, simultaneously capturing interactions between the entities. This is done by comparing the entities’ features—using techniques such as aggregation or averaging—to generate interaction vectors, which are then embedded through neural networks or similar techniques. The final predictions are based on a distance margin or threshold. If the interaction embedding of a pair of entities belonging to the source and target KGs is measured to be above the threshold (measured using the vector norms), then the entities are aligned together. The aim is to keep a marginal distance between aligned entities with non-aligned ones. Interaction training methods might use translational or GNN or any other basic model to initially embed the entities, but in contrast to the three other groups, these models can provide insights into the correlation between the features of entities belonging to two KGs, whether they are aligned or not.
After analyzing many comparative studies on benchmark datasets, we decided to focus on this study on the following recently proposed embedding-based EA methods, representative of each of the four groups outlined above: MultiKE (translational model; Zhang et al., 2019), RDGCN (GNN-based model; Wu et al., 2019), i-Align (GT-based model; Trisedya et al., 2023), and BERT-INT (interaction training model). We choose these established models because they are scalable to run on real-world large KGs and have state-of-the-art performance on well-known benchmark datasets (Jiang et al., 2023). We give more details on each of them in the following.
MultiKE considers the two distinct KGs to be aligned as one large KG. To connect these two KGs and augment the number of relation triples, the method connects each entity in the source KG to the neighbors of its counterpart entity in the target KG and vice versa by replacing the head and tail entities of each relation triple with their counterparts in the reference alignment. To further enhance the relation triples, the method identifies matching relation and attribute predicates by comparing their literal or relation embeddings and selecting those that exceed a similarity threshold. Once the predicates are matched across the KGs, each relation is replaced by its counterpart, augmenting the relation triples accordingly. Then, it represents each entity and relation using a variant of TransE. To generate the final EA predictions, the model combines these representations with encoded local names of entities and predicates, which are then fed into a convolutional neural network (Zhang et al., 2019).
BERT-INT begins by generating initial entity embeddings using a pre-trained BERT-based model, leveraging the entities’ descriptions or names/labels. It then constructs a similarity matrix based on the initial embeddings for each pair of training entities. Next, the method creates a neighborhood similarity matrix to co-train each entity pair in the candidate set. For training the interactions of the KGs’ structural embeddings, BERT-INT relies exclusively on the direct neighbors of the entities. It is worth noticing that BERT-INT computes interactions between the attributes of entities being compared, rather than simply aggregating attribute information. This approach can reduce the impact of noisy or irrelevant attribute matches. The attribute–view interactions are processed in a unified way along with name/description and neighbor interactions, contributing to the overall alignment decision. It then aggregates all the vectors obtained by the similarity matrices to represent each pair of entities and finalizes the entity pair representations using a multi-layer perceptron.
RDGCN leverages the information of relations into entity representations employing a two-step process. First, a dual relation graph is constructed based on the input KG (the context graph), which is nothing but the line graph of the context graph. 3 In the dual graph, each node represents a type of relation, and two nodes are connected together if they have a common head or tail in the main KG. Then, a graph attention mechanism is applied to arouse reciprocal actions between the two graphs. The resulting vertex representations in the context graph are fed to GCNs (Kipf & Welling, 2017) layers to capture the graph’s structural information through a message-passing system. In the last step, the obtained entity representations are used for aligning pairs of entities.
i-Align uses two transformer-based architectures to represent the entities based on their graph structures and textual attribute values. The model uses a graph encoder to aggregate the entities’ structural information which can also effectively handle large KGs. The model’s other transformer obtains the interconnection between the entity attributes using the embeddings of attribute keys and values as inputs. i-Align provides explanations of the alignment results in the form of a set of the most influential attribute predicates and entity neighbors based on the attention weights of its two transformers.
We summarize the main properties of these four methods in Table 1, including their respective results in terms of Hit@1 measure as reported in the original papers introducing these methods. As we can see, all methods report a Hit@1 of more than 88%, exceeding competing methods in the respective studies. MultiKE and i-Align have been evaluated on the DBP_WD_100K (Sun et al., 2018) and DBP_YG_15K (Zhang et al., 2022) benchmark datasets, respectively, while BERT-INT and RDGCN—on the DBP15K (Sun et al., 2017) dataset. All four methods use entity names for embedding as an extra; i-Align utilizes the attribute predicate’s names and values in its embedding procedure. To use the maximum descriptive information of entities, BERT-INT employs the entity’s descriptions instead of their names when such descriptions exist.
Comparison of Embedding-Based EA Methods.
Note . EA = entity alignment; KG = knowledge graph; GNN=graph neural network; FR–EN = French–English.
Comparison of Embedding-Based EA Methods.
Finally, to be able to compare between non-embedding-based and embedding-based methods, we include in our analyses DLinker (Happi Happi et al., 2022) as a representative method of the non-embedding-based group. The method applies an average aggregation between the similarity measures derived from the instance objects calculated by the longest common subsequence algorithm. DLinker has a performance that is close to that of the best-performing system LogMap (Jiménez-Ruiz & Cuenca Grau, 2011), on several OAEI 4 entity linking tracks. Furthermore, because it is developed in this paper’s authors team, having full control of the tool facilitates further experiments.
In the sequel, we move on to describing and comparing the datasets that we consider in this study, specifically in terms of their different heterogeneity aspects.
In this section, after giving a summary of the datasets we consider and motivating their choice, we study the degrees of their heterogeneities using specific metrics, introduced below. The study showed these datasets to be diverse and highly heterogeneous. Hence, we believe the analysis of the performance of the four chosen models (described above) on this particular collection of datasets would give us adequate insights beyond the specific choice of datasets and models, and in particular, for a better understanding of the challenges for the EA task when dealing with real-world, highly diverse datasets.
Datasets
We proceed to present and analyze the chosen datasets, coming from the two groups identified in the introduction: synthetic benchmark datasets and real-world datasets.
Benchmark Datasets
We consider DBP15K (Tang et al., 2020), which is a benchmark dataset that a significant number of state-of-the-art methods report their results on (Berrendorf et al., 2020; Zeng et al., 2021). DBP15K consists of three pairs of KGs that differ in the used language (French, Japanese, and Chinese). We pick the French–English dataset (DBP15K
Real-World Datasets
Because heterogeneity of KGs has a broader meaning than linguistic differences (Achichi et al., 2019) and also, benchmarks often present idealized scenarios with a limited set of relationships, controlled noise, and specific characteristics (Fundulaki & Ngomo, 2016), we added to our investigation two real-world datasets: DOREMUS (Achichi et al., 2018) and AgroLD (Larmande & Todorov, 2021) that differ from benchmarks in terms of the types of their heterogeneity. DOREMUS is a real-world music-related dataset consisting of three interconnected datasets that describe classical music works and the related events and entities. The data is multilingual with a majority of French text and comes from catalogs and archives of three major French cultural institutions (Radio France, La Philharmonie de Paris, and the French National Library; Lisena et al., 2018). AgroLD consolidates data relevant to the plant science community, including crops such as rice, wheat, and Arabidopsis (Venkatesan et al., 2018). With approximately 900 million triples, AgroLD is the result of annotating and integrating over 100 datasets from 15 diverse sources (Larmande & Todorov, 2021).
To get an idea of the extent to which the KGs in each dataset differ in scale, we show in Table 2 the sizes of the source and target KG for each dataset (denoted by #S and #T, respectively). The remaining columns of the table will be introduced and explained as we proceed in this section.
Comparing Each Two KGs for Each Dataset (All Numbers Indicate Percentages Except for KG Sizes, Which Indicates the Number of Entities).
Note . KG = knowledge graph; JS = Jensen–Shannon; EA = entity alignment.
Comparing Each Two KGs for Each Dataset (All Numbers Indicate Percentages Except for KG Sizes, Which Indicates the Number of Entities).
We start by taking a bird’s view on the datasets and showing the degree distribution of the underlying KGs, that is, the undirected graphs in these datasets in Figure 1. We count the number of entities relative to their degrees and visualize this for degrees up to the point where 90% of the nodes have a degree below that threshold. We also considered visualizing up to the median or median unique degree. However, we believe it is not appropriate to plot up to these values, as the median only represents the point at which 50% of the entities are below or equal, and fewer than five entities have the median unique degree.

Degree distribution of each two knowledge graphs (KGs) for each dataset.
The figure shows that the number of nodes having the same degrees is similar in the pair of KGs in both DBP15K and SPIMBENCH datasets, indicating similar degree distributions in these two datasets. However, this is not the case for DOREMUS, AgroLD, ICEWS-WIKI, and ICEWS-YAGO, where we can see that the number of nodes having the same degree is different across the KGs in each dataset.
To get a more in-depth understanding of the dataset heterogeneities, we proceed to compute three statistical and two qualitative metrics for each pair of KGs in our datasets.
To statistically analyze the underlying distribution of degree sequences in each pair of KGs, we opt for applying the JS divergence test. We found JS divergence or JS distance (Endres & Schindelin, 2003), a suitable statistical metric that captures the amount of overlap between two distributions by using a bi-directional Kullback–Leibler (KL) divergence (Kullback & Leibler, 1951). KL divergence, defined in equation (1), measures how one reference probability distribution
To better recognize the differences in the KGs’ degree distributions, we calculate our second statistical metric, which measures the maximum difference in the percentage of entities with respect to the degrees in each pair of KGs. By looking at the second column of Table 2, we can see that the maximum difference in the percentage of the nodes across the KGs (w.r.t. the node degrees) in DOREMUS is much higher than in all other datasets. This confirms the observation in Figure 1, showing that the percentage of entities having the same degree in the two KGs varies less in the benchmark datasets (DBP15K
Size Similarity
As a third statistical metric, we calculate the normalized difference between the number of entities in every pair of KGs, so we can compute the similarity in the size of KGs using equation (3):
Analyzing the results of the three statistical features (Table 2), in combination with observing the degree distributions (Figure 1), reveals the higher level of structural heterogeneity in KGs of DOREMUS and AgroLD as compared to the two synthetic benchmark datasets. Moreover, there is less JS distance between the degree distribution of KGs in benchmark datasets in comparison with the other datasets. Furthermore, in all non-benchmark datasets, the ICEWS-YAGO dataset contains KGs having the least overlaps in their degree distributions, and AgroLD is the dataset that includes KGs having the most difference in scales.
Nevertheless, none of these three statistical metrics are indicators of the textual/lingual properties of the entities. We therefore turn our attention to string-level features.
In order to get a more enhanced understanding of how dataset heterogeneities affect each model’s performance, we need a qualitative heterogeneity metric, especially for approaches such as BERT-INT, MultiKE, and i-Align that use the textual attribute values of the entities. Even RDGCN, instead of random initialization, uses a representation vector of the entity name as the entity’s initial embedding. In addition, since all four of the analyzed methods have been trained in a supervised manner, they all use some part (30%) of the reference alignment as their training data. Hence, the quality of data of the reference alignment affects the model’s performance directly. Hence, we explore two qualitative metrics—one based on the Levenshtein similarity (in this subsection) and one based on the embeddings’ semantic similarity (in the following subsection).
As a first qualitative metric, we get an average over the Levenshtein normalized similarity of the attribute values for all pairs of aligned entities in the reference alignment. Levenshtein, or edit distance, is originally a measure of the closeness of two strings. It quantifies the minimum number of single-character edits (insertions, deletions, or substitutions) required to transform one string into the other (Yujian & Bo, 2007). The resulting distance is always in the interval
Because minor variations in the input data do not affect the performance of the language models in embedding the texts (Elekes et al., 2017; Heylen et al., 2008) and regarding the fact that in EA methods that utilize the attribute values of entities (including three of our employed methods), a language model is used for initial embeddings of the entities (Shen et al., 2022; Z. Zhang et al., 2020), we first lemmatized and stemmed all the words in each attribute value. Then, we compared all attribute values for each pair of entities in the reference alignment, and we computed the maximum Levenshtein similarity between each pair of attribute values and averaged all. Due to the multilingual nature of DBP15K
The results of the Levenshtein measurements reported in Table 2 indicate that, despite possible translation errors in the DBP15K French KG (Wu et al., 2019), the normalized Levenshtein similarity of aligned entities in the benchmark datasets is higher than in the real-world datasets. This suggests that EA methods, particularly those relying on textual descriptions of entities, may perform better on benchmark datasets. Since attribute triples are not included in the ICEWS-WIKI and ICEWS-YAGO datasets, calculating the Levenshtein measure based on attribute values is not possible for these datasets.
Semantic Similarity
While the normalized Levenshtein similarity offers insights into the textual closeness of aligned entities in each dataset, it primarily focuses on character-level differences and does not capture the semantic or contextual similarities between entity pairs. Therefore, we further investigate the semantic similarity between the aligned entities in the KGs.
As mentioned, approaches using the entity or predicate features as input usually utilize a language model to embed the entities. The studies show that the performance of these methods is relevant to the quality of the initial embeddings (Sun et al., 2020; Wang et al., 2019). Hence, we want to measure the similarity of two aligned KGs (Traverso et al., 2016; Zhu & Iglesias, 2016) based on initial embeddings of entities in the reference alignment. Because language models capture the semantic similarity of words, we rely on the entity embeddings. We apply the well-known normalized Euclidean relative distance over the pairs of entities of the reference alignment, which is a common choice (Fanourakis, Efthymiou, Kotzinos, & Christophides, 2023), given as follows:
To visualize how the semantic similarity in different synthetic benchmark and real-world datasets differs, we applied t-distributed stochastic neighbor embedding (t-SNE; Van der Maaten & Hinton, 2008). t-SNE is a dimensionality reduction technique commonly used for visualizing high-dimensional data in lower-dimensional space, typically 2D or 3D. In Figure 2, we visualize the entity embedding spaces of the SPIMBENCH and DOREMUS datasets that, according to Table 2, are datasets with a low and a high level of EA semantic similarity, respectively. 8

Reduced-dimension BERT-based initial entity embeddings of SPIMBENCH (to the left) and DOREMUS (to the right).
The dark blue and red points represent the seed alignment of the KGs, while each entity in the seed alignment is connected to its counterpart using grey lines. Looking at the grey lines that show the distance between the initial embeddings of the entities in the reference alignment, one can easily recognize how far the entities are located in the DOREMUS real-world dataset. In SPIMBENCH, only two aligned entities have a long distance, and for the other aligned samples, the distance is much shorter than what we can see for the DOREMUS dataset. It is important to note that in Figure 2, for the SPIMBENCH dataset, several red dots appear in the bottom-right corner, seemingly unlinked. In reality, they are connected to nearly overlapping dark blue dots. The line between them is not visible because the initial embeddings of the source and target entities are so similar that the connection line becomes imperceptible. Additionally, there are some red dots located in the middle-left of the figure, surrounded by light-blue dots. These light-blue dots represent entities in the source KG that have very similar initial embeddings but are not part of the seed alignment, meaning they do not have a corresponding match in the target KG. The plots confirm the higher level of semantic heterogeneity in the real-world DOREMUS dataset as compared to the benchmark dataset SPIMBENCH.
In this section, we present the results of implementing and applying the selected EA models on the chosen datasets. We first explain the challenges of using the models on real-world and less well-known benchmark datasets and how we overcome these issues, this being part of the lessons learnt in this work. Next, we discuss the evaluation metrics employed by the models and present the results of our experiments. Further on, we provide an overview of how the models perform on both benchmark and real-world datasets. We also investigate how these performances relate to the dataset features in light of the discussion in Section 4. Finally, we look into the inference capacities of the models when facing the full-scale graphs (instead of their corresponding validation sets).
Datasets Preparation for Applying the EA Models
For each dataset, we have a file of the source KG, a file of the target KG, and a file containing the reference alignments in XML, turtle, or Ntriples format. To feed the data to each model, we prepare a series of files following the naming convention and formats required by each model, for example, json, pkl, txt, or other. In this process, we confronted issues either related to the dataset design itself, for example, using blank nodes, or related to the model input’s design, for example, when there is no instruction about the proper model input. Even with correctly formatted inputs, runtime errors can still occur unexpectedly due to minor changes in the input data. This necessitates a thorough process of data validation to ensure the models function correctly. Data validation in an ML pipeline ensures that training data is error-free and accurate, preventing issues that could degrade model performance during deployment and safeguarding against errors introduced during data processing (Polyzotis et al., 2018). Hence, we need to handle the data lifecycle of inputs to each embedding-based EA model (Gudivada et al., 2017), and address as many problems as we face to prepare suitable data. After writing the codes to prepare the proper input files for the four models that we use and validating them on the different benchmark and real-world datasets, we share the codes on a GitHub page 9 to pave the way for researchers to benefit from employing these methods on their specific datasets. We included all the links to the original models on our GitHub repository.
Note that, despite the high heterogeneity aspects of ICEWS-WIKI and ICEWS-YAGO, which make them more similar to the real-world KGs, the only model that we employ for them is RDGCN. The reason is that these two datasets lack the attribute triples, which are the essential features utilized by DLinker and the other three methods, and they only contain the relation triples. We opted not to use additional datasets due to significant structural differences and limited accessibility. The OpenEA datasets, for instance, were generated under the 1-to-1 assumption, omitting unmatched entities. In response to this limitation, Leone et al. (2022) introduced new, more realistic datasets that do not follow this assumption, which we could not access on their repository. While the sampling algorithm code is provided to regenerate the datasets, doing so would result in datasets that differ from DOREMUS and AgroLD, as Leone’s datasets are derived from larger KGs, whereas DOREMUS and AgroLD are not. Therefore, we chose not to use additional datasets to maintain consistency in the real-world data analysis.
Evaluation Metrics
There are two commonly used families of evaluation metrics: (1) precision and recall (and the resulting
We define precision, recall, and
In this work, we make use of both types of metrics: Hit@k allows us to compare our results with the state-of-the-art EA embedding-based methods, which mainly use this metric to report their performance. However, Leone et al. (2022) argue that precision, recall, and
A question that might come to mind is why, under the 1-to-1 assumption, Hit@1 is equivalent to precision, recall, and
To show that the precision and recall (and the
Defining TP, FP, TN, and FN for EA Models Based on Hit@1 Predictions, When the Validation Set of Size
Considering Hit@1 as the final prediction by EA models, for each entity
Real-World Versus Benchmark Datasets
The comparative performance of the selected models on the collection of chosen datasets is given in Table 4. For each dataset, the highest Hit@1 score is indicated in bold. Following the discussion presented in Section 5.2, we compare the Hit@1 results of embedding-based models with DLinker, considering cases where the 1-to-1 assumption was met during the evaluation of embedding-based EA models on each dataset. The best-performing method for each dataset is highlighted in bold. We can observe an overall drop in the performance of embedding-based models when tested on real-world datasets, as compared to benchmark ones. In what follows, we discuss the results of each of the models of interest in light of that global observation, while also considering the internal mechanisms of each of the models that differentiate them from one another and could provide insights into these observations.
Measured Evaluation Metrics of Analyzed EA Models on the Datasets (all Numbers Indicate Percentages). Following the 1-to-1 Assumption During the Models’ Evaluation, Hit@1 Equals the Precision, Recall, and
-Score for Each Model.
The reported numbers are derived from the original study.
Measured Evaluation Metrics of Analyzed EA Models on the Datasets (all Numbers Indicate Percentages). Following the 1-to-1 Assumption During the Models’ Evaluation, Hit@1 Equals the Precision, Recall, and
The performance of the BERT-INT model is strong on datasets such as DBP15K and SPIMBENCH, achieving high Hit@1 rates (99.3% and 82.4%, respectively). However, its performance drops significantly on our real-world datasets DOREMUS and AgroLD (Hit@1 rates of 47.9% and 21.1%, respectively). The reason for this drop is that BERT-INT relies heavily on the quality and amount of textual information (entity descriptions), as observed in Table 1. Hence, datasets with less textual and semantic similarity and fewer descriptive features, such as DOREMUS and AgroLD (Table 2), lead to a decrease in its performance. This emphasizes the importance of high-quality data descriptions for BERT-INT’s success.
Observing the results of RDGCN (Table 4), we can see that its performance is more than 88% and 77% on DBP15K and SPIMBENCH, respectively, and it also drops significantly in the real-world scenarios (close to 0% Hit@1). RDGCN relies solely on graph structure and a GloVe word embedding 10 on entity names (see Table 1). This, along with the statistical metrics from Table 2, which highlight the greater structural heterogeneity of the real-world dataset, helps explain this outcome. Since RDGCN is adaptable to different initial embeddings, we modified the initial embedding of the model to use a multilingual pre-trained BERT model to generate the initial embeddings for entity names. This adjustment improved the Hit@1 and Hit@10 scores on the dataset by only 0.4% and 6%, respectively. This modest improvement is likely due to the fact that most entity names are represented by IDs rather than meaningful text, which limits the impact of the embeddings. We have uploaded the code for generating the initial embeddings with BERT using graphics processing units to our GitHub repository. We also experimented with using BERT-based embeddings of the entities’ descriptions (used in BERT-int) as the initial embeddings in RDGCN. To facilitate this, we implemented a method to create a dictionary of entities’ descriptions for any pair of given RDF graphs, which is available in our repository. This approach led to improvements in the percentage of the Hit@1 scores across several datasets: 93.70 on DBP15k, 99.53 on Spimbench, 22.49 on DOREMUS, and 7.21 on AgroLD. These represent enhancement percentages of 5.1%, 21.83%, 21.16%, and 7.19%, respectively. This demonstrates that more informative initial embeddings significantly boost RDGCN’s performance.
Looking at Figure 1, we can see the long-tailed issue of AgroLD’s KGs. The long-tailed problem in graphs (Malekzadeh Hamedani & Kaedi, 2019; Shi, 2013) is described as an issue where a small number of nodes have a substantial number of neighbors, while the majority (referred to as tail nodes) have only a few neighbors (Liu et al., 2021). GNNs used in RDGCN under-represent tail nodes during the training of the model and lead to a low-quality KGE (Liang et al., 2024), and this can explain the drop in performance on this dataset for RDGCN. However, as we observe in Figure 1, SPIMBENCH also has a long tail problem, which does not appear to be an issue for RDGCN. We found two main differences between SPIMBENCH and AgroLD that could explain the drop in performance from one to another of these methods. (1) The number of common neighbors: In the SPIMBENCH dataset, many entities in the reference alignment share common neighbors. On average, 48% of the entities in the reference alignment, across its two KGs, have at least one common neighbor with their linked entities. According to the results presented in Wang et al. (2024), a higher number of common neighbors improves the efficiency of embeddings generated by GCNs for these entities. However, in the AgroLD KGs, none of the linked entity pairs share a common neighbor. As Wang et al. (2024) demonstrated, the performance of GNN-based graph embedding models, including GCNs, correlates more strongly with the number of common neighbors than with node degrees. This suggests that the lack of common neighbors in AgroLD could negatively impact the performance of the RDGCN model. (2) The KGs in AgroLD are bipartite: Giamphy et al. (2023) discuss how GNN-based graph embedding of a large bipartite graph is difficult due to the challenge of merging heterogeneous node and graph-level information while ensuring scalability to handle the graph’s increasing size. They also propose a list of available resources that perform better on bipartite graph embedding (but unfortunately, we found none of them working on the multi-relational graph embedding, which is our case). Moreover, RDGCN uses word embedding models to produce the initial embeddings of the entities using entity names. Because the names (which are the last part of the entity URIs by the model’s default) of the musical works and the proteins/genes in DOREMUS and AgroLD have been defined by IDs in their respective ontologies, the initial entity embeddings would not be able to guide the embedding module to a better result. All the aforementioned observations can explain the low performance of RDGCN on DOREMUS (1.2% of Hit@1) and AgroLD (<1% of Hit@1). Furthermore, the results of RDGCN on the ICEWS-WIKI and ICEWS-YAGO datasets suggest again the contribution of the high-quality entity names to improving the model’s performance. As a result, these two datasets are structurally less complex for the model compared to real-world datasets.
Although MultiKE outperforms several EA translational-based methods (Zhang et al., 2019) using a multi-view KGE technique, this model overall performs the weakest among the employed embedding- and non-embedding-based models on the selected datasets. Similar to its predecessors (Table 4), MultiKE’s performance also drops for DOREMUS and AgroLD. Recall that we observe a higher level of structural and qualitative heterogeneities in these two real-world datasets than in the benchmark datasets (Table 2). Hence, the fact that MultiKE relies on both the graph structure and textual information of entities and their attributes (Table 1) can explain the gap in the performance of this model. Furthermore, while DBP15K is less heterogeneous than the SPIMBENCH, we suspect the reason for the worse performance of MultiKE on this dataset is the multilinguality that could not be handled using a pre-trained English word2vec model 11 that MultiKE is employing for the entities’ local name embeddings. To embed the French language in Doremus and DBP15K using MultiKE, we were unable to use a pre-trained multilingual BERT model due to the method’s strong reliance on word2vec. Instead, we added a French word2vec dictionary to the existing English one, which led to significant improvements in MultiKE’s performance on the DOREMUS dataset. Specifically, Hit@1 increased to 30.7% and Hit@10 rose to 34.4%, which seems reasonable given the predominance of French text in DOREMUS. However, the inclusion of this multilingual collection significantly reduces MultiKE’s performance on DBP15K, with Hit@1 and Hit@10 dropping to 0.53% and 3.18%, respectively. This represents a decrease of 37% in Hit@1 and 40.4% in Hit@10. We believe this decline is due to a lack of mappings between the French and English word vectors, which causes conflicts in the embeddings of the two languages.
Finally, i-Align uses two transformer encoders for text and graph embeddings. As Table 4 shows, it performs better on SPIMBENCH and DOREMUS (75.0% and 53.1% of Hit@1, respectively) as compared to DBP15K (26.6% of Hit@1), and its performance drops significantly when it comes to AgroLD (4.4% of Hit@1). We suspect that the reason for the model performing worse on DBP15K as compared to DOREMUS is the fact that only the first 10 characters of the attribute values were considered, while the rest of the sequence was ignored by the textual transformer-based encoder. Due to the curse of multilinguality issue by transformers (Blevins et al., 2024; Pfeiffer et al., 2022) and inter-language competition for the model parameters, it seems this limited amount of data may not suffice to train the same text transformer’s parameters. Additionally, during our experiments, we discovered that reducing the length of textual properties of the entities in the BERT-INT model can result in a significant reduction in performance by as much as 19%. This again illustrates the importance of retaining the informative attribute descriptions included in the values.
As a baseline of non-embedding-based approaches, we used the DLinker method. Because this model fundamentally finds the longest common subsequence in the descriptions of a pair of entities belonging to two different KGs, it does not support EA on the multilingual dataset of DBP15K
Furthermore, since DLinker finds the alignments based on the greedy strategy of finding the longest common subsequence and ignoring the rest of structural or literal information, and since DLinker’s performance is significantly better than the embedding-based methods, we can conclude that in cases where we have real-world data, taking the extra volume of information into account does not help the quality of the embeddings but enforces more noise to them. Although this conclusion holds for all categories of embedding-based EA methods, the translational and GNN-based methods, which rely primarily on graph structure, introduce more noise into the embeddings. However, i-Align also embeds the graph structures using a graph transformer to embed the local subgraphs containing the nodes that have high interconnectivity with the given entities in two KGs. By comparing the performance of i-Align with RDGCN and MultiKE, we recognize that using the transformer attention mechanism for local subgraph embedding propagates noise less than the GNN message passing and translational systems in the more heterogeneous real-world cases (see results of these three methods on and AgroLD datasets in Table 4). The other reason that i-Align has a better performance than the RDGCN and MultiKE seems to be the fact that it relies more than those on the literals and textual properties of the entities, as we see that both i-Align and BERT-INT perform better than the other methods for the real-world cases. Furthermore, by comparing the performance of i-Align and BERT-INT on AgroLD, we can conclude that an interaction training method that uses almost all textual properties of the entities is the best choice for the case where we have structurally and semantically heterogeneous large-scale KGs. Because the interaction training methods mostly rely on the comparison between the properties of pairs of entities belonging to two KGs rather than comparing them as particles of the large KGs they belong to, we can conclude that for doing an embedding-based EA task on real-world datasets, a local comparison of the given entities in two KGs will guide the model to predict higher quality alignments.
Generalizability Assessment
Inspired by prior work on domain generalization (Fan et al., 2024; Gulrajani & Lopez-Paz, 2021), we examine the robustness of existing EA models by evaluating their performance on datasets that differ from synthetic benchmarks in structure and semantic similarity levels. However, unlike these studies that focus on training on one domain and testing on a different one, we investigate how well-established EA methods perform when applied to more heterogeneous and less curated datasets, without altering their training or optimization procedures. To quantify this, we compute the average Hit@1 on the real-world datasets DOREMUS and AgroLD. As shown in Table 4, BERT-INT reaches an average Hit@1 of 34.5%, while i-Align, MultiKE, and RDGCN score 28.75%, 2.5%, and 0.675%, respectively. This superior performance suggests that interaction-based models such as BERT-INT generalize more effectively to heterogeneous real-world scenarios compared to structure-dependent embedding models, although further investigation would be needed to confirm their generalizability across other domains. Overall, our results show that even though embedding-based models perform very well on some benchmark datasets (e.g., 99.3% of Hit@1 for BERT-INT on DBP15K
In this section, we focus on the capabilities of models in the inference phase. As a common practice, under the 1-to-1 assumption, models such as RDGCN and BERT-INT consider only a subset of the reference alignment as a validation set during the model evaluation, ignoring the rest of the space. That corresponds to the subspace of dark-colored points in Figure 2 (reference alignments). Such under-representation of the search space undermines the reliability of the reported results, as well as the efficiency of these methods in predicting correct alignments beyond the validation set. Indeed, some real-world studies have removed the 1-to-1 assumption in dataset generation, allowing for more complex scenarios with non-matchable entities. However, many EA models still only focus on data that involves ground truth entities, sometimes even using only ground truth for training. As a result, these models fail to consider the non-matchable entities added to the dataset as candidates. This means the model’s performance remains the same as if it were working under the 1-to-1 assumption, because it doesn’t effectively handle the additional non-matchable data. In Figure 3, we illustrate the comparison space of EA models that impose the 1-to-1 assumption during evaluation. In this context, the similarity matrix used to identify the top-ranked predictions is a square matrix, as depicted by the green one in Figure 3. This matrix excludes comparisons between entities in the validation set and other entities in the KGs, such as non-matchable entities and those utilized during training. Later in this section, when we extend the comparison space—first to compare source-to-target and then to compare target-to-source, the results show a decrease in Hit@1. This suggests that the best-predicted match in the extended comparison is not always the same as in the more restricted case. Moreover, for certain entities in the validation set, their most likely alignment may actually be outside the validation set, highlighting a lack of efficient embedding for these entities.

Illustration of the search (comparison) spaces of the EA models in the limited case (imposing the 1-to-1 assumption) and in the extended case (entire graphs).
Therefore, in what follows, we assess the models’ performance on two versions of the datasets: (1) a limited validation set scenario, and (2) an extended scenario where each source entity’s candidate search space includes a broader set of entities from the target KG, rather than being restricted to only those in the validation set. In Figure 4, we visualized the results of our experiments on the performance of BERT-INT and RDGCN. We utilized the repository provided by Leone et al. (2022) to measure

Hit@1 and
As illustrated by the comparison between the blue and dark-green bars in Figure 4, there is a performance decline of 5.66% and 45.15%, as measured by Hit@1, for RDGCN on the DBP15K
Recall that Hit@1, as shown in Table 4, is equivalent to
The objective of this work is to build upon and complement recent empirical studies in the field of embedding-based EA (Fanourakis, Efthymiou, Christophides, et al., 2023; Fanourakis, Efthymiou, Kotzinos, & Christophides, 2023; Leone et al., 2022; Sun et al., 2020; Zhang et al., 2022), offering a critical perspective on the different models and their limitations, particularly in relation to the challenges posed by various types of datasets and the evaluation process. Therefore, we aim for this study to open new methodological avenues, without focusing on proposing a new model.
We conducted an in-depth analysis of the features of several real-world datasets compared to popular benchmark datasets. Also, we presented an empirical study analyzing the performance of embedding-based EA models beyond test data and on real-world heterogeneous data. We observed that a number of EA embedding-based models, such as BERT-INT and RDGCN with very strong performance for the task of EA on the well-known DBP15K dataset, suffer a drop in performance on real-world data with heterogeneous textual properties. Hence, the results of our study shed light on the benchmark overfitting issue of EA methods discussed in Roelofs (2019) and Todorov (2019), that is, the scenario where the model is tuned excessively to perform well on specific benchmark datasets or evaluation metrics, at the expense of its generalization ability to new, unseen real-world data.
It appears challenging to identify a single structure-related meta-feature that accounts for the performance drops of all methods across different datasets, as each method captures the structure from a different perspective. Since, however, heterogeneity is not just limited to diversity in size and degree distribution, we observed the semantic similarity over the reference alignments to be well-correlated with the performance of EA models that employ a language model, helping explain the performance issues. Then, by investigating the reasons for performance fluctuations of EA models regarding the heterogeneities of real-world datasets, we found interaction training models better fit for driving the EA task in real-world, especially large-scale scenarios. Although interaction training models showed promise on real-world data, we could not conduct a deeper analysis of how they handle noise and ambiguity, due to time and resource constraints. We leave this as an important direction for future work.
Most of the existing embedding-based EA methods simplify the inference process by considering the 1-to-1 assumption (Zeng et al., 2021) and use just a limited portion of the embedding space for the evaluation. This seems to be neither fair nor practical when it comes to using them to discover unseen alignments. As a result, there is a need to go toward an inductive learning EA approach, in which models are trained on pairs of entities from two aligned KGs to predict alignments between unseen entities belonging to the same KGs, as well as matches between entities in other unseen KGs. By addressing this challenge, we believe that EA models will be able to uncover a significantly larger number of alignments across different pairs of KGs.
