Abstract
Introduction
Knowledge graphs (KGs) are an established paradigm for effectively and efficiently integrating heterogeneous data (Hitzler, 2021; Hogan et al., 2022; Noy et al., 2019). Many methodologies for creating KGs (and the ontologies that act as their schema (Hitzler & Krisnadhi, 2016)) have been developed over the years (Fernandez-Lopez et al., 1997), which recommend or otherwise emphasize the use of various techniques. These range from the use of upper ontologies (Gangemi et al., 2002; Smith, 1998), the use of ontology design patterns (Blomqvist et al., 2016; Gangemi & Presutti, 2009; Shimizu et al., 2023), or even the use of LLMs alone (Meyer et al., 2023), or combined with other methods (Shimizu & Hitzler, 2025).
Evaluating of KGs (or the ontologies that act as their schemas) can be done in many ways (Gómez-Pérez, 2004; Raad & Cruz, 2015), including the use of large language models (Tsaneva et al., 2024), logical and mathematical characteristics (Guarino & Welty, 2004), heuristics (Poveda-Villalón et al., 2014), or competency questions (Mansfield et al., 2021). On the other hand, validation tools (e.g., SHACL (Knublauch & Kontokostas, 2017) or ShEx (Baker & Prud’hommeaux, 2019)) can measure whether or not the KG adheres to the schema.
As such, these also widely vary along which dimensions the evaluation occurs (e.g., is the ontology well-formed?) and how the quality is reported (i.e., quantitative or qualitative reporting). Of particular importance, in any case, is determining whether or not the resultant KG after executing a methodology indeed serves the needs of the stakeholders. For example, competency questions act as both a guide during the development (in many methodologies) and also as a mechanism to confirm if the KG appropriately models—and returns—the correct data (Antia & Keet, 2023).
Beyond these particular assessments of quality, however, is also whether or not a KG is appropriate for
The work presented in this article explores how various graphical structures, such as those that would produced via different KG or ontology development methodologies, impact how various KGE models are impacted by those changes, when evaluated against the link prediction task. To the authors’ knowledge, beyond their own work (Dave et al., 2024) that this paper extends, there has yet to be any comprehensive investigation in this area (although recently a pipeline for
Specifically, instance graph, which we call SKG-4; SKG-4 plus type annotations, which we call SKG-5; SKG-5 plus superclasses for each type, which we call SKG-6; SKG-5 with reified properties, which we call SKG-5r; SKG-5r with shortcuts, which we call SKG-5rs; and SKG-5rs with added contextual nodes, which we call SKG-5rsc.
We furthermore note that these various representations span complexity. On one hand, they represent a richer ontological reality, but on the other hand simpler semantics (and thus KG structures) are easier to consume and query. This is inline with how patterns can be used to flatten or expand
Concretely, this article contributes: (a) various synthetic graphs and mechanism for their generation, (b) the FB15k isotopes: FB15k-238 and FB15k-239, 1 (c) the scripts and configuration files to generate these datasets, (d) a thorough evaluation of the effects that the incorporation of increasing metadata has on the performance of the KGE models in the link prediction task, 2 (e) the creation of SKG-237, a graph mimicking FB15k-237 in structure as far as node, edge , node/ratio and degree centrality, that is trained and validated in the same way as the ones above on TransE, (f) the creation of synthetic knowledge graphs (SKGs) of increasing complexity, showcasing their generation, training, and evaluation on different hyperparameters from our original isotopes, along with visualizations using t-SNE and UMAP, and (g) a discussion of results and insights.
Related Work
In Iferroudjene et al. (2023), they argue that the removal of Freebase
Overall, we see that deductive reasoning is quite difficult outside of the symbolic algorithms dedicated to it. In particular, neurosymbolic methods (e.g., as found in Hitzler et al., 2023) struggle quite a bit. As deductive reasoning is a major hurdle for approaching human-level cognition, this provides further motivation for understanding the impact of how the presence (or lack thereof) of semantic information impacts KGEs.
The importance of evaluating KGEs with respect to the underlying semantics of the graph is brought up in recent research. When evaluating embedding performance, for instance, Jain et al. (2021) mentions the importance of evaluating how well these embeddings preserve the semantic links within the KG in addition to using ordinary metrics. Our goal of understanding embedding behavior in synthetic KGs is in support of this. Gutierrez et al. work (Gutierrez Basulto & Schockaert, 2018) additionally points out the significance of matching vector space representations to basic ontology rules and terminology, claiming that a more thorough examination of how well embedding methods work with complex semantic structures is essential—a realization that directs our investigation of synthetic KGs.
Additionally, Kang et al. (2021) showed how conditional information can make connections in data more clear, which motivated us to use t-SNE visualizations to uncover important patterns in our embeddings. Their research on demonstrating dataset properties guided our approach for using these visualizations to identify patterns, improve our understanding of groups, and identify data cluster divisions. Building on this visualization method, Damrich et al. (2023) reveals how UMAP and t-SNE can be used to effectively study high-dimensional data. Their usage of similar learning methods to modify embeddings offers a helpful perspective on how visualizing models such as TransE might highlight structural ties in the data, which we apply into our own visualizations of our synthetic KGs.
Knowledge Graph Embedding Models
We utilize the DGL-KE library for scalable training and evaluation of KGE models.
3
KGE models that implement an additive scoring function can be categorized as
Methodology
In this section, we describe how we created the various synthetic KG and FB15k isotopes developed for our evaluation. Specific implementation details, including hyper-parameters, are detailed in Section 3.4.
Creating SKG-4, SKG-5, and SKG-6
We created a total of six synthetic datasets to further our investigation regarding the graph structure of a KG and how that may affect the link prediction aspect of KGEs. The structure of, or template for, each of these synthetic KGs (SKGs) is shown in Figure 1.

This figure shows the various schema diagrams for the synthetic KG isotopes. We have used consistently coloring across all figures to demonstrate correspondence. For clarity, in SKG-5RSC, we denote
We describe the

This figure shows a basic schema diagram modeling a reified node, with labels corresponding to Figure 1. The reification is represented by the
For more context, you can find some examples in Appendix A.1.
In this study, we currently instantiate each template 1,000 times. This can be improved in the future to produce templates that interlink or somehow connect via nodes. As it stands, each SKG has 1,000 disconnected components.
It is important to clarify that the training, validation, and test splits are constructed to maintain strict disjointness of entities across these subsets, especially for SKG-4 which consists of 1,000 disconnected components. This design ensures that entities appearing in the test set are not seen during training, reflecting a realistic and challenging open-world link prediction scenario. Thus, standard transductive embedding models face limitations, as they cannot leverage embeddings for unseen entities at test time. This setup was deliberately chosen to evaluate model generalization capabilities under such constraints. Furthermore, while SKG-4 presents disconnected graph components, the synthetic dataset generation process preserves internal structural patterns similar to those in real-world KGs, thereby providing meaningful evaluation benchmarks despite the absence of entity overlap across splits.
FB15k-237 is published with the data split to allow for training, evaluation, and validation of KGE models. This research introduces

Graphical overview of adding semantics to FB15k and the method of testing trained models. (a) This represents the types of triples contained in each of the datasets. The yellow ellipses are a set of triples extracted from
We provide Table 1 as a summary of the count of entities, edges, and triples per data split in FB15k-237 and our augmentations.
Comparison of Different Counts for the Freebase Subset and the Created Augmentations.
We constructed a synthetic graph with the same number of unique nodes, unique predicates, and triple count as FB15k-237. However, the exact
We stress that SKG-237 was not produced randomly. The centrality characteristics of nodes, edge frequency per predicate, and node degree marginal distributions were instead preserved by carefully sampling and connecting entities and relations. Semantic aspects like topic-driven grouping, inverse relation pairings, and ontological hierarchies were not attempted to be replicated. The objective of SKG-237 is to separate semantic content from structural form so that we can determine if structural similarity is sufficient to support efficient KGEs.
Implementation
Our graphs are generated using a set of scripts which can be found online. Research artifacts include the scripts for generating the SKG and FB15k isotopes, for calculating the ratio and centrality metrics, for generating the visualizations, and a container for training the KGE models, as well as each of the graphs themselves. They are provided through a Zenodo repository 7 and a GitHub repository 8 under the MIT License, which is also included in the repository.
The KGE models , except TransD, are trained through the Deep Graph Learning - Knowledge Embedding (DGL-KE) library (Zheng et al., 2020). Experiments using TransD employed
Hyper-parameters play a crucial role in training machine learning models, and adjustments to hyper-parameters have a sizable impact on model performance, choosing them for KG embedding model training is a difficult but important issue (Lloyd et al., 2023). Due to the smaller size of these synthetic KGs, we had different hyper-parameter configurations for them, due to incompatibilities between the graph size and the DGL-KE configuration. Further, we were not able to identify the hyper-parameters used in the initial publications of the KGE models, so we opted to standardize their values across our experimentation with the implemented models in DGL-KE. As used by DGL-KE, the list of hyper-parameters 9 are found in Table 2.
The Hyper-Parameter Settings Used for the Training and Evaluation During Training of the KGE Models With Respect to FB15k-237, FB15k-238, and FB15k-239.
The Hyper-Parameter Settings Used for the Training and Evaluation During Training of the KGE Models With Respect to FB15k-237, FB15k-238, and FB15k-239.
The experiment consists of four overall analyses:
The Standardized Lowest Hyper-Parameter Settings Used for the Training and Evaluation During Training of the KGE Models for the SKGs.
The Standardized Lowest Hyper-Parameter Settings Used for the Training and Evaluation During Training of the KGE Models for the SKGs.
As a straightforward and widely used model that offers a clear baseline for evaluating embedding accuracy and link prediction performance, we chose to focus on TransE as our starting point. TransE’s translation-based approach fits in well with our goal of investigating how structural features affect model behavior, and initial testing showed that it is very sensitive to graph structure changes. To compare the synthetic KG with complex models created in future investigations, it was the perfect place to start when evaluating how well it represents underlying relationships.
The DGL-KE library provides an evaluation mechanism, configured with
Graph Metrics
Important insights into the dynamics and structure of the underlying data are obtained by investigating graph metrics in the context of KGs. The following explains each metric’s selection and the implications of each datasets investigation.
Results
We report our results along three dimensions, the graph centrality metrics, KGE model performance on the link prediction task (including both the evaluation for the SKG isotopes and the ablation-like study for FB15k isotopes), and visualizations using t-SNE (Kang et al., 2021) and UMAP (Damrich et al., 2023).
Graph Metrics of the Isotopes
Tables 4, 5 and 6 present important metrics, such as the total number of facts, nodes, edges, and edge-to-node ratio, for the datasets and synthetic KGs. Additionally, they offers broad information for degree centrality, betweenness centrality, and closeness centrality, displaying the average, maximum, and minimum values for each of these metrics across the datasets.
The Graph Metrics for FB15k-237, FB15k-238, and FB15k-239.
The Graph Metrics for FB15k-237, FB15k-238, and FB15k-239.
The Graph Metrics for SKG-4, SKG-5, and SKG-6.
Note: The arrows in the difference column indicate the direction of change: an upward arrow (
The Graph Metrics for SKG-4, SKG-5, SKG-5r, SKG-5rs, and SKG-5rsc.
Note: The arrows in the difference column indicate the direction of change: an upward arrow (
As can be seen at Table 7, even though SKG-237 is the same in structure with FB15k-237 as far as node, fact, edges and degree centrality values, the train/evaluation and visualization results are so different as shown in Table 8.
The Graph Metrics for SKG-237.
The Results of Our Evaluation of SKG-237 for TransE, Using the Standardized Hyper-Parameters.
Note: We note a
Table 9 refers to the evaluation results of TransE.
The Performance Results for the SKG Isotopes Using TransE With the Standardized Hyper-Parameters.
The Performance Results for the SKG Isotopes Using TransE With the Standardized Hyper-Parameters.
Table 10 reports the model performances when trained with their respective KGs. Models trained according to a specific FB15k-
The Results of Evaluating Each of the Models Against Their Respective Training Data.
Note: This table reports the results of testing each of the models against solely the FB15k-237 training data (i.e.,
The Results of Our Ablation-Like Study, Where We Change Which Component of the Data Against Which We Evaluate.
Note:
As a space saving measure, the evaluation of FB15k-237 is reported only once, as the second test to compare the various trained models repeat evaluation of FB15k-237 on its own test data. If there are no
Table 11 reports the result of the aforementioned
Captions of the figures are included in the Appendix so as to not overwhelm the narrative.
Discussion
KGE Performance Over SKG Isotopes
For SKG-4, which has no hierarchical relationships or any sort of additional “semantic complexity,” we observe the strongest performance across most metrics. As this synthetic KG is less complex and generally consistent, embedding models can easily pick up patterns. We note that these values will still be limited due to the low connectivity between template structure instantiations. Yet, when we begin introducing semantic annotations (in the form of
KGE Performance Over FB15k Isotopes
First, across the different isotopes, we see that the inclusion of the additional semantic data drastically improves the performance of
We also test if the presence of additional semantic metadata present during training improves link prediction
The key takeaway, rather than just new models or new evaluations, is that models that are not meant to handle these sorts of relations (notably, TransE) still have an increased performance on the link prediction task when we remove these relations (i.e., the ones that TransE would not handle well) from the evaluation, indicating that their presence
KGE Performance Over SKG-237
Despite utilizing a KG that has the same structural characteristics as FB15k-237 (nodes, predicates, and triples), the TransE model does not do well on link prediction, according to the results . Low HITS@1, HITS@3, and HITS@10 scores, along with poor Mean Reciprocal Rank (MRR) and Mean Rank (MR) values, show that the model has struggled properly ranking pertinent entities, even among the top 10 predictions. This suggests that although the synthetic KG shares structural similarities with FB15k-237, it does not contain of the semantic relationships that underlie the original dataset. Thus, we note, that to some extent TransE requires that the KG indeed more closely mimic real-world data. Further exploration is required to determine the exact connection between recurring entities in triple and the appearance of entities consistently in appropriate domains and ranges of relations. That is to say, that we suspect in order for a KG to be TransE-learnable, a minimum semantics is required in the graph.
Ablation-Like Study With FB15k Isotopes
The purpose of our ablation-like study is to determine how different training data influences the model and, subsequently, if the end results change for different test data. For example, the
Overall, we see that when looking to improve performance for link prediction, for simple assertional relationships,
TransD performs second best when
Discussion of Graph Metrics
The graphs for SKG-4, SKG-5, and SKG-6 are getting more complex as reflected by their increased nodes, edges, and edge-to-node ratio, which indicates their higher degree of connectivity. Degree and betweenness centrality show that center nodes that are important are becoming more frequent, even while many nodes are still less connected. Nodes become easier to locate in SKG-5, but somewhat less so in SKG-6, according to closeness centrality.
When reification relationships (r), shortcuts (s), and contextual information (c) are added, the metrics for SKG-4 through SKG-5rsc clearly show a pattern of a growing complexity. A significant increase is seen in the overall number of facts, nodes, and edges; denser graphs are demonstrated by a higher edge-to-node ratio. There is a range of degree centrality values, with some nodes growing closer together while others stay just moderately connected. Although betweenness centrality points to the rise in important nodes, especially in SKG-5rs, overall average values are still low, suggesting that there is no dominant centralization. As shortcuts and context are introduced, nodes become easier to access, thus increasing total graph connection, according to closeness centrality values. These trends demonstrate the graphs’ increasing structural changes and depth as more semantic layers are added.
With the most facts, nodes, and edges, FB15k-239 has the largest graph structures, according to the metrics. In bigger datasets, the edge-to-node ratio drops from FB15k-237 to FB15k-239, indicating a lower relative graph density. While average values drop across the datasets, degree centrality measurements indicate that FB15k-237 has a greater range with higher maximum values, suggesting better balanced connection in larger graphs. Although betweenness centrality varies throughout the datasets, FB15k-237 has somewhat higher maximum values, suggesting that some nodes are essential for connecting. More direct node interaction is suggested by FB15k-237’s greater maximum and average closeness centrality scores.
So far we managed to replicate the graph structure of FB15k-237, as far as node, edge count, node/edge ratio and degree centrality.
Most nodes do not act as important information-transfer facilitators, given the low average and maximum betweenness centrality numbers, thereby pointing to a graph structure in which no single node controls the shortest paths.
Also, nodes appear to be in a similar location with respect to their average distance to every other node, based on the very small variety of closeness centrality values. This suggests that a lot of nodes in the graph are relatively easy to find, indicating a balanced connectivity pattern.
Discussion of Visualization Results
The distribution of the training embeddings for FB15k-237, FB15k-238 and FB15k-239 can be seen in Figures 4, 5 and 6, showing discrete clusters within each dataset. The visualizations show distinct regions with dense node clusters and with minimal areas with scattered nodes. This suggests that the model has effectively discovered significant connections between the KG’s entities.

TransE embedding visualizations for SKG-237.

TransE embedding visualizations for FB15k-238.

TransE embedding visualizations for FB15k-239.
The entities and relations are typically well-separated, suggesting that the model obtained distinct representations of different entity and relation types. The relationships between entities in the embedding space appear to be implied by the model, as shown by the red crosses that depict relations appearing between clusters.
TransE t-SNE and UMAP visualizations are showcased in Figure 7(a) and (b). Notably the nature of the clustering is quite different. Of course, both t-SNE and UMAP are not appropriate for strictly defining cluster membership, they can give an understanding of what clusters might exist. In this case, we might inspect that the centrality metrics for SKG-237 are misleading. Future work should include different investigations to the centrality metric beyond the average, per se.

TransE embedding visualizations for SKG-237.

(a) t-sne embeddings for SKG-4 with version-1 hyperparameters. (b) t-sne embeddings for SKG-4 with version-2 hyperparameters.

(a) umap embeddings for SKG-4 with version-1 hyperparameters. (b) umap embeddings for SKG-4 with version-2 hyperparameters.

(a) t-sne embeddings for SKG-5 with version-1 hyperparameters. (b) t-sne embeddings for SKG-5 with version-2 hyperparameters.

(a) umap embeddings for SKG-5 with version-1 hyperparameters. (b) umap embeddings for SKG-5 with version-2 hyperparameters.

(a) t-sne embeddings for SKG-6 with version-1 hyperparameters. (b) t-sne embeddings for SKG-6 with version-2 hyperparameters.

(a) umap embeddings for SKG-6 with version-1 hyperparameters. (b) umap embeddings for SKG-6 with version-2 hyperparameters.
The dense mixing observed here may lead to the model miss-ranking predictions because of similar embeddings for different entities. As opposed to this, the UMAP plot in Figure 7(b) displays elements within close clusters, although this separation might simply represent key structural differences rather than expressing the more complicated semantic connections required for accurate predictions.
Plots for the SKG isotopes are shown in Figures 14–19, displaying the TransE training results and giving us an insight on the overall clustering and that the lack of interconnections between template structure instantiations has a negative impact. Yet, in higher isotopes, we also notice a distinct lack of clustering based on type (i.e., that consistent use of type for the range of a property does not seem to overtly influence the distribution of embeddings).

Visualization of t-SNE and UMAP embeddings for SKG-4.

Visualization of t-SNE and UMAP embeddings for SKG-5.

Visualization of t-SNE and UMAP embeddings for SKG-5r.

Visualization of t-SNE and UMAP embeddings for SKG-5rs.

Visualization of t-SNE and UMAP embeddings for SKG-5rsc.

Visualization of t-SNE and UMAP embeddings for SKG-6.
The t-SNE and UMAP visualizations both demonstrate the creation of clusters that is similar for SKG-4 seen in Figure 14, suggesting that entities with similar semantic properties are grouped together. Based on embedding values, the color coding indicates that different entities have different semantic characteristics. Along with SKG-5 and SKG-6 shown in Figures 19 and 15 respectively, the clusters show the most separation, confirming the results we discussed above about the highest evaluation results.
The rest of the visualizations of the extended versions of SKG-5 that are SKG-5r/5rs/5rsc in Figures 16, 17 and 18 respectively, show many small tight clusters again reflecting the evaluation results.
Surprisingly, creating SKG-237 with exact triple, node, predicate and degree centrality number as FB15k-237 was not enough as a controlled environment in terms of training and evaluation results.
The semantic connections represented in the synthetic graph may not capture the complex patterns found in FB15k-237, despite its structural features (such as node/edge counts and centrality measurements) being identical. Hence, we took the next step in creating SKG-4/5/6 and the variations of SKG-5 (SKG-5/5r/5rs/5rsc).
In summary, controlling graph structure has yielded important information on KGE’s performance. With its simple structure, SKG-4 provides the best link prediction results, indicating that model performance is improved by minimal complexity. While adding complexity enhances semantic depth, it also makes prediction more difficult. This is true for SKG-5 and its modified versions, which include reification, shortcuts, and contextual information.
Simpler structure, such as SKG-4, are shown to form more distinct clusters, while more complicated graphs create a balance between relationships and group formation. According to these findings, adjusting graph complexity affects how effective KGEs are; simplicity and structural depth must be balanced. The experiment described in this short article invites further investigations toward understanding the impact of a KG’s schema and KGE model performance. The reports of our experiment suggests a threshold of semantic inclusion exists that can assist in link prediction for all models. Understanding the effects of graph metrics and structure on embedding outcomes is essential. Getting the best results out of knowledge graph embeddings can be complex as demonstrated by the impact of these metrics and the careful tuning of training and evaluation parameters.
Future Work
We have identified some next steps in this line of research: Replicate the experiment on other benchmarks (e.g., YAGO (Tanon & Gerhard Weikum, 2020) or WN18RR (Bordes et al., 2013)). Replicate the experiment using additional models (e.g., Deep Learning techniques for KGEs (Dettmers et al., 2018)), which may better incorporate semantics, as well as establish no differences between implementations (e.g., Ali et al., 2021) Increase the number of tested isotopes by adding even more semantic metadata and varying graph structures. Examine the impact on other downstream tasks (e.g., entity clustering (Wang et al., 2017)). Examine how different embedding models are capable (or not) of handling various KG characteristics, especially when the graph may have
Footnotes
Acknowledgments
The authors acknowledge support from the National Science Foundation (NSF) under Grant #2333532; Proto-OKN Theme 3: An Education Gateway for the Proto-OKN. The authors would like to thank Brandon Dave for his earlier contributions to this work, namely through (Dave et al., 2024).
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
