Abstract
Keywords
Introduction
Motivation and Problem Statement
An essential property of a truly intelligent application is its ability to access all the information necessary to solve the task it is designed for. With the advent of knowledge graphs (KGs), this long-standing objective in AI of supplying machines with relevant information is gradually becoming a reality (Lenat & Feigenbaum, 2000; Weikum, 2021). KGs are the key technology to tie together data and knowledge (Gutiérrez & Sequeda, 2021). Thereby, they diminish the effort of combining data with other sources (Pour, 2022) or using it in applications of various domains (e.g., agriculture Chen et al., 2019, manufacturing Buchgeher et al., 2021, or tourism Kejriwal, 2022) and task types (e.g., advertising He et al., 2016, question answering Huang et al., 2019 or recommendation Wang et al., 2021).
The core idea of KGs is to represent data as a labeled directed graph, with nodes representing concepts or concrete instances and edges representing relations between them.
1
Using graphs to represent data has several advantages over relational or NoSQL alternatives, like the flexible definition and reuse of schemas and the large variety of graph-based techniques for querying, search or analytics (Hogan et al., 2021). As shown in Figure 1, nodes in a KG may represent concepts (e.g., the class

An overview of the typical steps addressed during knowledge graph construction.
The trend of entities added to publicly available KGs in recent years indicates they are far from complete. The number of entities in Wikidata (Vrandečić & Krötzsch, 2014), for example, grew by 26% in the time from October 2020 (85M) to October 2023 (107M). 2 Wikidata describes the largest number of entities and comprises—in terms of entities—other public KGs to a large extent (Heist et al., 2020). Consequently, this challenge of incompleteness applies to all public KGs, particularly when it comes to long-tail and emerging entities (Färber et al., 2016).
Automatic information extraction approaches can help mitigate this problem if the approaches ensure that the extracted information is high quality. While the performance of open information extraction systems (i.e., systems that extract information from general web text) has improved in recent years (Kotnis et al., 2022; Liu et al., 2020; Radevski et al., 2023), the quality of extracted information has not yet reached a level where integration into public KGs like DBpedia (Lehmann et al., 2015) should be done without further filtering.
The extraction of information from semi-structured data is, in general, less error-prone and already proved to yield high-quality results as, for example, DBpedia itself is extracted primarily from Wikipedia infoboxes; further approaches use the category system of Wikipedia (Mahdisoltani et al., 2013; Xu et al., 2016) or focus on tables (in Wikipedia or the web) as semi-structured data source to extract entities and relations (Zhang & Balog, 2020). As highlighted by Weikum (Weikum, 2021), first “picking low-hanging fruit” by focusing on premium sources like Wikipedia to build a high-quality KG is crucial as it can serve as a solid foundation for approaches that target more challenging data sources.
We present CaLiGraph, a KG automatically constructed from semi-structured content in Wikipedia. CaLiGraph uses DBpedia as a foundation to extract an extensive taxonomy from the category graph in Wikipedia and enriches it with OWL-based axioms describing the semantics of the classes. Further, it uses various information extraction techniques to extract new entities and facts from enumerations and tables in Wikipedia, particularly focusing on constructs where similar entities co-occur. In its most recent version, CaLiGraph describes 1.3 million classes and 13.7 million entities.
In this work, we give a comprehensive overview of CaLiGraph. In particular, our contributions are as follows:
We give an overview of the field of automated KG construction and formulate open challenges in Section 2. We summarize the extraction process of CaLiGraph, including all relevant inputs, in Section 3. We describe the purpose, contents, resources and use cases of CaLiGraph in Section 4. We provide statistics, quality metrics and evaluations of the major CaLiGraph versions as well as comparisons to popular public KGs in Section 5.
The most straightforward way to create a KG is through manual definition. Cyc Lenat (1995) and WordNet Miller (1995) are notable examples, employing a team of experts to insert the data by hand. While this is feasible for domains with a manageable amount of data, the potential to scale up is very limited. Paulheim (2018) Freebase Bollacker et al. (2008) and, more recently, Wikidata Vrandečić and Krötzsch (2014) are examples of achieving scalability in manual curation via crowd-sourcing.
In this paper, we only consider automatically extracted KGs in this work. Apart from manually curated KGs, this excludes KGs relying on human-in-the-loop mechanisms (Pradhan et al., 2020) or dataset-dependent RML mappings (Arenas-Guerrero et al., 2021; Iglesias et al., 2020) to extract instance data.
Generally, automated KG construction (AKGC) can use various types of input data. The vast majority of approaches work on unstructured texts, i.e., tries to extract triples from plain text. For example, from the sentence
The family of approaches that directly extract such triples without a predefined schema or set of entities is called
In the following, we present an AKGC implemented in the CaLiGraph extraction framework. We use the pipeline to compare popular KGs on the web and formulate challenges and limitations in AKGC. CaLiGraph has multiple differences to the aforementioned approaches:
CaLiGraph relies on a corpus of semi-structured documents (i.e., listings and categories in Wikipedia). In all those lists, entities are typically mentioned only once, using an informative label. This makes the linguistic part of the processing easier, as it does not require semantic parsing or coreference resolution. CaLiGraph uses a given KG as its backbone, but extends both its ontology as well as the set of entities. This makes it different to many other AKGC pipelines, which keep the ontology and/or the set of entities fixed.
KG construction is typically not an end-to-end ML task but consists of multiple steps, each with unique requirements and challenges (Weikum, 2021). Figure 1 lists the steps in the order they are addressed in the CaLiGraph extraction framework, together with actual examples. The pipeline consists of the two high-level blocks of
OC steps:
KGP steps:
While the steps in the
Given the pipeline above, we discuss the construction processes of automatically extracted general-purpose KGs. We only consider publicly accessible KGs and disregard closed-source industry-created KGs like those from Microsoft, Facebook, Amazon, or ebay (Noy et al., 2019). Figure 2 shows a timeline with the major milestones of the public KGs discussed in the following.

A timeline with major milestones of popular public KGs.
DBpedia (Lehmann et al., 2015) aims to represent the knowledge of Wikipedia in a structured form and focuses on infoboxes to extract knowledge.
Ontology Construction
DBpedia provides a Mappings Wiki 3 where the community defines classes, properties, datatypes and restrictions. Further, they map infoboxes to types in the schema and infobox keys to properties.
Knowledge Graph Population
DBpedia defines one entity per article in Wikipedia. A disambiguation of entities is unnecessary as they are marked with hyperlinks in the text. Type assertions are derived from infobox types, and relation assertions are derived from infobox keys.
YAGO
YAGO Suchanek et al. (2007) is built on the idea of combining a small but well-crafted top-level schema with a large but messy taxonomy, thereby creating a unified and cleaned schema. Further, they tap other data sources to ingest additional data from various domains.
Ontology Construction
Up to version 3 (Mahdisoltani et al., 2013), YAGO automatically combines WordNet (Miller, 1995) with the Wikipedia category graph to create a large ontology. They add axioms for some classes derived from the category graph using hand-crafted rules. In version 4 (Pellissier Tanon et al., 2020), they fundamentally change the KG by combining the ontology from Schema.org (Guha et al., 2016) with the one from Wikidata to create a cleaned, “reason-able” version of Wikidata. They define manual mappings between Schema.org classes and Wikidata classes to create the combined ontology and add rudimentary SHACL constraints to ensure data validity.
Knowledge Graph Population
Up to version 3, YAGO performs KGP similarly to DBpedia, using articles as entities and extracting assertions from infoboxes. Additionally, they define an enhancement process where additional entities may be added from any external sources or tools. In version 2 (Hoffart et al., 2013), temporal and geospatial data is integrated, and in version 3 (Mahdisoltani et al., 2013), multilingual data from multiple Wikipedia language chapters is added. In version 4, entities and assertions are taken from Wikidata.
NELL
NELL Mitchell et al. (2018) is an example of extracting a KG from free text. It was originally trained with a few seed examples and continuously ran an iterative coupled learning process. In each iteration, facts were used to learn textual patterns to detect those facts, and patterns learned in previous iterations were used to extract new facts, which served as training examples in later iterations. NELL introduced a feedback loop incorporating occasional human feedback to improve the quality.
Ontology Construction
NELL started with an initial ontology defining hundreds of concepts and binary relations. During runtime, the ontology is extended with additional concepts and relations.
Knowledge Graph Population
NELL is bootstrapped with a dozen examples for each concept and relation. New entities and assertions are added with each iteration.
BabelNet
BabelNet Navigli and Ponzetto (2012) is a KG that integrates encyclopedic and lexicographic knowledge from Wikipedia and WordNet in multiple languages.
Ontology Construction
The ontology consists of concepts derived from senses in WordNet and from articles and categories in Wikipedia (Flati & Vannella et al, 2014). They connect the two resources by mapping senses to articles automatically. In early versions, only lexical properties are used. In the recent version, they integrate related KGs like Wikidata and YAGO, taking over their semantic properties as well.
Knowledge Graph Population
Initially, the graph was populated with entities from Wikipedia articles. From WordNet, lexical and semantic pointers between synsets are extracted as relations. Relations between Wikipedia articles were initially extracted as unlabeled relations. In the recent version, there are efforts to extract the semantics of the relations. Further, assertions from related KGs like Wikidata and YAGO are included (Navigli et al., 2021).
DBkWik
DBkWik Hertling and Paulheim (2020) aims to extract and fuse data from thousands of Wikis of arbitrary content from a Wikifarm, for example, Jedipedia 4 or Music Hub. 5
Ontology Construction
DBkWik uses a variation of the DBpedia extraction framework to extract data from Wikis. Contrary to DBpedia, DBkWik has no community-defined mappings. Instead, they generate a shallow schema from the infoboxes of each Wiki and fuse these schemas afterwards. Then, they enrich the unified schema with subclass relations and restrictions for domains and ranges.
Knowledge Graph Population
Entities are derived from articles in the Wikis, and assertions are derived from infoboxes. Similar to the schema, entities must also be matched to avoid duplicates from overlapping Wikis.
Wikidata
Wikidata Vrandečić and Krötzsch (2014) is, to date, one of the largest publicly available KGs. The original motivation was to unify the information in infoboxes across different language editions of Wikipedia by storing the main information about entities in a central KG.
Ontology Construction
Wikidata uses the crowd sourcing approach for classes and properties as well. The ontology is maintained by the Wikidata user base.
Knowledge Graph Population
Entities can be entered and altered by users of Wikidata. Moreover, larger data dumps (called
Limitations and Challenges
In Table 1, we list the advantages and limitations of the previously discussed KGs. Following, we distill these into a (incomplete) list of challenges (mostly complementary to challenges mentioned by Weikum (2021)):
Advantages and Limitations of Public General-Purpose KGs.
See https://lod-cloud.net/.
See https://en.wikipedia.org/wiki/Wikipedia:Notability.
Advantages and Limitations of Public General-Purpose KGs.
We created CaLiGraph to tackle several of these challenges in the context of Wikipedia. We exploit semi-structured data structures like listings and tables to extract information about novel entities (C2). We create a schema from the Wikipedia category graph and enrich it with semantic restrictions describing the meaning of the concepts (C3). All the automated extraction procedures target structured or semi-structured data to minimize errors and ensure high extraction quality (C4).
This section describes the extraction framework of CaLiGraph with respect to the tasks shown in Figure 1. Section 3.1 provides details about the parts of Wikipedia used in the extraction process, Section 3.2 describes how the ontology is created and Section 3.3 describes the KG population process. This section intends to give a crisp overview of the complete construction process of CaLiGraph without going into detail too much. We provide references to additional material within the section for the interested reader.
Wikipedia As Semi-Structured Data Source
Due to its structured and encyclopedic nature, Wikipedia provides interesting conditions to extract information automatically. Concretely, we select Wikipedia as a data corpus for CaLiGraph as it has several advantages:
Structure
Wikipedia is written entity-centric with a focus on facts. Due to the encyclopedic style and the crowd reviewing process, it has a fairly consistent structure. Wikipedia uses its own markup language, that is, wikitext or Wiki markup,
7
which allows for more concise access to (semi-)structured page elements such as sections, listings, and tables, compared to plain HTML. Listings are often used to provide an overview of a set of entities that are related to the entity an article is about. Section titles are typically used consistently for specific topics (e.g., for the
Entity Links
If a Wikipedia article is mentioned in another article, it is typically linked in the Wiki markup (a so-called
Access
Wikipedia snapshots are published periodically as XML dumps that can be processed conveniently. Many high-quality open-source libraries exist for the interpretation of Wiki markup. In our framework, we use WikiTextParser 8 to process the markup in Python.
DBpedia
With DBpedia Lehmann et al. (2015), a well-established Wikipedia-based KG is already available. As it is extracted primarily from infoboxes, the information in DBpedia is very accurate and thus a perfect source for distant supervision (Mintz & Bills et al., 2029).
In the remainder of this section, we briefly explain the main Wikipedia elements exploited to extract CaLiGraph. Provided statistics are computed on the Wikipedia dump the most recent CaLiGraph version 3.1.1 is based on (August 2022, English).
Articles
An article in Wikipedia describes a concept of the real world. In the following, we will refer to this concept as the
Wikipedia contains 6.1 million articles in English (excluding non-encyclopedic pages like disambiguation pages and redirects).
Listings
With listings, we refer to (semi-)structured elements in Wikipedia that contain several items. In many cases, a listing represents a concept, with each item describing a concrete instance of this concept.
9
We are particularly interested in listings with items explicitly mentioning the entities they describe. We refer to these entities as
In Figure 3, we show four different listings in the form of tables (Figures 3(a) and (d)) and enumerations (Figure 3(b) and (c)). For example, in Figure 3(d), the soap opera characters are considered SEs, while the actors are not, as the listing focuses on the characters. While listings are usually formatted as enumerations or tables, they have no convention of how their information is structured. For example, SEs can be listed somewhere in the middle of a table (instead of in the first column), and enumerations can have multiple levels. Further, SEs may already be marked as entities through Wiki markup (blue or red links), but this is not always true.

Listings in Wikipedia containing the mention
Of the 6.1 million articles in Wikipedia, 2.1 million contain at least one listing in the form of an enumeration or a table. We find 3.5 million enumerations and 1.4 million tables in these articles. 10 On average, listings have 11.7 items with a median of 7.
List pages are a special kind of Wikipedia pages that serve the sole purpose of listing entities with a common property. The list page
Wikipedia contains 89K list pages with 159K tables and 381K enumerations. On average, listings have 21.8 items with a median of 9 items.
Categories
Contrary to list pages, categories are a formal construct in Wikipedia and serve the purpose of categorizing pages in a hierarchical structure. This structure, the Wikipedia Category Graph (WCG), is a directed but not acyclic graph. It does not only contain categories used for categorising articles but also ones used for administrative purposes (e.g., the category
Wikipedia contains 2.2 million categories, with 11K list categories and 311K categories used for non-encyclopedic purposes like maintenance. We regard categories as of the latter kind if they are no transitive subcategories of the category
Ontology Construction
In the following, we explain how the CaLiGraph extraction framework builds an ontology from categories and lists in Wikipedia Heist and Paulheim (2020) and how the classes are enriched with expressive axioms (Heist & Paulheim, 2019) (cf. upper part of Figure 1).
Class & Property Definition
All encyclopedic categories and list pages in Wikipedia are considered candidate classes for the CaLiGraph taxonomy. Additionally, we reuse and link to classes of the DBpedia ontology. By doing so, we can effortlessly enrich CaLiGraph with additional parts of the DBpedia ontology, like relations and disjointness axioms.
The category candidates contain many categories that are suitable classes for a taxonomy like
Taxonomy Induction
After removing non-taxonomic categories, we first build a taxonomy from the remaining categories, list categories, and list pages. To combine those, we use the existing connections in Wikipedia. Figure 4 shows an example of how these groups are connected. While all edges in the figure could be used to form the taxonomy, some edges should be discarded. For example, the category

Hierarchical relationships between categories, list categories and list pages in Wikipedia.
Removing wrong class subsumption axioms is a crucial step for different reasons. First, the Wikipedia category graph is not acyclic, but a class hierarchy should be. In fact, some of the subcategory axioms are questionable, such as making

Examples of non-taxonomic nodes and edges (marked in red) that must be removed from the respective category graph or list graph (Heist & Paulheim, 2020).
If this step was omitted and every subcategory assertion was converted to a subclass assertion, it would lead to a massive amount of false type assertions in the final graph (in the case above: typing a song with
As a final step, we connect the taxonomy of categories and lists to the DBpedia ontology. We map the categories to DBpedia classes using type axioms derived from DBpedia resources and linguistic signals (see next section for how the type axioms are created). After mapping, the CaLiGraph ontology consists of all the information in DBpedia like classes, properties, axioms (e.g., class disjointnesses), resources, and additional classes from categories and list pages. In the following, elements of the T-box (i.e., classes and properties) of CaLiGraph will be prefixed with
While category names are plain strings, we aim to uncover the semantics of the categories. To that end, we want to extract both type and relation assertions from categories and assign them to entities in those categories. Formally, we can learn two types of axioms:
In Figure 4, we may learn the following ontology axioms:
Given that we have an instance, e.g.,
The derived type axioms serve as the basis for a mapping from categories to DBpedia. The relation axioms are added to the CaLiGraph ontology as restrictions similar to
The
Candidate Selection
We identify sets of categories that most likely share a common type or relation. In Figure 1, an example of a category set is the set of subclasses of the category
Pattern Mining
We use the category sets to identify linguistic patterns as pre- or postfixes for all possible types and relations. For example, we may learn the pattern that categories ending in
Pattern Application
We apply the patterns to all categories in Wikipedia to extract axioms like (1) and (2). Here, we combine the likelihood of a pattern with the signals from a category to judge whether applying the pattern to the category is possible.
This section describes the steps taken to populate CaLiGraph with additional entities as well as type and relation assertions (cf., lower part of Figure 1). We first recognize mentions of SEs in listings (Heist & Paulheim, 2022), then we link the mentions to entities in CaLiGraph or create new entities (Heist & Paulheim, 2023), and finally we derive new facts for the entities discovered in listings (Heist & Paulheim, 2021).
Named Entity Recognition
While a few listings already contain disambiguated entities, this is not the case everywhere. Many listings contain only text, mostly because the entities in them do not have a corresponding Wikipedia page describing them. Thus, we need an additional entity recognition step.
To detect SEs in listings, we phrase the problem as a token classification problem (Heist & Paulheim, 2022). We produce a label for every token of the input sequence and aggregate the token labels to predictions of SE mentions. With a transformer-based model, we predict 13 token labels, such as
We pass the context of a listing (e.g., page name and section name) and multiple listing items as textual input to the transformer model. To preserve the information about context and listing layout, we use special tokens. By passing multiple listing items at once, the model can learn the structure of the listing. For example, it may recognize that the SE is always mentioned in the first cell of a table (cf. Figure 3(a) and (d)), or is always followed by a particular sequence of characters (cf. Figure 3(b) and (c)).
We generate the training data for the mention detection model from entities in list pages, using a heuristic labeling (i.e., weak supervision): as we already know the type of entities in a list page (e.g., entities in
Named Entity Disambiguation
One main challenge of Named Entity Disambiguation (NED) is the inherent ambiguity of mentioned entities in the text. Figure 3 shows four homonymous mentions of distinct entities with the name
With NASTyLinker Heist and Paulheim (2023), we employ an approach for NED in CaLiGraph that can deal with both of these challenges. It produces clusters of mentions and entities based on inter-mention and mention-entity affinities from a bi-encoder. NASTyLinker relies on a top-down clustering approach that assigns mentions to the entity with the highest transitive affinity in case of a conflict. A threshold on the transitive affinity ensures that new entities are created for mentions without an existing counterpart in CaLiGraph.
Information Extraction
The information extraction efforts in CaLiGraph are currently focused on SEs in Wikipedia listings. Our approach identifies the characteristics of a listing, which are the types and relations shared by all its SEs. Given the example page about
We frame finding descriptive rules for listings based on their context as an association rule mining problem (Heist & Paulheim, 2021). We define rule metrics that take the inherent uncertainty into account and make sure that rules are frequent (rule support), correct (rule confidence), and consistent for all listings (rule consistency). To find a reasonable balance between the correctness and coverage of the rules, we set the thresholds based on a heuristic considering the distribution of NE tags over entities and existing knowledge in CaLiGraph. For the example given above, we identify the following generic rules:
This section gives an overview of CaLiGraph as a data source. First, we introduce its versions, purpose, and vocabulary structure. Then, we detail the extraction procedure of CaLiGraph, including sources, provenance, stability, and sustainability. Finally, we explain how CaLiGraph can be accessed and how it is used already.
Description
CaLiGraph and all the associated information is accessible via http://caligraph.org. The dataset is licensed under CC BY 4.0, 13 giving everyone the right to use, share and adapt all material with the only liability of giving proper attribution.
The project to create CaLiGraph was initiated in 2018 (Heist, 2018) and, to date, three major versions have been published. Here is an overview of the versions that we use in the remainder of this work:
The purpose of CaLiGraph is to serve as a large-scale general-purpose KG covering all topics addressed in Wikipedia. In particular, CaLiGraph aims to incorporate all information given in a semi-structured format in Wikipedia. By exploiting the data structure, the extraction mechanisms of CaLiGraph can extract information, especially about long-tail entities, more precisely than from full text. Currently, the focus is on extracting information about entities mentioned in tables and enumerations.
Another feature distinguishing CaLiGraph from most other public general-purpose KGs is its large taxonomy containing expressive class descriptions. An example is shown in the upper part of Figure 1 where
Vocabulary
The CaLiGraph dataset builds on well-established vocabularies like
Extraction Procedure
CaLiGraph is extracted using the CaLiGraph Extraction Framework 14 as described in Section 3. We describe the extraction’s inputs, outputs, and organisation in the following.
Data Sources
The main inputs to the CaLiGraph extraction framework are an XML dump of the English Wikipedia and the English chapter of DBpedia (Lehmann et al., 2015) in the form of triples. Further, we use WebIsALOD (Hertling & Paulheim, 2017) to gather additional hypernyms during taxonomy construction (see Section 3.2.2).
Provenance
In CaLiGraph, we provide provenance information for new classes and entities using
For additional classes, we point to the Wikipedia categories or list pages used for extraction. For the additional entities, we include information about the listings they have been extracted from. For example, suppose we create the new class
Stability
CaLiGraph is built on information from Wikipedia and DBpedia. New releases are dependent on the information from these two resources. As we have no control over the data sources, CaLiGraph gives no guarantees for the stability of ontology and resources between major versions. The changes may affect any information contained in CaLiGraph. For example, it is possible that a page name in Wikipedia and, consequently, a resource in DBpedia changes. This change would then be taken over in CaLiGraph as well. Further, if the structure of the category graph in Wikipedia changes, this can influence the extraction of the CaLiGraph taxonomy. Finally, any changes in listings in Wikipedia may change how facts are extracted.
Sustainability
CaLiGraph is hosted and maintained by the Data and Web Science Group of the University of Mannheim. 16 The release cycle for CaLiGraph was mostly irregular in the past, as new developments were integrated as quickly as possible. There are ongoing efforts to align the release cycle to the one of DBpedia and even to integrate the extraction of CaLiGraph into the DBpedia extraction workflow. Still, it is planned to improve and extend CaLiGraph further in various ways (see future work in Section 6.3). The Data and Web Science group plans for several projects for improving the quality of CaLiGraph, each of which will be accompanied by a new full extraction. Like for other efforts carried out by the group, such as WebDataCommons, 17 we are comitted to maintaining access to open datasets, as well as to further develop the datasets.
Usage
The following describes how to access and interact with CaLiGraph best. Further, we give an overview of potential and existing use cases.
Access
The main web resources to view, use, and extend CaLiGraph are:
In general, CaLiGraph is intended to be used as a knowledge base for various domains similar to DBpedia. Hence, it can be used in similar use cases, for example, for information retrieval or question answering. CaLiGraph is already used in several concrete scenarios:
Qin and Iwaihara (2022) use CaLiGraph as training data for a transformer model to annotate table columns with entity types. Biswas et al. (2021) use CaLiGraph to evaluate models for entity typing using only the surface forms of the entities. In 2021, we submitted CaLiGraph as a dataset for the Semantic Reasoning Evaluation Challenge (Heist & Paulheim, 2021). It has been used in every challenge edition to evaluate reasoning systems (e.g., by Chowdhury et al. (2022)).
In this section, we show statistics about CaLiGraph, summarize all efforts to measure its quality and compare its performance on downstream tasks with DBpedia and YAGO. We use the English chapters of DBpedia in the versions from 2016 (
Statistics
We compare the KGs w.r.t. classes and entities in Table 2. We have performed a similar comparison with more public KGs in a previous work (Heist et al., 2020), but only with an early version 1.0.6 of CaLiGraph. However, the results are not directly comparable as, in the previous study, we only considered predicates in the namespace of the respective KG.
Basic metrics of all CaLiGraph Versions and other KGs Based on the English Wikipedia. †Entities are not Disambiguated Properly.
Basic metrics of all CaLiGraph Versions and other KGs Based on the English Wikipedia. †Entities are not Disambiguated Properly.
Compared to DBpedia, YAGO and CaLiGraph contain many more classes, largely retrieved from the WCG. The increase in classes and relations in the major CaLiGraph versions is caused by the Wikipedia version used for extraction (
In terms of entities,
Similarly to DBpedia and YAGO, CaLiGraph covers many domains. Figure 6 shows how the entities in

A sunburst diagram of frequent entity types in CaLiGraph.
We compare the type and relation frequencies of the three CaLiGraph versions in Table 3. We use the prominent types mentioned in Heist et al. (2020) to compare types. Unfortunately, the ranks are not perfectly comparable as DBpedia changed its taxonomy, taking effect in
Comparison of Counts and Ranks of Prominent Types and Properties Among CaLiGraph Versions. Prominent Types are Taken From Heist et al. (2020), and Prominent Properties are Selected Based on Their Frequency in CLGv3.
We take the most frequently used ones in
We provide information about the data quality in CaLiGraph concerning its metadata (Section 5.2.1), the vocabulary use (Section 5.2.2), as well as class and instance data (Section 5.2.3).
Metadata
As described in Section 4, the CaLiGraph ontology builds on well-established vocabularies like
Five-Star Rating
According to the five-star rating for linked data vocabulary use defined by Janowicz et al. (2014), the CaLiGraph dataset can be categorized as a four-star dataset and will be a five-star dataset soon:
Star: There is dereferenceable human-readable information about the used vocabulary on http://caligraph.org. Star: The information is available as machine-readable explicit axiomatization of the vocabulary as the CaLiGraph ontology is published using Star: The vocabulary is linked to other vocabularies, e.g., DBpedia (see Section 4.2.2). Star: Metadata about the vocabulary is available (see Section 5.2.1). Star: The vocabulary is linked to by other vocabularies soon, as DBpedia is preparing to provide backlinks to CaLiGraph similar to the ones from CaLiGraph to DBpedia.
20
Class and Instance Data
In Table 4, we collect all evaluation results of parts of CaLiGraph data conducted using direct or indirect human supervision. CaLiGraph intends to ingest as much of the semi-structured information in Wikipedia as possible. The results show that most of the information is extracted with an accuracy of over 90%, with entity linking approaches being the only exception.
Collection of Evaluation Results of CaLiGraph Data.
Collection of Evaluation Results of CaLiGraph Data.
One metric that allows for direct comparison with other knowlege graphs is the accuracy of relation assertions, which essentially refers to the fraction of correct triples. Here, CaLiGraph yields an accuracy of about 96%, which is slightly below the reported triple accuracy of DBpedia, YAGO, and Wikidata, which, according to Färber et al. (2017), expose a triple accuracy (called
The CaLiGraph extraction pipeline is a sequence of steps, with later ones depending on the results of previous steps. It is, hence, unavoidable that errors are propagated through the pipeline. The evaluations listed in Table 4 identify such errors explicitly. In the results of NASTyLinker (Heist & Paulheim, 2023), the errors are not contained in the final accuracy of 89.4% for single mentions and 82.3% for all mentions. Considering the SE labeling errors (Heist & Paulheim, 2022), the results for single mentions would decrease by 5.4%, and the results for all mentions would decrease by 3.3%. The results for extracting facts from listings (Heist & Paulheim, 2021) include errors caused by incorrectly parsed entities already. The errors are responsible for an accuracy decrease of 2.6% for entity types and 0.2% for relations.
The named entity disambiguation has an important impact on the overall quality of CaLiGraph. In Heist and Paulheim (2023) we explored that impact by comparing it to simple baselines. The most trivial of entity disambiguation baselines, that is, assuming that all SEs with the same label denote the same entity, leads to both lower precision (91.4% compared to 97%) and recall (73.5% compared to 87.0%) than the named entity disambiguation used in CaLiGraph. The drop in recall is more significant since it is more likely that an entity has multiple labels than that the same label refers to multiple entities.
Moreover, only by clustering entity mentions and allowing for NIL entities, it is possible to attribute information to entities which do not have their own Wikipedia page. This is the most remarkable difference of CaLiGraph compared to other KGs derived from Wikipedia, which only contain entities for which a dedicated Wikipedia page exists.
KGrEaT (
Experimental Setup
We consider CaLiGraph (
We report the results for two entity mapping scenarios: precision-oriented mapping and recall-oriented mapping. Both scenarios link the task dataset’s entities to KG entities using
We compute the results using four embedding methods:
Results and Discussion
Table 5 shows the average rank of the KGs w.r.t. the datasets of a task. In both scenarios, DBpedia shows superior performance in the Clustering, Entity Relatedness, and Semantic Analogies tasks, YAGO works best for Document Similarity, and CaLiGraph for Regression and Recommendation. While DBpedia has a tendency to work better in the precision-oriented scenario, CaLiGraph works better in the recall-oriented scenario.
Evaluation Results of the KGs Given as Average Rank per Task Type.
Note: The esults are computed for a precision-oriented mapping scenario and a recall-oriented mapping scenario. The best results are bold, second-best are underlined.
Evaluation Results of the KGs Given as Average Rank per Task Type.
Note: The esults are computed for a precision-oriented mapping scenario and a recall-oriented mapping scenario. The best results are bold, second-best are underlined.
On a dataset level (see Tables 7 and 8 in Appendix B for details), it becomes clear that the choice of a KG for a given task is always dependent on the domain. As expected, DBpedia performs well on DBpedia-based datasets. The superior performance of
Dataset Coverage (per cent) of the KGs Evaluated with KGrEaT for the Precision- and Recall-oriented Mapping Scenarios. Datasets Marked with a Dagger are Independent of DBpedia.
KGrEaT Evaluation Results of the KGs Aggregated by Task Type, Dataset and Metric for the
KGrEaT Evaluation Results of the KGs Aggregated by Task Type, Dataset and Metric for the
Summary
With CaLiGraph, we presented a KG created from Wikipedia categories and lists, offering a rich taxonomy with semantic class descriptions, and going beyond the one-entity-per-page paradigm of DBpedia and YAGO, thus offering a much larger set of entities. We gave an overview of its extraction framework and summarized relevant information for potential users of the KG. The comparison of CaLiGraph to other popular public KGs shows that, despite its wealth in classes and entities, it can be favorable to use CaLiGraph in some scenarios, but there is no one-size-fits-all solution.
Limitations
In Section 2.3, we identified several challenges in the field of AKGC. We made a step forward for some of them, while others are yet to be addressed. For CaLiGraph, we can formulate some of these limitations in more detail:
Error Accumulation
AKGC in CaLiGraph is executed as a pipeline of automatic processing steps. Errors in early steps are propagated to subsequent steps and may create distortions with a high impact on the outcome. For example, in the recent version of CaLiGraph, an extraction error made
Entity Ambiguity
Ambiguity is one of the biggest challenges when identifying and disambiguating mentions of entities in text. As information about long-tail entities during extraction is sparse, the quality of such entities in CaLiGraph is not satisfactory yet.
Wikipedia Dependency
Currently, the CaLiGraph extraction targets a single version of Wikipedia only. Any information not contained in that version can consequently not be part of the KG. Further, we have no direct influence on the content of Wikipedia and hence have to deal with potential problems only during extraction.
DBpedia Dependency
CaLiGraph builds on the ontology of DBpedia, taking over all types and properties. While types are extended, the set of properties remains fixed, and knowledge can only be modeled within the bounds defined by the DBpedia ontology.
Future Work
For future work in CaLiGraph, the focus is divided between improving the quality of the existing KG and extending its coverage to incorporate more knowledge.
Improving Extraction Quality
While error propagation is currently problematic in CaLiGraph, it is also a chance to improve the overall quality of the graph by gradually improving the individual parts. Fixing errors in the early stages of the extraction may positively influence the complete extraction pipeline. To that end, we plan to implement a more rigorous error-monitoring system to capture errors early and monitor all parts of the pipeline to identify opportunities for improvement.
As a concrete improvement, we plan to replace or augment the taxonomy induction step of Section 3.2.2 with a Transformer model that is tuned on identifying subclass relationships (e.g., from Hertling & Paulheim, 2023). This may improve the class hierarchy substantially as we currently rely on manually combined hypernym information from multiple sources.
We plan to put more emphasis on the dependency of SEs expressed through co-occurrence. This might be particularly helpful when trying to disambiguate entities in text. We are only implicitly using the context of an entity mention during disambiguation. Explicitly providing information about related entities might improve the disambiguation capabilities of NASTyLinker.
Extending KG Coverage
We plan to extend CaLiGraph in the three dimensions of ontology, assertions and data sources. To extend the ontology, we can discover additional axioms by extending the Cat2Ax approach from categories to list pages. Additionally, we may derive more axioms by relying on common sense knowledge from another KG (e.g., CSKG Ilievski et al., 2021). We further plan to discover new properties by using the existing data in CaLiGraph as a foundation to automatically exploit dependencies between co-occuring entities where the relation underlying the co-occurrence pattern is not in the ontology yet.
Like YAGO, we can extend the coverage of CaLiGraph to more dimensions like temporal or geospatial information. As the KG currently reflects only the point in time when the Wikipedia dump was created, we consider incorporating edits in Wikipedia pages to reflect the temporal dimension. Alternatively, we explore the possibility of extracting CaLiGraph from multiple dumps and merging the results to include a temporal perspective.
CaLiGraph currently targets only the English Wikipedia chapter. An extension to other languages would have the benefit of providing multilingual labels. Still, all the automatic extraction mechanisms may be able to derive much more complementary information from the diverse language chapters. The main challenge here is to merge the information derived from all the language chapters into a unified KG. Finally, we may extend the extraction to other data sources. As most extraction methods in the pipeline are built for encyclopedic content, a first step is to follow the example of DBkWik and target other Wikis than Wikipedia.
