Abstract
Introduction
A Knowledge Graph (KG) uses a graph-based model to represent real-world entities, their attributes, and relationships [40]. Entities are anything that can be uniquely identified and described, such as people, places, things, or concepts, but also the relationships between those. The “graph” metaphor stems from the idea of depicting statements representing relationships between entities as directed graph edges. A wide range of information can be represented using KGs, including encyclopedic knowledge, scientific data, corporate data, and, – along with meta-information attached to statements – also contextual information, such as who the the the
The goal of said standards is to enable interoperability, but also the ability to unambiguously describe the (allowed) schema and semantics of knowledge graphs, which in turn is a crucial aspect in order to maintain KG quality, as more and more KGs are published in a decentralized, collaboratively created fashion across the Web.
Since its creation by the Wikimedia Foundation in 2012, Wikidata has become one of the largest such KGs, publicly available on the Web, with more than 100M items1
In terms of supporting the above-mentioned Semantic Web standards, the Wikidata KG is available in standard RDF format and can be queried via a public SPARQL endpoint. Yet, Wikidata does neither adhere to OWL/RDFS, nor SHACL: while other knowledge graphs often have predefined formal ontologies or schemas defined in RDFS and OWL, Wikidata takes a different approach, with its community focusing on the development of the data layer (A-box) and the terminology layer (T-Box) evolving alongside with it. This means that Wikidata does not have a single, pre-defined formal ontology [50] adhering to RDFS/OWL’s well-defined semantics. In fact, while some Wikidata properties, such as
In the current paper, we focus on the largest and most widely supported amongst these constraints approaches in Wikidata, namely the
When it comes to how property constraints should be interpreted/checked, there is a description in natural language for each constraint type available on a respective help page, for instance, the
In the present paper, we explore the use of both SHACL and SPARQL as tools for formalizing Wikidata’s property constraints; the use of these standardized tools should provide more accurate, open, and efficient means of identifying and addressing inconsistencies in Wikidata and resolving potential ambiguities. To this end, our main contributions are as follows:
We provide a gentle and comprehensive introduction to Wikidata’s specific, namespace-based RDF reification model with many illustrative examples, that show how Wikidata’s wide range of different property constraints are represented using this model.
We study to what extent the expressiveness of the SHACL-Core language is sufficient to express Wikidata property constraints and come to the conclusion that the SHACL-Core language is not expressive enough to represent all Wikidata property constraints: Among the 32 investigated property constraint types, SHACL-Core lacks components to express two of them. In addition, we argue that another four constraint types are not reasonably, or only partially expressible.
For the Wikidata property constraints expressible in SHACL-Core, we present a tool to automatically translate such constraints; the tool can benefit also other Wikibase KGs that import Wikidata property constraints.
We show how the non-SHACL-Core-expressible remaining constraints can be formalized in full SHACL (using the SHACL-SPARQL extension), and argue for an, in our opinion more effective, formalization in SPARQL alone.
We consequently unambiguously formalize all 32 Wikidata property constraint types as SPARQL queries which provide a declarative means to express constraints, directly operationalizable via Wikidata’s SPARQL endpoint.6 Notably, as it turns out, some constraint types can only be partially evaluated online due to incomplete RDF representation of Wikidata’s own RDF data model on Wikidata’s SPARQL query endpoint.
We present a comparison of our SPARQL approach to the current Wikidata violation reports, demonstrating the feasibility of using SPARQL to actually check constraints: particularly, we highlight potential ambiguities and reasons for deviations in violations found with our approach compared to the Wikidata violation reports; we believe that our approach as such helps clarifying such ambiguities in a reproducible manner.
We note that, due to the known scalability limits of Wikidata’s SPARQL endpoint, we still run into timeouts in checking some of the most violated constraints; yet we argue that our work, can be understood as providing challenging benchmarks for both (i) SHACL validators and (ii) SPARQL engines, based on the real-world use case of Wikidata; as such we extend and go beyond recent related benchmarks.7 For instance, our constraint checking queries are not restricted to “truthy” statements, as opposed to the recent WDbench [6] SPARQL benchmark.
The remainder of this paper is structured as follows. Section 2 presents an exhaustive, tutorial-style introduction to Wikidata’s property constraints, diving into Wikidata’s RDF meta-modeling, and explaining how property constraints are represented within this model.
Section 3 discusses how to represent the semantics of Wikidata property constraints in SHACL-Core. We also present
Yet, as not all Wikidata property constraints are expressible in SHACL-Core, in Section 4 we instead present a complete mapping of
As a demonstration of feasibility, we present a detailed analysis and experiments, comparing violations found by our approach with the officially reported constraint violations by Wikidata itself in Sections 5 and 6.
Finally, after discussing related works on constraint formalization and quality analysis for Wikidata and other KGs in Section 7, we conclude in Section 8 with pointers to future research directions.
As mentioned already in the introduction, standard ontological inference as a means to detect inconsistencies is not directly applicable to the approach taken by Wikidata. Firstly, Wikidata’s data model may arguably be described as extending RDF’s plain, triple-based model, by various meta-modeling features for adding references and other contextual qualifiers to statements. Indeed, Wikidata’s data model is mapped to RDF via a specific reification mechanism. Secondly, there is neither a strict distinction between the data and terminology layers nor does Wikidata’s terminology rely on OWL/RDFS [31]. Rather, the terminology layer evolves in the background as editors add/update new facts, potentially introducing new properties and classes in a community-based approach. Additionally, proprietary, community-driven, ad-hoc processes have been set up within Wikidata to define constraints on the terminology used. In particular, the
In order to provide the required background, in the following subsections we introduce the RDF data representation adopted by Wikidata (Section 2.1) with several examples, followed by illustrating details of how Wikidata’s property constraints are represented within this data model (Section 2.2), in particular focusing on qualifiers used as “parameters” for constraint definitions (Section 2.3). Finally, we discuss both (i) challenges in understanding the exact meaning of these property constraints (the semantics of which are largely described in natural language only), as well as (ii) potential issues in verifying them on Wikidata’s RDF representation (Section 2.4).
In this section, we describe how Wikidata’s data model – and specifically property constraints – are modeled in RDF and can be queried with SPARQL.9 For details, we refer the reader to
To this end, let us start with the bare basics of RDF and SPARQL and then gradually dive into the specifics of Wikidata. When talking about RDF, as usual, we will refer to

Simple RDF (sub-)graph example from Wikidata, showing two statements/triple, indicating that Lionel Messi (URL:
The most important namespaces used in Wikidata’s RDF representation; we omit some additional standard namespaces such as
Here, URIs are represented as namespace-prefixed identifiers, e.g.
For queries over RDF data, we will use SPARQL [32]: indeed the RDF representation of Wikidata can be fully queried by means of Wikidata’s SPARQL query service. Wikidata’s query service.,10 available at
Here, the triple pattern Note that, interestingly, the query from above also returns
Other methods, apart from Wikidata’s query service to access RDF from Wikidata include complete RDF dumps12 cf. For instance, supporting content negotiation, the result of running
In a nutshell, a dataset speaks “authoritatively” about a URI, or likewise a namespace prefix, if it is published/accessible on the same pay-level-domain. I.e., for instance, Wikidata is authoritative for all URIs (and, resp., namespaces) which start with

Subgraph example from Wikidata. Direct claims can be stated (using
In this reification model, Wikidata uses URIs that represent hashes for “anonymous” reified
As a side note, let us emphasize that Wikidata’s use of such “hashed” statements nodes is a deliberate choice to avoid the use of cf.
Besides Items and Properties, since a large part of Wikidata is also specialized in linguistics knowledge and multilinguality, another special kind of entities, so-called
Such direct claims can be further described and annotated with meta-information. That is, for each claim, a separate,
Fig. 2 presents a subgraph of Wikidata that illustrates this RDF representation, containing two claims about
Fig. 3 shows two claims about the capital of the US, one of which (the current capital) has a
which will only return the current capital, whereas if we wanted to query for

A subgraph showing claims about two
Note that similarly,
The upper half of Fig. 4 illustrates the modeling of quantity values in Wikidata, in this case about Lionel Messi’s height. The lower part of the figure illustrates another heavily used feature of Wikidata, namely references: the property “reference URL” (

A subgraph containing a
Figure 5 illustrates labels and descriptions, where English, Spanish, and Arabic labels and descriptions for Item Q615 (

A subgraph illustrating additional RDF triples for representing (multi-lingual) labels and descriptions of Wikidata entities, leveraging RDF’s language-tagged literals.
As illustrated in Fig. 6, the lexeme

A subgraph about the English noun “football”, including normal claims, but also Wikidata-specific additional vocabulary to talk about languages.
As we can see in the example, lexemes can be involved in normal (
Figure 7(a) summarizes the modeling of regular statements and ranks, including the involved namespaces, in a more abstract manner. As shown in Fig. 7(b), the RDF model contains also triples to “navigate” between the differently prefixed URIs per property ID (PID); we will need to make use of these connections in our modeling of constraints in SHACL and SPARQL later on, but let us first turn to how these constraints themselves are actually represented within RDF model.

The Wikidata meta-model in RDF and its namespaces usage illustrated for (a) statements and claims, (b) properties, and (c) property constraint definitions. Dashed lines represent equivalent entities. Figures 2 and 3 illustrate concrete instantiations of the “Wikidata Statement” block (a), while Fig. 8(c) illustrates the block “property constraint definition” block (c). Abbreviations: QID = entity ID, PID = property ID.
Wikidata property constraints make use of the described modeling to represent specific community-defined constraint types, where specific instantiations of a
Wikidata property constraints types: incl. information about their usage in constraint definitions, information about whether and how we could express them in SHACL-Core and SPARQL, as well as which qualifiers they use (verified on the status at the writing of the paper using a variation of this query: https://w.wiki/7KrH ; the short links in the SPARQL column refer to our direct links into our Github repository, available at https://github.com/nicolasferranti/wikidata-constraints-formalization/ : besides the SPARQL formalizations you also find all corresponding SHACL shapes (where expressible) there
Wikidata property constraints types: incl. information about their usage in constraint definitions, information about whether and how we could express them in SHACL-Core and SPARQL, as well as which qualifiers they use (verified on the status at the writing of the paper using a variation of this query:
(Continued)

Example of a Wikidata property constraint and data graphs with different behaviors (as of 2022-03-29).
Whereas each constraint type is modeled as an item – for instance, the
In terms of parameters, constraint-type specific
For instance, the
Fig. 8a illustrates how these qualifiers are concretely instantiated for an IRS constraint the property
connects the property
Further, the IRS-specific qualifier
whereas the respective allowed values are defined via four additional triples using the
Figures 8b and 8c presents data subgraphs for two different items,
As a second example, let us look at another constraint type, the so-called

Another example of a Wikidata property constraint and data graphs (as of 2022-08-20).
As illustrated in Fig. 9, the property
The single-value constraint on P36 in Fig. 9a, lists (amongst others) the
We hope that the previous subsection has sufficiently illustrated the most relevant aspects of modeling and parameterizing property constraints. Rather than in terms of fully elaborated examples, let us summarize all mentioned and remaining qualifiers used in the context of constraint modeling and parameterization in the following. To this end, Table 2 provides an overview of which qualifiers are used in current descriptions of constraints of different types. For each of the used qualifiers, we will provide a description of how they are used in the context of the different constraint types listed in Table 2, along with specific constrained properties (Section 2.3.1), which are essential for modeling the semantics of constraints and for verifying them; i.e., these will be essential for our formalization in SHACL and SPARQL. (Section 2.3.2), which essentially mark concrete items as exceptions to constraints or deactivate whole constraints, that do not need to be verified. (Section 2.3.3), which have no semantic relevance for formalizing the (verification of) constraints as such, but serve other, mostly descriptive purposes.
Core constraint qualifiers
Case 1: Case 2: another path a label (i.e., the path a description (i.e., As an interesting side observation, we note that the similar-in-spirit We note that the actual definition of the single-value constraint on P36 on Wikidata lists even more separator properties. I.e., note that P2309 is itself restricted by a
Case 1: “as main values”, i.e. using the namespaces for claims ( Case 2: “as qualifiers”, i.e. using the Case 3: “as reference” (i.e., only in reference statements about claims, identifiable via the
for the subject items of constrained property
Just like any other “regular” claims about Items, constraint-definition-statements about Properties also have a
Notable, at the time of writing, there were 17 property constraint definitions with a different status, cf.
i.e., additionally using qualifier Interestingly, other sub-types of musical ensembles, such as
stating that entities that are instances of these values24
The above summary of the used qualifier properties to parameterize constraints should have illustrated that the semantics of Wikidata’s vocabulary used to describe constraints are not always uniquely determined: indeed, the interpretation of constraint qualifiers depends on (i) in which context (ii) in which particular combination with other qualifiers, they are used (iii) in particular constraint types.
Before we continue in Section 3 with more details on how these qualifier properties are interpretable as parameters in SHACL-Core shapes for verifying different constraint types, let us discuss some additional challenges that potentially complicate these formalizations, and motivate our idea to design bespoke translations per
The above description of Wikidata property constraints modeling in RDF defines how constraints are represented but not how they should be checked. To understand how to check constraints, a description property ( The description page of our running example
It is important to note here that the Wikidata community has defined all existing property constraint types in a manner where these types and their modeling have grown organically. In our formalizations of such constraints, we tried to stay as close as possible to the – partially heterogeneous – interpretations derivable from the natural language descriptions of constraint types. We could find and document several cases of textual ambiguity, leaving room for different interpretations and as a result, for different implementations of the respective constraint checks. We illustrate some of these issues.
Let us first note that in Wikidata constraint type descriptions, it is rarely explicitly/uniformly specified whether
Indeed, some of these constraints resemble known RDFS axioms: for instance, the
For instance in our formalizations of the
The interested reader might have noted the last sentence in parentheses, which indirectly informed our respective interpretation of
This potential issue is, by the way, not restricted to IRS constraints, cf. Footnote 25 on p. 2349, which illustrates a similar questionable example for the potential non-consideration of subclass-inferencing in the context of a
Along these lines, similar issues of interpretation may arise, in the context of (sub-)properties (P). Recall the
Yet, the claim that
We consequently do not consider (transitive) subproperty relationships in our formalization, in line with the observed behavior within Wikidata: indeed, although
The description page for
Values to be considered different if they use different values for Values to be considered different if they use different values for
indeed, the latter interpretation (
It is arguable whether we should consider exceptions – marked as
Deprecated Constraints are, interestingly, still being tested and reported by Wikidata’s Database Reports; likewise, no distinction is made on the constraint status – denoted by
Differences in Wikidata RDF serialization
Another remarkable finding arose for us when taking a closer look at
Wikibase Item (
Wikibase property (
Wikibase MediaInfo (
Lexeme (
Sense (
Form (
Wikidata Item (
Notably, though, these types can only be partially checked, depending on the RDF serialisation, for various reasons:
incoherent serialisation with respect to different entity types;
differences between RDF serialisations on the SPARQL endpoint vs. RDF dumps;
checks requiring unintuitive workarounds
returns over 10000 results (apparently covering all properties), on the other hand, the query
returns 0 results on the Wikidata SPARQL endpoint.
For instance, the SPARQL endpoint returned over 13M instances of
Indeed, these subtleties and incoherence within Wikidata’s RDF serialization(s) make it hard to define generic parameters to test allowed entity types constraints automatically, forcing case-by-case implementations (either in terms of SHACL shapes or in a SPARQL query), depending on whether working on the SPARQL endpoint or on Wikidata’s RDF dump.
In the context of this paper, we have designed both our SHACL-Core formalization (cf. Fig. 12 below) of
In summary, all of these examples and issues should motivate the following disclaimer: the SHACL-Core shapes and SPARQL queries proposed in this paper were created from the available descriptions and aim to reduce the margin of interpretation in dealing with Wikidata constraints while keeping as close as possible to the documented interpretations. As such, all our SHACL and SPARQL formalizations discussed in the following Sections 3 and 4 (and linked from Table 2) reflect our best-effort
In this section, we will present Wikidata property constraint types in terms of SHACL shapes, i.e., deploying the official W3C standarised language to express constraints on RDF graphs.
To this end, we will first provide some necessary background on RDF and SHACL, introducing some notions that will be useful in the rest of the section (Section 3.1), whereafter we will dive into details of formalization and expressibility of particular Wikidata property constraint types (Section 3.2), again mostly driven by examples; for a full list of SHACL formalizations per constraint type, we refer to Table 2.
SHACL validation
The SHACL standard specifies constraints through so-called
A
The shapes graph in Fig. 10a describes the shape :
Consider the RDF graph represented by Fig. 8b:33 Note that the nodes and edges are labeled with both names and their (namespace-abbreviated) URIs, but we assume the corresponding RDF graph is implicitly clear to the reader.
This is not the case for the RDF graph in Fig. 8c, since
Overall, the shapes introduced in Example 9 define exactly the intended semantics of the

SHACL shape
The SHACL specification allows for shapes to refer to other shapes which may result even in cyclic references and
The constructs defining a shapes graph can syntactically be viewed as concepts in expressive Description Logics [9], a well-known family of decidable fragments of first-order logic. That is, the shape components can be viewed as logical constructs, such as existential and universal quantifications, qualified number restrictions, constants, or regular path expressions. For instance, the shapes graph in Fig. 10a can be expressed as the tuple containing the target
Let us proceed to describe more systematically now, how Wikidata property constraints can be translated to SHACL-Core shapes.
The
Constraint types
On the contrary, constraints
Further building upon and extending the discussion of Example 9, we discuss and illustrate translations to SHACL guided by the constraint qualifier properties to parameterize them in the following, where we go through the constraint qualifiers in the same order as in Section 2.3.
As noted in Section 2.3.1, the P2306 qualifier may also in the context of other constraint types, refer to more generic paths
where the listed allowed units in line 6 denote
This concludes the discussion of the treatment of
Yet, along the lines of our discussion in Section 2.4.3 above, where we remarked that Wikidata does not seem to make an explicit distinction between constraint statuses in its Database Reports, we also do not consider the
In summary, by “templating” the respective qualifier parameter translations from Wikidata’s constraint representation to SHACL shapes based on the illustrating above examples, we can cover most constraint types, and – additionally carry over also some useful descriptive information to dedicated SHACL-Core constructs, that are not strictly needed for validation as such, but can be used by validators to generate explaining output.
A prototype, reading constraint definitions in Wikidata’s representation and accordingly creating their shapes representation on-the-fly is presented in Section 3.3 below. Table 2 presents the entire set of analyzed constraint types, their Wikidata IDs, as well as a column to state whether it was possible to map the constraint type to SHACL (and SPARQL, respectively, see Section 4). The particular SHACL encodings, created by using the above-introduced “mappings” between Wikidata qualifiers and SHACL-Core components, can be found in an online GitHub repository accompanying our paper.35
In order to demonstrate the feasibility of an automated translation, we developed
In a nutshell, we first generalized the example SHACL shapes (such as the one in Fig. 10a to become templates for specific constraint types, e.g. by replacing the specific qualifier values assigned to a single property, illustrated by the following “template abstraction” for IRS constraints:
Our wd2shacl tool then populates these templates according to the actual qualifier values instantiated for a specific property Available at

Wikidata to SHACL architecture. Dashed lines represent abstract classes.
After collecting constraint types and qualifiers for the input property
As shown in Table 2 the vast majority of Wikidata constraint types can be rewritten into SHACL-Core Shapes (26 out of 32); yet, three could only be partially translated, one cannot be expressed in a reasonable way, while for further two we did not find any way to express them in SHACL-Core at all. Let us discuss these, and the involved challenges in more detail:
Firstly, the
Next, the We leave it as an open question at this point whether there exists a more concise formulation in terms of more complex, possibly nested SHACL Shapes. Don’t try this at home!
The
Wikibase item
Wikibase MediaInfo
Wikibase lexeme
Wikibase form
Wikibase sense
Wikibase property

SHACL-Core shape for verifying an
The last three constraint types in our problematic list (
On the contrary, it is straightforward to verify the uniqueness or difference of a property value with respect to the claimed subject in the absence of separators: as such, both (

SHACL shapes encoding: from simple SHACL-Core shapes to the SPARQL formalization.
The scenario changes though when a separator qualifier property is introduced. According to both possible interpretations
To illustrate this, consider again the instantiation of the separator qualifier for the
Therefore, the variants of all three,
We show in the next section how these remaining constraint types can be expressed with formalisms beyond SHACL-Core.
Beyond its core language, SHACL provides a mechanism to refine constraints in terms of full SPARQL queries through a SPARQL-based constraint component (
Figure 13c extends the SHACL-SPARQL shape the existence and – in case – equality of the
However, as there may be several qualifiers, this shape is not sufficient. Figure 13d generalizes the SHACL shape to consider as violations entities that have qualifiers any with equal values, as such implementing interpretation
Towards SPARQL
In summary, we note the following limitations for the implementation of Wikidata property constraints via SHACL.
In summary, we observe that not all Wikidata constraints were possible to directly map as shapes in SHACL-Core.
Firstly, we can cover only a subset of SHACL-expressible Wikidata property constraints in SHACL-Core;
Secondly, our approach introduced so far instantiates separate SHACL shapes for each property constraint definition, even if we used SHACL-SPARQL;
Thirdly, the capacity of checking these constraints (there were over 72K constraint definitions in total at the time of writing) against the whole Wikidata graph – to the best of our knowledge – goes beyond the scalability (and feature coverage) of existing SHACL validators.
As for the first item, clearly, the above-mentioned issues regarding the expressivity of Wikidata property constraints within SHACL-Core limit its applicability. Moreover, non-core features are unfortunately not mandatorily (and thus rarely) implemented by SHACL validators so far.
As for the second and third items, we argue that due to the limitations associated with the expressibility of the SHACL constraints and the lack of tools capable of efficiently validating large graphs, a direct SPARQL translation potentially presents itself as a more generic, flexible, and operationalizable approach for validating Wikidata Property constraints.
Let us demonstrate this idea with a straightforward SPARQL translation of a simple
An even more direct and crisp, and also executable formulation of this constraint can be easily constructed by the following SPARQL query:
In fact, we claim that violations of this particular constraint type, i.e. the
Indeed, this single query checks
Overall, we hope the illustrative examples in this section have sufficiently motivated that a direct translation of property constraints to SPARQL has advantages over SHACL for various reasons. That is, while in this section we in principle have made a case for using (a subset of) Wikidata’s property constraints as a “playground” to automatically generate a large testbed for SHACL(-Core) validators (and have also sketched how to extend this approach to SHACL-SPARQL), we also hope to have convinced the reader that this approach is not (yet) practically feasible, and in the end have made a case for direct generalizing our approach to
As opposed to the prototypical nature of the previous section, here we aim at a fully
The availability of Wikidata’s database reports web page,42
For instance, our example item-requires statement constraint on FIFA player ID is reported at
As a summary of these database reports, Fig. 14 shows the development of property constraints over time for the 10 most violated constraints: according to the Wikidata database reports web page, we observed that since the introduction of Wikidata property constraints in 2015, the total number of constraints has grown from 19 in 2015 to 32 in 2023; new constraints were created, evolved, or ceased to exist. Data in Fig. 14 point to an increase in the number of violations for the

#violations for top 10 most violated constraint types (logarithmic scale).
Rather we aim at proposing to rethink the
as for (i), Wikidata as an RDF graph can be queried through a SPARQL query service – by expressing constraint violations per constraint type as SPARQL queries, we can benefit from the query language’s operationalizable nature, and various existing SPARQL implementations, that scale to billions of triples.
as for (ii), SPARQL itself is a declarative language, with well understood theoretical properties and mappable to other locigal languages, such as Datalog [8,49,51]

SPARQL queries general template with exemplification.
In this section, we describe the overall structure of our Wikidata constraint validation approach using SPARQL queries. We again illustrate it via our running example from Fig. 8.
Figure 15a presents a generic structure followed by each SPARQL query proposed in this paper. We generalize queries into different “blocks”, such that each block can contain multiple triple patterns as exemplified in Fig. 15b, which fulfill different functions. Figure 15b represents the concrete query for the
We have encoded all 32 constraint types (some of which are in separate queries for different variations) in SPARQL queries, following similar patterns corresponding to
Again, we note that apart from Wikidata itself, there are an increasing number of other Wikibase instances listed in the Wikibase Registry44
We now show how to formally express the constraint types that could not be directly represented with SHACL-Core in SPARQL.

Experiments
We designed an experiment to evaluate the semantics of our SPARQL queries to verify our approach against the Wikidata database reports. We compared the violations obtained by our queries with the violations published in the Wikidata Database reports (cf. Footnote 5). Unlike DBpedia, where a version of the KG is pragmatically generated and made available every three months,46
In order to still ensure comparability of results as far as possible, the conducted experiment was designed on a sample of constraint violations collected according to the following steps:
We identified the top-5 most violated constraint types from Wikidata’s violation statistics table on December 16, 2022:
We ranked the associated properties in descending order of the number of violations for each of these constraint types.
We executed our SPARQL queries to collect the violations of five different properties for the five constraint types, totaling 25 violation sets available in our GitHub repository.47
The ad-hoc violation checking system used in Wikidata takes about a day to execute and publish results, thus our queries were executed one day before the data was available. Consequently, we extract the set of corresponding violations published by the Wikidata portal referring to the same properties on the next day. For instance, the
Finally, we structured and compared the violations reported by the Wikidata Database reports with the violations retrieved by the SPARQL queries on the Wikidata endpoint.
As the queries were executed on the SPARQL endpoint and our target was the properties with the highest numbers of violations, we also had to consider timeout-related issues due to limitations of the Wikidata environment itself: due to the high number of triples associated with some of the targeted properties, the limit of 60 seconds for a query, established by Wikidata’s SPARQL endpoint is not enough to process the entire target set. Therefore, it was necessary to discard the target properties that timed out and proceed with the subsequent one with the next highest number of violations (in steps 2+3 above), to arrive at 5 properties for each of the 5 chosen constraint types. Note that in order to have a reasonable basis for comparison, the SPARQL endpoint is the only option at the moment, since the database reports are computed on this state of the KG. In future work, we intend to create a benchmark to facilitate the testing of different approaches to collecting violations including testing of other engines and environments; more on that in the related and future work sections below.
In the next subsections, we provide a table for every constraint type containing the list of properties analyzed (Property ID), the total number of violations the Wikidata database reports claimed to have found (# of violations), the total number of violations made available by the database reports on the specific HTML pages for each property, the number of violations available (
One-of constraint
The first results concern the
One of constraint violations
One of constraint violations
For
Note: it is necessary to be logged in in Wikidata to see violations in the UI).
The four violations not captured by our approach for the property
Lastly, in
For
Item requires statement constraint violations
Item requires statement constraint violations
In Table 4, note that for the top 3 properties (P1559, P1976, and P2539), our approach found all the available violations and some extra violations that unfortunately cannot be compared because the results available in the Wikidata database reports are incomplete. For
The 90 violations not found by our SPARQL query for
The statistic table of the Wikidata database reports points to Single-value constraint as the third most violated constraint type. We notice that this statistic takes into account
Single value/best single value constraints violations
Single value/best single value constraints violations
In
The analysis of
The occurrence of properties from the astronomy domain, such as
The required qualifier constraint has the same principle described for IRS: the same property can have multiple instances of the required qualifier constraint, each one of them requiring a different property to be used as a qualifier for a given statement. For this constraint type, which again is very common, three properties were skipped due to timeout on the Wikidata SPARQL endpoint, where the properties with the next highest violation rates were selected. The results are available in Table 6, showing that the whole set of available violations (VA) was found by our SPARQL approach (OV).
Required qualifiers constraint violations
Required qualifiers constraint violations
Value requires statement constraint violations
Finally,
For
In summary, common reasons for mismatches include – as a matter of interpretation – whether only truthy statements or also non-preferred and deprecated statements should be checked for constraint violations. Also, other deviations could arguably be identified as a matter of interpretation. As we also discussed, our constraints could be adapted to the respective different interpretations relatively easily with minor modifications of our query patterns. Overall, while we only conducted these analyses on a sample, we argue that the experiment has confirmed our opinion that a declarative and adaptable formulation of Wikidata property constraints in terms of SPARQL queries is both feasible and could add to the clarification of the constraints’ actual semantics. The deviations between the Wikidata UI pages and the Wikidata database reports confirm our opinion that such clarification is dearly needed.
Regarding the practical feasibility of the proposed approach, it is important to acknowledge the generic nature of the SPARQL queries proposed in this study. The absence of hard-coded parameters ensures adaptability to diverse constraint parameters, with queries exclusively relying on the Wikidata data model to match the triple patterns that generate specific constraint violations. This flexibility enables the applicability of generic queries checking violations across all properties instantiating a specific constraint type. On the downside, as we have discussed, such generic queries potentially lead to scalability problems for current SPARQL engines. As a practical workaround, constraints can also be checked by instantiating our generic queries per property, mitigating scalability challenges to a certain extent: in the context of the Wikidata ecosystem, the envisaged solution is therefore promising for deployment as a background process capable of processing batches of properties. By doing so, the system can systematically uncover candidate inconsistencies, according to the available resources and relevance of properties.
Related work
Constraints play an important role in specifying rules for data, defining the requirements to prevent it from becoming corrupt, and ensuring its integrity. There has been significant research on the development of constraint representations and validation techniques specifically for knowledge graphs.
Constraint languages for graph data
RDF has long served as the W3C-recommended graph-based data model for presenting information in the Semantic Web, whereas a standardized language to express and validate data graphs has only recently been introduced. Ontology languages like RDFS, OWL, and its sublanguages, which have been standardized along with RDF, have been widely used for modeling the data through axiomatic structures. For instance, DBpedia, like other open knowledge graphs (e.g. YAGO, GeoNames), makes use of ontologies to model the data, which have been employed also for detecting (some) inconsistencies (e.g., [11,48]). However, ontologies have been particularly criticized for their limited use when checking the conformance of data graphs. Indeed, the primary utility of ontologies lies in facilitating deductive reasoning tasks, such as node classification or evaluating overall satisfiability, and not in describing constraints on KGs. With the growing emphasis on data accuracy for graph-based applications, the absence of constraint languages similar to those found in relational [3] and semi-structured data [2] contexts became noticeable. To address this gap, multiple strategies have emerged. Hogan [38] used rule-based fragments of OWL/RDFS for scalable inconsistency identification and repair suggestions. The idea to use scalable variants of bespoke Datalog-based reasoning for constraint checking and verification originally imposed in Hogan’s thesis may be argued to be not unlike our approach: SPARQL has been shown to be equally expressive as non-recursive Datalog with negation [8], where features like property paths only mildly add harmless, linear recursion [51]. Another line of research regards extending ontology languages to treat axioms as integrity constraints under the closed-world assumption [45,58].
In particular, to address the lack of dedicated constraint languages for graph data, novel schema formalisms for RDF graph validation like the Shape Expressions language (ShEx) [13,29,56,57] were proposed before SHACL became a W3C recommendation. ShEx is a formal modeling and validation language for RDF data, which allows for the declaration of expected properties, cardinalities, and the type and structure of their objects. ShEx is closely related to SHACL, and in some cases, it is possible to translate SHACL shapes into ShEx shape expressions since their expressiveness is similar for common cases [29]. For instance, the shapes graph presented in Fig. 10 can be respectively represented in ShEx as follows:
Yet, we leave a full discussion about whether our approach, and therefore all existing constraint types transfer over to ShEx as an open question to future work. Additionally, validation languages based on ShEx supporting the Wikibase data model have been recently proposed in the literature [28], however, they still lack support for many Wikibase constructs and there is no operational validator yet.
Furthermore, SPARQL-based approaches to validate knowledge graphs can also be found in the literature [42,59], including the SPARQL Inferencing Notation (SPIN) 51
According to Corman et al.’s translation,
Unfortunately, this query currently times out on Wikidata’s SPARQL endpoint, mainly due to the nested negation yielding from a modular translation. The
In addition, not all Wikidata constraints could be directly represented in SHACL-Core. We therefore had to devise specific SPARQL queries for each of the 32 Wikidata constraint types to generate viable and functional solutions.
Data restrictions within Wikidata are also discussed by the community and implemented through further projects using other pre-established technologies. For instance, the
meaning that the absence of a “mother” does not lead to inconsistencies, which indicates that the objective of such schema is rather to assist in the “design” of classes than constraint checking in the strict sense. Also, although there are some ShEx to SHACL conversion tools,58
Erxleben et al. [24] exploit properties describing taxonomic relations in Wikidata to extract an OWL ontology from Wikidata. The authors also propose the extraction of schematic information from property constraints and discuss their expressibility in terms of OWL axioms. However, whereas we focus herein concretely on covering all property constraints as a means to find possible violations in the data, Erxleben and colleagues rather stress the value of their corresponding OWL ontology as a (declarative) high-level description of the data, without claiming complete coverage of all Wikidata property constraints.
Martin and Patel-Schneider [44] discuss the representation of Wikidata property constraints through multi-attributed relational structures (MARS), as a logical framework for Wikidata. Constraints are represented in MARS using extended multi-attributed predicate logic (eMAPL), providing a logical characterization for constraints. Despite covering 26 different constraint types, to the best of our knowledge, the authors have not performed experiments to evaluate the accuracy of the proposed formalization, nor its efficiency, and do not discuss implementability. In fact, the theoretical framework partially skips over the subtleties of checking certain constraints in practice. As an example, the translation of cf.
Abián et al. [1] propose a definition of contemporary constraint that was indeed later adopted by Wikidata property constraints. Shenoy et al. [55] present a quality analysis of Wikidata focusing on correctness, checking for weak statements under three main indicators: constraint violation, community agreement, and deprecation. The premise is that a statement receives a low-quality score when it violates some constraint, highlighting the importance of constraints for KG refinement. Boneva et al. [12] present a tool for designing/editing shape constraints in SHACL and ShEx suggesting Wikidata as a potential use case, but – to the best of our knowledge – without exhaustively covering or discussing the existing Wikidata property constraints.
Apart from works specifically on constraint for Wikidata, in [48] the authors systematically identify errors in DBpedia, using the DOLCE ontology as background knowledge to find inconsistencies in the assertional axioms. They feed target information extracted from DBpedia and linked to the DOLCE ontology into a reasoner checking for inconsistencies. Before, Bischof et al. [11] already highlighted logical inconsistencies in DBpedia which can be detected using OWL QL, rewritten to SPARQL 1.1 property paths – not unlike our general approach.
Despite the partially negative result that some of our SPARQL queries time out, and also – as we discussed above – we did not find SHACL validators that would allow us to check our constraint violations at the scale of Wikidata, we believe, besides our primary goal of clarifying Wikidata property constraint semantics, our results should be considered as a real-world challenge
As for SHACL, real-world performance benchmarks still seem to be rare. Schaffenrath et al. [53] have presented a benchmark consisting of 58 SHACL shapes over a graph with 1M N-quads sample from a tourism knowledge graph, evaluated with different graph databases, emphasizing that “larger data exceeded […] available resources”. The shapes we present are on the one hand targeting an (orders of magnitude) larger dataset, but on the other hand can also be evaluated locally on a per entity level, thus providing a benchmark of quite different nature. Also, the evolving nature of Wikidata makes a dynamic/evolving benchmark that can be evaluated/scaled along the natural evolution and growth of Wikidata itself. Next, [27] presents a synthetic SHACL benchmark derived from the famous LUBM ontology benchmark, while also emphasizing the current lack of real-world benchmarks for SHACL.
Closest but also orthogonal in focus to our own work is a recent paper by Rabbani et al. [52], which focuses on the orthogonal problem of automatically
Finally, apart from serving as a basis for novel benchmarks for SHACL, our SPARQL formalization particularly extends, and in our opinion complements, the existing landscape of real-world SPARQL benchmarks. Indeed, the – to the best of our knowledge – only benchmark for Wikidata, WDbench [6] covers a significantly different kind of Wikidata queries than we do. WDbench is a benchmark extracted from Wikidata query logs, focusing on queries that time out on the regular Wikidata query endpoint, but it is restricted to queries on truthy statements only, that is for instance not covering queries on qualifiers. Our queries on the contrary, by definition all relate to querying qualifiers and, as such they require the whole Wikidata graph and cannot be answered on the truthy statements alone. Yet, similar to the WDbench queries, many of the queries we present, particularly on very common properties, suffer from timeouts, as our experiments confirm. Thus, while the approach we present shows in principle feasible, it calls for novel more scalable approaches to efficiently solve such SPARQL queries that currently time out. We hope, as the queries we focus on typically only affect local contexts of entities and properties, they could hopefully be solved, e.g., by clever modularisation and partitioning techniques.
Conclusions and future work
We have formalized all 32 different property constraint types of Wikidata using SPARQL and discussed ways to encode them with W3C’s recommendation mechanism for formalizing constraints over RDF Knowledge Graphs, SHACL. This study made it possible to clarify to which extent SHACL-Core can represent community-defined constraints of a widely used real-world KG. One of our results is a collection of practical SHACL-Core constraints that can be used in a large and growing real-world dataset. Indeed the non-availability of practical SHACL performance benchmarks has already been emphasized by [27], where we believe our work could be a significant step forward towards leveraging Wikidata as a large benchmark dataset for SHACL validators. Other results include clarifications of heretofore uncertain issues, such as the representability of permitted entities and exceptions in Wikidata property constraints within SHACL [55]. We also could argue the non-expressibility of certain Wikidata constraints, due to the impossibility of comparing values obtained through different paths matching the same regular path expression within SHACL-Core.
As we could show, all these issues could be addressed when using SPARQL to formalize and validate constraints, where all 32 constraints could in principle be formalized. In this context, as a partially negative result, one of the main limitations of the work was the increasing performance limitations of Wikidata’s query endpoint, which calls for more scalable query interfaces and bespoke evaluation mechanisms. On the positive side, these limitations give rise to further research considering property constraint violation detection as a performance SPARQL benchmark as such. As a first next step in this direction, we aim to compare our results from the Wikidata SPARQL endpoint with a local installation, comparing different graph databases or lightweight query approaches such as HDT [25] to support a queryable version of Wikidata constraints checks independent of the SPARQL endpoint, which, for reasons of immediate comparison with the current Wikidata violation reports, was beyond our scope of the present paper.
Wikidata property constraints are dynamically evolving and maintained by the community, as shown by new constraint types such as
In future work, we plan to use and build on the results of this paper to further systematically collect and analyze the kinds of constraint violations in Wikidata and study their patterns as well as their evolution over time. Understanding data that violates the constraints and its evolution is fundamental to identifying modeling or other systematic data quality issues and proposing further refinements, but also repairs, especially in collaboratively and dynamically created KGs such as Wikidata. Proposing refinements is a process that can be envisioned when taking into account the repair information declaratively represented in and retrievable through operationalizable constraints.
We have established SPARQL, as a declarative and operationalizable means to implement Wikidata’s property constraints and also briefly discussed its relationship to other potential formalisms, such as Datalog and Description Logics. In order to further clarify the exact formal properties of Wikidata’s property constraints, further research on a concise and bespoke formal language, e.g. in terms of extended DLs, which captures all and only the required features, would be an interesting route for further work; attempts such as MARS [44] provide promising starting points already in this direction.
