Abstract
Keywords
Introduction
Approaching its tenth birthday, Wikidata [66] is now the paramount general-purpose user-contributed KG in research and industry [2]. By August 2022, Wikidata have had nearly 100 million data items and more than 1.7 billion statements [72]. Besides being collaborative and multilingual, Wikidata has the unique ability to assign one or more sources to each statement [27]. According to its introduction, Wikidata is a secondary database that collects statements along with their provenance [73]. Providing provenance in Wikidata is called referencing. In Wikidata, “references are used to point to specific sources that back up the data provided in a statement” [40]. Figure 1 shows a referencing in Wikidata, where Albert Einstein’s

An example of referencing in Wikidata for Albert Einstein’s sex or gender statement.
Linked Data Quality is a multi-dimensional concept [14,23,27,32,42,43,53,62,76] including availability, completeness, etc., in which, providing the source of facts is considered part of
Although some KGs, e.g. DBpedia [3], support referencing on the resource (item) level, Wikidata is the only KG that supports referencing at the statement (facts and claims about items) level among open general-purpose KGs. Wikidata has an active user community contributing to and refining content and benefits from
There is no KG comparable to Wikidata in terms of size and topic coverage. Due to the large volume of data, evaluating the entire Wikidata over 40 metrics requires expensive hardware and unexpected processing time. We use subsets of Wikidata to evaluate the assessment framework and implemented tools. Along with facilitating the processing of Wikidata’s large volume, subsets provide a comparison platform to review differences in referencing quality scores in different thematic parts of Wikidata [12]. We use three topical subsets [11] and four random subsets of Wikidata in different sizes. Topical subsets allow us to analyze Wikidata referencing in multiple topics, while random subsets enable us to approximate the referencing quality of the entire Wikidata. Thus, by evaluating RQSS over Wikidata subsets, we provide a comprehensive statistical overview of the Wikidata referencing quality.
This study is the most comprehensive evaluation of Wikidata references in different dimensions and complements previous subjective research [2,58]. Our contributions are (i) defining the first comprehensive referencing quality assessment framework for Linked Data based on the Wikidata data model, (ii) developing RQSS which is the referencing quality scoring system to automatically monitor the referencing quality of Wikidata datasets, and (iii) providing statistical scores of Wikidata subsets referencing quality during the evaluation of RQSS. In Section 2, we review related work on data quality and state-of-the-art Wikidata reference quality assessments. Section 3 presents the referencing assessment framework, its dimensions, and metric definitions. Section 4 is an overview of the implemented metrics and the structure of RQSS. In Section 5 we provide the evaluation results of RQSS over Wikidata topical and random subsets. Section 6 presents the limitations we faced during the study and the countermeasures we deployed to overcome those. In Section 7 we discuss the main points of the study and a summary of lessons we learned during this research. Finally, in Section 8 we present our conclusion and discuss future work.
The research question and objectives require a complete survey on the Linked Data quality criteria and Wikidata referencing quality literature. Linked Data quality has been studied widely but referencing quality in Linked Data is rarely investigated.
Data quality
Data quality is defined as “fitness for use” [49]. In the literature, the quality of data is considered a multidimensional concept. Wang and Strong [67] categorised data quality into four main categories, each consisting of one or more dimensions:
There are lots of studies on the quality of Linked Data. Zaveri et al. [76] provided the most comprehensive aggregation of data quality dimensions by surveying 21 data quality papers up to 2012. From this core set, they identified 23 data quality dimensions categorized into 6 categories. Färber et al. [27] extended the criteria of Wang and Strong [67] into 11 dimensions and 34 metrics and then evaluated five KGs: Freebase [24], Wikidata, YAGO [26], Cyc [30], and DBpedia [3]. The score of each metric in their evaluation is between 0 to 1. With this scoring system, users can assign a weight to each metric based on their quality priorities. Debattista et al. [23] examined nearly 3.7 billion triples from 37 Linked Data datasets. They used 27 metrics based on the Zaveri et al. survey. They also provided a Principal Component Analysis (PCA) over their evaluation results to find the minimum number of metrics that can inform users about the quality of Linked Data datasets. None of these studies has done a comprehensive investigation of referencing quality metrics in Wikidata.
Wikidata quality has been investigated broadly. Piscopo et al. [59] surveyed 28 papers on Wikidata quality mostly published in 2017. They stated that trustworthiness needs to be investigated further in Wikidata. Shenoy et al. [63] proposed a framework to recognize low-quality statements in Wikidata. They created a historical dataset of removed Wikidata statements by finding the differences amongst 311 weekly dumps in sequence and applied the removing pattern to current statements to identify low-quality statements. Abian et al. [1] investigated the imbalances of Wikidata in gender, recency and geological data considering user needs. They used Wikipedia page view information to conclude user needs and applied them to Wikidata random items to find the gaps.
Trust and referencing
The ability to provide the provenance of data is placed under the
Wikidata references quality
The studies on Wikidata referencing quality are few and limited. In its quality rules, Wikidata recommends the provided references should be relevant (i.e., directly applicable and support the content or context of the associated fact) and authoritative (i.e., deemed trustworthy, up-to-date, and free of bias) [74]. Piscopo et al. [58] examined the authoritativeness and the relevance of Wikidata’s English external sources. They first evaluated a small set of sample references (<300 statements) through microtask crowdsourcing. The results of this sampling were then given to a machine-learning algorithm that measured the relevance and authoritativeness of all English external sources. The final results showed that about 70% of Wikidata’s external sources are relevant and 80% are authoritative. This approach has recently been reproduced and extended on Wikidata snapshot of 16 April 2021 [2]. The recent study considered both English and non-English external sources. However, it is still limited to relevance and authoritativeness. Piscopo et al. [60] showed that Wikidata has a more diverse pool of external references (in terms of origin country) than Wikipedia as well as benefits from external datasets (such as library catalogues). Curotto and Hogan [21] proposed an approach to index English Wikipedia references as a source for Wikidata statements. However, this proposal considers no plan to evaluate the quality of the indexed references.
Referencing quality assessment framework
A robust evaluation of data quality requires rigorous and formally defined criteria. There are different dimensions to categorize data quality criteria based on measurement objectives. Although the definition of data quality criteria varies in various contexts, e.g. Linked Data and structured data, data quality dimensions are consistent. Considering references as metadata, data quality dimensions are applicable but appropriate reference-specific criteria should be defined for each dimension.
In this section, we select quality dimensions definable in the context of references and then define reference-specific quality metrics for each dimension. We base our dimension selection on the Zaveri et al. survey [76], which is, to the best of our knowledge, the most comprehensive collection of Linked Data quality metrics. At the beginning of each category and dimension, a brief survey of the Linked Data definition and metrics is provided. Then, the informal definition of the metrics is presented. The formal definitions, discussions, and additional considerations in computing the metrics can be found in Appendix A. Table 1 shows these dimensions with those that apply to references shown in bold.
Linked data quality categories and dimensions as collected in [76]. Categories and dimensions in bold are applicable to references and are defined in this report
Linked data quality categories and dimensions as collected in [76]. Categories and dimensions in
In terms of computation, there are two types of metrics in this framework: objective and subjective. Subjective metrics cannot be computed without human opinion intervention. We highlight those metrics as
(Accessibility).
This category includes dimensions that are related to access and retrieval of data. There are five dimensions in this category: availability, licensing, interlinking, security, and performance [76]. In the context of referencing, only performance is not applicable. According to Zaveri et al., “Availability of a dataset is the extent to which information (or some portion of it) is present, obtainable, and ready for use” [76]. Several metrics are defined for availability in terms of Linked Data. It can be measured via the accessibility of the server and existence of SPARQL endpoints [27,42], the existence of RDF dumps [27,42], the uptime of URIs [27,42], and proper dereferencing of URIs (in-links, back-links, or forward-links) [23,27,42,43]. The suitability of data for consumers is also another (subjective) metric considered in literature [27,42]. In the context of references, we define the following metric for the availability: The ratio of dissolvable external URIs to the total number of external URIs. “Licensing is defined as the granting of permission for a consumer to re-use a dataset under defined conditions” [76]. In datasets, the licensing criteria are the existence of human-readable [23,43] or machine-readable license [23,27,43], permissions to use the dataset [29] (as cited in [76]), and indication of attribution [29] (as cited in [76]). In the context of references, we define the following metric for the licensing status of external URIs: The ratio of human and/or machine-readable licensed external URIs to the total number of external URIs. “Security is the extent to which access to data can be restricted and hence protected against its illegal alteration and misuse” [76]. Security is not covered as much as other Accessibility dimensions. According to Zaveri et al. [76], Flemming’s study [29] is the only work that includes a definition for this dimension. While governmental or medical datasets often hold sensitive information accessed by numerous users, rendering them prime targets for potential attackers, Flemming’s tool lacks any metric to assess this aspect. Zaveri et al. (based on Wang and Strong [67]) mentioned secure access to data (e.g. via SSL or login credentials) and proprietary access to data as metrics of security. In the context of references, secure access to external IRIs is important. An unsecured external link decreases the trust in the provenance of data and causes security threats such as man-in-the-middle [18]. Therefore, the following metric can be considered for security in the context of references: The ratio of external URIs that support TLS/SSL [65] connections to the total number of external URIs. In Linked Data, “interlinking refers to the degree to which entities that represent the same concept are linked to each other, be it within or between two or more linked data sources” [76]. This dimension is measured by data network parameters like interlinking degree, clustering coefficient, centrality, and sameAs chains [38]. Another metric is The ratio of reference properties that are connected to another property in an external ontology to the total number of reference properties. In Linked Data, the performance of the dataset deals with the degree of responsiveness to a high number of requests. According to Zaveri et al., “performance refers to the efficiency of a system that binds to a large dataset, that is, the more performant a data source the more efficiently a system can process data” [76]. The measures of evaluating this dimension are the usage of hash-URIs instead of slash-URIs [29] (as cited in [76]), low latency [14,23,29], high throughput [23], and scalability of a data source [29] (as cited in [76]). This dimension is not meaningful in the context of references. (Availability).
(Availability of External URIs).
(Licensing).
(External URIs Domain Licensing).
(Security).
(Security of External URIs).
(Interlinking).
(Interlinking of Reference Properties).
(Performance).
(Intrinsic).
The intrinsic category contains dimensions that are independent of the user’s context. This category focuses on whether information correctly and compactly represents real-world data and whether the information is logically consistent in itself [76]. Dimensions that belong to this category are accuracy, consistency, and conciseness [76]. According to Zaveri et al., “Accuracy is defined as the extent to which data is correct, that is, the degree to which it correctly represents the real world facts and is also free of syntax errors. Accuracy is classified into (i) syntactic accuracy, which refers to the degree to which data values are close to its corresponding definition domain, and (ii) semantic accuracy, which refers to the degree to which data values represent the correctness of the values to the actual real-world values” [76]. Accuracy is an important aspect of data quality as it is sometimes considered a synonym of quality in the literature [27]. Bizer and Cyganiak [15] suggest outlier detection methods (e.g. distance-based, deviations-based, and distribution-based methods [50]) as metrics of accuracy. Checking the use of proper data types for literals and assuring that literals are abiding by the data types is also used as a metric for accuracy [23,27,42]. By evaluating the quality of five open KGs, Färber et al. [27], based on Batini et al. [4], considered two syntactic metrics (syntactic validity of RDF documents and syntactic validity of literals) and one semantic metric (semantic validity of triples) for measuring the accuracy. We use these three metrics in the context of references. The ratio of statement nodes whose referencing metadata sub-graph matches the Wikidata data model, to the total number of statement nodes. Figure 2 shows the Wikidata referencing data model. The RDF model of Wikidata references, derived from [69]. The ratio of reference literal values that match the Wikidata specified literal rules to the total number of literals. Figure 3 shows an example of a regular expression specified to reference-specific property One of the regular expressions of property The ratio of reference triples that based on their corresponding statement, exactly match a gold standard set of ⟨statement,references⟩ to the total number of reference triples. Combining the definition of multiple studies, Zaveri et al. stated that a knowledge base is consistent if it is “free of (logical/formal) contradictions with respect to particular knowledge representation and inference mechanisms.” [76]. Assessing this dimension depends on the knowledge inference methods (e.g., OWL or RDFS) used for inference in the knowledge base. The rate of entities that are members of disjoint classes [23,27,42], is one of the common criteria for this dimension. Other common metrics for checking consistency in Linked Data are usage of undefined classes [23,42], ontology hijacking [23,42], and OWL inconsistencies [23,42], the extent of values compliance with the domain/range of data types [23,27], and misuse of predicates [16]. In the context of references, consistency can be measured by three metrics: (i) use of consistent (reference-specific) predicates, (ii) compatibility of values with the domain and range of reference-specific properties, and (iii) compatibility of different references of an item/statement. The ratio of reference properties specified to be used in reference triples to the total number of reference properties. In Wikidata Qualifiers of the property scope value of property The ratio of reference properties whose values are consistent with the specified ranges by Wikidata to the total number of reference properties. In Wikidata, ranges of a property can be fetched from the Qualifiers of the value-type constraint value of property The ratio of multiple-referenced statements whose references are consistent with each other to the total number of multiple-referenced statements. According to Zaveri et al., “conciseness refers to the redundancy of entities, be it at the schema or the data level. Conciseness is classified into (i) intensional conciseness (schema level) which refers to the case when the data does not contain redundant attributes and (ii) extensional conciseness (data level) which refers to the case when the data does not contain redundant objects” [76]. Redundancy in both schema and instance levels is covered in the Mendes et al. [53] framework. Debattista et al. [23] considered instance-level redundancy in their investigation of Linked Data. In the context of references, redundancy in the instance level is not considered a negative point in the quality of references (because different but equivalent references increase the trust in data). Note that Redundancy at the instance level is different from exact duplication. Exact duplication occurs when an entire triple is repeated in a dataset due to serialization errors. Such duplications are rare and can be ignored. We consider redundancy in both schema and instance levels. The existence of different predicates for pointing to the same provenance information is the schema-based metric of conciseness. To illustrate the conciseness in references instance-level, we also provide a metric to measure reference sharing [11].
Reference sharing in Wikidata data model. Statement nodes 1, 2, and 3 are all derived from the same source. The ratio of reference properties with another equivalent reference property to the total number of reference properties. The ratio of reference nodes who are shared with more than one statement to the total number of reference nodes. Figure 6 shows reference sharing in the Wikidata data model. (Accuracy).
(Syntactic Validity of Reference Triples).


(Consistency).
(Consistency of Reference Properties).


(Conciseness).

(Ratio of Reference Sharing).
(Trust).
This category contains dimensions that illustrate the perceived trustworthiness of the dataset [76]. These dimensions are reputation, believability, verifiability, and objectivity [76]. In KGs, having references at different levels is a metric of trustworthiness [27]. When we aim to define trustworthiness in the context of references, we emphasize external sources presented as references. Zaveri et al. defined reputation as “a judgment made by a user to determine the integrity of a data source” [76]. Reputation is the social aspect of trust in the Semantic Web [36], thus, the reputation criteria try to measure the opinions of users about datasets [5,35]. Investigating the opinions of users can be done explicitly through questionnaires and decentralized voting such as Gil and Artz’s study [35]. On the other hand, implicit methods like relying on page ranks can be used as a metric for reputation [5,35]. Golbeck and Hendler [36], proposed an algorithm for computing the reputation of objects considering the incoming links to the object. We use the following metric to measure the referencing reputation of the dataset: The average of the external URIs page ranks. Zaveri et al. define believability as “the degree to which the information is accepted to be correct, true, real and credible” [76]. Believability sometimes is considered as a synonym for Believability considers the data consumer side in the trust category and is closely related to the reputation of the dataset [27]. Believability is a highly subjective dimension that needs to acquire the data users’ opinion [37,39]. However, there are different objective metrics to measure believability, e.g., the use of trust ontologies in data [48] and clarifying the provenance of data [23,27]. In the context of references, we define the metric for the believability dimension based on the fact that references are added more by humans or machines. The ratio of human-added reference triples to the total number of reference triples. Verifiability is defined as the “degree by which a data consumer can assess the correctness of a dataset” [76]. Verifiability indicates the possibility of verifying the correctness of the data [27]. A dataset is verifiable if there exists concrete means of assessing the correctness of data. Therefore, providing the provenance of facts [23,27] and the use of digital signatures to sign RDF datasets [19] are suggested metrics for this dimension. Subjective methods like using unbiased trusted third-party evaluators are also suggested in the literature [14]. In the context of references, the document type of a reference is the subject of measurement. We score external sources (external or internal) based on their document type, and define the metric as follows: The average of type verifiability scores of the external sources. The predefined document types with grades from high to low are scholarly articles, well-known trusted knowledge bases, books and encyclopedic articles, and finally magazines and blog posts. Objectivity is defined as “the degree to which the interpretation and usage of data is unbiased, unprejudiced and impartial” [76]. As believability focuses on the subject side (data consumer), objectivity considers the object side (data provider) of the dataset [27]. Verifiability has a direct impact on objectivity [54]. Bizer [14] considered three subjective criteria to measure objectivity, including the neutrality of the publisher, confirmation of facts by various sources, and checking the bias of data. In the context of references, we define objectivity as the ratio of statements that have more than one provenance. The ratio of multiple-referenced statements (statements with more than one reference) to the total number of referenced statements. (Reputation).
(External URIs Reputation).
(Believability).
(Human-added References).
(Verifiability).
(Verifiable Type of References).
(Objectivity).
(Multiple References for Statements).
(Dynamicity).
Dimensions of this category monitor the freshness and frequency of data updates [76]. These dimensions, according to Zaveri et al. [76] are currency, volatility, and timeliness. [27,67] considered dynamicity as the timeliness dimension in the contextual category. Bizer [14] however, considered dynamicity as the timeliness dimension in the intrinsic category. More recently, Ferradji et al. [28] measured currency, volatility, and timeliness in Wikidata. Measuring the dimensions of this category is based on date/time values. There are different properties in the context of references to capture the date/time of a reference. In PROV-O [52] properties like A SPARQL query service for Wikidata history has been explained in According to Zaveri et al., “currency measures how promptly the data is updated” [76]. This dimension is usually measured by computing the distance between the latest time data modified and the observation time [53]. Sometimes the release time of data is also included in the calculation [62]. Another way to measure this is to consider the time that it takes for a change made to a dataset for a known real-world event [76]. For example, the time that Wikidata takes to update a wrestler’s statement for his new Olympic medal is a currency measurement. Using up-to-date references is very important in some cases, e.g., medical facts. In the context of references, currency can be measured via two metrics: the freshness of reference triples and the freshness of external URIs. The average time elapsed since the last update of reference triples, relative to their total existence duration. The average time elapsed since the last update of external URIs, relative to their total existence duration. According to Zaveri et al., “volatility refers to the frequency with which data varies in time” [76]. While currency focuses on the updates of data, volatility reports the frequency of change in data. Volatility can give the user an expectation of the near update. Volatility besides the currency can be a metric for the validity of data [76]. The The average of the frequency-of-update scores, based on the “Timeliness measures how up-to-date data is, relative to a specific task” [76]. This dimension is a combination of currency and volatility and specifies data as up-to-date as it should be. Since the definition of timeliness is related to the task at hand, we define the metric The fraction of the external URI freshness score to their volatility.
(Freshness of Reference Triples).
(Freshness of External URIs).
(Volatility).
(Volatility of External URIs).
(Timeliness).
(Timeliness of External URIs).
(Contextual).
The contextual category includes dimensions that mostly depend on the context of the task at hand [76]. There is more variability in the literature as to which dimensions belong to this category. Färber et al. [27] considered timeliness and trustworthiness with relevancy in this category. According to Zaveri et al. [76], Completeness indicates the extent to which the dataset covers real-world structures and instances. It is an extensive dimension that contains several sub-categories in some sources, e.g., Furber et al. [32] and Mendes et al. [53] that considered completeness in the schema and data instances. Zaveri et al [76] provided a comprehensive definition, according to which, “completeness refers to the degree to which all required information is present in a particular dataset. In terms of Linked Data, completeness comprises the following aspects: (a) Schema completeness, the degree to which the classes and properties of an ontology are represented, thus can be called “ontology completeness”, (b) Property completeness, measure of the missing values for a specific property, (c) Population completeness is the percentage of all real-world objects of a particular type that are represented in the datasets and (d) Interlinking completeness has to be considered especially in Linked Data and refers to the degree to which instances in the dataset are interlinked” [76]. Zaveri et al. definition reflects the criteria used to measure completeness in Linked Data. These criteria are schema completeness, property completeness, population (data instances) completeness, and interlinking completeness. In the context of references, we provide metrics for schema, property, and population completeness.
Class Schema Completeness of References: The ratio of classes in the dataset with defined reference-specific properties at the schema level to the total number of classes. Property Schema Completeness of References: The ratio of properties in the dataset with defined reference-specific properties at the schema level to the total number of properties. The average completeness ratio of reference properties in the dataset relative to their schema-defined reference properties for each property. The completeness ratio of a given reference property represents the proportion of statements with its corresponding schema-defined property to the total number of referenced statements with that given specific reference property. The average completeness ratio of reference properties in the dataset relative to their corresponding fact classes at the instance level. The ratio indicates the proportion of referenced facts with a specific reference property to the total number of facts with the corresponding property at the instance level. The ratio of referenced statements in the dataset where the statements come from a selected set of facts to the total number of statements with the same facts properties. According to Zaveri et al., “Amount-of-data refers to the quantity and volume of data that is appropriate for a particular task” [76]. In the context of linked data, this dimension represents the coverage of the dataset for a specific task. It includes statistics on the number of entities, the number of properties, and the number of triples [76]. In the context of references, this dimension can include quantitive statistics of references. Beghaeiraveri et al. [11] provided a statistical review of 6 Wikidata subsets that are relevant to this dimension. They investigated the number of reference nodes, the total number of reference triples, the distribution of triples per reference node, the usage frequency of reference-specific properties, and the percentage of shared references. For all of these concepts, we formally define a quantitative metric in the Amount-of-data dimension. In these metrics, having quantitative statistics and the distribution of scores helps users estimate the coverage of references. The ratio of distinct reference nodes to the total number of statements in the dataset, indicates the richness of reference metadata in capturing diverse sources for facts. The ratio of distinct reference triples to the total number of statements in the dataset, provides an overview of the referencing depth and richness in capturing multiple details for each fact. The complement of the ratio of distinct reference nodes to the total number of reference triples in the dataset, representing the average number of triples associated with each reference node and indicating the level of detail in referencing. The ratio of distinct reference literals to the total number of reference triples in the dataset. Note that the Wikidata data model has three types of reference values: external sources, internal sources, and literals (Fig. 7).
Different types of reference values in Wikidata for According to Zaveri et al., “Relevancy refers to the provision of information which is in accordance with the task at hand and important to the users’ query” [76]. In Linked Data, relevancy metrics are checking the existence of meta-information attributes and the extent of using relevant external links and/or relevant The ratio of reference triples deemed relevant to their associated facts to the entire reference triples. The complement of the ratio of shared reference triples that are deemed irrelevant to their corresponding fact to the total fact-reference triples. (Completeness).
(Class/Property Schema Completeness of References).
(Schema-based Property Completeness of References).
(Property Completeness of References).
(Population Completeness of References (Subjective)).
(Amount-of-data).
(Ratio of Reference Nodes per Statement).
(Ratio of Reference Triples per Statement).
(Ratio of Reference Triples per Reference Node).
(Ratio of Reference Literals per Reference Triple).

(Relevance of Reference Triples (Subjective)).
(Relevance of Shared References (Subjective)).
(Representational).
Representational dimensions indicate the proper presentation and ease of understanding of data to the user. According to Zaveri et al. [76], in Linked Data these dimensions are According to Zaveri et al., in the context of Linked Data, “representational-conciseness refers to the representation of the data which is compact and well-formatted on the one hand and clear and complete on the other hand” [76]. Literature measures this by keeping URIs short and free of SPARQL parameters [23,43] and also avoiding the use of RDF reification, containers, and collections [23,27,43]. As references are statements about statements, reification is inevitable [27]. However, short URIs in external sources can help machines process references. The average of the length scores of the external sources URLs. Higher scores are given to shorter URLs. Consistency in representation refers to “the degree to which the format and structure of the information conform to previously returned information as well as data from other sources” [76]. Representational consistency metrics assess the degree of using existing terms in the context [27] and established terms that already are used in the dataset [23]. In the context of referencing, despite there being no standard vocabulary, there are well-known general ontologies, e.g., Dublin Core Metadata [68] and the W3C PROV-O [52]. In addition, some ontologies use their specific properties for references, e.g., Genealogy.3 The complement of the ratio of distinct reference properties to the total number of reference triples. For accurate insight, the diversity is measured based on the number and variety of reference properties used across all reference triples. Understandability deals with the readability and accessibility of data for humans. According to Zaveri et al., “understandability refers to the ease with which data can be comprehended, without ambiguity, and used by a human information consumer” [76]. Metrics for evaluating understandability in Linked Data look for the percentage of entities, classes and properties with human-readable metadata, e.g., using The ratio of reference properties in the dataset that have associated human-readable labels to the total number of distinct reference properties. The ratio of reference properties in the dataset that have associated human-readable descriptions to the total number of distinct reference properties. The average of the external source references reachability scores, with higher scores given to sources that are easy to reach for human users. According to Zaveri et al., “Interpretability refers to technical aspects of the data, that is, whether the information is represented using an appropriate notation and whether it conforms to the technical ability of the consumer” [76]. Interpretable data increases the reusability and facilitates the integration with other datasets [76]. This dimension also considers technical aspects of data representation [27] and is a way to measure how exploring data is easy for machines. The interpretability criteria in Linked Data are using well-defined and unique identifiers across the dataset [14,23], and avoiding the usage of RDF blank nodes [23,27,43]. In the context of references, we define a metric based on avoiding blank node usage in references. The complement of the ratio of blank nodes in the union set of all reference nodes, reference properties, and objects in the dataset to the total number of elements in that union set. According to Zaveri et al., “Versatility refers to the availability of the data in an internationalized way, the availability of alternative representations of data and the provision of alternative access methods for a dataset.” In Linked Data, versatility has metrics such as providing different serialization for data [23,27] and multilingualism [23,27,33]. In the context of references, multilingualism helps various language speakers verify the facts. Furthermore, non-English cultures and language facts require sources in their language. The ratio of reference properties in the dataset that have associated labels in languages other than English to the total number of distinct reference properties. The ratio of reference properties in the dataset that have associated descriptions in languages other than English to the total number of distinct reference properties. The ratio of non-English sources, including both internal and external references, to the total number of non-literal sources in the dataset. The ratio of facts in the dataset that has at least one non-English source reference to the total number of facts. (Representational-conciseness).
(External Sources URL Length).
(Representational-consistency).
(Understandability).
(Human-readable labelling of Reference Properties).
(Human-readable Commenting of Reference Properties).
(Handy External Sources).
(Interpretability).
(Usage of Blank Nodes in References).
(Versatility).
(Multilingual labelling of Reference Properties).
(Multilingual Commenting of Reference Properties).
(Multilingual Sources).
(Multilingual Referenced Statements).
Alternative metric categorizations
As Section 3.1 represents the metrics in Zaveri et al. categorizations (Table 1), the metrics can be classified in alternative categorizations based on their novelty in the context of references and the part of the referencing they focus on. Table 2 shows the classification of all defined metrics based on the metric targets, i.e., the part of referencing on which the quality review is conducted. Table 3 separates our referencing quality metrics into three categories – in terms of the coexistence with traditional Linked Data quality criteria. Note that the novel metrics are still packed in traditional Linked Data dimensions and categories. For example, the Human-added References metric is a new metric which has not already been in Link Data quality criteria; however, as it investigates the believability of a reference to the users, it fits in the Believability dimension.
The classification of referencing quality assessment metrics based on the target of evaluation. Metrics in italic are subjective
The classification of referencing quality assessment metrics based on the target of evaluation. Metrics in
The categorization of referencing quality assessment metrics based on their relation with traditional Linked Data criteria. Metrics in
The Referencing Quality Scoring System (RQSS) is a data quality assessment methodology [76] that aims to measure the referencing quality of the Wikidata and other Wikibase-hosted datasets.4

Main components of RQSS and part of its data pipeline.
Full Wikidata dumps can be downloaded from
Due to the limitations of our available resources, we cannot apply RQSS to the whole of Wikidata, which currently has more than 100 GB of data containing 1.2 billion statements representing 100 million items. RQSS is used to compute the scores and present the graphical charts of three topical and four random Wikidata subsets. Through subsetting, we establish a comparison platform and gain valuable insight into the referencing quality in different topics and also Wikidata as a whole.
Subsetting overview
We extract three topical subsets corresponding to three Wikidata WikiProjects: Gene Wiki [17], Music, and Ships [11].6 Gene Wiki WikiProject: The Wikidata full JSON dump of 3 January 2022 can be downloaded from The script can be found in
Table 4 shows for each subset the number of items, statements, references, and statements that have at least one reference. We note that the referencing rate in random subsets is generally higher than in the topical subsets. We also observe that items are missing from each of the random subsets, i.e. none of the random subsets contains the expected number of items, but this rate is consistent across the four subsets. Wikidata item identifiers start with Q, followed by an incremental number. At the end of December 2021, the maximum Q-ID in Wikidata was 110,272,953. The random generator script is set to generate the given number of random Q-IDs (100K, 500K, or one million) between Q1 and Q110272953.9 The script can be found in
Initial statistics of the Wikidata subsets: the number of items, statement nodes, reference nodes, and referenced statements (statements with at least one reference)
Table 5 shows the intersection between the random subsets, i.e., the number of overlapping items. Considering the sum-up size of each pair of subsets, the amount of overlap is negligible. However, the uniformity of referencing and missing item rates in the four random subsets with different sizes reveals the need for a deeper look at the main classes of instances inside the subsets. We call this process finding Note that the pie chart belongs to December 2019 when Wikidata had about 71 million items. The script can be found in
The number of overlapping items in random subsets
Figure 9 shows the topic coverage of the four random subsets. All four subsets have a similar topic coverage. In all subsets, the majority belongs to the The lists of the distinct items in each random subset can be found in

Topic coverage of the four random subsets. Note that the colours are consistent across the four charts.
In this section, we analyse the quality scores obtained by running RQSS over topical and random subsets in detail metric by metric. We also evaluate the correctness of RQSS by matching the obtained results with the previous knowledge from Wikidata. During this evaluation, we will discuss valuable information from the data composition in Wikidata.
Availability: Availability of external URIs, licensing: External URIs domain licensing, and security: Security of external URIs
Table 6 shows the details of the availability, licensing and security of external URIs in each subset (Metrics 1, 2, and 3). To check the availability of external URIs, RQSS forces a 10-second request and 60-second response time-out. For security, RQSS sets HTTP requests to verify TLS certificates. To check whether a license exists for URI domains, RQSS probes the HTML home page of the domain to find any trace of licensing terms.13 See the
RQSS results of availability of external URIs. (Availability), external URIs domain licensing (Licensing), and security of external URIs (Security)
Availability and security scores are high while licensing is low. Random subsets get better scores than topical subsets in general. The results of random subsets are similar due to their similar topic coverage. Between topical subsets, Gene Wiki has the highest, and Music has the lowest scores.
Table 7 shows the RQSS results for interlinking of reference properties (Metric 4). To check the interlinking, RQSS seeks the number of values for The query can be found in
RQSS results for interlinking of reference properties

The distribution of reference properties equivalents (between those with
RQSS results for reference triple syntax accuracy
Figure 11 shows the top three reference properties in terms of having literal values in each subset. External ID properties have the majority in all subsets except Ships. In Ships and the two 100K random subsets,
RQSS results for reference literal syntax accuracy

The top three reference properties with the highest percentage of literals in each subset.
RQSS results for consistency of reference properties
RQSS results for consistency of reference properties
RQSS results for range consistency of reference triples
Similar to [11], we count all incoming connections to each reference node to see if the reference node is used as a reference for more than one statement. Table 12 shows the ratio of reference sharing for each subset. As a factor of conciseness, reference sharing is a positive point. The ratio for random subsets is higher than for topical subsets. We believe it is related to scholarly articles as the majority of random subsets (as well as Wikidata). There are many reference nodes with the value of an article shared between all related items. Amongst topical subsets, Gene Wiki has the highest score; another evidence of bot activities in this subset. Column ‘Maximum’ in the table shows the highest number of incoming edges to a reference node. Column ‘Mean’ shows the average number of incoming nodes. While the average number of incoming nodes is 14, there are reference nodes shared between thousands of statements.
RQSS results for reference sharing
RQSS results for reference sharing
We use Pydnsbl to check whether URI domains are among the public black-listed domains on the web.16
RQSS results for the reputation of external URIs (Pydnsbl)
In the absence of an effective solution to retrieve the revision history of Wikidata, RQSS reads the HTML history pages of items on the Wikidata website front end. Figure 12 shows the ‘View History’ tab of

‘View History’ tab of
RQSS results for human-added references. Computing Gene Wiki scores timed out after three unsuccessful attempts and more than 90 days of processing
Table 14 shows the number and the percentage of referenced items, the number of referenced facts (distinct properties used) of the referenced items, the score of the metric, and the number of fact properties in which there is no historical metadata for them. While the initial ⟨item, referenced statement property⟩ pairs have been extracted quickly, the results of Gene Wiki were not available after three unsuccessful attempts and more than 90 days of processing due to the huge number of external HTTP requests and HTML rendering required. The scores vary between random and topical subsets. Due to the presence of active bots in the Gene Wiki WikiProject, such as Pathwaybot17
We retrieve all IRI-based reference node values from the subsets. For Q-ID values, we get the type of value from Wikidata on 21 August 2022. For external URI values, we only check if the URI belongs to our well-known datasets list obtained through the authors’ experience.19 The list of datasets can be found in
RQSS results for the type of sources
RQSS counts the number of reference nodes connected to each statement node via
RQSS results for having multiple references for statements
RQSS results for having multiple references for statements

The distribution of references connected to statements (between statements with
RQSS results for fact-reference freshness. Computing Gene Wiki scores timed out after three unsuccessful attempts and more than 90 days of processing
RQSS results for fact-reference freshness. Computing Gene Wiki scores timed out after three unsuccessful attempts and more than 90 days of processing
RQSS results for freshness of external URIs
To compute Metric 19, RQSS uses the Ultimate Sitemap Parser Python package.20
RQSS results for class and property schema completeness in referencing
RQSS results for class and property schema completeness in referencing
RQSS results for schema-based property completeness of references

The distribution of completeness ratios of the 193 schema-level ⟨fact property, reference property⟩ (
RQSS results for property completeness of references

The distribution of completeness ratios ⟨fact property, reference property⟩ (
By extracting the number of statement nodes, reference nodes, reference triples and reference literals, RQSS computes the amount of data ratios. Besides that, RQSS retrieves the number of outgoing reference triples and outgoing literal values for each reference node. Figure 16 shows the scores of the four Amount-of-data metrics. Gene Wiki has the highest score in all metrics except for the Metric 25. Note that the definition of Metric 27 inverses the ratio and subtracts it from one to map the ratio into a number between 0 and 1. Figure 17 shows the distribution of triples and literals per reference node. The average of triples per reference node of Gene Wiki is 3.5, which is higher than other subsets as Metric 27 score shows. Random subsets have identically the same distribution over both ratios and their metric scores, as well as their distribution, are very close to Gene Wiki, showing that the Wikidata as a whole is in good condition concerning the amount of data.


The distribution of triples and literals per reference node. Red lines are medians and triangles are means. Outliers are ignored due to readability.
RQSS decodes each external URI to percent encoding and counts the number of characters. Table 22 shows the details of External URI lengths in each subset and the scores. There are no URIs longer than 2083 in any of the subsets. Music and Ships score better than Gene Wiki and random subsets. The results show an inverse relation between referencing URI lengths and the activity of bots.
RQSS results for URI length of external sources
RQSS results for URI length of external sources
Table 23 shows the results for reference property diversity. The scores of all subsets are higher than 0.9. Smaller random subsets have lower scores. In smaller random subsets, the property diversity of references is not far from larger subsets due to a broad type of statements (which is the nature of random selection), and the number of their triples is much less. Figure 18 shows the top five properties with the highest frequency of use in each subset. The frequency of property usage in topical subsets is similar to [11] and shows that sources in Music and Ships are more internal (Wikimedia-based projects). The distribution of frequency and type of properties in random subsets is similar. Apart from
RQSS results for the diversity of reference properties
RQSS results for the diversity of reference properties

Five properties with the highest frequency of use in each subset.
RQSS results for human-readable labelling and commenting of reference properties
RQSS results for human-readable labelling and commenting of reference properties

The distribution of the number of labels and comments in reference properties. Red lines are medians, triangles are means, and circles are outliers.
RQSS results for handy external sources

The share (percent) of different handy external source types.
RQSS checks the number of blank nodes amongst reference nodes and reference value nodes (Fig. 2). Table 26 shows the number of nodes in each reification part, the number of blank nodes, and the scores. The results show quite a low number of blank nodes only in reference values. Note that the ‘Value Nodes’ column is the
RQSS results for blank nodes in referencing reification
RQSS results for blank nodes in referencing reification
RQSS results for multilingual labelling and commenting of reference properties
RQSS results for multilingual labelling and commenting of reference properties

The distribution of the number of non-English labels and comments in reference properties. Red lines are medians, triangles are means, and circles are outliers.
RQSS results for multilingual internal/external sources

Five most frequent non-English languages used in sources.
RQSS results for multilingual referenced statements
In addition to the statistical analytics and referencing scores, this comprehensive and in-depth study of Wikidata references brings several challenges, the solution of which requires novel techniques. The first and most important is querying the massive size of Wikidata. The public SPARQL endpoint is neither intended, nor suitable, for performing quality tests. Storing, processing and querying the 100 GB Wikidata dumps is beyond most computing resources available to researchers. Aiming to establish a local SPARQL endpoint on a full Wikidata dump, we were not able to deploy the Wikibase Docker containers due to the lack of root privileges (i.e. requisite administrative permissions for installing applications and running commands) and sufficient hardware resources, especially permanent storage space on our server.22 The Wikibase Docker image can be found in
The size problem and technical limitations with Wikibase Docker (lack of root privileges and sufficient resources) meant that we had to query lots of metadata (e.g. languages of sources in Metric 39 or equivalence of reference properties in Metric 4) directly from the Wikidata public endpoint. It is not a good practice because there is a seven-month period between our data dump and the date of the experiment. The best practice would be to include all metadata in the subsets or index the 03 January 2022 full dump in a local triplestore and query it. The first solution is not possible with current subsetting tools. The second solution, however, requires expensive infrastructure.23 A Google Cloud computation engine with sufficient resources would cost more than $571 per month. Estimated by Google Cloud Pricing Calculator:
The lack of a permanent and easy access method to the Wikidata revision history impacted this study. Our approach utilised the HTML history web pages, which are inaccurate due to missing information. Wikimedia revision dump files are more than 3TB compressed, making it far harder than Wikidata dumps to process locally. Accessing the revision history is required for any quality study, and establishing permanent ways to access the historical metadata is the data provider’s responsibility. In several metrics, we hypothesize the variation in scores is related to the amount of bot versus human activities, but distinguishing bots from humans requires pattern recognition of activities, which requires access to the detailed revisioning metadata. The same is true about freshness and date-time metadata.
In several metrics where accessing accurate data is impossible, we use proxies. For example, in Metric 13, we use the concept of black-listed domains as the reputation proxy. This approach has limitations: as the number of black-listed domains is low, the metric returns unrealistically high scores. A better solution would be to have a ranking system for Wikidata’s external sources individually. A ranking algorithm can update the visits of external sources periodically and deliver better insight into the reputation of external sources.
The problem of subjective metrics is another matter of importance. One of these metrics is relevancy. The high relevance of references can increase the quality score of other objective metrics. In subsets such as Ships, many reference values are Wikidata ship instance items that are relevant to the statement they reference, but good referencing practice would be to link to external sources to verify the data [74]. For example, the claim for the power of a nuclear ship engine should refer to governmental documentation, encyclopedia articles, or military magazines, not an item within Wikidata. In such cases, we need an approach to distinguish non-relevant and non-sensible provenance values.
Despite the limitations discussed in Section 6, this research reveals important promising results. The findings of this study provide a resounding affirmative to the question: “can the quality of referencing in Wikidata be assessed effectively by relying on the Linked Data quality definitions and metrics”, by defining a framework consisting of 40 quality metrics across different data quality dimensions, coming both from Linked Data quality literature and novel definitions. The most important achievement of this research is that statistical analysis can identify data quality weaknesses in the context of referencing. The results revealed that while Wikidata exhibits high scores in areas like accuracy and security of references, there are opportunities for improvement in dimensions such as completeness, verifiability, objectivity, and multilingualism. For multilingualism, which is a flagship defining characteristic of Wikidata, our results indicate low performance. Our analysis critiques these scores and suggests the most efficient ways of improvement. Although having low scores in criteria such as the completeness of referencing is expected (and hard to improve due to the data volume and rapid growth of Wikidata), in other dimensions such as interlinking, the quality can be improved by treating a small amount of data, i.e., only reference properties. The quality scores also uncovered interrelationships between different quality dimensions. For example, we observed the human-added ratio has a strong indirect effect on verifiability (verifiable type of sources) and a direct effect on objectivity (multiple references per fact). Another relationship was that having multiple references for facts affects multilingualism positively. The comprehensive review gives us a good insight into the subjective versus quantitative criteria. Given the rapid advancements of Large Language Models (LLMs) and their capacity to access real-time data from the Web, an intriguing direction for future research is to explore the feasibility of integrating subjective criteria into LLMs. This approach could potentially alleviate the challenges associated with collecting human opinions in a high scale.
Another question that RQSS, as the main deliverable of this study, addresses is “to what extent is there a difference in the quality of references provided by humans and bots?”, where our initial hypothesis was that a strong bot activity would lead to higher overall referencing quality scores. The research found that this hypothesis is wrong. While bots perform well in tasks such as adding new provenance metadata and adhering to schemas, they lag in dimensions such as using referencing-specific properties consistently, maintaining freshness of references, representational conciseness, and providing multilingual sources. The human-added referencing ratio is lower in random subsets compared to topical subsets except Gene Wiki, where the highly bot-active exhibited similar patterns to random subsets in many metrics.
One of the primary lessons gleaned from this research is the importance of subsetting in assessing the quality of a KG. By examining both topical and random subsets in a unified comparison, our study illuminates the quality of referencing within specific Wikidata WikiProjects (such as Gene Wiki, Music, and Ships), which represent thematic aspects of the Wikidata knowledge base, alongside random subsets that reflect the entirety of the KG. This approach provides valuable insights into the referencing quality across different thematic areas and the whole Wikidata, and can be used in future quality assessments. Besides subsets, the framework can be deployed on other Wikidata projects such as Scholarly Articles, Astronomy, or Law, to allow maintainers and editors to identify weaknesses in the quality of references based on the scores. It can also be directly applied to other KGs hosted in Wikibase instances that follow the Wikidata model, e.g., the EU Knowledge Graph [25].
Conclusions
In this study, we investigated the referencing quality of a collaborative KG, Wikidata. We first defined a comprehensive framework for assessing referencing metadata based on previously defined Linked Data quality dimensions. We used the Wikidata data model to define formal referencing quality metrics. We implemented all objective metrics as the Reference Quality Scoring System – RQSS – and then deployed RQSS over three topical and four random Wikidata subsets. We gathered valuable information on the referencing quality of Wikidata. RQSS scores show that Wikidata is rich in the accuracy, availability, security, and understandability of referencing, but relatively weak in completeness, defined schemas, verifiability, objectivity and multilingualism of referencing. In more detail, in the accessibility category, Wikidata subsets have an average of 0.95 for availability and 0.92 for security, but 0.06 for licensing and 0.12 for interlinking. In the intrinsic category, the average score is 0.99 for accuracy, 0.56 for consistency and 0.65 for conciseness. In the trust category, the average score of subsets for reputation is 0.99, for believability is 0.5, for verifiability is 0.35, but for objectivity is 0.02. In the currency category, the average is 0.94 for the freshness of facts-reference pairs but 0.09 for the freshness of external URIs. In the contextual category, the average of schema completeness is less than 0.01, however, for schema-based property completeness the average is 0.39 and for instance-based property completeness the average is 0.35, and for amount-of-data, the average is 0.34. In the representational category, the average of subsets scores is 0.88 for representational-conciseness, 0.99 for representational-consistency, 0.85 for understandability, 0.99 for interoperability, and 0.59 for versatility. RQSS reveals the interrelation between different referencing quality dimensions and highlights efficient ways to address the weaknesses in referencing quality in Wikidata, especially in reference properties.
The results show several metrics return a score very close to 0 or 1 in all subsets. These metrics can be divided into three categories:
Metrics that return high scores in Wikidata random and topical subsets, but might behave differently in other non-Wikidata Wikibase-derived datasets. Syntactic Validity of Reference Triples, Usage of Blank Nodes in References, and Labelling-Commenting metrics (both English and multilingual) belong to this category. In current Wikidata dumps, due to active maintenance, negative scores in such metrics are rare. However, these metrics are essential for the framework when the end users try to assess a non-Wikidata but a Wikibase-derived dataset or aim to find those rare inconsistencies.
Metrics that return low scores in Wikidata because the measuring target is very recent. Schema-based metrics in the Completeness dimension belong to this category. The concept of EntitySchemas in Wikidata is recent compared with the KG lifetime. Again, the presence of these metrics is required to be able to monitor Wikidata schema-based referencing quality and other Wikibase-derived datasets.
The External URIs Reputation metric, which uses deny-listed URIs as a proxy to measure URLs reputation (instead of using page ranks). Until finding a reliable measurement, this metric can be ignored in referencing quality assessments, unless end users want to find those deny-listed URIs to achieve a 100% score.
Our evaluation had multiple challenges: the large volume of the Wikidata dump and the lack of proper documentation to establish local copies of data namely, regarding the Docker images, the lack of a feasible approach to access Wikidata revision history, and the impact of the subjective quality issues on objective metrics. RQSS is the first reusable comprehensive referencing quality investigation and gives us valuable insights into referencing quality strengths and weaknesses. Adding support for subjective criteria in relevancy, authoritativeness and consistency, by deploying a combination of convolutional networks learned over human opinions would further strengthen the RQSS framework. Another important future step is to overcome the challenges of massive data and historical metadata. Although RQSS can effectively calculate referencing quality scores and the analysis of scores provided valuable information about Wikidata, RQSS scores should be evaluated by human experts to ensure their usefulness. Finally, the RQSS assessment framework should be generalized to all RDF KGs. In the current version, RQSS and its assessment framework are based on the Wikidata data model. This means that the Python implementation and the formal definitions are made using Wikidata terminology, vocabulary, and RDF model. In addition, several necessary metadata for computing the metrics come directly from Wikidata, e.g., schemata and historical information. The good news is that the nature of the referencing quality metrics and dimensions can be reproduced for any other KGs. In all KGs that support referencing, references must be available, complete, reputable, etc. Even the type of calculation can be generalized with few changes. For example, in the Amount-of-data dimension, for KGs that references are bound to the items (instead of statements), one can change the ratios per item (instead of statements). The current implementation can be applied to any Wikibase-derived dataset with minor changes in prefixes and namespaces. Generalizing RQSS for any RDF KG enables data quality researchers to compare provenance quality across different KGs.
