Abstract
Introduction
Galleries, Libraries, Archives and Museums (GLAM institutions) have traditionally provided access to digital collections. The wide range of material formats include text, image, video, audio or maps.
As technologies have evolved over the years, GLAM organisations have adapted to the new environments in terms of new skills, service design or digital research [41]. Institutions have started to make their collections accessible for computational uses such as data science, Machine Learning, and Artificial Intelligence [33,34]. Recently, the Lab concept has emerged as a means to publish digital collections as datasets amenable to computational use as well as to identify innovative and creative ways of reusing them [32]. GLAM institutions are engaging users and researchers to conduct research on the digital collections.
The
Applying the LOD concepts to the digital collections provided by libraries has become highly popular in the research community. Many institutions have adopted Resource Description Framework (RDF) to describe their content. In addition, collaborative editing approaches have been proposed using Wikidata and Wikibase to highlight research opportunities in community-based collections, as well as community-owned infrastructure, to facilitate open scholarship practices [2,22]. The use of LOD enhances the discoverability and impact of digital collections by transforming isolated repositories (data silos) into valuable datasets that are connected to external repositories.
However, the use of Semantic Web technologies requires complex technical skills and professional knowledge in different fields, hindering their adoption. Many aspects must be taken into account such as the vocabulary to describe the resources, the identification of external repositories to create the links and the system to store the final dataset. As in the case of other types of structured data, LOD suffers from quality problems such as inaccuracy, inconsistency, and incompleteness, impeding its full potential of reuse and exploitation.
Bibliographic records published as LOD by libraries have gained value in contexts outside of the library domain in order to connect and reuse resources [19,26]. It is crucial to provide libraries with higher-quality and richer metadata for reuse in a linked data environment, not only to create contextual information, but also to facilitate the work of the library staff [3].
Recent studies have focused on the next generation of metadata in libraries to develop quality assurance practices [44]. Some approaches have assessed the quality of LOD using several methods and techniques [18,56]. A preliminary query-based approach assesses the quality of the LOD published by four relevant libraries [10]. Shape Expressions (ShEx) have emerged as a concise, formal, modelling and validation language for RDF structures, addressing the Semantic Web community’s need to ensure data quality for RDF graphs [36,47]. However, to the best of our knowledge, none of these previous approaches provide a user-friendly syntax, systematic and reproducible method to assess the quality of LOD published by libraries based on ShEx as a main component.
The objective of the present study was to introduce a systematic and reproducible approach to analyse the data quality of LOD published by libraries. The methodology was then applied to four LOD repositories issued by relevant institutions. The collection of ShEx schemas provided as a result of this study is publicly available and can be used to reproduce the results and extend the examples provided using new rules based on additional properties and vocabularies.
The main contributions of this paper are as follow: (a) a methodology to assess the quality of the LOD published by libraries using ShEx; (b) the results obtained after the quality assessment; and (c) the ShEx definitions to assess LOD published by libraries.
The paper is organised as follows: after a brief description of the state of the art in Section 2, Section 3 describes the methodology employed to evaluate LOD in libraries using ShEx. Section 4 shows the results of the methodology’s application. The paper concludes with an outline of the adopted methodology, and general guidelines on how to use the results and future work.
Related work
Background
The Semantic Web is a web of data that is machine-readable and includes a collection of technologies to describe and query the data, as well as to define standard vocabularies. Linked Data was introduced by Tim Berners-Lee [6] as a essential component of the Semantic Web to create relationships between datasets. Thus, the Resource Description Framework (RDF) [53] lies at the heart of the Semantic Web as it provides a standard model for data interchange on the Web and extends the Web’s linking structure by means of URIs. In addition, SPARQL provides a standardised query language for data represented as RDF in which a query can include a list of triple patterns, conjunctions, disjunctions, and optional patterns [54].
Libraries have traditionally provided the descriptive metadata of bibliographic records using standards such as MARC.1
In this sense, several initiatives provide a more expressive and modern framework for bibliographic information based on Semantic Web technologies. Some examples include: Functional Requirements for Bibliographic Records (FRBR), the family of conceptual models [24], and Resource Description and Access (RDA) specification [40], the IFLA Library Reference Model (LRM) [25], the Bibliographic Ontology (BIBO) [15], the Bibliographic Framework Initiative (BIBFRAME) [31] and FRBRoo [52]. However, translating the old records into the new format is not an easy task [1], since libraries usually host large catalogues, including many types of resources that often require a manual revision to transform the data with accuracy.
Several major libraries (e.g., the OCLC, the British Library, the National Library of France), publishers, and library catalogue vendors have applied LOD to their catalogues in an effort to make these records more useful to users. For instance, the
LOD promotes cultural heritage discovery and access by providing a resource context through the linking of bibliographic catalogue records to external repositories such as Wikidata, GeoNames and the Virtual International Authority File (VIAF). GLAM institutions are increasingly embracing the value of contributing information to open knowledge and collaborative projects such as Wikidata. In this sense, many institutions have linked their collections to Wikidata by means of dedicated properties. For instance, the property
For researchers, data quality is a key factor when choosing a dataset for reuse [17,18]. In this way, several methods and tools have recently emerged allowing to assess the quality of datasets built using Semantic Web technologies. In addition, the research community has highlighted the need for reproducible research by providing articles, data and code [4].
The development of tools to support data validation has accelerated over the past few years [35]. SemQuire consists of a quality assessment tool for analysing aspects of quality of particular LOD. It recommends a series of 55 intrinsic, representational, contextual and accessibility quality metrics [30]. Stardog Integrity Constraint Validation (ICV) allows to write constraints that are translated to SPARQL in order to assess RDF triples in a repository [9]. DistQualityAssessment is an open source implementation of quality assessment of large RDF datasets using Apache Spark [42]. Luzzu is a platform that assesses Linked Data quality using a library of generic and user-provided domain specific quality metrics [16].
Shapes Constraint Language (SHACL) is a World Wide Web Consortium (W3C) specification for validating graph-based data against a set of conditions [55]. As a result of the validation process, SHACL provides a validation report described with the SHACL Validation Report Vocabulary that reports the conformance and the set of all validation results. It provides advanced features such as SHACL-SPARQL that can be used to express restrictions based on a SPARQL query.
ShEx enables RDF validation through the declaration of constraints on the RDF model [37]. ShEx schemas are defined using terms from RDF semantics such as node which corresponds to one IRI, a blank node or a literal, and graph as a set of triples described as subject, predicate, object. ShEx enables defining of node constraints to determine the set of a node’s allowed values, including their associated cardinalities and datatypes.
ShEx also enables the definition of constraints on the allowed neighbourhood of a node called Shape, in terms of the allowed triples that contain this node as subject or object. Listing 1 shows an example of ShEx to validate entities of type person described using FOAF.

A ShEx Shape to validate a person described using the FOAF ontology. Person1 matches PersonShape including the required properties
There are several implementations of ShEx including shex.js for Javascript,3
The international research community has become increasingly interested in applying and using ShEx for the validation of RDF data. One example is the description and validation of Fast Healthcare Interoperability Resources (FHIR) for RDF transformations by means of ShEx [45]. Moreover, ShEx is employed in several Wikidata projects to ensure data quality by developing quality-control pipelines [48]. ShEx is also used to facilitate the creation of RDF resources that are validated upon creation [49]. Another approach proposes a set of mappings that can be used to convert from XML Schema to ShEx [20].
While ShEx and SHACL behave similarly with simple examples, ShEx is more grammar-oriented (shapes look like grammar rules) and SHACL is more constraint-oriented. ShEx provides an abstract syntax that can be easily serialized to several formats. SHACL uses inference (e.g. checking
Comparison of data quality assessment tools for LOD
Table 1 contrasts all the tools mentioned above to assess LOD by using the following features: (i) published as open source; (ii) available for use or download; (iii) using a grammar-oriented and friendly syntax; and (iv) installation required to start using it.
The definition of LOD quality criteria has been attracting ever more interest. A LOD quality model specifies a set of quality characteristics and quality measures related to Linked Data, together with formulas to calculate measures [39]. A data quality criteria according to which large-scale cross-domain LOD repositories can be analysed provides 34 data quality dimensions grouped into 4 data quality categories [18].
With regard to libraries, a methodology for assessing the quality of linked data resources based on SPARQL query templates has been presented together with an extensive evaluation of five LOD datasets, including the BNE [29]. Another example is based on Europeana; it describes an approach for capturing multilinguality as part of data quality dimensions, including completeness, consistency and accessibility [11]. A new method and the validation results of several catalogues using MARC as a metadata format identifies the structural features of the records and most frequent issues [28]. Moreover, an extensible quality assessment framework which supports multiple metadata schemas describes the requirements that must be considered during the design of such software [27]. A previous computational analysis is based on art historical linked data to assess the authoritativeness of secondary sources recording artwork attributions [14].
A recent methodology provides the dimensions and data quality criteria to assess the LOD published by libraries (see Table 2) [10]. In particular, the dimension category includes the criterion
Let
Then we can define the metrics
In the case of an empty set of relation constraints (
However, in these previous works this criterion is used to assess only the properties
These efforts provide an extensive demonstration of how LOD can be assessed, specifying how each criterion can be evaluated. Nevertheless, to the best of our knowledge, no evaluation has been conducted regarding the consistency of statements with regard to LOD relation constraints published by libraries using ShEx.
The data quality criteria to assess LOD classified by category and dimension
The data quality criteria to assess LOD classified by category and dimension
This section introduces the methodology to assess the data quality of LOD published by libraries using ShEx. The procedure is described in Fig. 1 and is based on 3 steps, which are detailed in the following subsections: (i) identification of resources; (ii) definition of ShEx schema; and (iii) validation. The validation step’s output is a report describing the results of the evaluation.
Prior works to assess LOD are based on query-based methodologies that can be complex to reproduce for non-expert users. We used ShEx in this approach because: i) it provides a grammar-based language – similar to regular expressions – to define the rules with which to assess the data; ii) ShEx schemas can be reused to reproduce the results; and iii) ShEx schemas can be used as a starting point to be extended with additional classes and properties. In addition, the use of ShEx does not require installing software to use it.
Although LOD repository publications have recently been on the rise, in some cases and for a number of reasons, the URL is no longer available, making its reuse difficult. In this sense, their exploitation and analysis requires specific knowledge about Semantic Web technologies. Nevertheless, promoting them by way of prototypes and reuse examples may help to lower the barriers to entry.
In addition to the publication of the LOD repository, metadata can be enriched using external repositories. This information can also be assessed in order to identify duplicates as well as to validate the number of external links.

Methodology for assessing the data quality of LOD repositories published by libraries using ShEx.
The identification of resources is a crucial step when analysing the elements and properties to be assessed by means of ShEx.
Publication workflows in libraries are becoming ever more complicated as metadata maintenance is a dynamic and evolving process [5]. In this sense, bibliographic information is stored as metadata using common entities (e.g. author, work, date). Metadata comes in an increasing number of options, including FRBR, BIBFRAME, RDA, Dublin Core (DC), schema.org, Europeana Data Model (EDM) and FRBRoo. In addition, the vocabulary used to describe the contents can be complex, as in the particular case of FRBR based vocabularies in which entities typed as Work follow a hierarchical organisation that includes several layers.
Main entities described by LOD vocabularies used by libraries to publish bibliographic information
Main entities described by LOD vocabularies used by libraries to publish bibliographic information
Table 3 shows an overview of the main entities in LOD vocabularies used by libraries to publish their catalogues.8 The prefixes used to abbreviate RDF vocabularies can be found in the appendix (Table 11).
The resources are identified by means of the SPARQL query in Listing 2 that shows an example of how to retrieve the different classes stored in an RDF repository. Several classes can be used to type the same resource. For instance, a book can be typed as For an overview of URI patterns see,

A SPARQL query to retrieve the different classes stored in an RDF repository
Once we extracted the main resources described in the repository and identified their type, we extracted the properties of each class using SPARQL queries. For instance, Listing 3 shows a SPARQL query to retrieve the different properties used by the class

A SPARQL query to retrieve the different properties used by the class
A collection of RDF triples can be assessed by means of a ShEx definition to determine whether the collection meets the requirements defined in the schema.
According to the entities and its properties identified in the previous step, ShEx schemas are defined to assess RDF data. ShEx can be represented in JSON structures (ShExJ) – intended for human consumption – or a compact syntax (ShExC) – for machine processing – [36].
ShEx has several serialization formats [21]:
a concise, human-readable compact syntax (ShExC);
a JSON-LD syntax (ShExJ) which serves as an abstract syntax;
an RDF representation (ShExR) derived from the JSON-LD syntax.
Following other approaches [47], the ShEx-based validation workflow for libraries consists of:
writing a schema for the data type in question;
transferring that schema into the library model of items, statements, qualifiers and references;
writing a ShEx manifest for the library-based schema.
A manifest file includes several properties: (i) a label for the schema; (ii) a ShEx schema; (iii) a data label describing the dataset; (iv) a data property including a SPARQL endpoint; (v) the SPARQL query to retrieve the data; and (vi) a status property with the value
When defining the ShEx schemas to assess a dataset, previous examples of ShEx can be reused as a starting point. For instance, if a dataset is based on FOAF, we could use previous examples based on this vocabulary to define the new ShEx schema.

An example of manifest file to test entities typed as person – that corresponds to the class
In addition, the definition of ShEx constraints for an existing dataset and its validation can be performed by means of graphical tools aimed at novices and experts; they enable combination and modification functionalities allowing the building of complex ShEx schema [8].
The last step consists of the conformance of the entity data from the library with the ShEx manifest defined in the previous step.
The ShEx2 Simple Online Validator10
The ShEx2 Simple Online Validator allows users to select a manifest using a button in the left-hand list. Once a manifest is selected, a query can be chosen using a button in the right-hand list. The validate button then produces a list of results according to the items retrieved by the query selected in the previous step. Users are allowed to edit the schema and query inputs in order to re-execute the query and the validation. The list of results may include errors detailing the resource and the property involved, together with a textual description. Some examples of errors and their interpretation are:
mismatched datatype: indicates that the tool cannot match an input value with the data type it expects for the value. A common problem is related to the class
missing property: a cardinality indicates that a property requires at least one value for the property.
exceeds cardinality: a cardinality indicates that a property requires a specific number of values for the property.
Prototypes and tools as illustrated in the example enable the reproducibility of the results. Researchers may thus replicate, reuse and extend findings, and thereby drive scientific progress. Nevertheless, there are some aspects to consider when using a LOD dataset published by a library for assessment. For instance, in order to use the ShEx2 Simple Online Validator, the DL must provide a SPARQL endpoint via the secure HTTPS protocol.
This section introduces the application of the methodology introduced in Section 3 to three uses cases based on relevant libraries. An additional use case is provided to show how the methodology can be adapted to other contexts.
After having identified the main classes and properties for each LOD repository, a file including the ShEx schema was manually created for each class (e.g. bnf-manifestation.shex), detailing the prefixes used and including the constraints. In order to use the schemas, we created manifests, based on each LOD repository, containing a list of items described as follows: i) the SPARQL query as well as the SPARQL endpoint that gathers the items to be tested; ii) a label describing the schema and the data used; and iii) the ShEx schema used to assess the data. The ShEx definitions are grouped by library in a manifest file (see Table 5). Since ShEx2 Simple Online Validator can process a manifest file by adding the parameter See, for instance,
When creating the ShEx schemas, preliminary tests were performed to pass the validation and after several iterations, we succeeded in addressing all the issues. For some classes that included a large number of resources, the properties were extracted manually, since the SPARQL endpoint produced some errors due to the complexity and time of the query. For instance, when using many properties in a ShEx definition, we may receive a 414 HTTP error (URI Too Long).
The selection of a LOD repository is a critical factor as well as a complicated task since many institutions are publishing their metadata as LOD. Choosing the right subject ensures the possibility of replicating existing results as well as presenting new challenges to researchers.
In this sense, benchmarks provide an experimental basis for evaluating and comparing the performance of computer systems [23,43] as well as the possibility of replicating existing results [46]. Previous research has focused on four LOD repositories published by libraries – BVMC, BnF, BNB and BNE – that serve as a benchmark and has discussed the methodology employed to evaluate linked data in libraries [10]. Other approaches provide a list of potential LOD datasets for reuse such as the LOD Cloud14
There are many aspects to consider when using a LOD repository. For instance, open licenses and clear terms of use and conditions are key when reusing datasets. Depending on the requirements, a SPARQL endpoint may be necessary in order to assess the information provided by the repository. Table 4 shows an overview of LOD repositories published by libraries and the vocabulary used.
Overview of LOD repositories published by libraries
In some cases, organisations provide a dump file instead of having a public SPARQL endpoint available. The Library of Congress, for example, suggests to download the bulk metadata and use a SPARQL engine to create custom queries such as RDF4J.16
The ShEx Online Validator requires a public SPARQL endpoint that uses HTTPS to test the entities. However, some organisations do not provide this protocol in their services such as the BNE and BVMC. To showcase the re-usability of our methodology to assess LOD, we identified datasets in Wikidata and the current LOD Cloud whose description contains terms such as
BNB Linked Data platform
BnF
National Library of Finland (NLF)
In order to show how our methodology can be adapted to other domains, we selected an additional dataset from LOD Cloud, the Linked Open Vocabularies (LOV).17
The SPARQL endpoints publicly available are used to assess the LOD datasets. The main difference between the repositories is the vocabulary used to describe the information, in particular the entities and properties.
BnF and BNB are linked to Wikidata by means of specific properties. In this way, and in addition to the ShEx definitions created according to the vocabularies used by the libraries, we have created a ShEx schema per library to assess whether the resources linked to Wikidata were typed as human (
The BNB Linked Data Platform provides access to the British National Bibliography18
The dataset is accessible through different interfaces: (i) a SPARQL online editor; (ii) a SPARQL endpoint for remote access; and (iii) a web interfaces providing a search box to enter a plain text term.
The BNB dataset has been modelled and represented in RDF using a number of standard schemas including the British Library Terms,20

Main classes retrieved from BNB LOD platform based on BIBO, SKOS and FOAF controlled vocabularies, and how they interact to create meaning.
A ShEx definition was created for each class to perform the assessment. As an example, the definition corresponding to See, for instance,

A ShEx Shape to validate the resources typed as
In addition, we created a further ShEx schema to assess the resources linked to the BNB Linked Data platform of Wikidata by means of the property
The data.bnf.fr project endeavours to make the data produced by Bibliothèque nationale de France more useful on the Web using Semantic Web technologies.
The dataset integrates several databases including the BnF main catalogue, BnF archives and manuscripts, and Gallica. The data model is based on FRBR, FOAF and SKOS as main vocabularies and provides links to external repositories such as GeoNames, Library of Congress and VIAF.23
An overview of the main classes stored in the LOD repository has been extracted and is shown in Fig. 3. A new vocabulary has been defined to describe roles in which resources are linked to the Library of Congress subject headings (LCSH).25

Overview of the main classes based on FRBR, FOAF and SKOS retrieved from data.bnf.fr and how they interact to create meaning.
Once we extracted the main resources described in the repository and identified their type, we extracted the properties for each class using SPARQL queries. For instance, Listing 6 shows a SPARQL query to retrieve the different properties used by the class

A SPARQL query to retrieve the different properties used by the class
A ShEx schema was defined for each class to perform the validation as is shown in Listing 7. As in the previous use case, all the ShEx schemas were included in a manifest file that is used by the validation tool as an input.

A ShEx to validate the resources typed as
Moreover, we created an additional ShEx schema to check whether the resources linked from Wikidata to data.bnf.fr by means of the property
The Finnish National Bibliography was published as LOD in 2017. The dataset containing about 40 million of RDF triples and based on schema.org and BIBFRAME, was extracted from MARC bibliographic records. The dataset contains a wide range of materials, including books, maps, journals and digitized documents.
A ShEx schema was defined for each class based on schema.org (e.g. CreativeWork, Periodical, Person and Place) to perform the validation. Properties are mainly based on schema.org as well as additional vocabularies such as the agent and the unconstrained version of the RDA element sets. Groups of related items such as Periodicals and CreativeWorkSeries were assessed by means of the properties
Linked Open Vocabularies
The purpose of LOV is to promote and facilitate the use of well documented vocabularies in the Linked Data environment [50]. The vocabulary collection is maintained by the LOV team of curators and is constantly growing (749 as of April 2021).
The data model is based on specifications to describe vocabularies including Vocabulary of a Friend (VOAF) and VANN – a vocabulary for annotating vocabulary descriptions–, and additional vocabularies such as FOAF, schema.org and DC.
A ShEx schema for the most relevant classes is provided such as
Results and discussion
In order to assess the datasets, the main resources were identified and validated using a random sample of 1000 items retrieved per entity and library from their SPARQL endpoints. A total of 37 ShEx definitions were created to validate the LOD published by libraries. Table 5 shows the description of the classes, ShEx and manifest files used to assess the BNB and the BnF. Fig. 4 shows the ShEx validation interface consuming the manifest file and presenting the evaluation results for the BNB Linked Data platform.
Description of the classes, ShEx definitions and manifest files used to assess the BNB and the BnF provided in the GitHub project
Description of the classes, ShEx definitions and manifest files used to assess the BNB and the BnF provided in the GitHub project

The ShEx validator interface that uses the manifest file provided for the BNB Linked Data platform to assess each of the ShEx definitions showing the results. Online access to run the examples is available at
Evaluation overview for the four datasets. For each dataset we display the total number of triples, the number of classes and properties assessed, the total number of evaluated items, how many tests passed, failed and did time out (TO). The last column shows the result for the data quality criterion
Table 6 provides an overview of the data quality evaluation. All the assessed repositories obtained a high score, notably the BNB, the NLF and the BnF.
The BNB reached the highest score. We applied constraints to it based on several properties including See, for example,
The BNF obtains a high score, even though that some constraints are violated. For instance, resources typed as
Regarding the NLF dataset, we identified properties such as See, for example,
The lower score for LOV can be attributed to the lack of values for several properties. Among the LOV dataset errors are the following:
422 resources typed as
162 persons without a property
162 organisations without a
In general, the ShEx schemas could be improved by setting the constraint
Moreover, some resources may not include sufficient information to be assessed. For instance, the resources typed as
Evaluation results aggregated by class for the BNB dataset
Evaluation results aggregated by class for the BNF dataset
Evaluation results aggregated by class for the NLF dataset
Evaluation results aggregated by class for the LOV dataset
The results of the assessment are useful for librarians in several ways since they provide valuable information with which to refine and improve their LOD catalogues. It is thus possible to identify potential properties that are not properly used to describe the bibliographic information. In the same way, they can measure the extent to which entities are described by means of a sufficient number of properties. For example, a librarian could be interested in assessing if the authors contain at least a name, a date of birth and an identifier matching a specific pattern.
The ShEx schemas provided in this study can be used as a starting point for other institutions willing to assess their LOD. In this way, the schemas could be further refined with additional node constraints as well as the incorporation of new vocabularies to assess further datasets. The adoption and use of this methodology in other contexts is also feasible as is shown in the variety of datasets and vocabularies used to assess the methodology.
With regard to the methodology, this approach is limited to one data quality dimension. In order to improve the methodology, additional data quality dimensions and criteria could be used such as license, completeness and trustworthiness (see Table 2). In addition, the ShEx schemas are based on the most relevant classes in each dataset.
Libraries are using Semantic Web technologies to publish and enrich their catalogues. While LOD repositories can be reused in innovative and creative ways, data quality has become a crucial factor for identifying a dataset for reuse.
Based on previous research, we defined a methodology described in Section 3 to assess the quality of LOD repositories published by libraries that uses ShEx as a main component. The methodology was applied to four use cases, resulting in a collection of ShEx schemas that can be tested online and reused by other institutions as a starting point to evaluate their LOD repositories. Our evaluation showed that ShEx can be useful to assess LOD data published by libraries. In addition, ShEx can be used as documentation since it provides a human-readable representation that helps librarians and researchers to understand the data model.
Future work to be explored includes the improvement of the ShEx definitions and the inclusion of additional use cases. Moreover, the extension of the ShEx validation tool in terms of libraries requirements such as common classes and properties used by libraries will be analysed.
