Abstract
Keywords
Introduction
The COVID-19 pandemic is complex and multifaceted and touches on almost every aspect of current life [25]. Coordinating efforts to systematize and formalize knowledge about COVID-19 in a computable form is key in accelerating our response to the pathogen and future epidemics [24]. There are already attempts at creating community-based ontologies of COVID-19 knowledge and data [37], as well as efforts to aggregate expert data [73]. Many open data initiatives have been started spontaneously [22,62,106]. The interconnected, multidisciplinary, and international nature of the pandemic creates both challenges and opportunities for using knowledge graphs [8,22–24,35,37,46,73,108]. However, there have been no systematic studies of crowd-sourced knowledge graph generation by spontaneous groups of self-coordinated users, under the pressure of rapidly occurring phenomena, such as the pandemic. Our paper fills this gap.
For applications of knowledge graphs in general, common challenges include the timely assessment of the relevance and quality of any piece of information with regards to the characteristics of the graph and the integration with other pieces of information within or external to the knowledge graph. Common opportunities are mainly related to leveraging such knowledge graphs for real-life applications, which in the case of COVID-19 could be, for instance, outbreak management in a specific societal context or education about the virus or about countermeasures [8,22–24,35,37,46,73,108]. While this manuscript as a whole emphasizes the opportunities, we think it is worthwhile to highlight some of the challenges early on.
COVID-19 data challenges
The integration of different data sources always poses a range of challenges [19], for example in terms of interoperability (e.g. differing criteria for COVID-19 deaths across jurisdictions), granularity (e.g. number of tests performed per jurisdiction and time period), quality control (e.g. whether aggregations of sub-national data fit with national data), data accessibility (e.g. whether they are programmatically and publicly accessible, and under what license) or scalability (e.g. how many sources to integrate, or how often to sync between them).
Integrating COVID-19 data presents particular challenges: First, human knowledge about the COVID-19 disease, the underlying pathogen and the resulting pandemic is evolving rapidly [53], so systems representing it need to be flexible and scalable in terms of their data models and workflows, yet quick in terms of deployability and updatability. Second, COVID-19-related knowledge, while very limited at the start of the pandemic, was still embedded in a broader set of knowledge (e.g. about viruses, viral infections, past disease outbreaks and interventions), and these relationships – which knowledge bases are meant to leverage – are growing along with the expansion of our COVID-19 knowledge [105]. Third, the COVID-19 pandemic has affected almost every aspect of our globalized human society, so knowledge bases capturing information about it need to reflect that. Fourth, despite the disruptions that the pandemic has brought to many communities and infrastructures [25], the curated data about it should ideally be easily and reliably accessible for humans and machines across a broad range of use cases [82].
Organization of the manuscript
In this research paper, we report on the efforts of the Wikidata community (including our own) to meet the COVID-19 data challenges outlined in the previous section by using Wikidata as a platform for collaboratively collecting, curating and visualizing COVID-19-related knowledge at scales commensurate with the pandemic. While the relative merits of Wikidata with respect to other knowledge graphs have been discussed previously [1,30,84], we focus on leveraging the potential of Wikidata as an existing platform with an existing community in a timely fashion for an emerging transdisciplinary application like the COVID-19 response.
As active editors of Wikidata, the authors have contributed a significant part of that data modelling, usage framework and crowdsourcing of the COVID-19 information in the knowledge graph since the beginning of the pandemic. We consequently have a unique perspective to share our experience and overview how to use Wikidata to host COVID-19 data, integrate it with non-COVID-19 information and feed computer applications in an open and transparent way.
The remainder of the paper is organized as follows: we start by introducing Wikidata in general (Section 2) and describe key aspects of its data model in the context of the COVID-19 pandemic (Section 2.1). Then, we give an overview of the language support (Section 2.2) and database alignment (Section 2.3) of COVID-19 information in Wikidata. Subsequently, we present snapshots of applications of the Wikidata’s COVID-19 knowledge graph to visualizing multidisciplinary information about COVID-19 (Section 3). These visualizations cover biological and clinical aspects (Section 3.1), epidemiology (Section 3.2), research outputs (Section 3.3) and societal aspects (Section 3.4). Finally, we discuss the outcomes of the open development of the COVID-19 knowledge graph in Wikidata (Section 4), draw conclusions and highlight potential directions for future research (Section 5).
Wikidata as a semantic resource for COVID-19
Wikidata is a large-scale, collaborative, open-licensed, multilingual knowledge base that is both human- and machine-readable. Notably, it is available in the standardized RDF (Resource Description Framework) format, where data is organized into entities (items) and the relationships that connect them to each other and outside data, named properties [102].
Wikidata is a peer production project, developed under the umbrella of the Wikimedia Foundation, which also hosts Wikipedia and an ecosystem of open collaborative websites around it. Similarly to Wikipedia, it relies on community-driven development and design and is both a-hierarchical and largely uncoordinated [47]. As a result, it develops entirely organically, based on the editor community’s consensus, which may be implicit (e.g. by the absence of modifications) or explicit (e.g. a policy on how to handle biographical information about living people). This community develops ontologies and typologies used in the database.
This community-centric approach is both a blessing and a curse. On the one hand, it makes
methodical planning of the whole structure and its granularity very difficult, if not
impossible [59]: there simply is no central
coordination system, and all major design decisions have to be approved through a consensus
of all interested contributors. On the other hand, harnessing knowledge and skills of a
broad range of human and automated contributors provides for an unparalleled flexibility and
versatility of uses, and allows for rapid addressing of emerging and urgent phenomena, such
as disease outbreaks.1
The novelty of a bottom-up developed Knowledge Graph relies on an entirely organic growth of taxonomies and content, negotiated continuously by the involved parties. While the benefits of peer production and collaborative editing are well known, they are particularly visible in contemporary and fast changing topics [54]. This is because the crowd-sourced coordination does not require a long decision-making process, nor a chain of command. Additionally, the bottom-up approach allows for a better optimization of topics, by relying on “free market” spontaneous forces of individual editors. It is already known that the search habits of users seeking medical content changed dramatically as a result of the COVID pandemic [13]. However, the exact dynamics of how this peer network responded to the challenge, in particular to the urgent need for new taxonomies and knowledge graphs, has not been a topic of systematic analysis. Our paper fills this gap.
With respect to the COVID-19 data challenges (cf. Section 1.1), Wikidata addresses them in several ways: First, it was designed for web
scale data with flexible and evolving data models that can be updated quickly and frequently
[98,102], and its existing community has been using it to capture COVID-19-related
knowledge right from the start.2 The creation dates of the three core items: “COVID-19
pandemic” (Q81068910) 2020-01-05 As
these items became available, they were quickly put to use for enriching the knowledge
graph around them. For instance, when the paper “Recent advances in the detection of
respiratory virus infection in humans” was published on 2020-01-15, the item Q82838328
about it had been linked to the “SARS-CoV-2” item within less than three days:
Cf.
An important caveat is that data integration through Wikidata poses some particular challenges of its own, such as data licensing (being in the public domain, Wikidata can essentially only ingest public-domain data [76]) or multilinguality (e.g. how to handle concepts that are hard to translate [88]), and for certain kinds of data (e.g. health data from individual patients), it is not suitable, although appropriately configured instances of the underlying technology stack might [80].
Here, we present how various types of data related to the COVID-19 pandemic are currently represented in Wikidata thanks to the flexible structure of the database and how useful visualizations for different subsets of the data linked to COVID-19 within the Wikidata knowledge base can be generated.
In Wikidata, each concept has an item (a human, disease, drug, city, etc.) that is assigned a unique identifier (Q-number; brown in Fig. 1), and optionally a label, description and aliases in multiple languages (yellow in Fig. 1). The assignment of a single language-independent identifier for each entity in Wikidata helps minimize the size of the knowledge graph and avoids issues seen in databases such as DBpedia, where separate items are needed for each language [1]. Such a feature is allowed thanks to the use of Wikibase software – a MediaWiki variant adapted to support structured data – to drive Wikidata instead of other systems that represent entities using textual expressions, particularly Virtuoso in the context of DBpedia [1] and NewsReader [101].

Data structure of a Wikidata item. The simple, consistent structure of a Wikidata
item makes it both human- and machine-readable. Each Wikidata item has a unique
identifier (brown). Items can have labels, descriptions and aliases in multiple
languages (yellow). They can include any number of statements having predicates (red),
objects (blue), qualifiers (black) and references (purple) where the subject is the
item. Finally, where additional wikimedia resources are available about an item’s
topic, those are listed (green). Source:
The true richness of the knowledge base comes from the connections between the items:
statements in the form of RDF triples (subject-predicate-object) where the subject is the
respective item, the predicate is a Wikidata property (red in Fig. 1), and the object is another Wikidata item or piece of information (blue in
Fig. 1). The properties that relate items are similarly
each assigned an identifier (P-number). Some properties relate a Wikidata item as the
object and can be taxonomic (e.g.
The only situation where DBpedia retrieves precise relational statements (e.g.
dbp:symptoms, dbp:treatment) as well as non-relational statements (e.g.
dbo:confirmedCases, dbp:arrivalDate) from Wikipedia is when the information is extracted
from infoboxes [83]. Even in this situation, the
infobox-based creation of DBpedia statements suffers from several inconsistencies
requiring the use of logical constraints and human efforts for their efficient elimination
although properties are quite defined in infoboxes [83]. A practical example of this problem is the DBpedia item about Ahmed
Al-Qadri, a former Syrian minister of Agriculture and Agrarian Reform, as of June 20,
2021.4
The advantage of Wikidata’s use of RDF over other competing semantic data formats,
particularly
In the context of the COVID-19 pandemic, an ontological database representing many
aspects of the SARS-CoV-2 outbreak has been represented in Wikidata, building on pilot
work that was started at the onset of the Zika pandemic [28] and led to the formation of WikiProject Zika Corpus.5
Wikidata is apt to cover gaps in ontologies, as any user is entitled to create new
classes and propose new properties. In contrast to DBpedia, which is based on scheduled
scraping of Wikipedia, the openness of the Wikidata data model allows flexible, immediate
representation by any stakeholder interested on a subject. For example, Wikidata has a
concept for
The core of the COVID-19 knowledge graph in Wikidata is formed by three main items (red
in Fig. 2):

Simplified skeleton of the data model of COVID-19 information on Wikidata. The three
main COVID-related items (the ‘C3 items’)8 are
represented in red, selected classes of items related to these are shown in blue, with
the relations between them represented as arrows. The number of statements relating to
each item from the relevant class is indicated next to the item (In the case of
scholarly articles, relations to each of the three COVID-related items is indicated by
colour). Relation types regularly used to define items within Wikidata classes are
omitted (e.g.
These three core items then link to a vast array of items related to all aspects of the
disease, its causative virus, and the resulting pandemic (>17,000 Wikidata items as of
20 August 2020; blue in Fig. 2). COVID and C3 stand for any subset of
{ Source queries:
When comparing the number of COVID-related Wikidata items with the number of
COVID-related entries on the English DBpedia as of May 26, 2021, we find that only 8727
DBpedia entities have been defined for COVID-19 information, presumably only the entities
having a corresponding article in English Wikipedia.10 For an update of the number of
COVID-related items in English DBpedia, please see
The collaborative work in Wikidata to populate and curate this data has been largely
accomplished by WikiProject COVID-19,11
These COVID-19-related items are linked to their respective classes or types using
In addition to relational statements that link items within the knowledgebase, non-relational statements link to external identifiers or numerical values [27]. Wikidata items are assigned their identifiers in external databases, including semantic resources, using human efforts and tools such as Mix’n’match [65]. These links make Wikidata a key node of the open data ecosystem, not only contributing its own items and internal links, but also bridging between other open databases (Fig. 3). Wikidata therefore supports alignment between disparate knowledge bases and, consequently, semantic data integration [11] and federation [65] in the context of the linked open data cloud [21]. Such statements also permit the enrichment of Wikidata items with data from external databases when these resources are updated, particularly in relation with the regular changes of the multiple characteristics of COVID-19. By contrast, DBpedia mainly uses Wikipedia for its enrichment and this does not support the coverage of multiple aspects of the analyzed disease [1,45,60]. Examples of Wikidata properties used to define external identifiers can be found in Table 1.

Wikidata in the Linked Open Data Cloud. Databases indicated as circles (with Wikidata
indicated as ‘WD’), with grey lines linking databases in the network if their data is
aligned, source dataset last updated May 2020 (available at:
Examples of Wikidata properties used to define non-relational statements
Numerical statements are assigned to disease outbreak items for the COVID-19 pandemic to
outline the evolution of the epidemiological status of different entities, from countries
to provinces, cities and cruise ships. The properties used to define these statistical
statements are shown in Table 1 and include data about the
morbidity, the mortality, the testing and the clinical management of COVID-19 at the level
of continents, countries and constituent states and also many smaller entities. Some
Wikidata properties used to store this epidemiological information have been created in
response to COVID-19 (e.g.
Confirmed active cases
Confirmed recovery rate
Confirmed patient-days
New confirmed cases
New confirmed deaths
New clinical tests
New confirmed recoveries
This set of COVID-19 information is integrated into Wikidata using human efforts, the
QuickStatements tool,13 QuickStatements (QS) is a web service that can
modify Wikidata, based on a simple text commands:
An
application programming interface (API) is a machine-friendly interface of a web
service that can be used to feed another computer program with needed information. The
Wikidata API is available at
Wikidata properties can have constraint declarations associated with them which represent conditions on the use of those properties. As an example, property drug or therapy used for treatment [P2176] has a type constraint that states that the items described by it should be instances of health problem [Q2057971] and a value-type constraint that states that the referenced items should be instances of medication [Q12140].
In 2019, Wikidata added a new namespace to define Entity schemas using the Shape Expressions (ShEx) language [78,97]. Entity schemas can be used to define expectations about the topology associated with some entities. Entity schemas are human readable and machine processable, facilitating their creation by domain experts and their use for validation. During the pandemic, entity schemas related to COVID-19 entities were created like virus taxon [E192], strain [E174], disease [E69], virus strain [E170], virus gene [E165], coronavirus pandemic local outbreaks [E188] and so on [105]. A remarkable aspect of entity schemas in Wikidata is their collaborative nature, which allows the entity schema ecosystem to evolve by users creating new schemas with different constraints or reusing existing schemas by importing existing ones.
Another approach to validate the data has been the use of SPARQL queries. SPARQL is available as part of the Wikidata Query Service and it can be used not only to query the knowledge graph, but also to detect inconsistencies and check logical constraints and more complex patterns based on heuristics. It is also possible to check the edit history and use the ORES service to eliminate database vandalism [97].
Although Web Ontology Language (OWL) can define knowledge graphs with a richer semantic characterization of data models by providing a layer of Description Logics such as in DBpedia [1], the infrastructure developed for the validation of RDF data in Wikidata helps assure a high level of consistency of the Wikidata knowledge graph.
In the context of COVID-19, numerical statements related to epidemiology are constantly
changing, and Wikidata’s structure benefits it in terms of recency.16 This can be
illustrated, for instance, by a query for the number of COVID-related items that have
been last modified in May and June 2021 (as of June 21, 2021):
Wikidata’s language-independent data model makes it well-adapted for multilingual
representation. In English, French, German and Dutch, its biomedical language coverage is
comparable to other semantic resources such as SNOMED-CT,22 ICD-10: International Classification of Diseases,
10th Revision [49]: ICD-10 supports Arabic,
Chinese, English, French, Russian, Spanish, Albanian, Armenian, Azeri, Basque,
Bulgarian, Catalan, Croatian, Czech, Danish, Dutch, Estonian, Persian, Finnish,
German, Greek, Hungarian, Icelandic, Italian, Japanese, Korean, Latvian, Lithuanian,
Macedonian, Mongolian, Norwegian, Polish, Portuguese, Serbian, Slovak, Slovenian,
Swedish, Thai, Turkish, Turkmen, Ukrainian, and Uzbek.

Language representation of COVID-19-related statements. A-D) Language coverage for
items and properties used in statements when either the object or subject is one of
the three COVID-related items (as per Fig. 2; note: log
y-axis). The eight most common languages in Wikidata are shown: en=English, fr=French,
de=German, es=Spanish, zh=Chinese, ar=Arabic, ja=Japanese, ru=Russian.) E) Percentage
of the items covered in order from highest to lowest coverage. faceted by categories
A-D. Data shown for top 150 languages in each category (note: languages not
necessarily in same order for each), as of August 15, 2020 (available at:
The better coverage in English is explained in part by the higher support of this
language in both biomedical language resources [34] and Wikipedia [91]. Cooperation
with publishers such as Cochrane has a significant effect on English Wikipedia coverage,
too [48]. The significant coverage of languages
like French, Spanish, German, Chinese and Swedish in Medical Wikidata fits with their
support by major biomedical multilingual databases: ICPC-2 [86] supports 24 languages,25 ICPC-2 supports Afrikaans,
Basque, Chinese, Croatian, Danish, Dutch, English, Finnish, French, German, Greek,
Hebrew, Hungarian, Italian, Japanese, Norwegian, Polish, Portuguese, Romanian,
Russian, Serbian, Slovenian, Spanish, and Swedish. Please refer to
The support of other natural languages can also be explained by the use of bots that
extract multilingual terms representing clinical concepts based on natural language
processing techniques and machine learning28 An example of such a Wikidata bot can be
Edoderoobot 2, which is specifically working on labelling, thereby translating
structured data into prose in the respective language. Further information about this
bot can be found at
These correlations can be interrogated by querying Wikidata to find out the current
status of the editing of this knowledge graph and of Wikipedia in 307 languages (Table S3;
top-ranking items for each variable summarised in Tables 3
and 4). Query results largely match previously published
trends for Wikipedia and Wikidata (Table 2), though we note
that Arabic (ar) and Chinese (zh), appear in the top 10 languages in the Wikidata COVID-19
subset, while being absent from the top 10 s for other sets described in Table 4. Coverage differed across languages and variables, and most of
the distributions showed marked positive skew. Nonparametric analysis of correlations
(Spearman’s rho) found large magnitude associations (rho 0.65 to 0.97, median = 0.84,
Supplementary Table S4), statistically significant even following stringent Bonferroni
correction. To account for skew and data spanning multiple orders of magnitude,
Languages ranked by medical content from the literature: number of medical Wikipedia articles, number of Wikidata labels, number of native speakers, and number of Wikidata users. Style code: italic for languages appearing in all four lists; bold for those appearing in only one
Languages ranked by medical content from Wikidata queries (as of August 11, 2020). The medical Wikipedia query yields Wikipedia articles associated with Wikidata items that have a disease ontology ID [P699] or are in the tree of any of the following classes: medicine [Q11190], disease [Q12136], medical procedure [Q796194] or medication [Q12140]. The medical Wikidata labels query yields labels of Wikidata items that have a disease ontology ID [P699] or a MeSH descriptor ID [P486] or are in the tree of any of the same four classes. The Wikipedia and Wikidata users column provides a snapshot from the Wikidata dashboard that lists Wikidata users who also edit Wikipedia by number of such users per Wikipedia language. Style code: italic for languages appearing in all three lists; bold for those appearing in only one
Similarly, the current representation of COVID-19 Wikidata items in natural languages
seems to be linked with COVID-19-related Wikipedia pages, edits and pageviews for a given
language, as shown in Table 4. This is confirmed by the high
correlation (Pearson
To investigate the possible causes of these highly correlated datasets, we compared them
to two external metrics for each language: the number of native speakers of each language
[26] and the maximum Human Development Index
for countries where that language is an official language [99]. This data was available for fewer languages
(
The observation here that current language coverage in Wikidata and Wikipedia correlates more closely to countries’ development index than to the number of speakers of each natural language aligns with previous work demonstrating low correlation of Wikidata to the number of speakers [52].
We interpret this as a potential ‘need gap’, where languages that have a large number of
relatively low-income speakers remain relatively underserved. To address this, it may be
necessary to encourage and/or support contribution by speakers of under-resourced and
unrepresented languages to medical Wikidata projects, analogous to those Wikipedia
projects.30 Current efforts to enhance the coverage and language
support of medical knowledge in Wikipedia are mainly driven by Wikimedia Medicine. For
further information, please refer to

A) all-versus-all pairwise correlations of
In addition to the intrinsic value of increased language coverage, it would also help in ensuring culturally relevant contextualizations in Wikidata’s medical and other domains.
Languages ranked by COVID-19-related content from Wikidata queries and other live data (as of August 13, 2020). The COVID-19 pandemic Wikipedia pageviews column represents daily average user traffic (averaged over 2020) to the article about the COVID-19 pandemic in the respective language. The COVID Wikidata labels query sorts languages by the number of labels of Wikidata items with a direct link to and/or from any of the core COVID-19 items – Q84263196 (COVID-19), Q81068910 (COVID-19 pandemic) and Q82069695 (SARS-CoV-2) – excluding items about humans (3131) or scholarly publications (40164). The COVID Wikipedia articles query filters those Wikidata items for associated Wikipedia articles and sorts languages by the number of such articles. The values in the COVID Wikipedia edits column represent the revision counts per Wikipedia language as taken from the dashboard listing Wikimedia projects by total number of revisions to COVID-19-related articles. Style code: Italic for languages appearing in all four lists; bold for those appearing in only one
As shown in the “Data model” section, Wikidata items are linked to their equivalents in other semantic databases using statements where the property provides details about a given resource and the object is the external identifier of the item in the aligned database. Similarly to Wikidata items, these database alignment properties are defined by labels, descriptions and aliases in various languages and by statements describing logical conditions for their usage including formatting constraints and allowed values of subject classes [97].
The alignment of Wikidata entities to other entries on different databases is a
collaborative process which, as with everything in Wikidata, is done via combination of
manual and automatic curation. As an example of automation, items concerning scholarly
entries (i.e. articles and reports) were often aligned to other databases using DOIs
(Digital Object Identifiers) as unique keys. As Wikidata is an open database, the
precision of the alignments is largely based on trust in the community, and misalignments
are promptly corrected once identified. At the scale of curation happening on Wikidata,
quality issues in aligned databases regularly surface, e.g. invalid DOIs stated in PubMed
and PMC Europe.31
As of September 1, 2020, 530232 For the updated count of the properties defining
external identifiers, refer to For the updated count of all the properties, refer to
In the circumstances of the COVID-19 outbreak, a SPARQL query34
Scholarly articles and clinical trials have been linked to numerous external identifiers,
particularly the Digital Object Identifier (DOI), the PubMed ID, the Dimensions
Publication ID, the PubMed Central ID (PMCID) and the ClinicalTrials.gov Identifier (Table
S5). Most of these identifiers are added thanks to WikiProject WikiCite aiming to add
support of bibliographic information on Wikidata [67,70,107]. The current representation of external identifiers for the
scientific literature in Wikidata seems to be similar to the general one for the
bibliographic data in the knowledge graph. As of September 3, 2020, 36208373 scholarly
articles35
However, this Wikidata coverage of the availability of COVID-19-related publications in
external research databases does not seem to fully represent full records of COVID-19
literature in aligned resources. By way of comparison, we performed a simple search for
“COVID-19” in a set of literature databases, and there were 103796 COVID-19-related
records available on PubMed,36
Wikidata’s relatively incomplete coverage of the literature is mainly explained by Wikidata’s development of scientific metadata being based on latent crowdsourcing of information from multiple sources through bots and human efforts and not on the real-time screening of the external scholarly resources [92,107]. In addition to such sampling biases, there are also differences in annotation workflows, e.g. in terms of the multilinguality of or the hierarchical relationships between topic tags in Wikidata versus comparable systems like Medical Subject Headings.
As for the diseases and symptoms related to COVID-19, Wikidata maps to multiple external
identifiers in the main biomedical semantic databases such as MeSH, ICD-10,43 International Classification of Diseases, Tenth
Revision ( Unified Medical Language System
(
Since Wikidata is multidisciplinary, it has extensive matching to humans and sovereign
states (Table S7) as well as films, computer applications and disease outbreaks (Table
S8). The alignment to various metadata databases like VIAF,48 Virtual International Authority
File ( Internet Movie Database
(
Concerning drugs, proteins, genes and taxons, Wikidata items are mainly assigned external
identifiers in the major knowledge graphs for pharmacology (e.g. MassBank51 Interim Register of Marine and Nonmarine Genera
( Protein Data Bank
( Kyoto Encyclopedia of
Genes and Genomes ( IUPHAR/BPS Guide to Pharmacology
( National Center for
Biotechnology Information (
Despite the volume and variety of database alignment in Wikidata, particularly related to COVID-19, the Wikidata statements providing external identifiers do not provide the extent of matching between the subject and its equivalent in the aligned database. By contrast, DBpedia assigns different properties for database matching according to the level of correspondence between the aligned entities (e.g. rdfs:seeAlso, skos:broader, or owl:sameAs) [94]. As a solution to this matter, a new Wikidata property entitled “mapping relation type” (P4390) has been created. This property is assigned as the predicate to the qualifier of a statement providing an external identifier of an item. The object of this qualifier has to be one of the SKOS generic mapping relation types: “close match” (Q39893184), “exact match” (Q39893449), “narrow match” (Q39893967), “broad match” (Q39894595) or “related match” (Q39894604). When the object is an “exact match”, the two aligned items are equivalent. However, when the object is a “broad match”, this means that the external entity is a hypernym to the corresponding Wikidata items (i.e. skos:broader), etc.
One of Wikidata’s key strengths is that each item can be understood by both machines and humans. It represents data in the form of items and statements, which are navigable in a web interface and shared as semantic triples [102]. However, where a computer can easily hold the entire knowledge base in its memory at once, the same is obviously not true for a human.
Since we still rely on human interpretation to extract meaning out of complex data, it is
necessary to pass that data from machine to human in an intuitive manner [57]. The main way of doing this is by visualising some
subset of the data, since the human eye acts as the input interface with the greatest
bandwidth. Because Wikidata is available in the RDF format, it can be efficiently queried
using SPARQL,57 The recursive acronym for “SPARQL Protocol and RDF Query
Language”, the current version of which is SPARQL 1.1. A full description of this
language is available at
Given the breadth of Wikidata’s COVID-19-related information (examples in Supplementary
Figure S1), extracting specific subsets of that information using SPARQL58 Technical documentation
about SPARQL can be found at WikiProject COVID-19 (WPCOVID) queries: extracts from
the query collection of Wikidata’s WikiProject COVID-19;
SARS-CoV-2-Queries:
extracts from the book “Wikidata Queries around the SARS-CoV-2 virus and pandemic”
[2];
SPEED queries: extracts
from the Wikidata-based epidemiological surveillance dashboard for COVID-19 pandemic in
Tunisia ( Scholia queries:
queries underlying COVID-19-related visualizations from the Wikidata-based scholarly
profiling tool Scholia [83];
Covid-19 Summary
queries: queries visualizing COVID-19 information in Wikidata linked to the
epidemiological information of the outbreak and to the characteristics of the infected
famous people;
A simple demonstration of Wikidata’s encoding of SARS-CoV-2’s basic biology is in its
genetics (Fig. 6) and resulting symptoms (Fig. 7). The viral genome contains 11 genes that encode 30 proteins
(and variants), which are currently known to interact with over 170 different human
proteins. Although there are two genome browsers based on Wikidata [66,79], neither yet
display the SARS-CoV-2 genome. SPARQL visualizations provide a broader way to explore
biomedical knowledge about the studied virus and the related infectious disease. As the
knowledge graph grows, this is allowing linking together complex knowledge on biochemistry
(e.g. genes and proteins), biology (e.g. host taxa), clinical medicine (e.g.
interventions) [104]. Such queries can be
expanded by considering the qualifiers that modulate biomedical statements. These
qualifiers allow the assignment of weights to assumptions according to their importance
and certainty. For instance, some treatments are indicated as hypothetical, or symptoms
are listed as rare, as defined by their
Epidemiology
Wikidata also contains the necessary information to calculate common epidemiology data
for different countries, such as mortality per day per capita, and case number to
mortality rate correlation. In some cases this is stored as aggregate data, such as the

SARS-CoV-2 interactions with the human proteome as of September 14, 2020 (available
at:

Symptoms of COVID-19 and similar conditions as of September 10, 2020 (available at:

Summary epidemiological data on the COVID-19 pandemic as of September 10, 2020
(available at:
In some cases, summary data is also time-resolved, allowing inquiry of its change over time (Supplementary Figure S2), capturing features not depicted in several statistical predictions of the epidemiological evolution of COVID-19 outbreaks [12] and clearly seen in other data sources, such that mortality peaks at the beginning of a disease outbreak [111]. Wikidata’s granularity (i.e. the representation of COVID-19 information at the scale of individual cases, days and incidents) and collaborative editing have also made it highly up to date on queries such as the most recent death of notable persons due to COVID-19. This result is difficult to achieve with other datasets (Supplementary Figure S3), and mirrors Wikipedia’s well-known rapid response to updating information on deaths [55,56].
A large portion of Wikidata is dedicated to publication metadata and citation links.
There are several ways to investigate the relevant topics in publications regarding
COVID-19. Firstly, topic keywords can be extracted directly from the titles of articles
with COVID-19 as a main topic (Fig. 9A). This is a useful
and rapid first approximation of topics covered by those publications, extracted as plain
text. These can be expanded upon by querying for the

COVID-19 publication topics as of September 10, 2020 (available at:
Because Wikidata is agnostic to the exact type of research output, its structure is equally suited to representing information on research publications, preprints (Fig. S5), clinical trials (Fig. S6) or computer applications (Fig. S7). However, preprints are not yet thoroughly covered in Wikidata, a limitation for this context as preprints have become particularly important during the rapid pace of COVID-19 research [10,64]. Further, Wikidata’s rich biographical and institutional data makes extracting information on authors, institutions or others straightforward (Fig. S8), and eventually for other contributors too [71].
Further emphasising the multidisciplinary nature of Wikidata, there are also significant social aspects of the pandemic contained in the knowledge base. This includes simple collation of information, such as regional official COVID websites, and unofficial but common hashtags (Fig. S9), or relevant images under Creative Commons licenses (Fig. S10). It also includes more cross-disciplinary information, such as companies that have reported bankruptcy, with the pandemic recorded as the main cause (Fig. 10), or the locations of those working on COVID (Fig. S8B).

Bankruptcies of publicly listed businesses due to the COVID-19 pandemic as of
September 13, 2020 (available at:
However, this also exemplifies how misleading missing data can be: Wikidata currently has
highly inconsistent coverage of companies that are not publicly listed, which heavily
biases the results. For example, the current lack of yearly updated socio-economic
information such as
Many knowledge graphs have been rapidly developed to represent various types of COVID-19-related information, including government responses [22], epidemiology [108], clinical data [73], scholarly outputs and outcomes [46], economic impacts [8], physiopathology [24], social networking [23] among other features related to the COVID-19 pandemic. These semantic databases are mainly built using a combination of human efforts and crowdsourcing techniques [22]. Such resources can also be developed through the automatic extraction – using natural language processing techniques – of information from scholarly publications about the outbreak, as is the case with the COVID-19 Open Research Dataset [106].
Despite the importance of such resources, they tend to cover a narrow range of aspects of the disease, and despite the challenges (cf. Section 1.1), more integrated approaches are necessary to support advanced decision-making related to the outbreak. In response, integrated semantic databases have been launched to combine more divergent information, such as CIDO (combining clinical data with genomics) [37] and COVID-19 data hub (combining epidemiological data with social interactions) [35].
While clearly a valuable part of the data ecosystem, these projects rely on small groups of data curators, a model that has struggled to keep pace when data and scholarly literature grow sharply, as is the case with topics like COVID-19 [53]. This observation fits with the considerably limited volume of knowledge graphs exclusively enriched and verified by a dedicated expert group – such as OpenCyc – when compared to the volume of open and collaborative knowledge graphs, particularly Wikidata, YAGO, DBpedia and Freebase [30].
Whereas most knowledge graphs tend to be specialized and developed by a limited team,
Wikidata deliberately takes a multidisciplinary, multilingual position anchored in the
linked open data ecosystem. It is this breadth, combined with its interoperability, that
makes it unique among even other user-generated collaborative projects. Indeed, it becomes
uniquely suited to highly dynamic topics such as the COVID-19 pandemic [104,105]. In
comparison to other resources like DBpedia, Wikidata is not just edited by machines and
built from data automatically extracted from textual resources like Wikipedia [60]. Wikidata complements automated edits from trusted
sources with enrichments and adjustments by a community of over 25000 active human users on
a daily basis66
These factors have facilitated Wikidata’s rapid growth since its creation in 2012 into an
interdisciplinary network of >90 million items, richly interconnected by more than a
billion statements [97,98]. In the context of the COVID-19 outbreak, Wikidata has proven its
efficiency in representing multiple facets of the pandemic ranging from biomedical
information to social impacts. This stands in marked contrast to other integrated semantic
graphs that only combine two to three distinct features of the pandemic (e.g. CIDO [37], COVID-19 data hub [35], COVID-19 Living Data68
Despite the advantages of collaborative editing and free reuse of open knowledge graphs like Wikidata to support and enrich COVID-19 information, these two features have several drawbacks related to data quality and legal concerns. It is true that the use of fully open licenses (CC0 or Public domain) in centralized knowledge graphs removes all legal barriers to their reuse in other knowledge graphs or to drive knowledge-based systems and encourages the development of intelligent support for tasks related to COVID-19. However, application of CC0 on these databases causes them not to integrate information for semantic resources and datasets with only partially open licenses (e.g. CC BY and MIT), as these licenses require either the attribution of the source work to authors or the use of the same license to process the data [36,77]. This situation is analogous to the status of regular group O red blood cells as a universal donor but restricted recipient [63].
It is worth noting that crowd-sourced collaborative editing is often prone to the law of
diminishing returns: the quality of human curation reaches a certain point, beyond which it
is difficult to achieve additional major improvements. For instance, the quality of Wikidata
relies partly on, e.g. automatically extracted infoboxes, which will only be verified and
checked by editors some time in the future. However, research shows that the Wikidata
community is already quite responsive to the needs of the database for all practical
purposes. It is also worth remembering that machine-based systems are not immune to that
effect neither [69]. Although collaborative
editing contributed to the development of large-scale information about all aspects of the
disease, there are currently still significant gaps and biases in the dataset that can lead
to imprecise results if not interpreted with caution. For example, the COVID-19 outbreaks on
cruise70 As of February 18, 2021, there are only 62 Wikidata
administrators, as shown at
The validation schemas for COVID-19 information in
Wikidata are currently available at
In this research paper, we demonstrate the ability of open and collaborative knowledge graphs such as Wikidata to represent and integrate multidisciplinary COVID-19 information and the use of SPARQL to generate summary visualizations about the infectious disease, the underlying pathogen, the resulting pandemic and related topics. We have shown how the community-driven approach to editing without centralized coordination has contributed to the success of Wikidata in tackling emerging and rapidly changing phenomena, such as the pandemic. We have also discussed the disadvantages of collaborative editing for systematic knowledge representation, mainly the difficulty of ensuring sustainability for COVID-19 information in open knowledge graphs, the tricky validation of conflicting semantic data, the lack of coverage of several aspects of the analyzed pandemic, and the significant underrepresentation of advanced semantics for several types of Wikidata statements. Then, we described how the Wikimedia Community is currently trying to solve them through a series of advanced technical and organizational solutions. As an open semantic resource in the RDF format, Wikidata has become a hub for COVID-19 knowledge due to its alignment with major external resources and to its broad multidisciplinarity. The insertion of information in the Linked Open Data format provides the flexibility to integrate data from many facets of COVID-19 data with non-COVID-19 data. By its multilingual structure, these inputs are contributed to (and reused by) people all over the world, with different backgrounds. Effectively, the WikiProject COVID-19 has made COVID-19 knowledge more FAIR: Findable, Accessible, Interoperable and Reusable [104].
An important aspect of Wikidata’s FAIRness is the Wikidata SPARQL query service
(
As Wikidata is community-oriented and broadly themed, virtually any researcher can take advantage of its knowledge, and contribute to it. SPARQL queries can complement and enrich research publications, providing both an overview of domain-specific knowledge for original research, as well as serving as the base for systematic reviews or scientometric studies. Of note, SPARQL queries can be inserted into living publications, which can keep up to date with the advancements both in human knowledge and its coverage on Wikidata.
Another part of FAIRness is user-friendly programmatic data access. Wikidata database dumps
are available for download and local processing in RDF, JSON and XML formats
(
Even though Wikidata is rich in COVID-19 knowledge, there is always room for improvement. As a collaborative endeavour, Wikidata and the WikiProject COVID-19 are likely to become further enriched over time. By the collective efforts of contributors, we hope that the database will grow in quality and coverage, supporting other types of information – such as the outcomes of the ongoing COVID-19-related research efforts – and contributing to higher pandemic preparedness globally.
Conflict of interest
All authors of this paper are active members of WikiProject Medicine, the community curating clinical knowledge in Wikidata, and of WikiProject COVID-19, the community developing multidisciplinary COVID-19 information in Wikidata. DJ is a non-paid voluntary member of the Board of Trustees of the Wikimedia Foundation, the non-profit publisher of Wikipedia and Wikidata.
Data availability
Source files for most of the tables featured in this work are available at https://github.com/csisc/WikidataCOVID19SPARQL under the CC0 license [96]. Figures involved in this research study are
available at The source code of the SPARQL queries used in this
work are also made available at
