Abstract
Introduction
One of the most important criteria for the evaluation of a scientific contribution is the coherent organisation of the textual narrative that describes it, most often published as a scientific article or book. In most academic disciplines, such writings have well-established models of organisation and rhetorical structure, to which scholars and contributors generally abide. These expectations are promoted by academic publishers, who ask for standardised models in the submissions they receive, constructed to efficiently describe the content’s organisation in logical sections. Such models not only express the expected structure of the article or book, but facilitate the detection of omissions, redundancies or incorrect sequences. Unfortunately, the number of distinct vocabularies adopted by publishers to describe these requirements is quite large, expressed in bespoke document type definitions (DTDs). There is thus a need to integrate these different languages into a single, unifying framework that may be used for all content, regardless of provenance and scientific context. For instance, a recent report by Beck [3] explains the requirements for an XML vocabulary of scientific journals to be acceptable for inclusion in PubMed Central1
PubMed Central:
Several studies exist that discuss models and theories for describing the structural, rhetorical and argumentative functions of texts. Such detailed descriptions in machine-readable form (e.g. [31]) have become a necessity for high-volume data access and comprehension both by humans and machines [8,10]. It is also a strict requirement for the complex process of semantic publishing [36,37]. Being able to simplify and automate the time-consuming process of annotating structural and rhetorical behaviours of document components (such as identifying front/body/back matters, Abstract, Results, etc.) may be instrumental in providing a number of services to publishers, open archives, and scientists themselves. For instance, the correct identification of structural patterns in academic documents could be used to generate lists and summaries automatically (e.g., tables of contents, lists of figures), to render the content in a web browser, or to provide full-scale converters between different component vocabularies, readily usable by delivery and publication platforms.
This paper describes
The rest of this paper is organised as follows. In Section 2 we discuss some relevant work about models describing document components. In Section 3 we give an overview of DoCO, presenting its foundations and formal characterisation to describe the organisation of documents according to both structural patterns and rhetoric structures. In Section 4 we illustrate how DoCO is presently being used for annotation and document component retrieval, two high-value tasks in literature management and analysis. Finally, in Section 5 we present further development planned for the near future.
Semantic Publishing and Referencing ontologies
In the past, several groups have proposed (Semantic Web) models, such as RDFS vocabularies and OWL ontologies, to describe particular aspects of the publishing domain, these being mainly concerned with the description of the metadata of bibliographic resources (e.g., DCTerms2
DC Terms: PRISM: http://www.prismstandard.org/resources/mod_prism.html. BIBO: Semantic Publishing and Referencing ontologies:
The original suite of SPAR ontologies comprises eight distinct modules. The following is a brief description of seven of these, while the last one, DoCO, is appropriately discussed in Section 3:
The FaBiO:
The CiTO:
The BiRO:
The C4O:
The PRO:
The PSO:
The PWO:
The above seven ontologies, along with the Document Components Ontology (DoCO), form the original set of SPAR ontologies. This set has more recently been extended with four other complementary ontologies that extend the coverage of the possible description of the publishing domain. These are as follows:
The SCoRO: The FRAPO: The DataCite Ontology: DataCite schema: The BiDO:
Still being actively maintained and expanded, the SPAR ontologies have drawn the attention of the Semantic Publishing community, as a reference point for standardising entity descriptions and fostering interoperability between services – as discussed in Section 4.
To the best of our knowledge, the first concrete attempt at describing document components by means of Semantic Web technologies is the Currently the SALT ontologies are not available at their original URLs, but we are informed that they will in future be hosted at
Similar to the above, the The SWAN Discourse Elements Ontology: The SWAN Discourse Relationships Ontology:
In [4], Ciccarese and Groza introduce the ORB – the Ontology of Rhetorical Blocks:
A detailed review and analysis of other RDF/OWL vocabularies and ontologies targeting the description of document components in terms of argumentative elements is presented by Schneider et al. in [35].
Other non-OWL proposals describing the structures that may be used in documents also exist. An example is the
From a more syntactical point of view, Tannier et al. [38] associate each (XML) element in a document with one of three different categories:

Diagram describing the composition and the classes of the
Zou et al. [41] make Tannier et al.’s classification more extreme, defining only two categories of document elements:
Finally, several XML vocabularies, which have been developed in the past years and which are currently used by scholarly publishers (e.g., the Elsevier Journal Article DTD22
Elsevier XML DTDs and transport schemas: http://www.elsevier.com/author-schemas/elsevier-xml-dtds-and-transport-schemas.
Even if each of the aforementioned works proposes to model document components according to a particular perspective (e.g., structural vs. rhetorical, minimalistic vs. all-inclusive), a generic model harmonising all these aspects is still missing. DoCO is our tentative attempt to cover all these different perspectives, since it is an OWL model for describing all the extrinsic and intrinsic characterisations of document components.
There is an intrinsic difficulty in defining certain document components as purely rhetorical or purely structural. Even a well-known, easily identifiable component such as the paragraph cannot be considered as being strictly structural (i.e., carrying only a syntactic function), since it intrinsically carries rhetoric as well, through its natural language sentences. Paragraphs therefore have more than a syntactic function.
However, document markup languages often define a paragraph as a pure structural component, without any reference to its rhetorical function:
“A paragraph is typically a run of phrasing content that forms a block of text with one or more sentences” [22];
“Paragraphs in DocBook may contain almost all inlines and most block elements” [40]23
The words
The above definitions emphasise the structural connotation of the paragraph, that it “forms a block of text” or that it “contains” other elements, and this connotation is amplified by our direct experience as readers. It is the structural aspect that readily stands out in a book or webpage and that helps us, as readers, to distinguish a paragraph from the surrounding text. Yet this is insufficient for describing this element in its entirety. For instance, what is missed is the characterisation of a paragraph as a “self-contained unit of a discourse in writing dealing with a particular point or idea”24
Wikipedia article about “Paragraph”:
The
The creation of DoCO was undertaken by studying different corpora of documents (mainly scientific literature and web documents on different topics) and publishers’ guidelines, from two perspectives – the structural and the rhetorical – as was also done by past works on document patterns [13–15]. We also undertook some informal interviews with researchers in different fields and with academic publishers, in order to gather as much information as possible about document components and their use. In addition, when developing DoCO and all its imported ontologies, we followed all the best practices already adopted in [5] and [6], which are directly inspired by the OBO Foundry Principles25
OBO Foundry Principles:

A Graffoo diagram [17] showing the eight concrete patterns for document structures (bottom classes, in blue) and their relationships to high-level and abstract patterns (top classes, in yellow). (Color figure online)
are open for use by all;
possess a unique identifier space (namespace);
are published in distinct successive versions;
have clearly specified and delineated content;
are orthogonal to other SPAR ontologies;
include textual definitions for all terms;
use relationships (object and data properties) that are unambiguously defined;
strive to be well documented;
are meant to serve a plurality of independent users;
have been developed collaboratively.
DoCO imports the Discourse Elements Ontology:
In the next subsections we briefly introduce our theory of structural patterns as described in [14], and the rhetorical components that usually appear in scholarly articles, which represent the theoretical underpinnings of DoCO. Then, we introduce some of the document components of DoCO relevant for the description of scientific articles. We provide their formal definitions using DL formulas.
We have been investigating patterns of textual documents to understand how their structure can be segmented into atomic components that can be addressed independently and manipulated for different purposes. Instead of defining a large number of complex and diversified structures, in [13] we proposed a small number of structural
These patterns for textual documents were fully described in [14] and modelled as an OWL ontology called Pattern Ontology:
All the patterns are defined in terms of two main kinds of entities, themselves characterised by two different properties28
All prefixes are declared in
These patterns are briefly introduced in Table 1. They facilitate the creation of unambiguous, manageable and well-structured documents. The regularity of pattern-based documents (defined by means of markup languages such as DocBook or LaTeX) then makes it possible to perform complex operations easily, even when knowing very little about the documents’ markup vocabulary. This in turn enables designers to implement more reliable and efficient tools [14], make hypotheses regarding the meanings of document fragments [15], identify special cases, and study global properties of sets of documents [13].
Eight (plus three) structural patterns for descriptive documents
The pure rhetorical characterisation of document components is not necessarily linked to the structural organisation that a scholarly article may have. For example, some scientific journals (such as the Journal of Web Semantics29
Journal of Web Semantics Guide for Authors: http://www.elsevier.com/journals/journal-of-web-semantics/1570-8268/guide-for-authors.
The characterisations of these purely rhetorical components, which are not always linked explicitly to a particular structure, are defined in the
Note that it is still possible to apply two different rhetorical characterisations to the same block of text. For instance, in journal articles it is common to have a section entitled “Materials and Methods”, which can be characterised rhetorically by using both the classes
In this subsection, we introduce those classes of DoCO that bring together both the purely structural elements of a document (i.e., the structural patterns introduced in Section 3.1) and generic rhetorical characterisations (i.e., the rhetorical components recounted in Section 3.2). We focus particularly on the structures that usually define the main components of scientific papers30
As already mentioned, DoCO contains more classes than those described here in the text, to enable description of other kinds of bibliographic entities, such as books and poems, in addition to scientific articles. For a full list, see the ontology itself at
The class
A In this and the following description logic excerpts, we use some properties that are defined in imported ontologies. In particular,
A Potentially there exist two different ways of organising footnotes, since their structural semantics can depend on the particular (markup) language we use to express it, as discussed in [15]. The first, is a container-based behaviour, as adopted by JATS [24], that allows one to specify footnotes (through the element
A Any table in DoCO is described as a po:Table that contains at least one po:Container, without referring explicitly to its rows, columns and cells. In the current version of DoCO, the explicit formalisation of these finer-grained elements was purposely avoided.
A
Commonly, in scientific publications, figures and tables are placed in captioned boxes (i.e., a
Captioned boxes can be used to define a space within a document that contains either a figure (i.e.,
A
This class is particularly useful to describe other, more specific kinds of lists: table of contents, list of figures, list of tables, etc. In particular, the class
All above textual or graphical constructs are usually contained within broader elements that aim to describe the overall organisation of the document structure. First, we have the
Following the front matter, the
The
The aforementioned elements are composed of other textual structures used for a coarse-grained and hierarchical organisation of text, such as
Articles normally, and even chapters sometimes, have particular kinds of sections that have a particular structural and rhetorical function, such as the
The latter kind of section/chapter, defined by the class
Sections and other high-level constructs such as chapters, captioned boxes or the document itself, can be introduced by a
Starting from the above definition, it is then easy to describe particular kinds of titles, such as
The following excerpt, written in Turtle [32], is an example of how DoCO may be used to describe some of the components characterising this article:
The main container (i.e., the paper) is described through FaBiO [29], while the order among the various components has been described by means of the Collections Ontology (CO)34
The Collections Ontology:
A more detailed version of this example, describing the paper in RDF according to DoCO, is available in [28].
This section represents an evaluation of the uses of DoCO, made by listing its adoption in different application scenarios involving the works of different research groups. In particular, we discuss some relevant applications of DoCO in tools and algorithms for the annotation and processing of scholarly articles developed in the past years by two of our research groups, one at the University of Bologna, and another at the University of Manchester. In addition, at the end of this section, we briefly list other external works that concretely use DoCO for different purposes within the Semantic Publishing community.
Processing scholarly articles: PDFX
PDFX35
The PDFX web service:
The identified elements are ultimately stored in an XML file with a tag hierarchy that closely follows the ANSI/NISO Journal Article Tag Suite standard (JATS) [24]. The semi-structured nature of the XML serves as a quick and convenient access route to any of the article’s components.
A “class” attribute has been added to each XML element in order to facilitate interoperability with other services. This attribute is derived from the tag given to an element in the identification stage, and is set in accordance with DoCO. This procedure facilitates aligning the structure recognition output of PDFX with the inputs that other text processing pipelines expect, thus adding a valuable metadata layer to the original publication. A multitude of different-purpose workflows can treat the PDF-to-DoCO-compliant-XML conversion as a pre-processing step, greatly widening their application domain in terms of accepted input.
Utopia Documents36
Utopia Documents –
DoCO provides a disciplined way for PDFX and Utopia Documents to interoperate. In particular, for any visualised PDF document, Utopia Documents runs the PDFX service in the background, using information about identified structural elements to provide additional user functionality. DoCO is used as a mechanism for tagging the output of PDFX and other Utopia Documents plugins in an interchangeable way; thus if plugins want to exchange tables/figures and references, they use DoCO annotations. Additionally, third-party plugins that are used for text mining can use the tagged structure to tune their behaviour as they pass through the document (e.g., some algorithms may want to include/exclude certain sections, or to become more or less sensitive, or to include/exclude captions or references during processing). For example, the mention of a particular gene or protein in the Introduction or Discussion sections of a paper is likely to have a very different meaning to the mention of it in the Materials and Methods section, where it is likely to be an “ingredient”.
The rhetorical element types that PDFX can differentiate
Utopia Documents works as follows. When a user opens an article, Utopia Documents uses PDFX to analyse the document’s structure. DoCO
Although the most frequently occurring structural components of documents are expressed in most XML vocabularies used by scholarly publishers – e.g., the Elsevier Journal Article DTD, DocBook and JATS – they are often expressed by different elements. For instance, the element
In making steps towards addressing this issue, we have recently used DoCO as a theoretical base for the development of an ontology-aware algorithm to retrieve the meaning of markup structures in XML article sources [15], without explicitly looking either at the particular markup language used, or the actual content of the document. The algorithm was developed by starting from the actual specification of DoCO classes, and then tuned according to other statistical and topological principles (e.g. the frequency of markup elements, their position within the document, etc.)37
The algorithm (fully introduced in [15]) is neither an intelligent nor an adaptive algorithm, but rather a prescriptive one that uses the logical characterisations of DoCO components as a basis to identify them in documents through an iterative process.
We performed a preliminary test (fully described in [15]) on a dataset consisting of 117 scientific papers encoded in DocBook and published between 2008 and 2011 in the Balisage Series Conferences38
Balisage Conference Series: We acknowledge that this analysis was subjective and solely based on our understanding of the semantics of the element, its definition schema and its documentation.
We are currently extending the algorithm in order to try to recognise additional DoCO components such as PubMed Central Open Access Subset: Science Direct:
In addition to our work described in the previous sections, we list here some of the most important activities within the Semantic Publishing community that work with or reference DoCO, according to a bipartite classification: works that use DoCO for internal project goals, and works that discuss its use for modelling document components.
Adoptions of DoCO as part of existing works
Use of DoCO for modelling documents
Conclusions
In this paper we introduced
Technically speaking, DoCO is a model that provides a general structured vocabulary of document components, based on our previous work on document patterns [14] and other existing works on the rhetorical characterisation of documents, such as [20,21]. DoCO was developed in order to be used in a complementary way with other ontologies describing different aspects of the publishing domain and scientific discourse. It can, for example, be used in conjunction CiTO to identify the specific sections, paragraphs, figures or tables to which a citation relates, instead of citing the paper as a whole. It can likewise be used with the SALT Rhetorical Ontology to explicitly characterise sentences or pieces of text as carrying a particular argumentative function.
In particular, in this article we formally described the DoCO components that most commonly appear within scientific articles, such as paragraphs, figures, tables, sections, chapters, references, front/body/back matters, and the like. In addition, we described tools and methods that use DoCO for different purposes, such as annotating PDF documents or retrieving the intended semantics of components of scholarly articles.
As future work, building from the encouraging results we obtained from our tests described in Section 4.3, we plan to refine the heuristics used in the algorithm for automated document component analysis, so as to increase the precision and recall for each element relative to the gold standard. We plan to extend the set of DoCO structures handled, to enable automated identification of other significant document components such as mathematical formulas, block quotes and front matter metadata (authors, affiliations, e-mail addresses for corresponding authors, etc.).
An initial mapping between DoCO and DocBook is already described in [15]. We plan to add additional mappings, for example to JATS metadata elements, in the near future.
In addition, we are working on extending the current implementation of PDFX in order to identify other document components, including those which are purely rhetorical (e.g., methods, materials, experiment, data, result, evaluation, discussion). All these components will have adequate DoCO annotations in the XML conversion outputs. Another future planned development for PDFX will concern the automatic conversion of all the structures retrieved and declared in the XML outputs into RDF according to DoCO and other relevant models, such as EARMARK [16] and SALT [21].
