Abstract
Keywords
Introduction
Ontology matching is the non-trivial task of finding correspondences between entities
of two or more given ontologies or schemas. It is an integral part to ensure
semantic interoperability. The matching can be performed manually or through the use
of an automated matching system. Ontology matching is a problem for Open Data (e.g.
matching publicly available domain ontologies or interlinking concepts in the
See
A major challenge for matching ontologies is the fact that they are typically designed within a given context and deep background knowledge that is not explicitly expressed in the schema definition [73]. In order to automatize the ontology matching process, external background knowledge is therefore required so that the automated matching system can interpret for example textual labels and descriptions of the elements within the schemas that are to be matched.
Current surveys in the ontology matching [14,19,223,238] and schema matching [12,318] domain classify matching systems according to their matching technique (strongly influenced by Euzenat and Shvaiko [74,290] as well as Rahm and Bernstein [265]) with minor or no emphasis at all on the background knowledge used.
In the area of context-based matching, i.e. matching with intermediate resources, Locoro et al. [213] present an abstract seven-step process for context-based matching together with an experimental evaluation of different parameter configurations. The proposed framework is flexible but experimentally focused on ontologies as background knowledge and a path- and logic-based exploitation approach. The survey at hand takes a broader look at the types of background sources and different exploitation strategies used in research including, for instance, unstructured data and statistical or neural approaches.
A recent survey by Trojahn et al. [334]
provides a detailed perspective into foundational ontologies in ontology matching
which includes, among other use cases, the exploitation of those for the task of
matching domain ontologies. The survey presented here is broader in the sense that
foundational ontologies are considered only as
Thiéblin et al. [327] review complex matching systems, i.e. systems that are capable of generating correspondences involving multiple entities, transformation functions, and logical constructors. The matching systems covered in their survey use different knowledge representation models (including table-based or document-based schemas, for instance). The systems are characterized based on the correspondence output and the underlying process type which generated the complex alignment. Background knowledge is not discussed and does not play a major role in the current implementations of complex matching systems. The survey at hand is complementary in the sense that it focuses on systems producing simple equivalence correspondences through the use of background knowledge.
This comprehensive survey reviews an extensive set of ontology matching and integration systems published in the last two decades in terms of the background knowledge used and in terms of the strategy that is applied to exploit the external background knowledge. It further covers the approaches used to link schema concepts to background knowledge. Based on the extensive collection of reviewed systems, we provide a comprehensive overview of background knowledge sources and strategies used in the past. Furthermore, this survey reveals a number of blind spots that have not yet been thoroughly explored.
In the following, the selection method for publications used in this survey is presented (Section 2.1). Afterwards, the core theoretic concepts are introduced in Section 3, namely schema matching and ontology matching (OM). In Section 4, background knowledge is defined, its usage in ontology matching system is analyzed, and the most used resources are presented. Thereupon, classification systems for background knowledge sources (Section 5), concept linking approaches (Section 6), and exploitation approaches (Section 7) are presented together with examples. In Section 8, we outline interesting directions for future work in the research field.
Selection of publications
See
We further manually added Back then the competition was actually
referred to as
The number of retrieved papers for each search parameter can be found in
Table 1. The bibtex files can be found in the
GitHub repository of this survey.4 See
Search parameters and the associated number of papers
See
The resulting set of papers constitutes the final set of publications used for identifying relevant works for this survey. In total, 1,814 papers were considered in this study.
Inclusion and exclusion criteria for the papers in this survey
Papers considered in this survey had to be written in English language (C1), had
to be accessible through the infrastructure of a large German research
university (C2), and had not to be a duplicate of another paper (C3). It is
important to note that multiple publications on the same topic (such as a
matching system) do not qualify as duplicates despite their potentially large
content overlap. This is rooted in the observation that there are often multiple
versions and papers of a single matching system which evolves over time (for
example
We explicitly exclude works limited solely to instance matching or entity linking (C4). We further focus on matching systems that produce simple correspondences rather than complex ones (C5). Lastly, we only cover papers that present an actual system, i.e. a background knowledge-based (C6) ontology matching system implementation (C7) for which an evaluation is presented. The usage of the background knowledge must be appropriately documented (C8). In total, 341 papers fulfilled the inclusion criteria of this survey.
All matching systems were systematically evaluated in terms of (i) the background knowledge sources used, (ii) the strategy deployed to link ontology concepts to the background knowledge source, and (iii) the strategies the matching systems apply to exploit the background knowledge sources.
All data points and code used for the quantitative analysis of this survey are
available online.6 See
The schema matching problem within the data integration process

Process for integrating two schemas, compiled from [344].
The focus of this paper is a special case of the first step of the DI process,
schema matching. It is important to note that a schema is not bound to a
technology stack. It is, for example, possible that the same schema is
implemented on different technology stacks such as different database types.
Many formalization notations for schemas have evolved over time – for example in
the area of (conceptual) entity relationship models
The ontology matching problem
This definition is a merge of
previous definitions by Gruber [112] and Borst [29].
A matching system can be seen as a function Originally called
Ontology integration (also referred to as ontology enrichment, ontology
inclusion, or ontology extension) describes the process of extending a given
target ontology Pre-processing
Phase Matching
Phase Merging
Phase Post-processing
Phase
In this article, we also cover papers and systems which address the ontology
integration problem where background knowledge plays a significant role in the
matching phase. In figures and tables, those systems are notated with a
subscript
See
See
Prior to 2010,
participants submitted resulting alignments directly. The submission of
packaged tools (at first in the form of URLs of Web services running on
the participants’ site) instead of results was started in 2010. Since
2012, the submission of packaged tools is the standard evaluation
procedure at the OAEI. See
The
discontinuation of tracks is often due to missing track organizers.
Reasons may be the high effort connected to evaluating other
researchers’ matching systems and writing summarizing reports or a
change in the research focus. However, most track data is still
available for download and for further usage.

OAEI schema matching tracks since the inception of the initiative. Explicitly excluded are complex matching tracks and instance matching tracks. The knowledge graph track is not a pure schema matching task but a combined one where schemas and instances have to be matched simultaneously. The library track has been organized multiple times with completely different datasets and by different researchers using the same track name. Therefore, the track streams have been divided in three groups (A, B, C).
Depicted are all schema matching tasks of the OAEI 2020 and 2021 together
with the best performing systems in terms of
The tracks which were considered
are listed in Fig. 2. Figures 3 and 4 do not
include other evaluation tracks such as team participations in the
SemTab [157] track. Due to
very high similarity, the following matching systems have been merged in
the figure:
Figure 5 reveals that over the years the number of
participating schema matching systems to date has slightly dropped from the peak
in the year 2012 albeit the current participation total is still comparatively
high compared to the early days of the initiative.15 Figure 5 has been compiled from Figs 3 and 4, hence the concrete
number of schema matching systems is counted each year excluding pure
instance matching systems. The OAEI does not calculate this statistic.
In addition, we found that over the years the OAEI counted
inconsistently with regards to participation (for example counting
participating teams in 2012 but matching systems in 2013 on their
results Web page).
Table 3 lists all schema matching tracks from 2020 and 2021 together with the best performing system and the background knowledge sources used by those. As visible in the table, all those systems make use of external knowledge datasets. AML, which scores as best performing system in multiple tracks, exploits multiple external knowledge sources.
Background knowledge
We define background knowledge in matching as any knowledge source that is
external to the matching process and is used to obtain the final alignment.
Hence, within the matching process, external knowledge can be used in the form
of an existing alignment (

All OAEI schema matching systems (which participated in the tracks listed in Fig. 2) and their evaluation time frame since the inception of the OAEI; Part 1 of 2 from 2012–2021.

All OAEI schema matching systems (which participated in the tracks listed in Fig. 2) and their evaluation time frame since the inception of the OAEI; Part 2 of 2 from 2004–2021.

The number of ontology matching systems participating in the OAEI from inception to date.
Background knowledge can significantly improve the performance of ontology
matching systems. This is clearly visible by analyzing different OAEI systems:
When comparing LogMap and LogMapBio [150] in the OAEI 2021 campaign, for instance, it can be seen that
the latter system scores a significantly higher recall on the OAEI Anatomy
dataset. Other examples can be found through a comparison of AML [88] and Gomma16 There is
no results paper for the OAEI 2013 participation of Gomma. However, the
system is described in the paper of the 2012 campaign [111].
In [87], Faria et al. evaluate strategies for matching biomedical ontologies. The experiments show a clear performance increase when background knowledge is used. In terms of exploitation strategies, the authors recommend to use cross-references (if available) over lexical expansion.
While evaluating an approach to build a background knowledge resource for ontology matching, Annane et al. [17] also analyze the performance of the YAM++ matching system with and without background knowledge finding that the matcher configuration which uses background knowledge significantly outperforms the version without additional resources. They report that the better performance is mainly due to a higher recall.
In an extensive survey on the systems participating in the OAEI Anatomy track from 2007 to 2016, Dragisic et al. report that “[f]or the systems that participated with a version using biomedical auxiliary sources and a version not using biomedical auxiliary sources, the F-measure for the one with biomedical auxiliary sources was always higher” [68].
Missing background knowledge was named as one of the 10 challenges for ontology matching in 2008 [291]; this was re-affirmed in 2013 [73] and it is still under active research.
As there are often multiple potentially beneficial sources of background
knowledge available for ontology matching, some authors propose heuristics to
determine the benefit of a background knowledge source in order to select one
before performing the match operation. Nasser et al. [330] define four criteria to automatic background
knowledge selection:
Based on their universal requirements,
they propose an approach which models the selection task as information
retrieval problem. Ontologies and background sources are indexed using TF-IDF;
the ontologies are then regarded as query on the background knowledge
sources.
In the LogMapBio system, Chen et al. [45] apply a relatively simple lexical algorithm to identify suitable
mediating ontologies from BioPortal [104,352]. In the OAEI 2020
campaign, the system achieved a significantly higher recall and
Faria et al. [89] propose a heuristic
called
Knowledge sources and matching systems that use them part 1 of 4.
Referenced is the first documented usage by the matching system. Systems
that did not participate in the OAEI are italicized. Named systems are
referred to using their system name
Knowledge sources and matching systems that use them part 1 of 4. Referenced is the first documented usage by the matching system. Systems that did not participate in the OAEI are italicized. Named systems are referred to using their system name
Knowledge sources and matching systems that use them from part 2 of 4. Referenced is the first documented usage by the matching system. Systems that did not participate in the OAEI are italicized. Named systems are referred to using their system name
Knowledge sources and matching systems that use them from part 3 of 4. Referenced is the first documented usage by the matching system. Systems that did not participate in the OAEI are italicized. Named systems are referred to using their system name
Knowledge sources and matching systems that use them from part 4 of 4. Referenced is the first documented usage by the matching system. Systems that did not participate in the OAEI are italicized. Named systems are referred to using their system name
Tables 4 to 7 list
all background knowledge sources that have been used by the systems evaluated in
this survey together with the actual systems that use the corresponding
knowledge source. As multiple papers exist for some systems, the first
documented usage of the knowledge source by the matching system is referenced.
Consequently, there is no guarantee that the latest system still uses the
specified sources.

Cumulative usage of a particular knowledge source of all systems in this survey within the years 2000 to 2021.
Figure 6 shows the cumulative usage of background
knowledge sources that have been referenced in at least four different
publications. The by far most often used external knowledge resource is
In Fig. 7, background knowledge source usage is
plotted over time. As in the figure before, only sources are depicted which are
used at least four times by the papers included in this survey. What is visible
from the figure (and also from Tables 4, 5, 6, 7, 8, and 9) is that background knowledge has been used from
very early on. In the first OAEI in 2004, for example, the The search engine is not online
anymore.

Number of publications of this survey using a particular knowledge source over time.
Matching systems using WordNet; Part 1 of 2. Referenced is the first
documented usage by the matching system. Systems that did not
participate in the OAEI at some point in time are italicized. Ontology
integration systems are indicated by a subscript
Matching systems using WordNet; Part 2 of 2. Referenced is the first
documented usage by the matching system. Systems that did not
participate in the OAEI at some point in time are italicized. Ontology
integration systems are indicated by a subscript
In the following, the ten most used external resources in ontology matching (see Fig. 6) are shortly introduced.
A A See
See
See
See
See
See
See
See
See
ICD stands for “International
Classification of Diseases”. SNOMED-CT
stands for “Systematized Nomenclature of Medicine Clinical
Terms”. See
See
See
See
Classification system
Multiple approaches for categorizing general matching techniques have been
proposed [74,265,290]. The
matching techniques further studied in this survey can be broadly categorized as
This is naturally not precise.
WordNet and other lexical resources, for example, are not classified as
formal/informal resource-based but instead as language-based according
to Euzenat and Shvaiko.

Aggregated number of publications of this survey using external background knowledge in ontology matching. Domain-specific background knowledge sources are colored in light gray, general-purpose background knowledge sources are colored in black.

Classification of background knowledge sources that are used for matching.
Structured sources appear in different variations ( Theoretically, the other structured
resources can also be mono- or multilingual – however, the focus of the
knowledge provided there is rather factual and the language is typically
not the core property of the knowledge resource. Therefore, we decided
against a subdivision here in favor of clarity.
An overview of the proposed classification system is presented in Fig. 9; in Table 10, all resources covered in this survey are categorized according to the presented classification system. In the following, we will further define each structured resource and provide examples for all fine-grained categories.
German book title, translates to
Background knowledge sources sorted according to their type
We further differentiate between (i)
An example for a general-purpose single SW dataset would be
An example for domain-specific linked SW dataset in this sense would be some or
all
See
See
Further properties of background knowledge sources that are not used here for the
proposed classification are (i)
The resource size may limit the utility provided by the source – a small general
knowledge thesaurus, for example, may only be of limited use – but may at the
same time also limit the exploitation strategy that can be used; the
The task-dependency also limits the options to exploit the source (see Section 7). A very specific Web-API providing only a very specific service may limit the strategy to the simple call of the service.
While license permissions are not of utmost concern to the research community, they are very important in the enterprise world when it comes to the actual application of matching systems in the real world for commercial purposes.
The level of authoring or trust of a knowledge source is affecting the
exploitation strategy as well. Generally, four main categories can be observed:
(1)
Categorization of linking approaches
In order to exploit an external knowledge source, the concepts in one or both of the
ontologies to be matched need to be linked to the knowledge source. The linking
process is also known as
While many publications address the concrete application of a background source for ontology matching, few discuss the actual linking problem. However, since linking is the first step in exploiting a knowledge source, it significantly determines the quality of the outcome. In a visionary paper by Sabou et al. [278], online ontologies obtained with a Semantic Web search engine have been used for ontology matching. Out of the 1,000 correspondences checked manually, 217 false ones have been identified. The authors find that out of those, 53% are due to anchoring errors. This emphasizes the need for a solid anchoring strategy.
The linking process is typically dependent on the knowledge source used and can be as simple as forwarding a label (e.g. when using the Google search API) or as complicated as the ontology matching problem itself (e.g. when another knowledge graph shall be used).
For linking, we distinguish two goals: (i) finding at most one link for each concept
in an ontology and (ii) finding up to many links for each concept in an ontology.
Multiple links can be sensible in the case of partial linking; for example, a
concept with label “derivatives exchange” may be linked to “derivatives” and
“exchange” in cases where there is no match for the complete concept. Other reasons
for multi-linking are datasets with homonyms40
In terms of classifying linking approaches, we propose a classification system consisting of four categories: (i) given links, (ii) direct label linking, (iii) fuzzy linking, (iv) Word Sense Disambiguation (WSD). The proposed classification system is summarized in Fig. 10. In the following, we will introduce each category in detail and provide examples. It is important to note that not every linking strategy can be applied on each dataset; WSD, for instance, can only be applied if there are multiple senses available in the background dataset.

Categorization of linking approaches.
See
Some authors consider WordNet metrics such
as the
In Section 5, the background knowledge resources used in ontology matching have been presented and categorized. The second main dimension of this survey is the exploitation strategy of the background resource. In many cases, there are multiple options to beneficially use an external knowledge source.
We classify exploitation strategies into four groups: (i) factual queries, (ii) structure-based approaches, (iii) statistical/neural approaches, and (iv) logic-based approaches. A factual query is the request for one or more data records contained in the background resource. Structure-based approaches exploit structural elements in the background knowledge source. Statistical or neural approaches apply statistics or deep learning on the background knowledge source or consume an existing pre-trained model. Lastly, logic based approaches employ reasoning with the externally provided resource. In the following, the categories are further described and extensive examples are provided. An overview of the proposed classification system is provided in Fig. 11.

Overview of the types of background knowledge exploitation strategies.
There is in some cases no clear boundary
between structure-based and statistical approaches since structure-based
approaches typically apply statistics. We classify an approach to be
structure-based if the focus is the exploitation of the structure of the
knowledge source.
Due to their nature, structure-based approaches are not (obviously) applicable to factual databases, or pre-trained neural models.
Neural approaches employ artificial neural networks either directly on the background
knowledge source or re-use existing pre-trained models. For example, the background
knowledge source may be transformed into a vector space [256] or the background knowledge source is already a vector
space that may be used directly to link the schemas to be matched [140] in a vector space. We also count neural
APIs into this category;
The term
The steps are namely: (i) ontology
arrangement, (ii) contextualization, (iii) ontology selection, (iv) local
inference, (v) global inference, (vi) composition, and (vii)
aggregation.
However, we did not find broad usage of logic-based exploitation approaches in past
and current (OAEI and non-OAEI) ontology matching systems that go beyond singled out
experiments. Approaches that fall into this category are Sabou et al. who use
SUMO stands for
“suggested upper merge ontology”, DOLCE stands for “descriptive ontology for
linguistic and cognitive engineering”, and OpenCyc is a subset of the Cyc
knowledge base by Cycorp that is not available anymore.

A logic-based exploitation strategy on an external ontology, initially
presented by Sabou et al. [276],
adapted.
In Section 5, we proposed a classification system for background knowledge sources and in Section 7 we presented a classification system for exploitation approaches. In this section, we will overlap those to a matrix and will position the systems evaluated in this survey in there. We will use this matrix as a starting point for discussions of white-spots in the area of background knowledge-based ontology matching. We further outline interesting observations, shortfalls and biases found in the ontology matching domain.
Systems in the background knowledge type / exploitation method type matrix
(domain-specific background knowledge)
Systems in the background knowledge type / exploitation method type matrix (domain-specific background knowledge)
Systems in the background knowledge-type / exploitation method type matrix (general-purpose background knowledge)
Tables 11 (domain knowledge) and 12 (general knowledge) present the systems evaluated in this study in a source/strategy matrix. The exploitation strategy (columns) in the table follows the proposed classification which is summarized in Fig. 11. The rows represent the background knowledge type and follow the proposed classification which is summarized in Fig. 9. Irrelevant combinations of source and strategy are marked in the tables with a hyphen. Empty or rarely filled cells hint at yet underexplored and potentially interesting research directions in the area of background knowledge-based ontology matching.
From the tables we see that general purpose background knowledge is used more
often than domain-specific background knowledge.47 Note that systems
that use WordNet (see Tables 8 and 9) are not explicitly listed for better
clarity in Table 12. The low usage of factual databases
may be due to the fact that the community prefers knowledge presented in
a graph.
It is quickly visible that factual queries are most often used regarding the strategy. When it comes to yet underexplored research directions of background knowledge usage, we see that in terms of the approaches used, logic-based and neural-based strategies are an interesting and promising research direction. Pre-trained embedding-models and architectures, for instance, are up to 2020 rarely used but may be very promising given breakthroughs in other scientific communities. An increase in publications in 2021 in this category may indicate that scientific interest is already moving in this direction. Structural approaches are almost completely limited to the English WordNet. The exploration of structural methods on multilingual datasets as well as on Semantic Web datasets may yield interesting results given good results on the English WordNet and given that this class of approaches is typically intuitive to understand and can be comprehended by humans (unlike neural models).
If we take a closer look at the domain-specific knowledge sources used, it is striking that almost all datasets are from the biomedical domain. This may be due to a particularly prolific bioinformatics community that holds open standards and open data high – however, the skewness of ontology matching publications towards the biomedical domain must be pointed out. In Fig. 6 (cumulative background knowledge usage), it is striking that all domain-specific datasets are from the biomedical domain. This domain-focus also visible when looking at OAEI tracks where almost all domain-specific problems are from this domain. This fact is likely self-enforcing: New researchers use existing evaluation datasets and existing background knowledge and quickly find themselves in this domain area.
Nonetheless, ontology matching is a problem in all domains that are concerned
with data management which makes it ubiquitous. Enterprise schema matching and
integration challenges in the business world, for example, are not reflected at
all in OAEI tracks.49 In the years 2016 and 2017, there
was a
An interesting research direction is, therefore, also to broaden the domain-focus of the ontology matching problem and to evaluate which background datasets and exploitation strategies are applicable in other domains. Therefore, new and publicly available benchmark datasets from more domains are required to support research efforts in this area. New challenges may come to light such as missing domain-specific knowledge sources not being broadly available [250]. The provisioning of further evaluation datasets in other domains is a clear desideratum.
A further bias besides a domain-focus is the focus on monolingual ontology matching. At the OAEI, there is currently only one multilingual matching task with few participants. The techniques currently applied are purely lookup-based despite advances in machine translation.
Multilingual ontology matching requires the addition of external resources; hence, we can find many multilingual background sources in Tables 4 to 7. However, when we compare the resource/strategy matrix in Tables 11 and 12, we quickly see that there are many systems that use general-purpose multilingual resources but there is not a single system that uses domain-specific multilingual resources. This may be due to the fact that there are at the moment no benchmark datasets for more advanced multilingual matching tasks available – despite this being a relevant problem in the real world. The current multilingual evaluation datasets are all from the conference domain with a rather low level of domain-complexity.
It could be further observed that, although many diverse multilingual resources
such as Wikidata or EuroVoc50 EuroVoc is a multilingual
thesaurus by the Publications Office of the European Union. See
Interesting research directions are the exploration of new multilingual matching methods and datasets as well as the exploration of multilingual matching challenges in domain-specific settings. The provisioning of further evaluation datasets is also for the aspect of multilinguality a desideratum. Given well-performing and publicly available deep-learning models from the NLP domain, their application should also be considered for the ontology matching task.
Another language-based bias is the focus on aligning schemas that are
semantically described in the English language. The research community currently
mainly solves English–English alignment problems.51 It has to be
mentioned here that this survey only considers publications published in
English (see C1 in Table 2) which may skew
the observations. However, given that English is the lingua franca in
the ontology matching community, we assume that this skew is
small.
An interesting research direction is, therefore, the exploration of non-English rooted ontology matching problems with non-English background knowledge sources. As with the multilingual bias, the community would greatly benefit from the provisioning of more evaluation datasets.
While multiple automatic background knowledge selection approaches have been proposed (see Section 4.2), we did not find significant usage of documented automated selection processes in the publications reviewed for this survey. Up to date, the majority of background knowledge sources in ontology matching is either bound to one predefined source or uses few hand-picked resources. With the exception of LogMapBio, most matching systems which apply an automated selection approach are presented in the context of background knowledge selection. Hence, self-configuring matching systems that select their own background resources based on a particular matching problem are still an interesting area of research. Very recent approaches, such as the usage of pre-trained language models that are fine-tuned on the matching task, do not solve this task (but instead emphasize the importance since the pre-trained model also needs to be selected).
Linking
Our analysis on how concepts are linked into the background knowledge source revealed that most matching systems do not perform elaborated linking approaches but use a direct string lookup. While this may be sufficient for some background datasets, there is indication that in some cases linking is a significant component in the performance of background knowledge-based matching systems [277,278].
A reason for the negligence when it comes to linking might be that Word Sense
Disambiguation is perceived as too hard. Another reason might be due to the fact
that schemas to be integrated are often derived from the same domain which
significantly reduces the amount of
However, when large external knowledge bases are to be matched or when the schemas to be matched are large and diverse such as in the case of knowledge graph matching, WSD may significantly improve the results obtained with external background knowledge. This finding is in line with a recent publication on knowledge graph matching by Hertling and Paulheim [130] who show that state-of-the-art matching systems perform badly when it comes to matching non-related or weakly-related knowledge graphs due to non-disambiguated homonyms.
An interesting research direction is consequently the development, evaluation, and comparison of multiple linking approaches and their effect on the performance of automated matching systems. We also see a need for the provisioning of additional matching gold standards in the area of knowledge graph matching as well as matching of weakly related schemas.
Conclusion
Since the early 2000’s, the understanding of the (automated) ontology matching problem as well as the development of advanced matching systems have greatly improved. Nonetheless, the ontology matching problem is not solved and will stay an interesting research area for the years to come. One key to coming closer to the solution is the deeper integration of background knowledge within the ontology matching process.
In this survey, we reviewed all ontology matching systems that participated in the OAEI from 2004 until today, as well as systematically selected ontology matching systems in terms of what background knowledge sources they use, which linking approach they employ, and how they use the external knowledge. We classify background knowledge in multiple structured and unstructured classes according to their purpose (domain-specific or general-purpose). The main structured knowledge source types are (i) lexical and taxonomical resources, (ii) factual databases, (iii) Semantic Web datasets, and (iv) pre-trained neural models. The main unstructured resource types are (i) textual and (ii) non-textual. In our review we found that mostly general-purpose structured knowledge is used in ontology matching. Most systems to date make use of simple lexical and taxonomical sources. Yet underexplored sources of background knowledge are unstructured resources, pre-trained neural models, general purpose knowledge graphs, and linked data.
We further presented a classification system for linking strategies consisting of four categories: (i) given links, (ii) direct linking, (iii) fuzzy linking, and (iv) Word Sense Disambiguation. Although linking is important when it comes to exploiting external knowledge sources, we found that most systems use direct label linking.
Concerning the strategy that is used to exploit knowledge sources, we presented a classification system consisting of four categories: (i) factual queries, (ii) structure-based approaches, (iii) logic-based approaches, and (iv) statistical/neural approaches. We found that a look-up strategy of facts is most commonly used. Structure-based strategies are almost exclusively applied on WordNet. Despite a clear vision, logic-based approaches did not gain much traction in recent years. A novel research area in terms of exploitation strategies are neural approaches which are currently barely used but showed very good results in other domains.
In our survey, we found multiple biases when it comes to ontology matching with background knowledge: (i) A focus on biomedical matching tasks, (ii) a focus on monolingual matching, and (iii) a focus on matching schemas rooted in the English language. In particular the business world where integration problems are plentiful and multi-faceted, is hardly considered in current research efforts. Although the focus of this survey is the usage of external knowledge within the ontology matching process, we consider the identified biases to be generally applicable.
