Traditional approaches for querying the Web of Data often involve centralised
warehouses that replicate remote data. Conversely, Linked Data principles allow
for answering queries live over the Web by dereferencing URIs to traverse remote
data sources at runtime. A number of authors have looked at answering SPARQL
queries in such a manner; these link-traversal based query
execution (LTBQE) approaches for Linked Data offer up-to-date
results and decentralised (i.e., client-side) execution, but must operate over
incomplete dereferenceable knowledge available in remote documents, thus
affecting response times and “recall” for query answers. In this paper, we study
the recall and effectiveness of LTBQE, in practice, for the Web of Data.
Furthermore, to integrate data from diverse sources, we propose lightweight
reasoning extensions to help find additional answers. From the state-of-the-art
which (1) considers only dereferenceable information and (2) follows
rdfs:seeAlso links, we propose extensions to consider
(3) owl:sameAs links and reasoning, and (4) lightweight
RDFS reasoning. We then estimate the recall of link-traversal query techniques
in practice: we analyse a large crawl of the Web of Data (the BTC’11 dataset),
looking at the ratio of raw data contained in dereferenceable documents vs. the
corpus as a whole and determining how much more raw data our extensions make
available for query answering. We then stress-test LTBQE (and our extensions) in
real-world settings using the FedBench and DBpedia SPARQL Benchmark frameworks,
and propose a novel benchmark called QWalk based on random
walks through diverse data. We show that link-traversal query approaches often
work well in uncontrolled environments for simple queries, but need to retrieve
an unfeasible number of sources for more complex queries. We also show that our
reasoning extensions increase recall at the cost of slower execution, often
increasing the rate at which results return; conversely, we show that reasoning
aggravates performance issues for complex queries.