Sage Journals: Discover world-class research

Abstract

Gaining insight into a complex problem often requires combining data from multiple datasets. For this reason, SPARQL query support within a federated environment is an important feature. However, several pitfalls have been encountered in practice, significantly complicating the use of SPARQL queries in such setups. These challenges include uninformative error responses, performance bottlenecks and unintended semantic changes introduced by SPARQL endpoints. To address these pitfalls, this paper introduces a newly implemented SPARQL query debugger, which is available as a web application at https://sparql-debugger.elixir-czech.cz. It has been developed for the purpose of monitoring, in real time, the execution of SPARQL queries that incorporate the service pattern. This monitoring is crucial for error detection and performance optimization. Detailed service execution data (such as SPARQL requests and responses, durations, etc.) can help identify the specific instance of a service responsible for a problem, even if it is deeply nested within the service execution tree. The tool is based on the principle of redirecting all requests to a debugging proxy server, so it can be used with all SPARQL-compliant endpoints without the need for their modification. The debugging tool presented in the paper enables the identification and resolution of issues that are otherwise difficult to address and has proven its effectiveness in practice.

Keywords

SPARQL federated query debugger

1. Introduction

The Semantic Web is becoming increasingly pivotal across various fields that require managing complex and diverse datasets interoperably. One notable example is bioinformatics, where researchers focus heavily on the principles of Findable, Accessible, Interoperable, Reusable data (Wilkinson et al., 2016) and seamless data integration, both of which are inherently supported by Semantic Web technologies. In practice, this entails publishing data in RDF, annotating it with ontologies and querying it using SPARQL (Bansal et al., 2022; Galgonek & Vondrasek, 2021; Pinero et al., 2020; Rutz et al., 2022; SIB RDF Group Members, 2023; UniProt Consortium, 2021; Zahn-Zabal et al., 2020). In bioinformatics, interoperability is important because gaining insights into a biological problem often necessitates combining data from multiple research domains. Because biological datasets are typically produced by independent teams highly specialized in different fields of interest, it is important to be able to create queries that span multiple datasets.

There are several approaches to spanning multiple RDF datasets. One option is to load the selected datasets, or relevant portions of them, into a single triplestore, making it possible, for example, to transform and augment the original data. Another approach is to transparently split a query into subqueries and dynamically identify the appropriate endpoints for each subquery, as implemented by tools such as FedX (Schwarte et al., 2011) or Comunica (Query Federation with Comunica, 2024). Alternatively, subqueries can be explicitly directed to specific endpoints using the standard SPARQL Federated Query extension (SPARQL 1.1 Federated Query, 2013). The advantages of this approach are that it does not require the use of any additional software and that it provides query developers with a complete overview over query federation. The work presented here focuses on the latter approach and federated query refers exclusively to the use of the service pattern within a SPARQL query.

In the SPARQL Federated Query extension, target endpoints are explicitly denoted by service patterns. Note that these patterns can be nested, resulting in a federated query that can be represented as a service pattern tree. From an operational perspective, as services are executed by individual endpoints, the evaluation of such a query effectively constitutes a service execution tree. Note that the two trees may differ in general, even though requests to evaluate service patterns are always initiated by endpoints that evaluate their direct parent patterns.

In practice, several pitfalls have been encountered that significantly complicate the use of federated SPARQL queries. First of all, in the event of an error, error messages from nested services are often not propagated to the top level and are effectively swallowed, resulting in uninformative query error responses. Furthermore, when a query takes an unusually long time to execute, it is not clear what the cause is, leaving users without an understanding of the problem. Last but not least, some SPARQL endpoints may silently modify subqueries they delegate to other service endpoints, so these endpoints might then interpret the modified subqueries differently, potentially leading to unexpected results.

To overcome these pitfalls, the debugging tool presented here has been developed to monitor the entire service execution tree of a federated query. The ability to monitor federated queries is crucial for both error detection and performance optimization. Detailed execution data can help identify the specific service pattern responsible for an error, even if it is nested deep within the service execution tree. Moreover, tracing can reveal service patterns that suffer from high latency or are executed too many times. This is often related to execution strategies employed by SPARQL endpoints. For instance, using the nested loop join (Buil-Aranda et al., 2014) strategy, a specific service pattern may be executed multiple times with different substitutions of its variables based on values computed beforehand, which can result in an enormous number of remote requests. By contrast, resolving a service pattern in its original form can lead to excessively large responses. By pinpointing these bottlenecks, users can optimize their queries for better performance.

Although general monitoring platforms, such as Datadog (Datadog, 2024), offer alternatives to our debugging tool, they require configuration and deployment on service endpoints, which may not always be feasible. By contrast, tools such as Virtuoso (Virtuoso Universal Server, 2024) and Jena (Apache Jena, 2024) offer detailed information on executions of directly nested services, but they cannot debug the execution within those services. Therefore, despite the importance of this functionality, the tool presented here is, to the best of our knowledge, the first to provide comprehensive tracing across all levels of a service execution tree, regardless of which SPARQL engine is used at each endpoint.

2. Implementation

The presented tool is provided as a web application¹ intended for SPARQL query developers, designed to encapsulate federated query debugging within an intuitive interface. The application features a custom YASGUI (Rietveld & Hoekstra, 2013) component for query editing, allowing users to work on multiple queries simultaneously using integrated tabs. The debugging process is initiated by pressing the Debug button. As services are executed by endpoints, the service execution tree is rendered continuously, showing the progress in real time. For each service call, trace information, including the state, HTTP status, request and response data, duration and number of solutions, is displayed. If a service pattern is invoked multiple times with, for example, varying variable substitutions, these calls are aggregated into a special bulk execution node. These nodes, highlighted in yellow, are collapsed by default to enhance clarity and include information about the total number of calls and their combined duration. It is also possible to run a query directly without debugging by triggering the Run button. Furthermore, both the debugging process and the query execution without debugging can be terminated using the Cancel button. In the case of debugging, this can save significant endpoint resources because the expansion of the service execution tree (i.e. the calling of remote services) is halted. In addition, the application also provides a selection of federated query examples.

2.1. Concept

It is generally assumed that SPARQL developers cannot feasibly modify service endpoints (e.g. configure them or install additional extensions) and that their interaction with endpoints is limited only to querying via the SPARQL protocol. The debugging tool presented here is therefore implemented with a proxy server at its core. This server intercepts and wraps the execution of each service together with detailed trace information such as the request data, response data, status, etc. When the debugging process is initiated, a query is sent to the proxy server, this execution acts as the root of the entire service execution tree.

An example of query debugging is illustrated in Figure 1. In this example, a query, denoted Q, is evaluated at endpoint-1 and contains one service pattern SQ evaluated at endpoint-2. All evaluations are performed through the proxy server.

Figure 1.

Debugging proxy server.

Because all service endpoints are treated as black boxes, interaction with them is only possible through SPARQL requests and responses. To properly trace services, the original service endpoint URLs in the SPARQL query request are substituted with the proxy server’s URL. Additionally, essential information must be encoded into these URLs to ensure that, when services are intercepted by the proxy server, they can be properly traced and executed. The URL of the original service endpoint has to be encoded, allowing the proxy server to know which actual service endpoint to call. Likewise, the ID of its parent in the service execution tree needs to be encoded to identify where the new service execution node should be added in the tree. Lastly, the encoded query ID retains information about the query scope.

To demonstrate the debugging process on a practical example, a federated SPARQL query (Figure 2) from the BioSODA website (Exploring Biological Data Using SPARQL, 2024), which retrieves genes that are orthologs of a gene expressed in the fruit fly brain, is used.

Figure 2.

Example federated SPARQL query.

The corresponding query service pattern tree and service execution tree are shown in Figures 3 and 4.

Figure 3.

Example service pattern tree.

Figure 4.

Example service execution tree.

The query is executed at the Oma-browser endpoint². It is processed from top to bottom using the nested loop join execution strategy, where for each substitution of the SPARQL variable ?id the second service pattern is executed. These executions are then aggregated into a new bulk execution node in the visualization.

Consider a scenario where the second service pattern subquery contains another nested service pattern. In such a case, it is impossible to substitute all service endpoints in the query with the proxy server endpoint at once when the query is submitted to the proxy server. Each proxy URL must encode the identifier of the parent in the service execution tree. However, for a nested service pattern, this identifier is only determined after a specific instance of bulk service execution has started at the proxy server. As a result, only the service endpoint URLs at the first level of nesting in the query are initially replaced with proxy URLs. The remaining service endpoint URLs are gradually replaced as the nested service endpoints are invoked through the debugging proxy server, elevating them to the first level of nesting.

Each query can contain more than one directly nested service. In the service execution tree, these executions share the same parent, but it is necessary to distinguish which nested service they belong to in order to create the bulk execution node. This is achieved by encoding a sequential number for each nested service, designated as the serviceCall parameter, into the proxy URL during the proxy URL substitution process.

Following the sample BioSODA query, Figure 5 presents an example of a request sent to the root query endpoint by the proxy server, while Figure 6 presents a request sent to the second service. Both requests are generated by the proxy server, with endpoints being enumerated by it.

Figure 5.

Request to a root query SPARQL endpoint generated by the debugging proxy server.

Figure 6.

Request to a SPARQL endpoint generated by the debugging proxy server.

Consider that during bulk execution, the service execution tree may differ from the service pattern tree. This discrepancy can also occur when service endpoints apply optimizations, such as grouping triple patterns evaluated by the same endpoint, as, for instance, with the Exclusive Groups in FedX (Schwarte et al., 2011). Another such case is issuing special SPARQL ASK requests to determine data availability (Saleem et al., 2018).

2.2. Performance Issues

Queries can be traced by the debugging proxy server in parallel. Each execution of a service corresponds to an HTTP request handled by the proxy server. Service execution trees can become large and deeply nested. Moreover, some SPARQL engines can be capable of executing multiple service calls concurrently within the same query execution. Taken together, this creates significant performance pressure on the debugging proxy server, especially when handling parallel executions at scale.

Note that when a service is executed at an endpoint, it is initiated by a corresponding proxy service execution, which waits for the result. Additionally, all preceding nodes in the service execution tree must also wait for their corresponding service executions to complete. As a result, numerous parallel threads are required, many of which may be blocked while awaiting responses. To address this, Java Virtual Threads are utilized, allowing each proxy server request to be handled by its own virtual thread. The benefit of this approach is that virtual threads are lightweight, enabling the system to efficiently manage thousands of them. If virtual threads get blocked, they do not block system threads, ensuring that the system remains both responsive and scalable.

A feature allowing users to cancel queries, which terminates all virtual threads associated with a specific query execution, is also available. Additionally, this feature instructs the proxy server to reject any new proxy calls related to the query being cancelled, even if they are initiated by the original endpoints during service execution after the cancellation has begun.

2.3. Frontend

To visualize query execution tracing, an npm package that renders the service execution tree is provided. This package is implemented as an independent React component. The component’s API consists of callbacks receiving a SPARQL query and the endpoint where it should be executed. These parameters are then sent to the proxy server, which initiates the query execution and starts notifying the component about updates to the service execution tree.

After query debugging is started, the service execution tree is rendered in real time. Instead of having the browser poll the server for updates, the proxy server actively pushes tree node changes to the browser, which re-renders only the affected parts of the tree dynamically. This real-time update is achieved using the Server-Sent Events (SSE) protocol (Server-Sent Events, 2024). Compared to the WebSocket protocol (The WebSocket Protocol, 2024), SSE offers a simpler and more efficient solution for this use case, as it requires only one-way communication from the proxy server to the browser, following an initial handshake.

Note that it is impossible to determine when a bulk execution node has fully completed until the parent execution is finished. Therefore, the bulk execution time is only partially calculated during execution. Additionally, it is assumed that executions within the bulk can occur in parallel. As a result, the execution time is calculated as the time interval between the start of the first endpoint call and the completion of the last call within the bulk. Consequently, the displayed execution time continues to increase until it is definitively set when the parent execution node is completed.

Part of the visualized service execution tree, including bulk execution nodes, is shown in Figure 7.

Figure 7.

Visualized service execution tree.

3. Discussion

The presented approach has been designed so that, during debugging, each query is evaluated in the same way as during direct execution. Nevertheless, the service execution trees may differ. For example, if a service endpoint is declared with the same URL as its parent, the service may bypass a nested service call, executing locally and potentially using internal optimizations. This can lead to discrepancies in the execution structure. In some cases, even the results can differ. Certain SPARQL endpoints, such as Wikidata, enforce a whitelist of allowed service URLs. If the proxy server is not included in this whitelist, service calls made during debugging may result in errors, while the same service call could succeed in a direct query execution without the proxy. This discrepancy occurs because the proxy is treated as an unauthorized endpoint during the debugging process.

3.1. Case Study from Practice

The usefulness of the tool in practice is demonstrated on the example of a query that returns no results, even though it is known that some solutions matching the query exist. The query (Figure 8) should return proteins that catalyse reactions involving cholesterol-like compounds. It is evaluated by the Uniprot endpoint (UniProt Consortium, 2021) and also includes nested services using the Rhea (Bansal et al., 2022) and IDSM endpoints (Galgonek & Vondrasek, 2021).

Figure 8.

Retrieval of a list of UniProtKB/Swiss-Prot human proteins that catalyse Rhea reactions involving cholesterol-like compounds.

By tracing the query using the SPARQL debugger, it can be observed that the UniProt endpoint alters the original object term "0.9"^^xsd:double for the predicate sachem:cutoff to the term 0.9 in the innermost service call. According to the SPARQL specification, the IDSM endpoint interprets 0.9 as a decimal rather than a double. Because IDSM is strongly typed, this change results in IDSM not returning any results for the given subquery, which means that not even the entire query will return any results. As a workaround, user can replace "0.9"^^xsd:double in the query with "9E-1"^^xsd:double, which is interpreted as a double even if the type is stripped, resolves the issue.

In our experience, the Uniprot endpoint first tries to use the nested loop join strategy and execute the respective service for different substituted values. However, this approach takes a long time and does not lead to the desired outcome. In subsequent attempts, the endpoint typically choose the option to call the service directly without any substitution, which allows it to get the result in a short time.

These observations, made possible by the SPARQL debugger, demonstrate its practical utility.

4. Conclusion

The presented software provides SPARQL developers with a debugger tool designed to offer detailed insights into the execution of complex federated queries. It has already proven itself to be effective in practice and has helped identify and resolve several errors and performance issues that were previously impossible to simply address. The proxy server exposes a REST API that enables integration with other applications independently of the debugger frontend.

The source code for both the proxy server³ and the frontend⁴ is available on GitHub. Additionally, a deployment of the SPARQL federated query debugger⁵ is ready for use online.

Footnotes

ORCID iDs

Marek Moos

Jakub Galgonek

Funding

The authors received the following financial support for the research,authorship and/or publication of this article: This work was supported by the CHIST-ERA grant TRIPLE,by the Technology Agency of the Czech Republic (TAČR) within the National Recovery Plan,project No. TH86010003. Computational resources were provided by the e-INFRA CZ project (ID:90254),supported by the Ministry of Education,Youth and Sports of the Czech Republic.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.

References

Apache Jena (2024). A java framework for semantic web and linked data applications. Accessed November 21, 2024. https://jena.apache.org/index.html

Bansal

Morgat

Axelsen

K. B.

Muthukrishnan

Coudert

Aimo

Hyka-Nouspikel

Gasteiger

Kerhornou

Neto

T. B.

Pozzato

Blatter

M. C.

Ignatchenko

Redaschi

Bridge

(2022). Rhea, the reaction knowledgebase in 2022. Nucleic Acids Research, 50(D1), D693–D700. https://doi.org/10.1093/nar/gkab1016. https://www.ncbi.nlm.nih.gov/pubmed/34755880

Buil-Aranda

Polleres

Umbrich

(2014). Strategies for executing federated queries in SPARQL1.1 (pp. 390–405). ISBN 978-3-319-11914-4. https://doi.org/10.1007/978-3-319-11915-1_25

Datadog (2024). Cloud monitoring as a service. Accessed November 19, 2024. https://www.datadoghq.com/

Exploring Biological Data Using SPARQL (2024). Accessed November 21, 2024. https://biosoda.expasy.org/build_biosodafrontend/

Galgonek

Vondrasek

(2021). IDSM ChemWebRDF: SPARQLing small-molecule datasets. Journal of Cheminformatics, 13(1), 38. https://doi.org/10.1186/s13321-021-00515-1. https://www.ncbi.nlm.nih.gov/pubmed/33980298

Pinero

Ramirez-Anguita

J. M.

Sauch-Pitarch

Ronzano

Centeno

Sanz

Furlong

L. I.

(2020). The DisGeNET knowledge platform for disease genomics: 2019 update. Nucleic Acids Research, 48(D1), D845–D855. https://doi.org/10.1093/nar/gkz1021. https://www.ncbi.nlm.nih.gov/pubmed/31680165

Query Federation with Comunica (2024). Accessed November 19, 2024. https://comunica.dev/docs/query/advanced/federation/

Rietveld

Hoekstra

(2013). YASGUI: Not just another SPARQL client? CEUR Workshop Proceedings, 1056, 1–9. https://doi.org/10.1007/978-3-642-41242-4_7 . ISBN 978-3-642-38708-1

10.

Rutz

Sorokina

Galgonek

Mietchen

Willighagen

Gaudry

Graham

J. G.

Stephan

Page

Vondrasek

Steinbeck

Pauli

G. F.

Wolfender

J. L.

Bisson

Allard

P. M.

(2022). The LOTUS initiative for open knowledge management in natural products research. Elife, 11, e70780. https://doi.org/10.7554/eLife.70780. https://www.ncbi.nlm.nih.gov/pubmed/35616633

11.

Saleem

Potocki

Soru

Hartig

Ngonga Ngomo

A.-C.

(2018). CostFed: Cost-based query optimization for SPARQL endpoint federation. Procedia Computer Science, 137, 163–174. https://doi.org/10.1016/j.procs.2018.09.016

12.

Schwarte

Haase

Hose

Schenkel

Schmidt

(2011). FedX: Optimization techniques for federated query processing on linked data (pp. 601–616). ISBN 978-3-642-25072-9. https://doi.org/10.1007/978-3-642-25073-6_38

13.

Server-Sent Events (2024). Accessed November 21, 2024. https://html.spec.whatwg.org/multipage/server-sent-events.html#server-sent-event

14.

SIB RDF Group Members (2023). The SIB Swiss Institute of Bioinformatics Semantic Web of data. Nucleic Acids Research, 52(D1), D44–D51.

15.

SPARQL 1.1 Federated Query (2013). Accessed November 19, 2024. https://www.w3.org/TR/sparql11-federated-query/

16.

UniProt Consortium (2021). UniProt: The universal protein knowledgebase in 2021. Nucleic Acids Research, 49(D1), D480–D489. https://doi.org/10.1093/nar/gkaa1100. https://www.ncbi.nlm.nih.gov/pubmed/33237286

17.

Virtuoso Universal Server (2024). Accessed November 21, 2024. https://virtuoso.openlinksw.com/

18.

The WebSocket Protocol (2024). Accessed November 21, 2024. https://datatracker.ietf.org/doc/html/rfc6455

19.

Wilkinson

M. D.

Dumontier

Aalbersberg

I. J.

Appleton

Axton

Baak

Blomberg

Boiten

J. W.

da Silva Santos

L. B.

Bourne

P. E.

Bouwman

Brookes

A. J.

Clark

Crosas

Dillo

Dumon

Edmunds

Evelo

C. T.

Finkers

Gonzalez-Beltran

Gray

A. J.

Groth

Goble

Grethe

J. S.

Heringa

t Hoen

P. A.

Hooft

Kuhn

Kok

Lusher

S. J.

Martone

M. E.

Mons

Packer

A. L.

Persson

Rocca-Serra

Roos

van Schaik

Sansone

S. A.

Schultes

Sengstag

Slater

Strawn

Swertz

M. A.

Thompson

van der Lei

van Mulligen

Velterop

Waagmeester

Wittenburg

Wolstencroft

Zhao

Mons

(2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3, 160018. https://doi.org/10.1038/sdata.2016.18. https://www.ncbi.nlm.nih.gov/pubmed/26978244

20.

Zahn-Zabal

Michel

P. A.

Gateau

Nikitin

Schaeffer

Audot

Gaudet

Duek

P. D.

Teixeira

Rech de Laval

Samarasinghe

Bairoch

Lane

(2020). The neXtProt knowledgebase in 2020: Data, tools and usability improvements. Nucleic Acids Research, 48(D1), D328–D334. https://doi.org/10.1093/nar/gkz995. https://www.ncbi.nlm.nih.gov/pubmed/31724716