Abstract
Introduction
The quest for transparency in the decision-making processes of artificial intelligence (AI) systems, particularly within the healthcare sector, underscores the urgency and importance of explainable artificial intelligence (XAI), 1 which is reflected in the publication increase in the field as seen in Figure 1. As AI continues to make inroads into critical healthcare applications, the imperative for algorithms that not only perform with high accuracy but also offer insights into their decision-making process becomes increasingly paramount. This is not just a matter of technical necessity but of ethical obligation, ensuring that healthcare professionals can trust and understand the AI-driven tools they rely on for diagnosing, treating, and managing diseases. Applications of machine learning in medicine include but are not limited to diabetes prediction, 2 cardiovascular disease detection, 3 asthma management, 4 and skin cancer detection. 5

Pubmed annual publication data for mHealth articles relating to machine learning and artificial intelligence.
XAI is at the forefront of reconciling the sophisticated functionalities of machine learning models with the imperative for outcomes that can be understood and interpreted by humans. This reconciliation is not just a technical challenge but a fundamental necessity in healthcare and medicine, where the stakes involve human lives, and the implications of decisions extend to the very fabric of patient care. In an arena characterized by complexity and the need for precision, the opacity of traditional machine learning models can hinder trust and collaboration between healthcare professionals and the technology designed to aid them.
XAI endeavors to peel back the layers of complexity that shroud advanced algorithms, making the rationale behind AI-driven decisions accessible and comprehensible to clinicians, patients, and other stakeholders. This transparency is vital for validating the reliability of AI recommendations, facilitating informed clinical judgments, and fostering a deeper integration of AI tools into healthcare practices. By illuminating how models arrive at their conclusions, XAI not only enhances the trustworthiness of AI applications but also enables healthcare professionals to apply their own judgment and expertise in interpreting AI insights, ensuring that the final decision reflects a synthesis of the best of both worlds.
Furthermore, by providing a framework through which AI processes can be examined and questioned, XAI supports the continuous improvement and refinement of AI models. This iterative process, informed by feedback from clinical practice, ensures that AI tools evolve in alignment with the needs and values of the healthcare community, ultimately leading to more personalized, effective, and ethically sound healthcare solutions. In doing so, XAI plays a pivotal role in realizing the potential of AI to transform healthcare, making it more responsive, efficient, and tailored to the unique needs of each patient.
Addressing this challenge, our paper proposes a novel methodology for enhancing the transparency and understandability of AI decisions in healthcare. By employing a knowledge transference technique between a complex, non-transparent black-box algorithm, Dl4jMlpClassifier, 6 and a more interpretable white-box J48 7 model, we aim to elucidate the rationale behind AI-generated health insights. This approach not only enhances the credibility and reliability of AI applications in healthcare but also empowers healthcare professionals by providing them with actionable, understandable intelligence. In doing so, we not only advance the field of medicine and healthcare but also contribute to the broader discourse on the ethical application of AI in sensitive domains, ensuring that technology serves humanity in the most beneficial and comprehensible manner possible.
Method
In this section, the information that is needed to understand and repeat the process of knowledge transfer between black-box and white-box algorithms and the dataset used in the experiment is presented.
Dataset information
In order to utilize machine learning in a healthcare scenario, we need patient data to train the classification algorithm and to test its performance. In this paper, we used a dataset of COVID-19 8 patients where the aim is to predict the onset of long COVID-19. 9 The data was gathered and provided by Zdravstveni dom dr. Adolfa Drolca Maribor a Slovenian-based non-profit public health care institute performing health care services on the secondary and tertiary levels. The study was conducted from October 2020 to July 2022 and was focused on long COVID-19 patients to assess the effect of vaccination on long COVID-19 symptoms 10 and to assess the effect of diabetes and being overweight on post COVID-19 syndrome. 11 Because of the focus of the study, the majority of the patients in the dataset were long COVID-19 patients and less than 20% of patients were COVID-19 patients. The dataset contains data from multiple medical consultations of each patient. In this article, the focus was on the data gathered during hospitalization as to assess the possibility of the onset of long COVID-19 after hospital discharge. The dataset is highly unbalanced, as is usual for medical data. The data set includes age, gender, body mass index (BMI), co-morbidities, complications, symptoms, and forced expiratory volume in 1 s (FEV1) and forced vital capacity of the lungs (FVC) ratio. 12 The dataset contains 553 instances, of which 312 are male patients and 241 are female patients. During the course of our study, we decided to exclude instances with missing data from the analysis. This choice was primarily driven by the nature of the dataset and the attributes selected, as specified by the medical expert who collected the data. It was observed that a significant number of instances within the dataset had multiple missing values, with some instances even lacking data across all selected attributes. Given the critical importance of complete and accurate information in medical data analysis and to maintain the integrity and reliability of our findings, we deemed it necessary to exclude such incomplete instances. This approach ensured that our analysis was based on the most reliable and pertinent data available. Table 1 presents the characteristics of the 128 remaining instances.
Patient characteristics table.
Experiment environment
The experiment was performed with Weka version 3.8.6. 13 Weka is a program containing a collection of more than 40 machine-learning algorithms that can be used for evaluation and testing without any background knowledge of computer science, which makes it easy to use for professionals outside the profession of computer science. Two algorithms were used to demonstrate the transference of knowledge Dl4jMlpClassifier, the black-box model, and J48, the white-box model. Dl4jMlpClassifier was chosen because it is inherently a black-box model and because it is the main deep learning algorithm provided by Weka. J48, on the other hand, was chosen because it is a decision tree algorithm, which is one of the most widely used white-box algorithms in healthcare. Both were run with their base parameters, so no parameter adjustments were performed, and the base accuracy was measured using 10-fold cross-validation. 14
Experiment process
The idea of data transference between models depends on the mislabeled data of the black-box model since the resulting classes of the black-box algorithms are used to train the white-box model. To prove the change in decision behavior, we need to check the base accuracy of the black-box model and the white-box model to establish a baseline. The base rationale and performance of the J48 classifier are expected to change to align with the black-box classifier, and for validation purposes, we need the initial performance data; this initial data, however, is not needed for the utilization of the approach as such. The next step is to introduce the decisions of the black-box model to all instances of the initial data, temporarily overwriting the actual class values. This data is then used to train the new white-box model, which should now have adopted the decision making of the black-box model. We now have three models, and to determine if the prediction of the new model improved from the infusion of the black-box, we check the performance matrices between the original white-box algorithm and the knowledge infused white-box. We check the same matrices between the black-box algorithm and the knowledge-infused white-box algorithm. The last stage is to provide both visualizations of the decision tree algorithms and scan for any significant differences. The whole process is captured in Figure 2.

The process of knowledge transference.
Extended testing
Considering the limitations of the small sample size in the initial dataset and the application of only one classification algorithm, we expanded our analysis to include a larger, publicly available dataset containing 1,032,572 instances and 21 attributes, which, like our main dataset, focuses on COVID-19. 15 Boolean features use 1 for “yes” and 2 for “no,” with 97 and 99 indicating missing data. Key features include sex (1 for female, 2 for male), age, COVID-19 test classification (1–3 for positive, 4+ for negative or inconclusive), patient type (1 for home, 2 for hospitalization), and various health conditions like pneumonia, diabetes, COPD, asthma, immunosuppression, hypertension, cardiovascular disease, chronic renal disease, other diseases, obesity, and tobacco use. It also details whether patients were intubated, admitted to the ICU, or died, with specific codes for missing or non-applicable data. The data includes 140,038 instances with pneumonia and 892,534 without. The gender is evenly distributed. All age groups are represented. However, the majority of patients are above 20 years of age. The focus was on pneumonia prediction. To thoroughly test the presented methodology, we applied seven different classification algorithms using the Weka software, following the previously described methodology.
Results
In this section, we will discuss the results of applying our approach to the long COVID-19 dataset. Initially, we will present the performance of the selected algorithms using 10-fold cross-validation. Subsequently, we will present the results obtained after knowledge transfer. Finally, we will focus on the presentation layer of the white-box algorithm. The last part focuses on testing the presented approach on a large dataset and with multiple classification algorithms.
Base performance
In Table 2, the performance at first glance is in favor of the J48 white-box algorithm, but a closer look at the confusion matrix output exposes that the model simply prioritizes the long COVID-19 patients and is terrible at predicting COVID-19 patients. Because we want to be able to distinguish between two types of COVID-19, the Dl4jMlpClassifier algorithm gives a better result. The classification accuracy of the white-box algorithm is a consequence of the inability to distinguish a single COVID-19 patient from the long COVID-19 patients. The deep learning algorithm had an inferior classification accuracy result compared to the white-box algorithm but was able to classify almost half of the COVID-19 patients. However, without a visual explanation layer, we are not able to confidently say that the use of the deep learning model would be warranted. What we can confidently say is that the usage of a decision tree classification model without any capability to distinguish between long COVID-19 and COVID-19 would be senseless.
Results of the initial test phase presenting the accuracy, true positive rate (TPR), false positive rate (FPR), precision, F-Measure, receiver operating characteristic curve area (ROC), precision-recall curve area (PRC), and the confusion matrix.
In our deep learning classification analysis, the model exhibited strong performance across various metrics. The true positive rate (TPR) was 0.836, indicating effective identification of actual positive cases, while the false positive rate (FPR) was controlled at 0.224. The model's Precision, a measure of accurate positive predictions, was high at 0.889. Its ability to balance correct positive predictions with the total number of actual positives was reflected in an F-Measure of 0.854. Additionally, the model showed excellent capability in distinguishing between classes, as indicated by the receiver operating characteristic (ROC) Area of 0.875 and the precision-recall curve (PRC) Area of 0.905.
Our decision tree classifier demonstrated varied performance across several key metrics. The TPR, at 0.852, indicates the classifier was effective in identifying true positives. However, the FPR was relatively high at 0.820, suggesting a significant proportion of negatives were incorrectly classified as positives. The classifier's Precision was 0.788, indicating that around 78.800% of its positive predictions were correct. The F-Measure, balancing precision, and recall stood at 0.810. Notably, the ROC Area was 0.496, showing a limited ability to distinguish between classes. The PRC Area was 0.772, reflecting a reasonable balance between precision and recall in the context of the given data distribution.
Performance after knowledge transfer
To be able to compare the classification results of all three models, we needed to test them on the complete original dataset in order to have a comparable output. In Table 3, we can see that the base J48 model still seems like the best performing out of those three when focusing on the detection of long COVID-19 patients and that the Dl4jMlpClassifier model and J48 knowledge-infused model performed similarly. To validate that there was an improvement in the decision-making of the white box algorithm, we examined the area under the curve (AUC) 16 values of all three models. The base model has an AUC value of 0.500 which clearly indicates that the model is not suitable for usage. The AUC score of the deep learning algorithm was 0.913, justifying its application in the decision-making process, and the AUC of the knowledge-infused white-box algorithm drastically increased to 0.913, indicating that we successfully transferred knowledge from one model to the other.
Results of the comparison of the performance of all three models presenting the accuracy, true positive rate (TPR), false positive rate (FPR), precision, F-Measure, receiver operating characteristic curve area (ROC), precision-recall curve area (PRC), and the confusion matrix.
The decision tree algorithm exhibited a distinctive performance profile in our analysis. The TPR was 0.867, indicating that it effectively identified about 86.700% of the positive cases correctly. Interestingly, the FPR was also 0.867, suggesting a high rate of false positives, with the same proportion of negative cases being incorrectly identified as positive. Precision and F-Measure values could not be calculated, indicating limitations in assessing the algorithm's exact accuracy and the balance between precision and recall. The ROC area, a measure of the model's ability to distinguish between classes, was 0.500, indicating no better discriminative power than random chance. Lastly, the PRC area was 0.770, showing a moderate ability of the model to balance precision and recall for different thresholds despite the challenges in other areas.
The deep learning algorithm showcased strong performance in classification, as evidenced by its metrics. TPR was 0.875, indicating a high effectiveness in correctly identifying positive cases. The FPR stood at a relatively low 0.218, suggesting good control over incorrectly classified negative cases. Precision was notably high at 0.903, indicating that more than 90% of positive predictions were accurate. The F-Measure, which balances precision and recall, was notable at 0.885. Further, the ROC area was 0.914, demonstrating a strong ability to distinguish between classes. The PRC area, at 0.925, underscores the model's excellent performance in maintaining high precision and recall across various thresholds.
Upon comparing the performance results of the infused decision tree model and the deep learning model, it can be asserted that both models have exhibited the same level of performance. Therefore, the performance metrics that were obtained from evaluating the infused decision tree model were identical to the metrics presented in the previous paragraph for the deep learning algorithm.
Visualization of classifiers
To finally conclude which white-box model is more viable, we visualized both models. Since the base model had only one leaf as a decision, we could only visualize the knowledge-infused decision tree seen in Figure 3. While the base J48 decides that all patients belong to the long COVID-19 class, the knowledge-infused decision tree provides a clear explanation behind his reasoning. The first four decision factors are shortness of breath, gender, cough, and FEV1/FVC. Each of the first three decisions leads either to the next decision node or to the decision long COVID-19. If shortness of breath was present, the patient is classified as a long COVID-19 patient. Otherwise, the decision-making goes to the gender node. At the gender node, if the patient is female, he is classified as a long COVID-19 patient. If he is male, the decision-making process goes to the next decision node. At the cough decision node, the patient with cough is classified as a long COVID-19 patient. Otherwise, he goes to the FEV1/FVC decision node. At the FEV1/FVC decision node, if the value is above 70, the patient is classified as a COVID-19 patient. Otherwise, he goes to the last decision node. The last deciding factor is the BMI. 17 The BMI splits into three decisions where a BMI below 29.040 leads to a classification of COVID-19, and a value above 29.300 leads to a long COVID-19 classification.

The visual output of the knowledge-infused white-box algorithm.
Extended testing of the base approach
In order to determine the practicality of the approach under consideration, our study was broadened to encompass a more extensive dataset, incorporating a variety of algorithms provided by the Weka platform. As outlined in Table 4, the performance metrics for five out of the seven evaluated algorithms exhibit a remarkable resemblance with their counterparts enhanced through knowledge infusion. The remaining two algorithms, highlighted in orange in the table, despite not achieving perfect alignment, demonstrated a discernible convergence in their performance metrics. The provided metrics are calculated through Wekas integrated functions, and the data included in the table represents the weighted averages of the metrics. Central to this expanded analysis was the prognostication of pneumonia, necessitating not merely an examination of the alignment in outcomes caused by knowledge transfer but also a heightened focus on identifying patients predisposed to developing pneumonia. Through this extended testing framework, we succeeded in enhancing the accuracy of pneumonia prediction by as much as 18%. It is noteworthy that the DecisionTables algorithm was the sole exception, registering a decrement in predictive performance for pneumonia. Nonetheless, this did not detract from the evidence of effective knowledge transfer, as indicated by the closely aligned outcomes between the original and the knowledge-infused models. To more thoroughly look at the alignment of decision-making of the base black-box model and the infused model, we performed two statistic tests for all seven pairs. First, we checked for significant differences in decision-making using McNemar's 18 test, where we only observed significant differences for both orange-marked algorithms in Table 4, which also had differences in base metrics. Additionally, we calculated Cohen's Kappa 19 to measure agreement between the black-box and infused white-box model, where only the kappa value of the random tree and its counterpart j48 infused random tree had a kappa value below 0.950. However, the exact kappa value was 0.810, still indicating a high alignment of decision-making.
Results of extended testing presenting the accuracy, true positive rate (TPR), false positive rate (FPR), precision, F-Measure, and the receiver operating characteristic curve area (ROC).
Discussion
Digital health is a global discipline whose goal is to produce sustainable, innovative, and health-focused solutions with the help of digital technologies. One integral part of modern digital health is mHealth since mobile digital technologies are omnipresent in our society and allow for a fast-paced holistic interaction between the health system and the general public. Integration of machine learning and AI into mHealth applications has been a general trend of the past decade and is gaining traction rapidly with the introduction of sophisticated black-box classification algorithms, of which the most prominent are deep learning algorithms. The introduction of black-box algorithms into the area of digital health brings with it a new problem, the ability to trust and understand the decision-making process is obscured. The problem of transparency and trustworthiness of black-box decision-making algorithms is being tackled by XAI. The aim of XAI is to include a layer of transparency in the decision-making of black-box algorithms. With our method, we wanted to provide a methodology that presents an explanation layer by transferring knowledge from a black-box algorithm to a white-box algorithm.
Current scientific literature already provides some solutions that can be applied to mitigate the presented dilemma. The two most prominent XAI attempts at providing a solution are SHapley Additive exPlanations (SHAP) 20 and Local Interpretable Model-Agnostic (LIME). 21 Those methods have already been applied in the context of COVID-19 detection.22,23 These methods primarily aim to elucidate the influence of individual attributes on model decisions. However, to the best of our knowledge, no approach similar to ours exists, and we are the first to introduce this methodology. Our approach, as detailed in this paper, diverges from this attribute-centric perspective. Instead, it focuses on transferring knowledge from models lacking an intrinsic explanation layer to ones with a structured, comprehensible output. This method is particularly advantageous in medical applications, where the clarity and coherence of decision-making processes are paramount. In the context of the long COVID-19 database, our approach not only facilitated the transfer of decision-making from a black-box model to a white-box model but also revealed nuances in the decision-making process that would otherwise remain obscured to medical professionals. The decision tree visualizations, a key aspect of our methodology, provide an intuitive and holistic understanding of decision pathways, which is often more aligned with the practical needs and interpretative habits of medical practitioners compared to the more fragmented insights offered by SHAP and LIME. Furthermore, the integration of decision trees with SHAP and LIME can offer a more comprehensive XAI framework, leveraging the strengths of each method to provide a multi-faceted understanding of complex models.
Implications and future directions
The main problem with black-box algorithms is their apparent lack of rationale when it comes to the decision-making process. With the help of the presented approach, we provide an additional tool for XAI in mHealth that can bring us one step closer to trustable integration of machine learning algorithms into digital health. With the added trust in the decision-making of black-box algorithms, the use of mHealth applications might evolve to the next stage, where health professionals can provide even more targeted and widespread benefits. The concept of this research could further be applied to extract the noise from the data to improve the default performance of white-box algorithms and provide an analysis of the excluded data. Additionally, the output of white-box algorithms that perform excellently but have complex decision-making reasoning could be simplified at a low accuracy cost.
Limitations
Expanding the scope of testing beyond a framework like Weka is essential to achieve a more detailed and nuanced analysis. A more elaborate and complex test bench would enable us to conduct in-depth testing, providing insights into the model's performance in varied scenarios, especially in cases with significant differences in the AUC between models with varying levels of explainability.
Moving beyond pre-packaged environments like Weka would allow for a more customized and granular testing approach, including the ability to tweak model parameters, experiment with different data preprocessing techniques, and apply advanced validation strategies. This would not only enhance our understanding of the model's strengths and weaknesses in different contexts but also provide a more comprehensive evaluation of its real-world applicability and scalability. The goal is to ensure that our model not only excels in controlled test environments but also maintains its efficacy and reliability in diverse, real-world scenarios.
Another clear limitation is that the approach can only be applied if the black box algorithm is not performing with 100% accuracy because the mistakes of the algorithm enable us to transfer decision-making from one algorithm to another. Since the relabeling of class values based on the output of the black-box algorithm is used to train the new white-box algorithm, a black-box classifier with 100% accuracy would not change any labels of the class, and no change in the decision-making could be expected.
Conclusion
The possibility to combine the strengths of two approaches to create an improved, more versatile, and trustworthy solution, for instance, combining the performance of a black-box algorithm and the explanation layer of a white-box algorithm, presents a strong tool to be included in the toolbox of XAI. Especially in domains that are high risk and where there is a dire need for machine learning because of the waste amounts of data being accumulated. While the presented approach does not solve the problem of XAI, it certainly presents an important avenue for application and research. With automation, this approach could both greatly improve the performance of white-box algorithms and present a validation and support tool in tandem with other XAI solutions. Additionally, exploring the reasons behind the knowledge transfer mechanism could present an additional avenue of application, such as outlier and misclassification detection.
