Sage Journals: Discover world-class research

Abstract

Objective:

Machine learning systems are increasingly used in high-stakes domains such as healthcare, where predictive accuracy must be accompanied by explainability to ensure trust, validation, and regulatory compliance. This study aims to evaluate the effectiveness of widely used local and global explanation methods in real-world clinical settings.

Methods:

We introduce a structured evaluation methodology for the quantitative comparison of explainability techniques. Our analysis covers five local model-agnostic methods—local interpretable model-agnostic explanations (LIME), contextual importance and utility, RuleFit, RuleMatrix, and Anchor—assessed using multiple explainability criteria. For global interpretability, we consider LIME, Anchor, RuleFit, and RuleMatrix. Experiments are conducted on diverse healthcare datasets and tasks to assess performance.

Results:

The results show that RuleFit and RuleMatrix consistently provide robust and interpretable global explanations across tasks. Local methods show varying performance depending on the evaluation dimension and dataset. Our findings highlight important trade-offs between fidelity, stability, and complexity, offering critical insights into method suitability for clinical applications.

Conclusion:

This work provides a practical framework for systematically assessing explanation methods in healthcare. It offers actionable guidance for selecting appropriate and trustworthy techniques, supporting safe and transparent deployment of machine learning models in sensitive, real-world environments.

Keywords

Explainable artificial intelligence (XAI)model interpretability black-box models healthcare AI

Introduction

The integration of artificial intelligence (AI) systems across various domains significantly enhances their efficacy in diverse tasks. These AI systems primarily rely on complex machine learning (ML) models, achieving outstanding performance in tasks such as prediction, recommendation, and decision-making support.^1–5 However, these models, characterized as “black-box models,” maintain opacity in their internal processes, prevalent in current AI systems leveraging deep learning models and ensemble methods like bagging and boosting.⁶ While these models boast exceptional performance, their lack of transparency poses inherent risks, including potential biases derived from training on unfair data.⁷ Such lack of transparency can lead to decisions lacking complete interpretability and may potentially contravene ethical principles. The integration of ML models into AI products and applications, as witnessed in the current business landscape, heightens the risks associated with compromising safety and trust, particularly in high-stakes decision-making domains like medicine, finance, and automation.⁸ The enactment of the General Data Protection Regulation (GDPR) by the European Parliament in May 2018 introduced provisions regarding automated decision-making, emphasizing the right to explanation.⁹ These regulations aimed at enabling individuals to acquire ‘‘meaningful explanations of the underlying logic” behind automated decisions. While legal experts differ in their interpretations of these provisions, there is a consensus regarding the pressing need to implement such principles, representing a notable contemporary scientific challenge.

In response to the practical and ethical challenges posed by the opacity of black-box models in AI, the field of eXplainable (XAI) has witnessed a surge in the development of explanation methods from both academia and industry.¹⁰ XAI seeks to address these concerns by making the internal logic of AI models more interpretable and accessible, which is critical for fostering trust, transparency, and accountability—especially in high-stakes domains such as healthcare, finance, and law. As AI becomes increasingly integrated into decision-making pipelines, ensuring that stakeholders can understand and verify algorithmic outputs is both a technical and ethical imperative.⁹ Importantly, these efforts must consider the diverse nature of data encountered, including structured data such as electronic health records and unstructured data like medical images¹ and clinical notes, each of which presents unique challenges for explainability.^11,12 Furthermore, there have been significant advances in XAI approaches specifically tailored for healthcare applications, such as interpretability techniques for medical imaging, natural language explanations for clinical notes, and causal inference models for patient outcomes,^13–15 all of which enrich the current XAI landscape and should be considered to provide a holistic overview.

However, the landscape of XAI methods is highly heterogeneous and fragmented, posing significant challenges for both researchers and practitioners.¹⁶ Explanation techniques differ not only in their algorithmic formulation but also in the type and granularity of interpretability they provide, where interpretability refers to the extent to which a human can understand or predict a model’s behavior. For instance, feature attribution methods such as SHAP¹⁷ and integrated gradients¹⁸ quantify the contribution of individual input features to a specific model prediction. These methods offer fine-grained, localized insights suitable for instance-level analysis. By contrast, rule-based approaches like RuleFit¹⁹ and anchors²⁰ derive human-readable if-then rules or logic programs that approximate the decision boundaries of a model, making them more amenable to global or structural interpretation. In an effort to bridge the gap between interpretability and predictive accuracy, recent research has explored knowledge transfer approaches that leverage complex black-box models to inform inherently interpretable white-box algorithms, aiming to combine the strengths of both.²¹

Moreover, explanation techniques vary in scope and assumptions: local methods provide explanations for individual predictions—typically by approximating the model’s decision boundary in a small neighborhood of the input—while global methods aim to summarize the overall decision logic of the model across the entire dataset.²² Some methods are inherently model-agnostic—meaning they can be applied to any black-box model without needing access to internal parameters—while others are model-specific, requiring access to internal gradients or structures (e.g. weights or layers). This diversity complicates the development of a unified evaluation framework: metrics such as fidelity and identity are often applied to local methods, whereas distributional similarity or rule complexity might be more relevant for global or rule-based approaches. Yet, these metrics are not universally applicable, and their interpretation often depends on the nature of the explanation task being evaluated. Reproducibility and stability—critical in high-risk domains—are also inconsistently assessed across methods, particularly those involving random sampling or local approximation.

To address the complexity and fragmentation in evaluating XAI methods, we propose a unified evaluation framework that systematically compares both local and global explanation techniques across multiple interpretability paradigms. Our work builds on previous studies—including our own earlier benchmark focused on local model-agnostic methods in healthcare²³—by significantly expanding the scope to include rule-based and global explanation methods, and by aligning evaluation metrics more precisely with explanation tasks. In contrast to prior work that often focuses narrowly on a single method type or metric, this framework evaluates explanation quality using a suite of clearly defined and task-relevant metrics.²⁴ We conduct experiments on real-world healthcare datasets, a domain where explanation reliability and usability are especially critical. The selected methods represent a balanced spectrum of explanation paradigms—including feature attribution techniques and rule-based models—and we explicitly define the XAI task setting (e.g. local vs. global) for each method. Our goal is twofold: (1) to offer evidence-based guidance for practitioners seeking interpretable solutions in safety-critical applications; and (2) to contribute to the broader discourse on how explanation quality can be rigorously, fairly, and transparently evaluated across heterogeneous XAI techniques.

Related work

The domain of explainable AI methods has arisen more recently within academic circles in comparison to the broader expanse of AI literature. This emergence has been prompted by the widespread integration of AI models across diverse sectors in contemporary society, signifying its growing importance.^1,23,25,26 XAI tools are differentially categorized into local or global methods, each designed to elucidate the decision-making processes of AI models across varying levels. The local methods specifically target the comprehension of an AI algorithm’s behavior at a more granular, low-level hierarchical tier.^1,27–29 The global set of XAI approaches, in contrast, are directed towards comprehending the behavior of AI algorithms at a higher hierarchical level.³⁰ These methodologies assist users in gauging how features contribute to predictions across entire datasets or collections of datasets. Another way to categorize explanation techniques is based on whether the technique adopted to explain can work only on a specific black-box model (model-specific) or can be adopted on any black-box (model agnostic). Model-agnostic techniques are applicable to any model, offering valuable insights into the factors influencing their decisions. These tools operate post-hoc, meaning they are applied after the model has been trained. Importantly, they do not require access to the model’s internal details; rather, they only necessitate the capability to test the model predictions.¹¹

Another branch of literature focuses on the quantitative and qualitative evaluation of explanation methods assessing the quality and utility of returned explanations, considering aspects like their effectiveness and relevance.^{6,16,23,31–33} Quantitative evaluation primarily focuses on metrics measuring aspects like fidelity,³⁴ stability.³⁵ These metrics assess the ability of explainers to mimic the behavior of the underlying model, and the consistency of their output. Despite the availability of these quantitative metrics, there is no wide agreement about a specific set of metrics to determine the ‘‘best” explainer. This highlights the complexity and subjectivity inherent in the evaluation process. Wilson et al.³⁶ propose three proxy metrics aimed at evaluating the quality of explanations within these frameworks. These metrics, namely completeness, correctness, and compactness, play a pivotal role in gauging the effectiveness and reliability of explanations. Completeness pertains to the audience’s ability to verify the validity of the explanation, specifically the extent to which it covers instances. Correctness focuses on the accuracy of the explanation. Compactness delves into the degree of succinctness an explanation can achieve, such as the number of conditions within a decision rule that elucidates a specific instance. Qualitative assessment focuses on evaluating the usability of explanations from the end-user’s perspective, encompassing aspects such as providing meaningful insights, ensuring safety, social acceptance, and fostering trust. Doshi et al.^37–39 propose qualitative evaluation criteria that categorize into three groups: functionally grounded metrics, application-grounded evaluation methods, and human-grounded metrics. Functionally grounded metrics leverage formal definitions to evaluate explainability, eliminating the need for human validation. Application-grounded evaluation methods involve human experts to validate specific tasks, typically in domain-specific settings. In contrast, human-grounded metrics evaluate explanations through non-expert human users, aiming to measure the overall understandability of explanations in simplified tasks.

Methods

This section outlines the methodology used to benchmark explainability techniques. It introduces the explanation frameworks considered in this study and describes the metrics employed to evaluate their performance. These methods were selected based on their relevance, popularity, and compatibility with both local and global interpretability goals.

Evaluated explainability frameworks

Local interpretable model-agnostic explanations (LIME)

The LIME²⁸ is a local model-agnostic explainer that produces feature importance vectors by generating explanations around an instance being explained $x$ . LIME assigns weights to randomly generated samples based on their proximity to $x$ , emphasizing nearby instances. Leveraging the proximity measure $π_{x}$ to capture local context, LIME perturbs $x$ by selecting nonzero elements uniformly. The perturbed instances $z \in R$ are used to train the explanation model $g (.)$ , often a sparse linear model, by feeding them into the black-box model $b$ and obtaining predictions $b (z)$ . The resulting local explanation comprises the weights assigned to the linear model’s coefficients. Additionally, LIME incorporates the submodular pick module (SP-LIME) to select representative instances and their explanations, ensuring non-redundancy and global representativeness.

Contextual importance and utility (CIU)

CIU is a local, post-hoc, model-agnostic explainer grounded in the concept that the significance of a feature can vary across different contexts.⁴⁰ CIU employs two algorithms for elucidating predictions rendered by black-box methods.^41,42 Originating from established decision-making theory, this method posits that the importance of a feature and the efficacy of its values dynamically evolve contingent upon other feature values.⁴² Contextual importance (CI) serves to approximate the aggregate importance of a feature within the prevailing context, while contextual utility (CU) offers an estimation of the favorability, or lack thereof, of the current feature value concerning a designated output class.

Anchor

Anchor is a model-agnostic method providing rule-based explanations called anchors.²⁰ Anchors apply when changes to other feature values of an instance don’t affect the outcome. For an instance $x$ and anchor $r$ , if $r (x)$ equals $b (x)$ (where $b$ is the black-box model’s prediction), it forms an anchor. The process involves perturbing $x$ to create synthetic instances, then extracting anchors with precision exceeding a user-defined threshold. Anchor utilizes a multi-armed bandit algorithm for synthetic data generation and employs a bottom-up approach with beam search to identify anchors.

RuleFit

RuleFit is a global explainability framework designed for constructing rule ensembles.¹⁹ The process involves two main phases: rule extraction and weight optimization. In the extraction phase, RuleFit trains a tree ensemble model on a sample $S$ and decomposes each tree into a collection of rules, forming a set $R$ . In the weight optimization phase, RuleFit optimizes a weight vector $α \in R^{M}$ for the rules in $R$ using sample $S$ . To maintain sparsity and ensure interpretability, RuleFit employs $L 1$ -regularization (Lasso penalty). By navigating these phases, RuleFit constructs a rule ensemble model that captures data complexity while providing a globally interpretable representation through rules and associated weights.

RuleMatrix

RuleMatrix is a model-agnostic explanation technique designed to offer both local and global explanations with a focus on visualizing extracted rules.⁴³ The technique comprises four main steps. Initially, the distribution of the provided training data $X$ is modeled. A joint distribution estimation technique is employed to accommodate both discrete and continuous features concurrently. Next, a set of data, referred to as $X_{sample}$ , is generated from the estimated joint distribution. The number of samples is a customizable parameter and can exceed the quantity of the original training data, if necessary. Then, the original black-box model being explained $F$ is utilized to label $X_{sample}$ . Lastly, $X_{sample}$ and their corresponding labels $Y_{sample}$ are employed to train a rule list using scalable Bayesian rule list (SBRL) algorithm proposed by Yang et al.⁴⁴

Table 1 presents a comprehensive summary of the explanation methods considered in our study, including framework name, publication year, citations per year, along with their capabilities in handling various data types distinguishing between tabular (TAB) and any data (ANY). Furthermore, Table 1 classifies these explanation methods into two distinct categories: Global explanations (G) and Local explanations (L).

Table 1.

Key characteristics of explanation methods used in this study.

Tool	Local versus global	Citations/year	Year	Data type
LIME	L&G	2063	2016	ANY
CIU	L	5	2020	TAB
Anchor	L&G	375	2018	ANY
RuleFit	G	83	2008	TAB
RuleMatrix	L&G	42	2018	TAB

LIME: local interpretable model-agnostic explanations; CIU: contextual importance and utility; TAB: tabular; ANY: any data; G: global explanations; L: local explanations.

Evaluation metrics for explainability methods

This section provides a comprehensive overview of established quantitative evaluation metrics utilized in the subsequent benchmarking process. Identity: Aims to ensure that for two identical instances, their explanations are also identical.²⁴ Separability: Aims to ensure that explanations for two dissimilar instances differ, operating under the assumption that all features used in the model are relevant to prediction and the model does not have degree of freedom.²⁴ Fidelity: Aims to evaluate how good is an explanation model at mimicking black-box model $b$ . There exist various implementations of fidelity, contingent upon the type of explainer under analysis.³⁴ In this work, for methods entailing the development of a surrogate model $g$ to emulate $b$ , fidelity is assessed by comparing the predictions of $b$ and $g$ using the instances employed in training $g$ . Speed: Aims to measure the explainability framework’s efficiency by quantifying the average time required for generating an explanation.²³ Stability: Aims to assess whether similar instances yield comparable explanations.²⁴ In this work, stability is quantified through the Lipschitz constant,¹⁸ represented as $L_{x} = max \frac{‖ e_{x} - e_{x^{'}} ‖}{‖ x - x^{'} ‖}$ , $\forall x^{'} \in N_{x}$ , where $x$ represents the instance being explained, $e_{x}$ stands for the explanation, and $N_{x}$ is a neighborhood of instances that are similar to $x$ . In this work, we report the stability percentage, calculated as $(1 - L_{x}) \times 100 %$ . Monotonicity: Aims to ensure a consistent and orderly relationship between feature attributions and the corresponding expectations, providing a reliable measure of the correctness and reliability of the explanation.⁴⁵ Monotonicity for feature attributions ( $a_{i}$ ) is quantified through Spearman’s correlation coefficient, denoted as $ρ_{S} (a, e)$ .⁴⁶ $a = (\dots, | a_{i} |, \dots)$ is a vector consists of the absolute values of the feature attributions for a function assuming value $y * = f (x *)$ at a point $x *$ . $e = (\dots, E (l (y *, f_{i}); X_{i} | x *_{- i}), \dots)$ contains the corresponding (estimated) expectations calculated using the following equation: $| a_{i} | \propto E (l (y *, f_{i}) | x_{- i}^{*}) = \int_{x_{i}} l (y *, f_{i} (x_{i})) p (x_{i}) d x_{i}$ (1)

where $f_{i}$ is the restriction of the function $f$ to feature $i$ , while keeping the other features at specific values $x_{- i}^{*} = (x_{1}, \dots, x_{i - 1}, x_{i + 1}, x_{N})$ . The function $l$ , represents performance measure of interest (in our study, $l$ corresponds to cross-entropy). In the context of rule-based explanation techniques, adapting the concept of monotonicity involves replacing the traditional feature attributions with rule-based feature importance metrics derived from the coverage of rules. Non-sensitivity: Ensures that the explainability method attributes zero importance only to those features on which the black-box model $f$ lacks functional dependence.⁴⁵ This concept aligns with the sensitivity axiom as proposed by Sundararajan et al.¹⁸ The metric for non-sensitivity involves two sets: $A_{0} \subset {1, \dots, N}$ , which is a subset containing the indices $i$ representing features with assigned zero attribution ( $a_{i} = 0$ ), and $X_{0} = {i \in {1, \dots, N} | E (l (y *, f_{i}) = 0}$ , representing the subset of indices $i$ for features where the model $f$ demonstrates no functional dependency. Here, $f_{i}$ , $y^{*}$ , and $l$ are defined in the same manner as discussed in the context of the monotonicity metric previously. The non-sensitivity is quantified by calculating the symmetric difference $| A_{0} △ X_{0} |$ , with $| \cdot |$ representing the cardinality of a set. In the context of rule-based explanation techniques, the set $A_{0}$ represents the features with zero attribution within the rules. Effective complexity: It measures the degree of feature interaction and its impact on the model’s accuracy.⁴⁵ Let $a^{(i)}$ represent the attributions ordered in ascending order based on their absolute values, and $x^{(i)}$ denote the corresponding features. Let $M_{k} = {x^{(N - k)}, \dots, x^{(N)}}$ be the set of the top $k$ features. Given a predefined tolerance level $ϵ > 0$ , the effective complexity, denoted as $k *$ , is defined as $k * = {argmin}_{k \in {1, \dots, N}} | M_{k} |, s.t. E (l (y *, f_{- M_{k}}) | x *_{M_{k}}) < ε$ , where $f_{- M_{k}}$ refers to the restriction of the model $f$ to the non-important features, with values fixed for the important features in $M_{k}$ . This restricted model, similar to the definition provided earlier, allows us to focus on the non-important features when assessing effective complexity. Here, $y^{*}$ and $l$ are defined in the same manner as discussed in the context of the monotonicity metric previously. In the context of rule-based explainability techniques, adapting the concept of effective complexity involves replacing the traditional feature attributions with rule-based feature importance metrics derived from the coverage of rules. In this work, we used $ϵ = 0.1$ . A low effective complexity indicates the ability to overlook certain features, even if they exert a minimal impact (analogous to non-sensitivity⁴⁷), resulting in decreased cognitive load. While this reduction in cognitive load may lead to a sacrifice in terms of factual accuracy, it offers a heightened level of flexibility. Users can freely alter the values of less important features without causing substantial deviations in predictions.

The three following metrics—entropy ratio, Kullback-Leibler divergence, and Gini coefficient—are used to assess the explainability of a model based on the distribution of feature importance.⁴⁸ They allow for a comparison between the actual feature importance distribution and a uniform benchmark distribution. Let us consider a set of feature importance denoted as $F = {F_{1}, F_{2}, \dots, F_{F}}$ . Let $P (.)$ denotes a normalization function which normalizes $F$ by dividing each feature importance value by the sum of the absolute values of all features. The normalized value for a specific feature importance, $j$ , is expressed as $p_{j} : = | F_{j} | / \sum_{i = 1}^{F} | F_{i} |$ . Consequently, the vector of normalized feature importance, $P (F)$ , takes the form of a probability measure $P (F) : = p_{1}, p_{2}, \dots, p_{F}$ , where $p_{j} \geq 0$ , and $\sum_{i \in F} p_{i} = 1$ . The distribution of these normalized feature weights can be undertaken using the following three metrics. Define the uniform benchmark distribution as $U = \bar{p}, \bar{p}, \dots, \bar{p}$ , where $\bar{p} = 1 / F$ . For rule-based explainability techniques, the feature importance is derived from the coverage of rules. Entropy ratio: It measures the spread of information within the feature importance distribution concerning a uniform benchmark. The entropy ratio is calculated as $S_{E R} (P (F)) = \sum_{j = 1}^{F} p_{j} \log p_{j} / \sum_{j = 1}^{F} \bar{p} \log \bar{p}$ . A lower ratio indicates that the explanation concentrates on a smaller set of features, implying higher model explainability. It compares the entropy of the actual feature importance distribution with the entropy of a uniform benchmark distribution. Kullback-Leibler divergence: It quantifies the difference between the feature importance distribution and the uniform benchmark distribution. The Kullback-Leibler divergence is calculated as $S_{KL} (P (F)) = D_{KL} (P (F) ‖ U) = \sum_{j = 1}^{F} p_{j} \log \frac{p_{j}}{\bar{p}}$ . A lower divergence suggests a closer similarity between the two distributions, indicating a more explainable model. It computes the information lost when one distribution is used to approximate the other. Gini coefficient: It evaluates the inequality or spread within the feature importance distribution. The entropy ratio is calculated as $S_{G} (P (F)) = \frac{\sum_{j = 1}^{F} \sum_{j^{'} = 1}^{F} | p_{j} - p_{j^{'}} |}{2 F^{2} (\sum p_{j} / F)}$ . A lower coefficient signifies a more evenly distributed importance among features, potentially resulting in a more complex explanation. Conversely, a higher coefficient indicates a more concentrated importance, possibly leading to a more explainable model. Correctness: It gauges the truthfulness of the explanation concerning how well it aligns with the black-box model’s workings.⁴⁹ Ideally, an excellent explanation strives for utmost truthfulness. It is imperative to clarify that this characteristic pertains to the descriptive accuracy⁵⁰ of the explanation, independent of the predictive accuracy of the black-box model. Compactness: It measures the size of the explanation, guided by human cognitive constraints.⁵¹ To enhance comprehension, explanations should prioritize sparsity, brevity, and non-redundancy, steering clear of overly extensive presentations. In this study, compactness assessment for rule-based explainability techniques considers the length of generated rules, while for feature attribution techniques, it is quantified by the count of features with non-zero attributions. In this work, the compactness scores are expressed as percentages, reflecting the proportion of concise elements in the overall explanation. Higher compactness percentages signify more efficient and succinct explanations.

Experimenatal setup

Overview of experimental workflow: This study follows a standardized and reproducible pipeline for evaluating explainability methods across multiple healthcare datasets. At a high level, we begin by training a predictive model for each dataset using an automated model selection approach. The resulting model is treated as a fixed black-box. We then apply both local and global explanation techniques to interpret the model’s behavior and evaluate these explanations using quantitative metrics. Datasets: This study employs a diverse collection of 10 publicly available tabular healthcare datasets, predominantly sourced from the UCI Machine Learning Repository⁵² and Kaggle (see Table 2). These datasets were selected due to their relevance to real-world clinical decision-making, as each one addresses a specific medical condition such as heart failure, breast cancer, or diabetes—conditions commonly studied in the context of model interpretability and healthcare risk prediction. In addition, the datasets vary in terms of feature dimensionality, class distribution, and sample size, enabling the evaluation of explainability methods across heterogeneous data scenarios. Moreover, these datasets are widely used as community benchmarks, which facilitates meaningful comparison with prior research in the field. Tasks and models: We focused on classification tasks and adopted a data-driven approach to ensure unbiased and high-performing model selection. To achieve this, we leveraged Auto-sklearn, an Automated Machine Learning (AutoML) framework.⁶³ Auto-sklearn allows us to select machine learning models based solely on their performance with the given data, providing a level playing field for the evaluation of interpretability techniques. The application of Auto-sklearn serves to mitigate potential biases linked to manual model selection, thereby enabling a thorough and systematic exploration of diverse models. In our experimental protocol, we follow a standardized pipeline for all datasets: (i) partitioning the dataset into 70% for training and 30% for testing, (ii) utilizing . to identify the optimal pipeline based on the training data, optimizing for accuracy, and (iii) for local explanation techniques, we apply each considered method to elucidate the predictions of the model for each instance in the testing dataset. We report the average of the considered evaluation metrics across instances in the testing dataset. For global explainability techniques, we explicate the black-box model based on the training data. Explanation method selection: To ensure a representative and balanced evaluation, we selected explanation methods based on three key criteria: (i) their popularity and widespread use in the XAI literature, (ii) their model-agnostic applicability, and (iii) their coverage of diverse interpretability paradigms. Specifically, we included techniques spanning perturbation-based approaches (e.g. LIME), contrastive rule-based methods (e.g. anchors), classical rule-learning algorithms (e.g. RuleFit and RuleMatrix), and sensitivity-based methods (e.g. CIU). This selection reflects both local and global explanation strategies, enabling us to compare performance across methodological families in healthcare-specific scenarios. Our choices were guided by prior benchmarking and survey studies,^6,16,23 ensuring that the selected methods are relevant, diverse, and generalizable. Hardware resources: Our experiments were conducted in a CPU environment running on Ubuntu 22.04 LTS, equipped with a 8-core Intel i7 Processor @ 2.80 GHz and 16 GB of RAM.

Table 2.

Summary of datasets used in this study. Dataset references are provided in the bibliography.

Dataset name	Dataset description	Number of features	Number of classes	Number of instances
Thyroid disease⁵³	Dataset is a collection of 10 thyroid disease databases from the Garavan Institute and other sources.	20	3	6240
Heart disease⁵⁴	Dataset contains attributes, such as age, sex, cholesterol levels, and blood pressure from the Cleveland Clinic Foundation.	13	2	1025
Hepatitis⁵⁵	Dataset contains attributes, such as age, sex, steroid, and antivirals of blood donors and hepatitis C patients.	18	2	155
Echocardiogram⁵⁶	Dataset contains data for classifying if patients will survive for at least one year after a heart attack.	9	2	131
Breast Cancer Wisconsin (original)⁵⁷	Dataset contains features representing the characteristics of cell nuclei present in images of breast mass.	8	2	682
SPECT heart⁵⁸	Dataset contains continuous and binary features from cardiac single proton emission Computed tomography images.	22	2	267
Diabetes⁵⁹	Dataset contains 10 years of clinical care records from 130 US hospitals and integrated delivery networks.	41	2	98,053
Diabetic retinopathy Debrecen⁶⁰	Dataset contains features extracted from the Messidor image to predict whether it contains signs of diabetic retinopathy or not.	19	2	1151
HIV-1 protease cleavage⁶¹	Dataset contains lists of octamers and a flag depending on whether HIV-1 protease will cleave in the central position.	8	2	2371
Heart failure clinical records⁶²	Dataset contains medical records of heart failure patients, collected during their follow-up period.	12	2	299

Results

This section presents the quantitative outcomes of our benchmarking study on local and global explanation methods using a range of evaluation metrics. Results are reported objectively without interpretation.

Local interpretability results

Table 3 summarizes the performance of LIME, CIU, RuleFit, RuleMatrix, and Anchor across several datasets on eight local interpretability metrics:

Table 3.

Metric scores for different local explanation techniques on tabular data.

Dataset name	Method	Identity (%)	Separability (%)	Fidelity (%)	Speed (in seconds)	Stability(%)	Monotonicity	Non-sensitivity	Effective complexity
Thyroid	LIME	0.00	100.0	26.74	5.24	91.15	$-$ 0.1439	1.00	0.00
	CIU	0.00	100.00	28.24	4.09	57.76	0.58	0.99	0.00
	RuleFit	100.00	20.74	96.28	0.00003	96.79	0.41	18.22	0.00
	RuleMatrix	100.00	47.98	98.01	0.03044	93.52	0.32	18.87	0.00
	Anchor	76.79	43.44	90.97	11.67	68.14	0.29	17.53	0.00
Heart disease	LIME	0.00	100.00	43.71	8.43	74.32	$-$ 0.0046	0.00	1.82
	CIU	0.00	100.00	46.17	5.53	82.10	0.1226	0.00	2.19
	RuleFit	100.00	70.28	80.54	0.00009	80.54	0.0212	11.95	2.11
	RuleMatrix	100.00	92.27	91.43	0.00017	79.38	0.0671	11.21	1.75
	Anchor	50.97	96.71	82.43	35.91	70.04	0.1821	9.77	1.77
Hepatitis	LIME	0.00	100.00	29.06	2.87	79.49	$-$ 0.2394	1.00	8.00
	CIU	82.05	100.00	30.34	1.71	64.10	0.21	0.48	8.23
	RuleFit	100.00	78.17	58.97	0.00009	56.41	$-$ 0.0558	16.46	9.10
	RuleMatrix	100.00	71.07	79.49	0.00021	61.54	$-$ 0.0023	16.59	8.21
	Anchor	41.03	95.27	87.62	3.43	87.18	$-$ 0.2543	7.45	3.39
Echocardiogram	LIME	0.00	100.00	63.13	0.73	87.88	0.1502	0.00	0.00
	CIU	6.06	100.00	65.66	0.91	81.82	0.2393	0.42	4.55
	RuleFit	100.0	75.21	63.64	0.00073	93.94	0.2087	8.73	4.24
	RuleMatrix	100.00	37.56	78.79	0.00019	87.88	$-$ 0.1623	8.82	3.55
	Anchor	57.58	66.58	96.46	2.46	51.52	0.2434	7.45	3.55
Breast Cancer Wiscousin (original)	LIME	0.00	100.00	75.83	4.57	92.39	$-$ 0.0794	0.00	5.44
	CIU	0.00	100.00	72.22	4.035	92.98	0.33	0.00	5.42
	RuleFit	100.00	50.97	91.23	0.00017	92.98	$-$ 0.3619	7.56	5.49
	RuleMatrix	100.00	62.67	94.74	0.00009	84.21	$-$ 0.3074	7.22	5.29
	Anchor	58.48	74.80	80.73	44.83	61.40	0.0254	5.98	5.44
SPECT heart	LIME	0.00	100.00	63.13	0.168	57.22	$-$ 0.0281	0.00	0.00
	CIU	0.00	100.00	25.13	0.239	88.77	0.2201	0.00	0.00
	RuleFit	100.00	67.52	87.70	0.00022	63.64	0.0776	21.41	0.00
	RuleMatrix	100.00	83.77	73.79	0.00007	51.87	0.12	20.61	0.00
	Anchor	49.20	98.15	81.65	0.40	78.07	0.15	18.92	0.00
Diabetes	LIME	0.00	100.00	21.43	5.72	60.82	0.0177	8.94	0.00
	CIU	0.00	100.00	27.76	5.305	52.45	0.0159	5.99	0.06
	RuleFit	100.00	68.02	59.64	0.00004	54.08	0.0080	40.19	0.00
	RuleMatrix	100.00	94.56	63.06	0.00058	59.79	0.1038	39.03	0.00
	Anchor	19.18	97.99	88.51	29.4	54.89	0.0193	35.11	0.00
Diabetic retinopathy Debrecan	LIME	0.00	100.00	34.03	0.2793	60.07	$-$ 0.0125	0.00	5.88
	CIU	0.00	100.00	32.29	0.5747	65.28	$-$ 0.0532	0.01	5.46
	RuleFit	100.00	63.27	62.85	0.00013	62.15	$-$ 0.0523	18.06	7.87
	RuleMatrix	100.00	89.68	60.42	0.00021	58.68	0.0197	17.60	6.52
	Anchor	30.21	95.01	91.59	1.25	54.17	0.0127	15.57	8.43
HIV-1 protease cleavage	LIME	0.00	100.00	74.03	2.67	86.67	0.0699	0.00	2.22
	CIU	59.36	99.99	73.64	5.95	98.48	$-$ 0.0827	0.00	2.15
	RuleFit	100.00	51.83	22.42	0.00005	87.86	0.0734	6.32	2.27
	RuleMatrix	100.00	0.16	32.72	0.00005	100.00	$-$ 0.036	8.00	2.20
	Anchor	30.86	97.30	76.39	29.18	83.64	$-$ 0.0358	3.49	2.36
Heart failure clinical records	LIME	0.00	100.00	42.89	2.94	84.00	$-$ 0.1693	0.00	0.00
	CIU	0.00	100.00	53.33	3.27	72.00	0.29	0.00	0.64
	RuleFit	100.00	62.77	76.00	0.00036	77.33	0.1445	10.39	0.77
	RuleMatrix	100.00	44.36	36.00	0.00034	70.67	0.0441	11.61	0.76
	Anchor	77.33	76.04	94.12	12.29	69.33	0.1807	10.06	0.64

LIME: local interpretable model-agnostic explanations; CIU: contextual importance and utility.

Identity: RuleFit and RuleMatrix consistently achieved 100%, indicating perfect reproducibility. LIME and CIU frequently scored 0%.

Separability: LIME and CIU obtained high scores across datasets, while RuleFit and RuleMatrix showed more variability.

Fidelity: RuleFit, RuleMatrix, and Anchor achieved high fidelity scores. LIME and CIU performed lower on this metric.

Speed: RuleFit and RuleMatrix provided the fastest explanations. Anchor was the slowest.

Stability: RuleFit and RuleMatrix had high stability. Anchor had the lowest.

Monotonicity: CIU achieved the highest monotonicity scores. RuleFit and Anchor were moderate. LIME scored the lowest.

Non-sensitivity: LIME and CIU scored highest in identifying non-influential features. Rule-based methods performed lower.

Effective complexity: LIME and CIU generated the simplest explanations. RuleFit and RuleMatrix had higher complexity.

Global interpretability

In this section, we present a comprehensive examination of various global explanation methods evaluated across a diverse array of datasets, as highlighted in Table 4.

Identity: RuleFit and RuleMatrix scored 100% across datasets. LIME and Anchor frequently scored 0%.

Separability: LIME and Anchor provided diverse instance-level explanations. RuleFit and RuleMatrix were more homogeneous.

Compactness: LIME yielded the most compact explanations, followed by RuleFit and RuleMatrix.

Correctness: RuleFit and RuleMatrix demonstrated higher alignment with black-box model predictions. LIME had lower scores.

Entropy ratio: Lower entropy ratios were observed for LIME, RuleFit, and RuleMatrix, indicating focus on fewer features.

Kullback-Leibler divergence: RuleFit and RuleMatrix achieved the lowest divergence values.

Gini coefficient: RuleFit and RuleMatrix obtained higher Gini coefficients, reflecting concentrated feature attributions. LIME exhibited more distributed importance.

Discussion

The benchmarking results reveal distinct trade-offs among the evaluated interpretability methods. RuleFit and RuleMatrix demonstrated strong performance in reproducibility, fidelity, and correctness, which are critical for deployment in clinical and regulatory contexts. However, their explanations often exhibited higher complexity, which may increase the cognitive effort required for interpretation. Anchor offered competitive correctness and fidelity but showed notable limitations in reproducibility and latency. These constraints can reduce its applicability in time-sensitive or reliability-critical settings. LIME and CIU stood out in terms of separability and low effective complexity, making them attractive for settings where simplicity and differentiation between instances are important. LIME and CIU often scored 0% in identity, indicating a lack of consistency in the explanations. LIME and Anchor often scored 0% in global identity as well, which raises concerns about the reproducibility and reliability of the explanations these methods generate in practice. These findings underscore that no explanation method dominates across all metrics or use cases. The appropriate choice depends on the application’s interpretability demands, whether prioritizing auditability, simplicity, model alignment, or computational efficiency. Finally, our evaluation does not incorporate human-centered assessments such as usability or clinician feedback. This limits our ability to judge the practical effectiveness of explanations in real-world decision-making. Future work should incorporate expert-in-the-loop studies and assess performance on diverse data modalities.

Table 4.

Metric scores for different global explainability techniques on tabular data.

Dataset name	Method	Identity (%)	Separability (%)	Compactness (%)	Correctness (%)	Entropy ratio	Kullback-Leibler divergence	Gini coefficient
Thyroid	LIME	2.13	48.11	91.49	42.28	0.68	1.24	0.00035
	Anchor	0.00	24.74	95.00	92.07	0.71	0.86	0.00170
	RuleFit	100	22	80	94.01	0.70	0.83	0.00230
	RuleMatrix	100	20	90	93.34	0.71	0.79	0.00210
Heart
disease	LIME	0.00	21.68	92.50	32.50	0.82	0.65	0.00038
	Anchor	0.00	24.73	92.31	97.94	0.87	0.35	0.00261
	RuleFit	100	17	90	98.02	0.84	0.20	0.00311
	RuleMatrix	100	18	91.3	96.45	0.83	0.21	0.00365
Hepatitis	LIME	2.22	26.67	97.78	40.00	0.78	0.84	0.00032
	Anchor	0.00	37.36	61.11	97.84	0.93	0.20	0.00103
	RuleFit	100	30	61	98.6	0.79	0.12	0.00212
	RuleMatrix	100	35	60	97.45	0.80	0.11	0.00263
Echocardiogram	LIME	0.00	72.18	70.59	26.47	0.78	0.77	0.00054
	Anchor	0.00	48.31	88.89	98.27	0.79	0.47	0.00624
	RuleFit	100	45	65.5	99.4	0.70	0.31	0.00671
	RuleMatrix	100	43	69.3	99.01	0.71	0.32	0.00723
Breast Cancer
Wiscousin (original)	LIME	8.33	48.71	95.83	33.33	0.66	1.07	0.00129
	Anchor	0.00	40.11	87.50	99.59	0.84	0.33	0.00627
	RuleFit	100	39	86.5	93.03	0.86	0.29	0.00821
	RuleMatrix	100	35	80	92.43	0.88	0.21	0.00917
SPECT heart	LIME	4.54	11.73	86.36	50.00	0.85	0.57	0.00028
	Anchor	0.00	58.10	86.36	99.68	0.91	0.29	0.00079
	RuleFit	100	10	86	99.35	0.88	0.18	0.00139
	RuleMatrix	100	9.72	85	99.40	0.87	0.19	0.00109
Diabetes	LIME	5.74	18.75	91.95	46.37	0.73	1.17	0.00001
	Anchor	0.00	20.89	78.05	95.31	0.88	0.43	0.00030
	RuleFit	100	17	75	94.5	0.80	0.23	0.00041
	RuleMatrix	100	16	77.31	93.1	0.83	0.35	0.00039
Diabetic retinopathy Debrecan	LIME	0.00	37.52	66.18	28.38	0.63	1.55	0.00018
	Anchor	0.00	26.58	68.42	96.93	0.95	0.16	0.00088
	RuleFit	100	25	65.12	97.1	0.72	0.13	0.00089
	RuleMatrix	100	21	66.1	98.3	0.72	0.14	0.00091
HIV-1 protease cleavage	LIME	0.00	21.69	81.25	16.88	0.91	0.44	0.00002
	Anchor	0.00	36.75	62.50	87.72	0.95	0.11	0.00401
	RuleFit	100	20	55.3	88.21	0.92	0.09	0.00451
	RuleMatrix	100	20.5	53.21	85.3	0.93	0.11	0.00501
Heart failure clinical records	LIME	0.00	61.13	97.37	31.58	0.76	0.89	0.00047
	Anchor	0.00	46.58	50.00	97.64	0.80	0.49	0.00368
	RuleFit	100	50	40	96.03	0.80	0.30	0.00467
	RuleMatrix	100	52	35	98.01	0.83	0.25	0.00511

LIME: local interpretable model-agnostic explanations.

Conclusion

This study quantitatively evaluates local (LIME, CIU, RuleFit, RuleMatrix, and ILIME) and global (LIME, Anchor, RuleFit, and RuleMatrix) explanation methods using healthcare datasets. Among global explainability techniques, RuleFit and RuleMatrix consistently perform well, providing clear and robust model interpretations. In contrast, LIME, which prioritizes simplicity, exhibits lower correctness, indicating potential misalignment with the underlying model. For local explainability, RuleFit and RuleMatrix excel in identity, stability, and fidelity metrics, demonstrating strong alignment with model behavior. The variation in performance across different metrics highlights the need for nuanced method selection, emphasizing the importance of prioritizing criteria based on the specific application and carefully considering trade-offs between explainability dimensions. Despite the comprehensive scope of this study, several limitations warrant consideration. First, the analysis focuses on a specific subset of explainability techniques, primarily rule-based and surrogate model approaches. This excludes other prominent families of XAI methods, such as gradient-based saliency techniques and counterfactual explanations, which may behave differently under the same evaluation metrics. Second, the experimental evaluation is limited to structured, tabular datasets derived from clinical settings. Consequently, the generalizability of the findings to unstructured data domains (e.g. medical imaging, clinical notes, or time-series data) remains unexamined. Third, the study emphasizes quantitative assessment metrics—including identity, fidelity, and stability—but does not incorporate human-centered evaluations. In particular, the cognitive and practical interpretability of the generated explanations from a domain expert’s perspective (e.g. clinicians or healthcare practitioners) is not assessed, which is critical for real-world deployment. Lastly, the current evaluation does not explore the computational cost or scalability of the methods, factors that may influence their feasibility in time-sensitive or resource-constrained environments. Furthermore, it is important to acknowledge that not all explanations are equivalent in purpose, granularity, or interpretive utility. Different XAI methods offer fundamentally distinct forms of explanations—ranging from rule-based abstractions to local perturbation-driven approximations—each suited to different users and contexts. For instance, a high-fidelity explanation may not be understandable to a non-technical audience, while a more intuitive explanation may oversimplify model behavior. These epistemic trade-offs highlight the need for incorporating expert-in-the-loop validation protocols to assess explanation relevance, plausibility, and actionability from a domain-specific perspective. In healthcare, engaging clinicians to validate whether the explanations align with clinical reasoning and decision-making pathways is essential. Such human-centered evaluations are a critical next step for ensuring the practical reliability and ethical deployment of explainable AI in high-stakes environments.

Footnotes

Acknowledgements

Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2025R756),Princess Nourah bint Abdulrahman University,Riyadh,Saudi Arabia.

ORCID iD

Radwa El Shawi

Ethical approval

This study used publicly available,de-identified datasets and did not involve any experiments on human subjects. Therefore,ethical approval was not required.

Contributorship

Radwa El Shawi and Nada Ahmed: conceptualization;Krish Agrawal and Radwa El Shawi: methodology;Krish Agrawal: software;Radwa El Shawi and Nada Ahmed: validation;Krish Agrawal and Radwa El Shawi: formal analysis;Krish Agrawal,Radwa El Shawi,and Nada Ahmed: investigation;Krish Agrawal: data curation;Krish Agrawal and Radwa El Shawi: writing–original draft preparation;Krish Agrawal,Radwa El Shawi,and Nada Ahmed: writing–review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

The author(s) disclosed receipt of the following financial support for the research,authorship,and/or publication of this article: This work has been partially funded by the project Increasing the knowledge intensity of Ida-Viru entrepreneurship co-funded by the European Union. This work has been partially funded by the Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2025R756),Princess Nourah bint Abdulrahman University,Riyadh,Saudi Arabia.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.

Data availability statement

The datasets used and/or analyzed during the current study available from the corresponding author on reasonable request.

Guarantor

Radwa El Shawi and Nada Ahmed are the guarantors for this article and accepts full responsibility for the integrity and accuracy of the work.

References

Shawi

Al-Mallah

. Interpretable local concept-based explanation with human feedback to predict all-cause mortality. J Artif Intell Res 2022; 75: 833–855.

Chen

Wei

Huang

, et al. A novel and interpretable approach to data augmentation for enhanced prediction of cutting tool remaining useful life. J Comput Des Eng 2025; 12: 121–139.

Tariq

Recio-Garcia

Cetina-Quiñones

, et al. Explainable artificial intelligence twin for metaheuristic optimization: double-skin facade with energy storage in buildings. J Comput Des Eng 2025; 12: 16–35.

Lee

Jung

Kang

, et al. Deep learning-based framework for monitoring wearing personal protective equipment on construction sites. J Comput Des Eng 2023; 10: 905–917.

Changdar

Bhaumik

. Physics-based smart model for prediction of viscosity of nanofluids containing nanoparticles using deep learning. J Comput Des Eng 2021; 8: 600–614.

Guidotti

Monreale

Ruggieri

, et al. A survey of methods for explaining black box models. ACM Comput Surv (CSUR) 2018; 51: 1–42.

Kurenkov

. Lessons from the pulse model and discussion. The gradient, 2020.

Chouldechova

. Fair prediction with disparate impact: a study of bias in recidivism prediction instruments. Big Data 2017; 5: 153–163.

Goodman

Flaxman

. EU regulations on algorithmic decision-making and a “right to explanation”. In: ICML workshop on human interpretability in machine learning (WHI 2016), New York, NY. http://arxiv. org/abs/1606.08813 v1.

10.

Miller

. Explanation in artificial intelligence: insights from the social sciences. Artif Intell 2019; 267: 1–38.

11.

Molnar

. Interpretable machine learning. Lulu.com, 2020.

12.

Yagin

El Shawi

Algarni

, et al. Metabolomics biomarker discovery to optimize hepatocellular carcinoma diagnosis: methodology integrating automl and explainable artificial intelligence. Diagnostics 2024; 14: 2049.

13.

Selvaraju

Cogswell

Das

, et al. Grad-CAM: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision, pp.618–626.

14.

Jain

Wallace

. Attention is not explanation. arXiv preprint arXiv:190210186, 2019.

15.

Guo

Liu

, et al. Causality-based feature selection: methods and evaluations. ACM Comput Surv (CSUR) 2020; 53: 1–36.

16.

Adadi

Berrada

. Peeking inside the black-box: a survey on explainable artificial intelligence (XAI). IEEE Access 2018; 6: 52138–52160.

17.

Lundberg

Lee

. A unified approach to interpreting model predictions. In: Advances in neural information processing systems, volume 30. https://arxiv.org/abs/1705.07874.

18.

Sundararajan

Taly

Yan

. Axiomatic attribution for deep networks. In: International conference on machine learning, pp.3319–3328. https://arxiv.org/abs/1703.01365.

19.

Friedman

Popescu

. Predictive learning via rule ensembles, 2008.

20.

Ribeiro

Singh

Guestrin

. Anchors: high-precision model-agnostic explanations. In: Proceedings of the AAAI conference on artificial intelligence, volume 32.

21.

Žlahtič

Završnik

Blažun Vošner

, et al. Transferring black-box decision making to a white-box model. Electronics 2024; 13: 1895.

22.

Dwivedi

Dave

Naik

, et al. Explainable ai (XAI): core ideas, techniques, and solutions. ACM Comput Surv 2023; 55: 1–33.

23.

ElShawi

Sherif

Al-Mallah

, et al. Interpretability in healthcare: a comparative study of local machine learning interpretability techniques. Comput Intell 2021; 37: 1633–1650.

24.

Honegger

. Shedding light on black box machine learning algorithms: development of an axiomatic framework to assess the quality of methods that explain individual predictions. arXiv preprint arXiv:180805054, 2018.

25.

Elshawi

Al-Mallah

Sakr

. On the interpretability of machine learning-based model for predicting hypertension. BMC Med Inform Decis Mak 2019; 19: 1–32.

26.

Buckmann

Joseph

Robertson

. An interpretable machine learning workflow with an application to economic forecasting. Technical report, Bank of England, 2022.

27.

Albini

Long

Dervovic

, et al. Counterfactual shapley additive explanations. In: Proceedings of the 2022 ACM conference on fairness, accountability, and transparency, pp.1054–1070.

28.

Ribeiro

Singh

Guestrin

. “Why should i trust you?” Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp.1135–1144.

29.

ElShawi

Sherif

Al-Mallah

, et al. ILIME: local and global interpretable model-agnostic explainer of black-box decision. In: Advances in databases and information systems: 23rd European conference, ADBIS 2019, Bled, Slovenia, September 8–11, 2019, Proceedings 23, pp.53–68. Springer.

30.

Craven

Shavlik

. Extracting tree-structured representations of trained networks. Adv Neural Inf Process Syst 1995; 8: 24–30.

31.

Samek

Montavon

Vedaldi

, et al. Explainable AI: interpreting, explaining and visualizing deep learning, volume 11700. Cham: Springer Nature, 2019. DOI: 10.1007/978-3-030-28954-6.

32.

Carvalho

Pereira

Cardoso

. Machine learning interpretability: a survey on methods and metrics. Electronics 2019; 8: 832.

33.

Arrieta

Díaz-Rodríguez

Del Ser

, et al. Explainable artificial intelligence (XAI): concepts, taxonomies, opportunities and challenges toward responsible AI. Inf Fusion 2020; 58: 82–115.

34.

Guidotti

Monreale

Giannotti

, et al. Factual and counterfactual explanations for black box decision making. IEEE Intell Syst 2019; 34: 14–23.

35.

Alvarez Melis

Jaakkola

. Towards robust interpretability with self-explaining neural networks. Adv Neural Inf Process Syst 2018; 31: 7775.

36.

Silva

Fernandes

Cardoso

, et al. Towards complementary explanations using deep neural networks. In: Understanding and interpreting machine learning in medical image computing applications: first international workshops, MLCN 2018, DLF 2018, and iMIMIC 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 16–20, 2018, Proceedings 1, pp.133–140. Springer.

37.

Doshi-Velez

Kim

. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:170208608, 2017.

38.

Doshi-Velez

Kortz

Budish

, et al. Accountability of ai under the law: the role of explanation. arXiv preprint arXiv:171101134, 2017.

39.

Doshi-Velez

Kim

. Considerations for evaluation and generalization in interpretable machine learning. In: Explainable and interpretable models in computer vision and machine learning, 2018, pp.3–17.

40.

Anjomshoae

Kampik

Främling

. Py-CIU: a python library for explaining machine learning predictions using contextual importance and utility. In: IJCAI-PRICAI 2020 workshop on explainable artificial intelligence (XAI), January 8, 2020.

41.

Främling

. Decision theory meets explainable ai. In: International workshop on explainable, transparent autonomous agents and multi-agent systems, pp.57–74. Springer.

42.

Anjomshoae

Främling

Najjar

. Explanations of black-box model predictions by contextual importance and utility. In: Explainable, transparent autonomous agents and multi-agent systems: first international workshop, EXTRAAMAS 2019, Montreal, QC, Canada, May 13–14, 2019, Revised Selected Papers 1, pp.95–109. Springer.

43.

Ming

Bertini

. Rulematrix: visualizing and understanding classifiers with rules. IEEE Trans Vis Comput Graph 2018; 25: 342–352.

44.

Yang

Rudin

Seltzer

. Scalable bayesian rule lists. In: International conference on machine learning, pp.3921–3930. PMLR.

45.

Nguyen

Martínez

. On quantitative aspects of model interpretability. arXiv preprint arXiv:200707584, 2020.

46.

Sprent

Smeeton

. Applied nonparametric statistical methods. CRC Press, 2016.

47.

Ylikoski

Kuorikoski

. Dissecting explanatory power. Philos Stud 2010; 148: 201–219.

48.

Munoz

da Costa

Modenesi

, et al. Local and global explainability metrics for machine learning predictions. arXiv preprint arXiv:230212094, 2023.

49.

Kulesza

Stumpf

Burnett

, et al. Too much, too little, or just right? Ways explanations impact end users’ mental models. In: 2013 IEEE symposium on visual languages and human centric computing, pp.3–10. IEEE.

50.

Murdoch

Singh

Kumbier

, et al. Definitions, methods, and applications in interpretable machine learning. Proc Natl Acad Sci 2019; 116: 22071–22080.

51.

Bhatt

Weller

Moura

. Evaluating and aggregating feature-based model explanations. arXiv preprint arXiv:200500631, 2020.

52.

UCI Machine Learning Repository . http://www.ics.uci.edu/-mlearn/MLRepository.html (2020, accessed 1 February 2003).

53.

UCI Machine Learning Repository . Thyroid disease data set, 1987. https://archive.ics.uci.edu/ml/datasets/thyroid+disease (accessed 2025).

54.

UCI Machine Learning Repository . Heart disease data set, 1988. https://archive.ics.uci.edu/ml/datasets/Heart+Disease (accessed 2025).

55.

UCI Machine Learning Repository . Hepatitis data set, 1988. https://archive.ics.uci.edu/ml/datasets/Hepatitis (accessed 2025).

56.

UCI Machine Learning Repository . Echocardiogram data set, 1990. https://archive.ics.uci.edu/ml/datasets/Echocardiogram (accessed 2025).

57.

UCI Machine Learning Repository . Breast Cancer Wisconsin (original) data set, 1992. https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Original) (accessed 2025).

58.

UCI Machine Learning Repository . Spect heart data set, 2000. https://archive.ics.uci.edu/ml/datasets/SPECT+Heart (accessed 2025).

59.

Strack

, et al. Impact of HbA1c measurement on hospital readmission rates: analysis of 70,000 clinical database records. Biomed Res Int 2014; 2014: 781670.

60.

UCI Machine Learning Repository . Diabetic retinopathy debrecen data set, 2008. https://archive.ics.uci.edu/ml/datasets/Diabetic+Retinopathy+Debrecen+Data+Set (accessed 2025).

61.

UCI Machine Learning Repository . HIV-1 protease cleavage data set, 2000. https://archive.ics.uci.edu/ml/datasets/HIV-1+protease+cleavage (accessed 2025).

62.

UCI Machine Learning Repository . Heart failure clinical records data set, 2012. https://archive.ics.uci.edu/ml/datasets/Heart+failure+clinical+records (accessed 2025).

63.

Feurer M, Klein A, Eggensperger K, et al. Efficient and robust automated machine learning. Adv Neural Inf Process Syst 2015; 28.

XAI-Eval: A framework for comparative evaluation of explanation methods in healthcare

Abstract

Objective:

Methods:

Results:

Conclusion:

Keywords

Introduction

Related work

Methods

Evaluated explainability frameworks

Local interpretable model-agnostic explanations (LIME)

Contextual importance and utility (CIU)

Anchor

RuleFit

RuleMatrix

Evaluation metrics for explainability methods

Experimenatal setup

Results

Local interpretability results

Global interpretability

Discussion

Conclusion

Footnotes

Acknowledgements

ORCID iD

Ethical approval

Contributorship

Funding

Declaration of conflicting interests

Data availability statement

Guarantor

References