Sage Journals: Discover world-class research

Abstract

Artificial intelligence (AI) chatbots have improved rapidly. However, these systems still face challenges in complex and time-sensitive issues where real-time awareness is imperative, such as remote emergent care. To address these limitations, a Multi-Agent System (MAS) was developed that employs a collection of AI agents with unique and distinct tasks, ranging from symptom analysis and user proficiency to risk assessment and information verification. In conjunction, these agents work together to enhance the clarity of output and thereby mitigate the hallucinatory effects associated with traditional single-agent systems. The trust dynamics of the human-AI team were measured quantitatively using a novel quantum model, implemented with Qiskit. In a human subject experiment, the MAS system significantly reduced the number of follow-up questions and achieved higher trust scores than the single-agent system, indicating the model’s validity. These results suggest that MAS-based systems can substantially improve the reliability and effectiveness of remote emergency care, offering a promising new direction for digital healthcare support. Future research will extend validation across broader populations and emergency scenarios.

Keywords

multi-agent system remote healthcare quantum AI emergency usability chatbot LLM

Introduction

Artificial intelligence (AI) has become increasingly prevalent, so its impact is most notable in healthcare. AI chatbots address challenges in medical access, particularly for underserved or remote populations (Laymouna et al., 2024; Usher et al., 2024). While many conditions require in-person check-ups, digital healthcare and AI chatbots offer 24/7 personalized support, improving accessibility (Aggarwal et al., 2023). These tools have also successfully promoted positive lifestyle changes, such as reducing physical inactivity and smoking (Aggarwal et al., 2023). Prior studies have examined the applications and challenges of Large Language Model (LLM) integration in healthcare, revealing that chatbots face limitations in multi-step tasks involving complex decision-making (Dam et al., 2024).

Additionally, hallucinatory effects continue to pose a serious issue with the accuracy of the information given. Ongoing research suggests that due to Gödel’s First Incompleteness Theorem (Gödel, 1931), there is a non-zero probability of hallucination occurring, regardless of changes to system architecture (Banerjee et al., 2024). In this context, Multi-Agent Systems (MAS) offer a promising alternative to fix these large-scale issues. The distribution of tasks among multiple agents, and the ability to delegate work autonomously, can leverage the specialization of labor to create a checks and balances system that mitigates some of these problems (Park et al., 2025). MAS integration into healthcare has gained support, with studies positing that the implementation of frameworks such as crewAI (https://www.crewai.com/) and Langchain (https://www.langchain.com/) offers ways to revolutionize medicine (Borkowski and Ben-Ari, 2024). For instance, a 2024 study investigated the use of MAS in monitoring, finding that the MAS outperformed traditional single agent systems like Q-Learning and Double DQN in tracking patient vitals (Shaik et al., 2023). While MAS integration has been investigated in areas such as pain management (Kamal et al., 2023), pre-hospital care (Safdari et al., 2017), and wellness monitoring (Humayun et al., 2022), its implementation remains limited in emergency remote care scenarios, where rapid first-aid interventions are critical for preventing life-threatening outcomes (Park et al., 2024). Auditory and visual information, which can aid in immediate action, has also never been incorporated for any MAS. To address these gaps, our study aims to develop multi-modal MAS for emergency situations and evaluate the effectiveness of this system in remote care applications, exploring its ability to mitigate follow-up questions and gain user trust, relative to a single-agent system (Park et al., 2020).

Over recent years, the popularity of chatbot technology has grown exponentially, embracing remarkable growth and revolutionizing the way AI can be used in the daily lives of individuals. The adoption of chatbots has shown a consistent upward trend since 2016, peaking in 2021 with 79 studies conducted in this domain in that year alone (Alsharhan et al., 2023). These AI-based conversational agents have been incorporated in several fields including customer service, banking, healthcare, education, e-commerce, tourism, and hospitality (Alsharhan et al., 2023), reflecting the expedited interest in this transformative technology. Notably, in the healthcare sector, 52% of patients acquire their health data with chatbots already (Market Research Future, 2021). Due to their conversational nature and ability to stimulate human-like interactions, chatbots are accessible and practical tools for streamlined communication and personalized support. One of the most popular chatbots, OpenAI’s ChatGPT, has more than 200 million weekly users (Kelly et al., 2025). As this trend continues, AI chatbots are set to be the defining technology of the next decade (Kelly et al., 2022), holding immense potential among various domains, including healthcare.

Methodology

This study aims to develop a MAS (https://github.com/AD-txigfwbexxk23/Multi-Agent-Systems-MAS-for-Remote-Healthcare; Figures 1 and 2) for remote and emergency care situations, and test this system for trust and efficiency. An AI system was developed using the crewAI library for Python 3.12.7, where multiple agents focused on different tasks are controlled by a single master agent. These agents include a Symptom Analysis Agent, an Advisor Agent, a Verification Agent, a User Proficiency Agent, and a Risk Assessment Agent. Communication between agents is facilitated through the crewAI framework, where each agent, powered by distinct AI models, contributes task-specific results to the master agent. The master agent consolidates and displays these results. Instructions for each agent are tailored to address emergency remote healthcare situations. A memory file stores all user prompts, enabling the agent to track past interactions and establish connections that refine subsequent responses. Figure 4 shows a more in-depth look at this system and how all components of the crewAI framework come together. Tools are accessible to each system, allowing agents to surf the web and make phone calls via the Twilio API. Agent declaration involves loading each agent with predefined roles and prompt templates, which are used to optimize healthcare-relevant tasks. These tasks include logic and instructions based on established medical information. Once the system is executed, the master issues task requests subordinate agents, retrieves their outputs, and integrates the responses.

Figure 1.

MAS interface.

Figure 2.

crewAI framework.

The system also supports multi-modal capabilities, providing both visual and auditory instructions to enhance user assistance in critical scenarios. The interface is established as a website for accessibility and testing purposes. Ergonomics was crucial in the user interface (UI), particularly in supporting accurate and real-time actions in high-stakes scenarios (Park et al., 2023). The system architecture was designed to minimize cognitive workload through an easy-to-understand layout, with the most prevalent information being listed through a clear visual hierarchy. This included optimizing button placement, screen layout, and visual feedback mechanisms to reduce reaction time and physical strain. Buttons are positioned on the left-hand panel in a vertical design, ensuring they remain spatially distinct and easy to locate with minimal eye movement. High-priority actions are placed near the top of the panel to reduce interaction time. At the same time, the use of consistent colour coding, such as red for confirmation, provides immediate visual cues to reinforce intended actions. This structure minimizes decision fatigue and supports intuitive navigation during emergencies.

Furthermore, multimodal capabilities supporting auditory input and output are directly informed by principles of human factor ergonomics. These include reducing cognitive load through redundant cueing, supporting situational awareness under time pressure, and minimizing user error by aligning interface behaviours with natural human response patterns. By applying these principles, the system enhances usability and ensures that users can make fast decisions in high-stakes medical scenarios.

Trust is a critical factor in a high-risk domain like emergency remote healthcare, where decisions must be made quickly and often with limited oversight. A lack of appropriate trust, whether over-reliance or under-reliance, can lead to misuse, disuse or delayed action, directly impacting patient outcomes (Hoff & Bashir, 2015). A quantum-based modelling system is developed to measure trust during the validation process that utilizes the Qiskit (https://www.ibm.com/quantum/qiskit) library to quantify trust from 0% to 100%. A qubit representing trust starts in superposition within the model. During the interaction, prompts given by the user are analyzed for sentiment, which directly affects the trust qubit’s state by rotating it towards 0 or 1 in the Bloch sphere. This approach was well-suited for the MAS validation as it allows for real-time monitoring of user confidence and quantifying trust. Unlike post-hoc surveys, which are prone to recall bias, coarse resolution, irrational responses, as well as social desirability effects, the quantum trust model offers a more objective, fine-grained and behaviorally grounded measurement of trust evolution over time. In parallel, we also collect a modified Trust Scale for the AI Context (TAI) survey (Scharowski et al., 2024) after each trial to benchmark the quantum-based model, supporting its use as a novel and viable approach for quantifying trust in human-AI interactions. The TAI survey differs slightly from previous publications, as the questions were altered slightly to fit within the study context.

Validation of the MAS (IRB number: REB25-0370) (Figure 3) was conducted through a human-subject experiment comparing three sets of performance outcomes: (1) Number of follow-up questions, (2) Change in trust from the initial state, and (3) Trust survey. Ten university students (M: 7, F: 3, Age.M = 18.1, Age.STDEV = 0.7) participated in the validation study, where they were required to navigate through two emergency situations: CPR and EpiPen (two participants dataset was removed due to the equipment issues). There were three inclusion criteria: 1) Participants should have limited or no prior CPR or EpiPen training. 2) The overall participant pool must be gender balanced (e.g., three male, five female or 4:4 if eight total). 3) Participants must not have impairments that prevent participation (e.g., physical or mental disabilities).

Figure 3.

Human-subject study (CPR).

In each scenario, participants assisted with a mannequin by performing first aid, guided by information provided by either the MAS or a publicly available GPT-4o mini model. Each participant completed six trials, covering all emergencies with both chatbot mediums. Participants initiated the session with a predefined query for both chatbots and were allowed to ask follow-up questions for further clarification or guidance. After each interaction, the number of follow-up questions was recorded, and trust was measured using the quantum model. Additionally, trust levels were assessed after each trial via survey scales. We hypothesized that the variables from MAS exhibit significantly fewer follow-up questions and a higher trust than the baseline condition (GPT 4o-mini model).

Result

Figure 4 describes the four analyses. For the number of queries, GPT and MAS generated markedly different interaction patterns. A linear mixed-effects analysis with participant as a random intercept showed that MAS reduced the number of clarifying queries by 2.88 ± 0.54 prompts (z = −5.33, p < .001). A Poisson mixed model confirmed the finding, indicating that MAS required only 31 % of the queries needed by GPT (β = −1.16 ± 0.25; p < .001) and showed no evidence of over-dispersion (Pearson χ²/df = 0.59). Wilcoxon signed-rank tests corroborated the effect (V = 0, p = .001). Model residuals were homoscedastic but slightly non-normal (Shapiro–Wilk W = 0.92, p = .021).

Figure 4.

Graphical results.

For the Trust in Automation (TPA) questionnaire, reverse-coded items were inverted before aggregation (Cronbach’s α = .84). A two-way repeated-measures ANOVA revealed a main effect of AI support, F(1, 7) = 8.27, p = .024, η² = .19, favoring MAS (M = 50.8, SD = 4.3) over GPT (M = 45.1, SD = 7.5). Task (CPR vs EpiPen) and the AI × Task interaction was not significant (p > .50). Individual-item ANOVAs, Holm corrected for multiplicity, showed AI differences on all twelve items (adjusted p ≤ .019). Residuals met normality and sphericity assumptions.

Real-time predicted-trust scores (0%–100%) (quantum model) displayed a parallel pattern. A mixed-effects model of the seven-step traces yielded a significant AI main effect (β = +.078 ± .035, z = 2.21, p = .027): MAS lifted mean probability from 0.41 to 0.49. Task lowered baseline trust (β = −.093 ± .035, p = .009), but the positive AI × Task term (β = +.102 ± .050, p = .041) indicated that MAS fully compensated for this dip. Residuals were slightly non-normal (W = 0.91, p = .012); a paired Wilcoxon test again showed higher trust for MAS (V = 1, p = .001). Time-series plots revealed a steep decay for GPT after the second query (0.50 → 0.19), whereas MAS maintained a flat trajectory near 0.50.

Subjective (trust survey) and predicted trust (quantum model) were strongly associated. A repeated-measures correlation controlling for participant yielded r = 0.77, p = 7 × 10⁻⁶, while naïve Pearson and rank correlations were r = 0.70 and 0.64, respectively (both p < .001). In a mixed model predicting survey ratings, average predicted-trust remained a significant covariate (β = +4.50 ± 1.89, t = 2.37, p = .018) after adjusting for AI and Task, accounting for 59 % of the within-participant variance. Residual diagnostics met normality (W = 0.96, p = .22) and homoscedasticity assumptions.

Discussion

One of the major findings of this study is that MAS consistently outperformed the GPT across the reduction in the number of queries and trust metrics. Approximately three fewer clarifying prompts per trial represent a 70 % reduction in interaction burden - an advantage that is practically important in time-critical care, where additional dialogue incurs cognitive load and delays (Wickens & Hollands, 2000). Convergent evidence from both self-report and algorithmic confidence signals indicates that users judged MAS to be more reliable and less error prone. This finding aligns with the literature showing that structured, context-limited automation promotes appropriate reliance, whereas open-ended dialogue can engender uncertainty and over- or under-trust (Lee & See, 2004; Hoff & Bashir, 2015).

The trajectory analyses add a temporal dimension to those conclusions. GPT began with parity but lost trust rapidly with each additional query, suggesting that every unanswered follow-up amplifies doubts about competence. MAS, by resolving the task in one or two exchanges, truncated the opportunity for such erosion. The presence of a significant AI × Task interaction on predicted trust but not on survey trust implies that real-time metrics are sensitive to transient task factors, information that post-hoc questionnaires smooth over. Integrating live trust estimates could therefore allow adaptive interfaces that escalate human oversight when confidence collapses and throttle guidance when confidence stabilizes (Parasuraman & Riley, 1997).

The strong correlation between predicted (quantum model) and subjective trust survey supports the validity of algorithmic proxies for human sentiment, echoing recent reports that physiological or behavioral models can anticipate trust calibration in human–AI teaming (Caldwell et al., 2022, Wang et al., 2024). Nonetheless, the mixed model revealed a residual main effect of AI after controlling predicted trust, indicating that design elements not captured by the probability metric, such as formatting, terminology or perceived transparency, also shape trust.

Future work should aim to replicate these findings with larger and more diverse participant samples, as well as across a broader range of clinical procedures. This would improve the generalizability of the advantages observed at MAS. Additionally, future studies should explore whether the performance benefits of MAS persist when baseline LLMs such as ChatGPT are enhanced with domain-specific fine-tuning or constrained by safety guardrails. Beyond emergency-based remote healthcare, MAS frameworks may offer efficiency and trust benefits in other healthcare contexts, including chronic disease management, diagnostics, and mental health support. Further development of the quantum trust model is also warranted; incorporating additional design factors may improve its predictive accuracy.

To summarize, our MAS not only reduces interaction demands but also sustains calibrated trust, and its own confidence signal serves as a useful proxy for post-task sentiment. These properties recommend MAS-style interfaces for emergency medical applications where speed, clarity and appropriate reliance are paramount.

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Funding for this research was provided by 2025 Transdisciplinary Connector Grant (University of Calgary). The views and opinions expressed are those of the authors.

References

Aggarwal

Tam

C. C.

Qiao

(2023, February 24). Artificial Intelligence–Based Chatbots for Promoting Health Behavioral Changes: Systematic Review. Journal of Medical Internet Research, 25, e40789. 10.2196/40789

Alsharhan

Al-Emran

Shaalan

. (2023). Chatbot adoption: A multiperspective systematic review and future research agenda. IEEE Transactions on Engineering Management, 71, 10232–10244.

Banerjee

Agarwal

Singla

(2024, Spetember 9). LLMs Will Always Hallucinate, and We Need to Live With This. arXiv. https://doi.org/10.48550/arXiv.2409.05746

Borkowski

Ben-Ari

(2024, October 1). Muli-Agent AI Systems in Healthcare: Technical and Clinical Analysis. Preprints. https://doi.org/10.20944/preprints202410.0182.v1

Caldwell

Sweetser

O’Donnell

Knight

M. J.

Aitchison

Gedeon

Johnson

Brereton

Gallagher

Conroy

(2022). An agile new research framework for hybrid Human-AI teaming: trust, transparency, and transferability. ACM Transactions on Interactive Intelligent Systems, 12(3), 1–36. https://doi.org/10.1145/3514257

Dam

S. K.

Hong

C. S.

Qiao

Zhang

(2024, June 17). A Complete Survey on LLM-based AI Chatbots. arXiv. arXiv:2406.16937

Gödel

V. K.

(1931, December 1). Über formal unentscheidbare Sätze der Principia Mathematica und verwandter Systeme I. Monatshefte für Mathematik und Physik, 38(1), 173–198. 10.1007/BF01700692

Hoff

K. A.

Bashir

(2015). Trust in automation: Integrating empirical evidence on factors that influence trust. Human Factors, 57, 407–434. https://doi.org/10.1177/0018720814547570

Humayun

Jhanjhi

N. Z.

Almotilag

Almufareh

M. F.

(2022). Agent-based medical health monitoring system. Sensors, 22(8), 2882. https://doi.org/10.3390/s22082820

10.

Kamal

M. A.

Ismail

Shehata

I. M.

Djirar

Talbot

N. C.

Ahmadzadeh

Shekoohi

Cornett

E. M.

Fox

C. J.

Kaye

A. D.

(2023). Telemedicine, e-health, and multi-agent systems for chronic pain management. Clinics and Practice, 13(2), 470–482. https://doi.org/10.3390/clinpract13020042

11.

Kelly

Kaye

S. A.

Oviedo-Trespalacios

. (2022). A multi-industry analysis of the future use of AI Chatbots. Human Behavior and Emerging Technologies, 2022(1), 2552099.

12.

Kelly

Kaye

S. A.

White

K. M.

Oviedo-Trespalacios

. (2025). What factors predict user acceptance of ChatGPT for mental and physical healthcare: An extended technology acceptance model framework. AI & Society, 1–19.

13.

Laymouna

Lessard

Schuster

Engler

Lebouché

(2024, July 23). Roles, users, benefits, and limitations of chatbots in health care: Rapid review. Journal of Medical Internet Research, 26, e56930. 10.2196/56930

14.

Lee

J. D.

See

K. A.

(2004). Trust in automation: Designing for appropriate reliance. Human Factors, 46, 50–80. https://doi.org/10.1518/hfes.46.1.50.30392

15.

Market Research Future. (2021). Healthcare Chatbots Market Size Worth USD 543.65 Million by 2026 at 19.5% CAGR – Report by Market Research Future (MRFR). GlobeNewswire. https://www.globenewswire.com/news-release/2021/07/21/2266698/0/en/Healthcare-Chatbots-Market-Size-Worth-USD-543-65-Million-by-2026-at-19-5-CAGR-Report-by-Market-Research-Future-MRFR.html

16.

Parasuraman

Riley

(1997). Humans and automation: Use, misuse, disuse, abuse. Human Factors, 39, 230–253. https://doi.org/10.1518/001872097778543886

17.

Park

McKenzie

Shahini

Zahabi

(2020, December). Application of cognitive performance modeling for usability evaluation of emergency medical services in-vehicle technology. In Proceedings of the human factors and ergonomics society annual meeting (Vol. 64, No. 1, pp. 72–76). SAGE Publications.

18.

Park

Rathenberg

Panko

McGhie

Son

(2025). Identifying smart technology and artificial intelligence solutions for human factors and ergonomic challenges in all-hazard response: A survey study. Applied Ergonomics, 126, 104488.

19.

Park

Rathenberg

Usher

Son

(2024, September). Smart all-hazard responses framework (SARF): Human factors and ergonomics approach. In Proceedings of the human factors and ergonomics society annual meeting (Vol. 68, No. 1, pp. 1761–1765). SAGE Publications.

20.

Park

Zahabi

Blanchard

Zheng

Ory

Benden

(2023). A novel autonomous vehicle interface for older adults with cognitive impairment. Applied Ergonomics, 113, 104080.

21.

Safdari

Malak

J. S.

Mohammadzadeh

Shahraki

A. D.

(2017, July 5). A multi agent based approach for prehospital emergency management. Bulletin of Emergency and Trauma, 5(3), 171–178. https://beat.sums.ac.ir/article_44382.html

22.

Scharowski

Perrig

S. a. C.

Aeschbach

L. F.

Nick

V. F.

Opwis

Wintersberger

Brühlmann

(2024, March 1). To trust or distrust trust measures: Validating questionnaires for trust in AI. arXiv.org. https://arxiv.org/abs/2403.00582?utm_source=chatgpt.com

23.

Shaik

Tao

Xie

Dai

H.-N.

Zhao

Yong

(2023, September 20). Adaptive multi-agent deep reinforcement learning for timely healthcare interventions. arXiv. https://doi.org/10.48550/arXiv.2309.10980

24.

Usher

Park

Son

(2024, June). Artificial intelligence for emergency medical service. In Proceedings of the international symposium on human factors and ergonomics in health care (Vol. 13, No. 1, pp. 1–6). SAGE Publications.

25.

Wang

Hou

Hong

(2024). Eye-Tracking characteristics: Unveiling trust calibration states in automated supervisory control tasks. Sensors, 24(24), 7946. https://doi.org/10.3390/s24247946

26.

Wickens

C. D.

Hollands

J. G.

(2000). Engineering psychology and human performance (3rd ed.). Prentice Hall.

Multi-Agent Systems (MAS) for Remote Healthcare with Enhanced Efficiency and Trust through Quantum-Model Methodology and Validation

Abstract

Keywords

Introduction

Methodology

Result

Discussion

Footnotes

Declaration of Conflicting Interests

Funding

References