Abstract
Introduction
Artificial intelligence (AI) has become increasingly prevalent, so its impact is most notable in healthcare. AI chatbots address challenges in medical access, particularly for underserved or remote populations (Laymouna et al., 2024; Usher et al., 2024). While many conditions require in-person check-ups, digital healthcare and AI chatbots offer 24/7 personalized support, improving accessibility (Aggarwal et al., 2023). These tools have also successfully promoted positive lifestyle changes, such as reducing physical inactivity and smoking (Aggarwal et al., 2023). Prior studies have examined the applications and challenges of Large Language Model (LLM) integration in healthcare, revealing that chatbots face limitations in multi-step tasks involving complex decision-making (Dam et al., 2024).
Additionally, hallucinatory effects continue to pose a serious issue with the accuracy of the information given. Ongoing research suggests that due to Gödel’s First Incompleteness Theorem (Gödel, 1931), there is a non-zero probability of hallucination occurring, regardless of changes to system architecture (Banerjee et al., 2024). In this context, Multi-Agent Systems (MAS) offer a promising alternative to fix these large-scale issues. The distribution of tasks among multiple agents, and the ability to delegate work autonomously, can leverage the specialization of labor to create a checks and balances system that mitigates some of these problems (Park et al., 2025). MAS integration into healthcare has gained support, with studies positing that the implementation of frameworks such as crewAI (https://www.crewai.com/) and Langchain (https://www.langchain.com/) offers ways to revolutionize medicine (Borkowski and Ben-Ari, 2024). For instance, a 2024 study investigated the use of MAS in monitoring, finding that the MAS outperformed traditional single agent systems like Q-Learning and Double DQN in tracking patient vitals (Shaik et al., 2023). While MAS integration has been investigated in areas such as pain management (Kamal et al., 2023), pre-hospital care (Safdari et al., 2017), and wellness monitoring (Humayun et al., 2022), its implementation remains limited in emergency remote care scenarios, where rapid first-aid interventions are critical for preventing life-threatening outcomes (Park et al., 2024). Auditory and visual information, which can aid in immediate action, has also never been incorporated for any MAS. To address these gaps, our study aims to develop multi-modal MAS for emergency situations and evaluate the effectiveness of this system in remote care applications, exploring its ability to mitigate follow-up questions and gain user trust, relative to a single-agent system (Park et al., 2020).
Over recent years, the popularity of chatbot technology has grown exponentially, embracing remarkable growth and revolutionizing the way AI can be used in the daily lives of individuals. The adoption of chatbots has shown a consistent upward trend since 2016, peaking in 2021 with 79 studies conducted in this domain in that year alone (Alsharhan et al., 2023). These AI-based conversational agents have been incorporated in several fields including customer service, banking, healthcare, education, e-commerce, tourism, and hospitality (Alsharhan et al., 2023), reflecting the expedited interest in this transformative technology. Notably, in the healthcare sector, 52% of patients acquire their health data with chatbots already (Market Research Future, 2021). Due to their conversational nature and ability to stimulate human-like interactions, chatbots are accessible and practical tools for streamlined communication and personalized support. One of the most popular chatbots, OpenAI’s ChatGPT, has more than 200 million weekly users (Kelly et al., 2025). As this trend continues, AI chatbots are set to be the defining technology of the next decade (Kelly et al., 2022), holding immense potential among various domains, including healthcare.
Methodology
This study aims to develop a MAS (https://github.com/AD-txigfwbexxk23/Multi-Agent-Systems-MAS-for-Remote-Healthcare; Figures 1 and 2) for remote and emergency care situations, and test this system for trust and efficiency. An AI system was developed using the crewAI library for Python 3.12.7, where multiple agents focused on different tasks are controlled by a single master agent. These agents include a Symptom Analysis Agent, an Advisor Agent, a Verification Agent, a User Proficiency Agent, and a Risk Assessment Agent. Communication between agents is facilitated through the crewAI framework, where each agent, powered by distinct AI models, contributes task-specific results to the master agent. The master agent consolidates and displays these results. Instructions for each agent are tailored to address emergency remote healthcare situations. A memory file stores all user prompts, enabling the agent to track past interactions and establish connections that refine subsequent responses. Figure 4 shows a more in-depth look at this system and how all components of the crewAI framework come together. Tools are accessible to each system, allowing agents to surf the web and make phone calls via the Twilio API. Agent declaration involves loading each agent with predefined roles and prompt templates, which are used to optimize healthcare-relevant tasks. These tasks include logic and instructions based on established medical information. Once the system is executed, the master issues task requests subordinate agents, retrieves their outputs, and integrates the responses.

MAS interface.

crewAI framework.
The system also supports multi-modal capabilities, providing both visual and auditory instructions to enhance user assistance in critical scenarios. The interface is established as a website for accessibility and testing purposes. Ergonomics was crucial in the user interface (UI), particularly in supporting accurate and real-time actions in high-stakes scenarios (Park et al., 2023). The system architecture was designed to minimize cognitive workload through an easy-to-understand layout, with the most prevalent information being listed through a clear visual hierarchy. This included optimizing button placement, screen layout, and visual feedback mechanisms to reduce reaction time and physical strain. Buttons are positioned on the left-hand panel in a vertical design, ensuring they remain spatially distinct and easy to locate with minimal eye movement. High-priority actions are placed near the top of the panel to reduce interaction time. At the same time, the use of consistent colour coding, such as red for confirmation, provides immediate visual cues to reinforce intended actions. This structure minimizes decision fatigue and supports intuitive navigation during emergencies.
Furthermore, multimodal capabilities supporting auditory input and output are directly informed by principles of human factor ergonomics. These include reducing cognitive load through redundant cueing, supporting situational awareness under time pressure, and minimizing user error by aligning interface behaviours with natural human response patterns. By applying these principles, the system enhances usability and ensures that users can make fast decisions in high-stakes medical scenarios.
Trust is a critical factor in a high-risk domain like emergency remote healthcare, where decisions must be made quickly and often with limited oversight. A lack of appropriate trust, whether over-reliance or under-reliance, can lead to misuse, disuse or delayed action, directly impacting patient outcomes (Hoff & Bashir, 2015). A quantum-based modelling system is developed to measure trust during the validation process that utilizes the Qiskit (https://www.ibm.com/quantum/qiskit) library to quantify trust from 0% to 100%. A qubit representing trust starts in superposition within the model. During the interaction, prompts given by the user are analyzed for sentiment, which directly affects the trust qubit’s state by rotating it towards 0 or 1 in the Bloch sphere. This approach was well-suited for the MAS validation as it allows for real-time monitoring of user confidence and quantifying trust. Unlike post-hoc surveys, which are prone to recall bias, coarse resolution, irrational responses, as well as social desirability effects, the quantum trust model offers a more objective, fine-grained and behaviorally grounded measurement of trust evolution over time. In parallel, we also collect a modified Trust Scale for the AI Context (TAI) survey (Scharowski et al., 2024) after each trial to benchmark the quantum-based model, supporting its use as a novel and viable approach for quantifying trust in human-AI interactions. The TAI survey differs slightly from previous publications, as the questions were altered slightly to fit within the study context.
Validation of the MAS (IRB number: REB25-0370) (Figure 3) was conducted through a human-subject experiment comparing three sets of performance outcomes: (1) Number of follow-up questions, (2) Change in trust from the initial state, and (3) Trust survey. Ten university students (M: 7, F: 3, Age.M = 18.1, Age.STDEV = 0.7) participated in the validation study, where they were required to navigate through two emergency situations: CPR and EpiPen (two participants dataset was removed due to the equipment issues). There were three inclusion criteria: 1) Participants should have limited or no prior CPR or EpiPen training. 2) The overall participant pool must be gender balanced (e.g., three male, five female or 4:4 if eight total). 3) Participants must not have impairments that prevent participation (e.g., physical or mental disabilities).

Human-subject study (CPR).
In each scenario, participants assisted with a mannequin by performing first aid, guided by information provided by either the MAS or a publicly available GPT-4o mini model. Each participant completed six trials, covering all emergencies with both chatbot mediums. Participants initiated the session with a predefined query for both chatbots and were allowed to ask follow-up questions for further clarification or guidance. After each interaction, the number of follow-up questions was recorded, and trust was measured using the quantum model. Additionally, trust levels were assessed after each trial via survey scales. We hypothesized that the variables from MAS exhibit significantly fewer follow-up questions and a higher trust than the baseline condition (GPT 4o-mini model).
Result
Figure 4 describes the four analyses. For the number of queries, GPT and MAS generated markedly different interaction patterns. A linear mixed-effects analysis with participant as a random intercept showed that MAS reduced the number of clarifying queries by 2.88 ± 0.54 prompts (

Graphical results.
For the Trust in Automation (TPA) questionnaire, reverse-coded items were inverted before aggregation (Cronbach’s α = .84). A two-way repeated-measures ANOVA revealed a main effect of AI support,
Real-time predicted-trust scores (0%–100%) (quantum model) displayed a parallel pattern. A mixed-effects model of the seven-step traces yielded a significant AI main effect (β = +.078 ± .035,
Subjective (trust survey) and predicted trust (quantum model) were strongly associated. A repeated-measures correlation controlling for participant yielded
Discussion
One of the major findings of this study is that MAS consistently outperformed the GPT across the reduction in the number of queries and trust metrics. Approximately three fewer clarifying prompts per trial represent a 70 % reduction in interaction burden - an advantage that is practically important in time-critical care, where additional dialogue incurs cognitive load and delays (Wickens & Hollands, 2000). Convergent evidence from both self-report and algorithmic confidence signals indicates that users judged MAS to be more reliable and less error prone. This finding aligns with the literature showing that structured, context-limited automation promotes appropriate reliance, whereas open-ended dialogue can engender uncertainty and over- or under-trust (Lee & See, 2004; Hoff & Bashir, 2015).
The trajectory analyses add a temporal dimension to those conclusions. GPT began with parity but lost trust rapidly with each additional query, suggesting that every unanswered follow-up amplifies doubts about competence. MAS, by resolving the task in one or two exchanges, truncated the opportunity for such erosion. The presence of a significant AI × Task interaction on predicted trust but not on survey trust implies that real-time metrics are sensitive to transient task factors, information that post-hoc questionnaires smooth over. Integrating live trust estimates could therefore allow adaptive interfaces that escalate human oversight when confidence collapses and throttle guidance when confidence stabilizes (Parasuraman & Riley, 1997).
The strong correlation between predicted (quantum model) and subjective trust survey supports the validity of algorithmic proxies for human sentiment, echoing recent reports that physiological or behavioral models can anticipate trust calibration in human–AI teaming (Caldwell et al., 2022, Wang et al., 2024). Nonetheless, the mixed model revealed a residual main effect of AI after controlling predicted trust, indicating that design elements not captured by the probability metric, such as formatting, terminology or perceived transparency, also shape trust.
Future work should aim to replicate these findings with larger and more diverse participant samples, as well as across a broader range of clinical procedures. This would improve the generalizability of the advantages observed at MAS. Additionally, future studies should explore whether the performance benefits of MAS persist when baseline LLMs such as ChatGPT are enhanced with domain-specific fine-tuning or constrained by safety guardrails. Beyond emergency-based remote healthcare, MAS frameworks may offer efficiency and trust benefits in other healthcare contexts, including chronic disease management, diagnostics, and mental health support. Further development of the quantum trust model is also warranted; incorporating additional design factors may improve its predictive accuracy.
To summarize, our MAS not only reduces interaction demands but also sustains calibrated trust, and its own confidence signal serves as a useful proxy for post-task sentiment. These properties recommend MAS-style interfaces for emergency medical applications where speed, clarity and appropriate reliance are paramount.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Funding for this research was provided by 2025 Transdisciplinary Connector Grant (University of Calgary). The views and opinions expressed are those of the authors.
