Abstract
Keywords
Introduction
Chest X-ray (CXR) imaging is a commonly used low-cost imaging modality to detect a variety of lung abnormalities. However, due to the great heterogeneity of lung abnormalities, reading and correctly interpreting CXR images is often a difficult and time-consuming task for radiologists. Misinterpretation of CXR images can have negative impact on patient care and may lead to serious treatment outcomes. It has been reported that 33% diagnostic errors with relevant imaging occurred due to misinterpretation and 22% of all errors in diagnostic radiology were made in CXR images [1, 2]. One way to reduce errors and increase diagnostic sensitivity in radiology is double reading between peers [3, 4], and they are often conducted in different approaches such as two readers with the same degree of sub-specialization reading the same image at different times; and a preliminary junior reader report followed by subsequent reading of subspecialists with a higher level of sub-specialization [5]. However, both approaches have a problem due to the scarcity of radiologists, especially in the rural aeras where the shortage of radiologists is much of a concern. Moreover, cost increase is another important financial factor to hinder the double reading process.
To overcome the challenge of double reading, developing artificial intelligence (AI) has been attracting broad research interest recently and many studies have reported encouraging technical results in medical imaging analysis such as detecting tuberculosis on CXR images [6, 7], classifying benign and malignant lung nodules from CT images [8] and detecting skin cancer from skin photographs [9]. Some laboratory studies have also reported that AI can perform comparably to humans in detecting diabetic retinopathy and malignant melanoma [9, 10]. The commercialized computer-aided detection (CAD) system has been already approved to be used as a second reader in the United States to help detect lung nodules that may be missed by the radiologists [11].
Most current research in CAD or AI of medical images focuses on detecting or diagnosing one specific abnormality or disease such as tuberculosis [12], pneumonia [13], and pulmonary nodules [14], however, the detection of a single abnormality may not reflect the complexity of real-world cases. Although some recent studies have addressed the simultaneous classification of multiple abnormalities, performance evaluation of these AI models is still limited to region of interest (ROI)-level, which only contains one positive ROI per CXR case [6, 15]. In addition, radiologists still need to confirm the AI results and write a diagnostic report, which is time-consuming.
In this research effort, we have developed a unique AI system named as MOM-ClaSeg (Multi-task, Optimal-recommendation, and Max-predictive Classification and Segmentation), which aims to automatically detect multiple lung abnormalities and generate diagnostic reports [16]. A radiologist only needs to review an AI-generated diagnostic report and then either approves it or makes minor modifications (if needed). Our hypothesis is that if successful, using MOM-ClaSeg AI system will help increase not only the accuracy (i.e., sensitivity) in detecting lung abnormalities, but also the efficiency of abnormality diagnosis (reducing image reading and report writing time). To test our hypothesis, we conduct a multi-reader and multi-case (MRMC) type observer performance study using a large and diverse CXR image dataset including images acquired from 12 hospitals. This MRMC study will analyze and compare diagnostic accuracy and efficiency between double reading (the second radiologist reviewing initial report from the first radiologist to generate final reports) and single reading with MOM-ClaSeg (the first radiologist reviewing initial report from MOM-ClaSeg to generate final reports) in reading and interpreting CXR images depicting different or multiple lung abnormalities. Specifically, accuracy, sensitivity and specificity and review time for double and single readings are assessed to evaluate a radiologist’s performance. To the best of our knowledge, it is the first study to explore whether single reading with AI would obtain a comparable performance to double reading. Further, it is also the first study to detect different multiple pulmonary abnormalities to evaluate the radiologist’s diagnostic performance on case report conclusion level (namely, combination of different multiple abnormalities per image) instead of ROI level (namely, single abnormality per image), which might be benefit for work routines where double reading is obligatory or not, as it allows for improved diagnosis efficiency but little extra cost increases.
Methods
Review of MOM-ClaSeg system
As reported in our previous paper [16], the MOM-ClaSeg AI system was originally developed by applying augmented Mask-R-CNN based Generative Pre-trained Text content generation networks using a large dataset involving 310,333 confirmed adult CXR images that were collected from multiple hospitals. This image dataset contains 243,262 abnormal images depicting 65 different abnormalities and 67,071 normal images. Unlike traditional CAD models that only detect a single type of abnormality, MOM-ClaSeg is optimally trained to detect and segment multiple abnormalities of different classes of abnormalities visible on CXR images and then generates diagnostic reports of radiological impression for all detected abnormalities.
In brief, Fig. 1 illustrates the graphical user interface (GUI) of MOM-ClaSeg. As shown in Fig. 1, an image report panel (located in the middle) delineates the boundary contour of the detected and segmented abnormal ROIs. A text report panel provides classification recommendations with three levels of confidence (blue for low, orange for middle, and red for high), as well as an automatically generated report that includes the image impression description and diagnostic conclusion. The image impression description section provides a brief summary of the radiologic manifestation observed on the image in a short paragraph. The image impression description serves as a guide for the radiologist’s interpretation and helps to identify potential abnormalities that require further evaluation. The diagnostic conclusion section is the final summary of the CXR interpretation and includes the class and location information of each detected abnormality (or ROI). It typically includes a statement on the presence or absence of any abnormalities. Conclusion includes multiple different abnormalities, their locations on CXR, as well as disease progression, and recommendation for potential intervention for follow-up, treatment, etc.

Visual representation of the graphical user interface (GUI) of the MOM-ClaSeg system.
This study was conducted in accordance with the Declaration of Helsinki (as revised in 2013) and this retrospective study was also approved by the Shenzhen Center for Chronic Disease Control Institutional Review Board ([2019]SZCCC-2019-014-01) with a waiver of informed consent. The MOM-ClaSeg AI system is installed in a central hospital to automatically screen and diagnose multiple abnormalities on CXR images received from different general hospitals in rural areas via a secured internet connection, and the patient identification was removed before image review and diagnosis. From May 22 to July 22, 2022, a total of 28,526 CXR images were retrospectively collected and used as the experiment group (single reading based on AI-generated reports), and from July 22 to September 22, 2022, a total of 7,845 CXR images were collected as the control group (double reading involving two radiologists). The experiment group includes 5,756 abnormal images and 22,770 normal images, while the control group includes 936 abnormal images and 6,909 normal images. All cases were representatives of outpatients referred by primary care physicians, self-referral, TB screening program, etc. The data distribution of 12 hospitals is provided in Table 1. In summary, this study includes reading the total 36,374 posterior anterior/anterior posterior CXR images.
Distribution of CXR images from 12 different hospitals
Distribution of CXR images from 12 different hospitals
In preparing the proposed MRMC study, 3 expert panel radiologists (>25 years of experience in imaging) serve as gold standard to read all 36,374 CXR images and classify them as normal and abnormal cases, during which further determination about the class and location of abnormalities was also made based on the pathology/diagnostic reports. A consensus principle was implemented where 3 expert radiologists reviewed together to make a consensus if inter-reader variability was detected.
Description of different types of radiologists, reports and readings are summarized in Table 2. Specifically, a panel of 8 radiologists participated in this blind MRMC observer performance study. Among them 6 are first radiologists with average CXR image reading experience of 5∼10 years, while 2 are second radiologists with average CXR image reading experience > 15 years. The MRMC study includes two image reading and diagnosis modes or groups namely, control mode or group G1 and experimental mode or group G2. Figure 2 illustrates the study design for G1 and G2 modes. As shown in Fig. 2, each CXR image is first read by either one radiologist (G1) or MOM-ClaSeg (G2), and then reviewed by a G1 or G2 radiologist to correct and approve the final diagnostic report. Hence, mode 1 (G1) involves two radiologists (one first and one second radiologist), while mode 2 (G2) involves the MOM-ClaSeg AI system and one radiologist. In the study, the second readers in both control group and experiment group were blind to identifying information about the first reader (either one radiologist or MOM-ClaSeg), and and neither were they aware of which group they belonged to. The time of image reading and report writing/generating of radiologists for diagnosis of each case in two reading modes are also recorded for comparison of efficiency.
Summarized description of different types of radiologists, reports, and readings
Summarized description of different types of radiologists, reports, and readings

Study design for the retrospective double blinded study. (a) CXR image interpretation flowchart: In the control group (G1), the first radiologist reads the CXR images, and the second radiologist reviews the initial report to make the final decision. In the experiment group (G2), MOM-ClaSeg is used as a first reader to automatically analyze CXR images and generate diagnostic reports, then a radiologist reviews and possibly modifies AI-generated diagnostic reports. (b) Study protocol and measurements: Expert panel sets up gold standard to evaluate the accuracy of final double-reading diagnostic reports in G1 and final single-reading diagnostic reports in G2.
A radiological report typically includes two sections. The first section includes the description (or impression) of image manifestation and radiological conclusion of specific findings or abnormalities. The second conclusion section typically includes multiple different abnormalities, their locations on the CXR image, as well as disease progression, recommendation for potential intervention for follow-up, treatment, etc. Since the types of abnormalities and locations are the crucial diagnostic information from a radiological report, this study evaluates (1) the detection performance based on class of abnormalities and their corresponding locations on CXR images, and (2) overall performance of radiologists based on the conclusion made in the final diagnostic reports in two reading modes or groups (G1 vs. G2).
In this MRMC observer performance study, either a radiologist (in G1 mode) or MOM-ClaSeg AI system (in G2 mode) can mark/segment ROIs of different types of lung abnormalities on each CXR image, and then they could appear in report conclusion. In theory, a single conclusion on one CXR image can potentially consist of 0 to 65 different abnormalities and one conclusion can have the same abnormality appearing in multiple locations. For example, one case conclusion on CXR image by MOM-ClaSeg or radiologist may include 2 nodules at different locations and 1 pneumonia ROI. Another case conclusion consisting of 3 nodules at different locations and 1 pneumonia ROI will be treated as a different type of conclusion. Instead of evaluating the performance on detecting each abnormality (ROI level), we evaluate the performance at report conclusion level which means (1) class of abnormality, (2) number of abnormalities, and (3) locations of each abnormality need to be considered. By comparing with the “gold standard” defined by a panel of 3 expert radiologists, the marked/segmented ROIs are categorized into one of the following classes namely, true-positive (TP), true-negative (TN), or difference (NOT TP or NOT TN) ROIs or cases. Figure 3 illustrates how to determine these classes.

Illustration of defined symbols. (a) and (b) definitions of TP and TN for performance evaluation in ROI and report conclusion-level of this study, respectively. (c) definition of difference (NOT true positive and NOT true negative). For the TP detection, number of total lesions as well as class and location of each lesion need to be identical between ground truth and predicted lesions.
(1) Symbols of ground truth for different abnormalities and AI predicted ROI.
As shown in Fig. 3a, the circle represents ground truth ROI, and the triangle represents AI-predicted ROI. The different color outline represents different classes of abnormalities. An abnormal image with 2 classes of abnormalities is used as an example of positive case.
(2) Definition of TP, TN and their difference (NOT TP or NOT TN) detection on report conclusion level.
Since each image can contain multiple lesions, only if all lesions and their corresponding locations (ROIs) are correctly detected and reported in the conclusion section of the diagnostic report is considered the correct conclusion. For images that do not contain lesions, the report conclusion is considered correct (TN) only when no lesion is reported in the final diagnostic report.
Thus, following the stated symbol explanation in Fig. 3, we define TP, TN and difference (NOT TP or NOT TN) to evaluate the performance at report conclusion level of the MOM-ClaSeg AI system and radiologists. Specifically, a TP is defined as a positive case where all lesions, including the total number, class(es) and location(s), are correctly detected and identified on the image. A TN is defined as a negative case where the final diagnostic report correctly predicts the absence of any lesions on the image (no finding). These definitions provide a standardized and objective approach for evaluating diagnostic performance in the context of two reading modes of this MRMC observer performance study. Note that false positive (FP) and false negative (FN) are not defined and used in this study because when multiple lesions exist on one single image, miss detection at ROI level can be counted simultaneously as FP and FN ROI which will confuse the evaluation of the report conclusion level performance as shown in Fig. 3c.
Although the area under operating characteristics curve (AUC-ROC) is a common index to evaluate performance of radiologists in MRMC observer performance studies, this study is different, it does not require radiologists to rate the probability scores of the detected lesions or abnormalities. Also, traditional AUC-ROC analysis only considers a single abnormality on an image. The study only makes a binary decision of either detection or no detection of the lesions. Therefore, based on the above categorization process and the binary decision, we compute the following three evaluation indices, namely, the detection accuracy (all detected TP and TN cases divided by all reading cases in each mode, G1 or G2), as well as the detection sensitivity and specificity, to perform the evaluation and comparison. Besides, the consumed average time of image reading and report writing, generating and approval in each case is computed and compared between the two reading modes. The statistical data analysis is performed using Python 3.8 and SPSS 20 software tools. In the statistical tests,
Results
Clinical characteristics
The main characteristics including age and gender of the patients are shown in Table 3. The control group included 7,845 patients in which 45.1% are male and 54.9% are female with an average age of 38.97±14.98 years, while the experimental group includes 28,526 patients in which 53.3% are male and 46.7% are female with an average age of 36.88±12.86 years. The difference is statistically significant in the patient age (
Patient demographics and top 25 report conclusions of the control and experiment groups
Patient demographics and top 25 report conclusions of the control and experiment groups
This section analyzes the detection performance for each abnormality at ROI level and at report conclusion level, which means (1) class of abnormality, (2) number of abnormalities, and (3) location of each abnormality need to be considered. As MOM-ClaSeg was evaluated and compared on report conclusion level where a report conclusion on a single CXR image may contain multiple abnormalities as stated earlier, a total of 119 and 420 different classes of report conclusions were eventually generated from the control group and the experiment group, respectively. Different classes of report conclusion are formed based on the combination of multiple different abnormalities. Each class of conclusion is the combination of several abnormalities from a total of 65 abnormalities. For example, one class of conclusion can be tuberculosis, nodule, and pneumonia; while another different class of conclusion can be just tuberculosis and pneumonia. In the analysis, the 25 top report conclusions from 119 and 420 classes of conclusions of each group of two groups were selected. Among those 50 classes of conclusion from two groups, there are 12 common classes of conclusions (overlapped gray and black bars), which results in a total of 38 classes of different report conclusions (38 = 25*2–12) listed in Fig. 4. For both the control and experiment groups, pneumonia, secondary pulmonary tuberculosis and nodule ranked within the top 3 report conclusions (Fig. 4). It happens to be that each class of conclusion only has one abnormality. The remaining report conclusions were different between the two groups. The distribution of all 38 report conclusions and the 12 common report conclusions were found to be significantly different between the two groups (all

Distribution of top 25 classes of abnormalities on report conclusion-level for both the control and experiment groups. The two groups had 12 common conclusions (overlapped gray and black bars), resulting in a total of 38 classes of abnormalities (38 = 2*25–12). Among those 38 classes, class 34 and 36 consist of three different abnormalities (secondary pulmonary tuberculosis, pneumonia, and pleural effusion for class 34 and secondary pulmonary tuberculosis, pneumonia, and pleural thickening for class 36) and the rest of the 36 classes consist of either 1 or 2 abnormalities.
Although MOM ClaSeg is capable of detecting 65 different abnormalities, a total of 48 and 61 classes of abnormalities were eventually generated from the control group and the experiment group, respectively. Similar to report conclusion level, the category of the top 25 classes of abnormalities on ROI level for both the control and experiment groups was also calculated and ranked. Fig. 5 displays the distribution of the top 25 classes for both the control and experiment groups. Among these 50 categories of abnormalities, there are 16 of common abnormalities (overlapped gray and black bars) between the two groups, leading to a cumulative number of 34 classes of abnormalities (34 = 2*25–16). At the ROI level, pneumonia and secondary pulmonary tuberculosis are the top 2 abnormalities in both groups, consistent with the report conclusion level list. However, pleural thickening ranked third on the ROI level list, rather than nodules.

Distribution of the top 25 classes of abnormalities on ROI-level for both the control and experiment groups. The two groups had 16 common abnormalities (overlapped gray and black bars), resulting in a total of 34 classes of abnormalities (34 = 2*25–16).
Differences on report conclusion level, between expert panel as ground truth and final diagnostic report from the study radiologists were evaluated. Compared with double reading in the control group, the conclusion difference between the gold standard and detection results in the experiment group (single reading with AI-assistance) decreases 43.82% (from 3.40% to 1.91%), 84.69% (from 12.93% to 1.98%) and 9.95% (from 2.11% to 1.90%) in all three scenarios, which included the total cases, abnormal positive cases, and normal negative cases, respectively (Fig. 6). The results suggested that AI was more helpful in interpreting abnormal cases than normal findings for posterior anterior/anterior posterior CXR images in outpatient settings.

The image review comparison testing flowchart and results for all cases, abnormal positive cases, and normal negative cases.
Table 4 demonstrates that the use of MOM-ClaSeg in single reading significantly improves the accuracy and sensitivity when compared to double reading. The accuracy improved by 1.49% (from 96.60% to 98.09%,
Performance comparison of the control and experiment groups
Review time comparison of the control and experiment groups
It is worth noting that the MOM-ClaSeg is particularly helpful in identifying abnormalities, both in terms of accuracy and efficiency. The degree of decrease in conclusion difference for abnormal cases is more than 8 times (84.69% /9.95% =8.51) that of normal cases. We present representative cases of accuracy and difference results detected by MOM-ClaSeg in Fig. 7. In terms of the efficiency of diagnosis time, the improvement for abnormal cases is also nearly twice that of normal cases. These findings suggest that MOM-ClaSeg can significantly improve the detection accuracy and reduce the review time in identifying abnormalities, making it a potentially useful tool for assisting radiologists in future clinical settings.

Examples of two true positive cases and two difference cases for the experiment group. Yellow outlines were annotated by the MOM-ClaSeg AI system.
Reading and interpreting CXR images depicting possibly multiple lung abnormalities or diseases and then writing diagnostic reports is a challenging and time-consuming task for radiologists, which requires a high level of expertise. Therefore, the development of an effective and robust AI system that can assist radiologists in detecting multiple abnormalities and generating diagnostic reports for CXRs could significantly improve the efficiency and accuracy of radiology reporting. In this paper, we report a unique and successful MRMC observer performance study that demonstrates the feasibility and advantages of using our recently developed MOM-ClaSeg AI system to assist radiologists in reading CXR images and generating final diagnostic reports.
In the medical imaging or informatics field, the application of AI has been increasingly studied aiming to assist radiologists in their diagnostic work. However, most of the current research in this field has focused on specific abnormalities or diseases, such as pulmonary nodules, tuberculosis, and pneumonia, separately, which limits the generalizability of these approaches [6, 17], as they may not reflect the complexity and diversity of real-world cases. Although a small number of studies and existing commercial solutions addressed the simultaneous classification of multiple CXR abnormalities recently, the involved classes of abnormalities were often less than ten [18, 19]. Moreover, the performance evaluation of AI model still focuses on ROI level where each CXR case only contains one positive ROI. However, in real-world practice, a CXR image without identification of the targeted ROIs cannot be interpreted as a normal case. Radiologists need to interpret CXR images to identify as many types of abnormalities as possible [19, 20], and relying on existing AI to assist in detecting only specific types of abnormalities independently is insufficient. Additionally, radiologists still need to confirm the AI results and write a diagnostic report, which is a time-consuming task. This highlights the need for an AI system that can assist radiologists in detecting multiple abnormalities simultaneously in each CXR image and generating diagnostic reports, thus improving their efficiency and accuracy.
The MOM-ClaSeg AI system used in this study has several major advantages, which can make new contributions to this research field. First, the MOM-ClaSeg AI system detects multiple abnormalities in CXR images, which is a significant departure from previous studies that focused on individual abnormalities or diseases. Second, the MOM-ClaSeg AI system allows to automatically generate diagnostic reports. Writing diagnostic reports is a tedious and time-consuming task for radiologists. The MOM-ClaSeg’s automatic report generation feature reduces the workload of radiologists and ensures accurate reports. It is a significant breakthrough as it can help in the early detection and diagnosis of pulmonary abnormalities or diseases, ultimately leading to better patient outcomes. Third, MOM-ClaSeg AI has an easy-to-use, interactive GUI, which can show both ROI level and case or image-level detection and diagnostic results. Thus, the MOM-ClaSeg AI system provides much more explainable information of the detection or diagnostic results to radiologists than many previous AI systems that use “black-box” approaches. As a result, by utilizing this new AI system, radiologists can save time and effort on routine tasks, allowing them to focus on more complex cases that require their expertise. MOM-ClaSeg can enhance diagnostic sensitivity by detecting more abnormalities that may have been missed by traditional interpretation methods.
Our MRMC observer performance study also has several unique characteristics aiming to more effectively test and demonstrate the potential clinical utility of the MOM-ClaSeg AI system, which focuses on detecting multiple abnormalities and the evaluation at the report conclusion level. The evaluation methodology commonly used for assessing AI algorithms involves popular AI metrics like AUC-ROC or F-scores for label-based precision and recall evaluations. However, when it comes to actual clinical workflow, minimizing the number of overall errors or misses on a per-image basis is more important. Multiple abnormalities may co-exist in one final diagnostic report, and the number of types of abnormalities present in a single CXR image may be quite large. This suggests that the evaluation methods used for AI algorithms should better be evaluated for clinical purposes based on image-level sensitivity and specificity instead of ROI level evaluations. To address this issue, we redefine the terms true positive, true negative, and difference (NOT true positive or NOT true negative) in our study with both the class and location of each AI-predicted ROI involved. This is in contrast to traditional studies that focus on a specific type of abnormality or disease and rely on single-level evaluation methods. By taking the more comprehensive approach, we believe that this MRMC observer performance study can more effectively mimic future clinical application of the MOM-ClaSeg AI system to significantly improve the accuracy and efficiency of radiology diagnosis. In addition, the distribution of abnormalities is displayed on both report conclusion level and ROI level, which allows radiologists to not only assess the probability of occurrence of each abnormality but also examine the correlation among different types of pulmonary abnormalities present in a single CXR image. This comprehensive evaluation is an advantage of our approach in this MRMC study, as it provides a more detailed understanding of the relationships among abnormalities and can aid radiologists in making an accurate diagnosis.
Overall, this is a unique MRMC observer performance study with a quite large and diverse CXR image dataset. To the best of our knowledge, no similar studies have been conducted and reported in the literature to date. Data analysis results of this study are encouraging and demonstrate that the MOM-ClaSeg AI system has the potential to add value to many clinical application. For example, in rural or resource-limited areas, access to specialized medical professionals and advanced technology may be limited, leading to delayed diagnoses and treatments. The results of this MRMC study indicate that single reading with AI outperforms double reading in detecting heterogenous and multiple lung abnormalities, which may alleviate the scarcity of radiologists in such areas and improve the quality and accessibility of CXR image diagnosis. Therefore, MOM-ClaSeg’s ability to assist radiologists in automatically detecting multiple abnormalities and generating diagnostic reports on CXR images may help overcome many existing clinical barriers, potentially leading to faster and more accurate diagnoses and improved patient outcomes.
Despite our encouraging results, we also recognize several limitations in this study. First, it is an unpaired and unbalanced study, with two consecutive months of control group data (double reading) and another two consecutive months of experimental group data (single reading with MOM-ClaSeg) for retrospective comparison tests. We, however, still consider this as a major distinction of our study since the results have demonstrated the feasibility of MOM-ClaSeg’s assistance in detecting multiple abnormalities on CXR images with statistically significant improvement in efficacy. Second, our performance evaluation methods differ from many previous MRMC observer performance studies, with only accuracy, sensitivity, and specificity used since we defined only TP, TN, and difference. False positive and false negative ROI may co-exist in one CXR image, as our approach focuses on multiple abnormalities, so that we were unable to provide F1 score and AUC-ROC measurements. Third, AI reading time including loading, operation, analyzing and report generation time was not presented separately, and the justification time of the AI tool. As the objective of the research is to access the role of AI as an aid tool in the double reading process, the reading time of the AI (first reader) and radiologist (second reader) was calculated together to compare with the time spent in the traditional double reading routine. In future work, we plan to conduct a paired study and calculate the time AI spent at each process to further evaluate the clinical potential of this new AI system, investigate radiologists with different levels of experience using MOM-ClaSeg, and include lateral CXR images in the analysis.
Conclusion
This research paper presents a unique MRMC observer performance study using the AI-based system MOM-ClaSeg, which can detect multiple abnormalities simultaneously and generate diagnostic reports automatically, to assist radiologists in CXR image interpretation and diagnostic report generation. The study demonstrates the feasibility of using this new AI system as the first reader to help improve diagnostic accuracy and efficiency of a single radiologist. This is an important and promising step toward applying AI systems to help improve diagnostic performance and productivity of radiologists in future clinical practice. Particularly, the MOM-ClaSeg AI system could be a valuable tool in rural or underdeveloped areas where there is a shortage of radiologists. To fully achieve this goal, additional clinical evaluation studies are needed using new image databases and involving other radiologists from different medical institutions in the future.
