Abstract
Introduction
Strabismus is a common pediatric ophthalmic disease, 1 which to presents as ocular deviation, 2 leading to impaired visual function and affecting the physical and mental health of children. Yoon et al. 3 showed that there was a moderate association between each strabismus type (esotropia, exotropia, and hypertropia) and anxiety disorder, schizophrenia, bipolar disorder, and depressive disorder. During clinical treatment, we found that most of the children with strabismus came to the clinic when their parents found out their strabismus during medical check-ups or in their daily life. In some cases, the development of stereoscopic vision is affected due to late detection of strabismus. Studies have shown that visual development in children occurs from birth through 7 to 8 years of age, and eye disease during this period can lead to irreversible consequences. 4 Therefore, early detection and treatment of strabismus is essential. Currently, the commonly used clinical examination for strabismus is the prismatic covering test, which determines the type of strabismus by alternating covering test (ACT) and the cover-uncover test (CUT), and the angle of strabismus is determined by alternating prism covering test (APCT). Considering with the young age and poor cooperation of children with strabismus, it is difficult for nonstrabismus specialists to diagnose strabismus quickly and accurately through APCT. However, the small number of specialists in strabismus and pediatric ophthalmology and the large base of children in China cannot meet the demand for large-scale screening, and there is currently no clinical medical equipment that can replace manual strabismus diagnosis.
Artificial intelligence was first proposed by McCarthy in 1956, referring to technology used to mimic human behavior. Based on this, Arthur Samuel introduced the concept of machine learning (ML) in 1959, emphasizing the importance of systems learning automatically from experience rather than being programmed. And deep learning (DL), a subfield of ML, allows the use of neural networks to study potential features in data from multiple processing layers, like the human brain. 5 DL has been applied in ophthalmology for image recognition (e.g. to classify diabetic retinopathy, retinopathy of prematurity, etc. 6 ), to improve the accuracy and efficiency of examinations 7 (e.g. visual field examinations for glaucoma 8 ), etc. Studies showed that AI technology can quickly build models, process data, and even simulate manual strabismus cover test, 9 which provide the possibility of developing strabismus screening equipment. On this basis, combined with VR to build a virtual scenario, we designed a new technology for strabismus screening and collected data from 110 children aged 3 to 18 years in the clinic to evaluate the accuracy of the device and to explore the accuracy of this system for clinical pediatric strabismus screening.
Methods
Participants
The study protocols adhered to the Declaration of Helsinki and were approved by the Ethics Committee of the Beijing Tongren Hospital, Capital Medical University (No. TRECKY2020-088). This study is a clinical diagnostic trial with a gold standard.
The ratio of strabismus patients to nonstrabismus patients at the first visit to the Strabismus and Pediatric Ophthalmology Department was estimated to be 1:2 based on the number of previous outpatient visits, and the prevalence = 1/3. The pre-estimated sensitivity of the diagnostic test (SN) was 0.9, and the permissible error (L) was 0.10, which made the test level α = 0.05, and gave a total sample size of N1 of 105 cases, according to the formula. The specificity of the pre-estimated diagnostic test (SP) was 0.85, and the permissible error (L) was 0.10, so that the test level α = 0.05, according to the formula, the total sample size N2 was 74 cases. According to the principle of taking the maximum value, at least 105 cases need to be included in this study. Considering the incomplete data caused by the children’s noncooperation during the inspection process related to the improvement project, assuming a 90% completeness rate, a total sample size of N = 105/0.9 = 117 cases is required.
Due to the low rate of completeness of outpatient data, 131 subjects were finally collected consecutively between February 2023 and February 2024 at the outpatient clinic of strabismus and pediatric ophthalmology of Beijing Tongren Hospital, China. These subjects were children aged 3 to 18 years old who were first-time ophthalmologists and had the ability to focus on both eyes. All subjects had normal intellectual development, no mental illness, were able to cooperate with the examination, and their guardians agreed to participate in the project and signed an informed consent form.
System description
The system consists of a main computer, a screen, a head-mounted VR device, and a designed AI analysis software. The head-mounted VR device (Figure 1) used was the “Qiyou 2S” featuring a resolution of 3840*2160 from Aqiyi corporation, which includes an infrared camera that captures eye movement data in real time and displays it on the computer screen, which is used to diagnose strabismus and calculate the angle of ocular deviation through the AI analysis software.

VR device (Aqiyi Qiyou 2S). VR: virtual reality.
The AI software includes alternating covering test, covering-uncovering test, and ocular motility examination, all of which include two modes of looking near and looking far. Various examination procedures can be selected by the interface of the VR device or projected on the computer screen and selected by the doctor. The time to complete all the examinations is <2 minutes.
We refer to and update the calculations of Yeh et al.
10
and Miao et al.
11
by setting up a coordinate system that captures the deviation of the pupil in the horizontal and vertical directions and incorporating it into the following equation:
where
The AI model used in this study is a medical image segmentation network called SS-SwinuNet, which is used for image segmentation, recognition, and computation. After capturing the eye movement video by the infrared camera that comes with the VR device, the key frames are extracted, and the image of the eye region is segmented by Swinunet. The model recognizes the pupil and iris and finally combines the algorithms of image processing to get the eye movement offsets. The development set of the model is a dataset that includes 10,000 images from TEyeD images and 5000 images of children with clinical strabismus. And 70% of the dataset were used as the training set, 15% as the validation set, and 15% as the test set. TEyeD is a large publicly available dataset of ocular images, while the 5000 images of children with strabismus were obtained from our hospital database. The model achieves an average recognition and segmentation accuracy of 95.36% in the development set. To evaluate the accuracy of SS-SwinuNet in recognizing strabismus in children, an external validation set was created in this study by consecutively collecting eye movement data from 131 patients.
Examination process
All subjects will be examined by two strabismus and pediatric ophthalmologists in separate outpatient clinics and results recorded. Subjects will then be examined in a separate room with VR equipment and results recorded. Subjects wore the VR equipment and gazed at the reticle in a white virtual scene. The operator selected the appropriate mode for the test, simulating cover by switching off the screen. For ACT, the screen was switched off for 2 seconds in one eye and 2 seconds in the other eye for 4 sets. In CUT, the screen of one eye was switched off for 2 seconds, and then restored for 2 seconds, for a total of 1 set, repeated for 4 sets, and then repeated for 4 sets for the other eye, and the system automatically recorded the changes in the eye positions of the two eyes, and the result was calculated.
Figures 2 and 3 show the eye movement data displayed in real time on the screen while the subjects were performing the APCT in near vision mode.

ACT in near mode. ACT: alternating covering test.

ACT in near mode. ACT: alternating covering test.
Statistical analysis
All data analysis was done by SPSS 29.0. The type of strabismus diagnosed was decided after discussion between the two doctors in case of disagreement. The angle of ocular deviation was averaged between the two doctors. The manual results were used as the gold standard to compare the agreement between the AI system and the manual results. Strabismus was diagnosed using the Kappa consistency test, and p < 0.05 was considered a statistically significant difference. The agreement of the strabismus angles obtained by the AI system with the gold standard was assessed using Bland–Altman plots, intraclass correlation efficiencies (ICC), and 95% confidence intervals based on bidirectional random effects, absolute agreement, and single measurements.
Result
In this study, 131 subjects were collected consecutively, out of which 10 patients suffered from strabismus both horizontally and vertically and were not included in the statistical analysis. There were also 11 patients with incomplete data, so they were excluded. The final data of 110 subjects were included and their data are shown in Table 1.
Demographic information of all patients (manual result).
All were diagnosed by PACT at distance (6 m) or near (33 cm). Exotropia: <−10 PD horizontally; esotropia: >10 PD horizontally; vertical: >5 PD vertically; normal: ≤10 PD horizontally and ≤5 PD vertically.
PACT: prism alternating cover test; PD: prism diopter.
The demographic characteristics of the study subjects are listed in Table 1. Mean age was 7.60 ± 2.90 (standard deviation) years; 50 (45.5%) were male and 60 (54.50%) were female. The sensitivity of AI system screening strabismus was 83%, the specificity was 89%, and the accuracy was 82%, displaying good consistency with the manual results (Kappa = 0.562, p < 0.001).
In this study, the data in the exotropia group were tested for normality by the Kolmogorov–Smirnov (K-M) test, and the normal, esotropia, and vertical strabismus groups were tested for normality by the Shapiro–Wilk (S-W) test. The results showed that the AI data of the exotropia group did not conform to normal distribution in both near and far viewing modes (p = 0.0125 and p = 0.0015), and the manual results conformed to normal distribution (p = 0.0697 and p = 0.0175.) The esotropia group conformed to normal distribution in both near and far modes (p > 0.05), and the vertical strabismus group conformed to normal distribution in near mode (p > 0.05), and did not fully conform to normal distribution in the mode of looking far (AI: p = 0.598 and manual: p = 0.0239); the normal group did not conform to normal distribution in both near and far mode (p < 0.05).
The data of the patients with strabismus were further analyzed and the results are shown in Table 2.
Basic information of AI results.
Mean ± deviation. Exotropia: <−10 PD horizontally; esotropia: >10 PD horizontally; vertical: >5 PD vertically; normal: ≤10 PD horizontally and ≤5 PD vertically.
AI: artificial intelligence; ICC: intraclass correlation efficient; PACT: prism alternating cover test; PD: prism diopter.
The demographic characteristics of the total 60 exotropia subjects are listed in Table 2. Mean age was 8.45 ± 2.75 (SD) years (range, 4–18 years); 30 (50%) were male and 50 (50%) were female. The sensitivity for diagnosing exotropia was 76.7%, the specificity was 100% and the accuracy was 87.2%, with strong agreement with the manual results (Kappa = 0.562, p < 0.001). The mean strabismus prism degree is −18.9 ± 12.3 (SD) PD in 33 cm and −18.7 ± 14.9 (SD) PD in 6 m (range, −10∼ −85 PD). Reproducibility of AI system with manual results, expressed as ICC, was low reproducible for the exotropic near mode (ICC = 0.391, range, −0.098, 0.655) and far mode (ICC = 0.334, range, −0.067, 0.618).
Although the AI system presents low reproducibility in screening exotropia, it has better results in esotropia. As the results listed in Table 2, the mean age for 18 esotropia subjects were 7.28 ± 3.34 (SD) years (range, 3–18 years); 8 (44.4%) were male and 10 (55.6%) were female. The sensitivity for diagnosing esotropia was 88.9%, the specificity was 98.9% and the accuracy was 97.3%, with strong agreement with the manual results (Kappa = 0.749, p < 0.001). The mean strabismus prism degree is 24.0 ± 14.6 (SD) PD in 33 cm and 31.3 ± 17.5 (SD) PD in 6 m (range, 5∼55 PD). ICC was high reproducible for the esotropic near mode (ICC = 0.817; range, 0.561, 0.931) and far mode (ICC = 0.764; range, 0.456, 0.910).
Besides, the demographic characteristics of the total 4 vertical strabismus subjects are listed in Table 2. Mean age was 6.25 ± 3.40 (SD) years (range, 3–11 years); 2 (50%) were male and 2 (50%) were female. The sensitivity for diagnosing vertical strabismus was 100%, the specificity was 100% and the accuracy was 100%, with strong agreement with manual results (Kappa = 1, p < 0.001). The mean strabismus prism degree is 8.3 ± 2.4 (SD) PD in 33 cm (range, 6–10 PD) and 24.0 ± 13.7 (SD) PD in 6 m (range, 10–39 PD). ICC was not statistically significant for the near or far mode results.
Because only the esotropia group conformed to a normal distribution in the near and far viewing modes, the ICC results described above only roughly reflect the agreement between the two methods. To ensure the accuracy of the results, Bland–Altman plots and linear correlation analysis plots were used in this study, and the results are shown below.
The linear regression plots for the two methods in the near and far modes in the exotropia group are shown in Figures 4 and 5. The exotropia group showed a strong correlation between the two methods in near mode (R = 0.731, p < 0.0001) and a stronger correlation in far mode (R = 0.561, p < 0.0001).

Linear correlation plots of the manual results versus AI results in 33 cm for the exotropia group. AI: artificial intelligence.

Linear correlation plots of the manual results versus AI results in 6 m for the exotropia group. AI: artificial intelligence.
As shown in Figures 6 and 7, the mean deviation in the exotropia group was −23.2 PD in the near viewing mode, with a 95% consistency limit: −49.8 to 3.5 PD. The mean deviation was −17.9 PD in the far viewing mode, with a 95% consistency limit of −50.2 to 14.4 PD, which is well outside the accurate range of the assessment. Together with the difference between the measurements of the two methods was statistically significant (p < 0.0001) for exotropia in the near and far viewing modes, indicating that there was no consistency between the two methods in the exotropia group.

Bland–Altman plot of the difference between manual and AI results versus the average of the manual and AI results in 33 cm for exotropia. Upper and lower dotted lines represent the 95% limits of agreement. The solid line represents the mean difference, which was −23.2 PD. AI: artificial intelligence; PD: prism diopter.

Bland–Altman plot of the difference between manual and AI results versus the average of the manual and AI results in 6 m for exotropia. Upper and lower dotted lines represent the 95% limits of agreement. The solid line represents the mean difference, which was −17.9 PD. AI: artificial intelligence; PD: prism diopter.
Figures 8 and 9 demonstrate the linear regression plots for the esotropia group in the near and far viewing modes. The esotropia group showed a strong correlation between the two methods in both near and far viewing (R = 0.7595, R = 0.7652, p < 0.001).

Linear correlation plots of the manual results versus AI results in 33 cm for the esotropia group. AI: artificial intelligence.

Linear correlation plots of the manual results versus AI results in 6 m for the esotropia group. AI: artificial intelligence.
As shown in Figures 10 and 11, the esotropia group showed excellent agreement between the two modalities in the near looking mode, but not in the far looking mode.

Bland–Altman plot of the difference between manual and AI results versus the average of the manual and AI results in 33 cm for esotropia. Upper and lower dotted lines represent the 95% limits of agreement. The solid line represents the mean difference, which was 4.3 PD. AI: artificial intelligence; PD: prism diopter.

Bland–Altman plot of the difference between manual and AI results versus the average of the manual and AI results in 6 m for esotropia. Upper and lower dotted lines represent the 95% limits of agreement. The solid line represents the mean difference, which was −8.7 PD. AI: artificial intelligence; PD: prism diopter.
In the near mode, the difference between the measurements of the two methods was not statistically significant (p = 0.0957). The Bland–Altman regression equation
Figures 12 and 13 show the linear regression plots of the two methods for the vertical strabismus group in the near and far viewing modes. There was no correlation between the two methods in the vertical strabismus group in both near and far looking modes (R = 0.2455, R = 0, p > 0.1).

Linear correlation plots of the manual results versus AI results in 33 cm for the vertical strabismus group. AI: artificial intelligence.

Linear correlation plots of the manual results versus AI results in 6 m for the vertical strabismus group. AI: artificial intelligence.
As shown in Figure 14, the mean deviation is 2 PD and the 95% consistency limit from −5∼9 PD in near mode for vertical strabismus. The difference between the measurements of the two methods was not statistically significant (p = 0.3429). The Bland–Altman regression equation

Bland–Altman plot of the difference between manual and AI results versus the average of the manual and AI results in 33 cm for vertical strabismus. Upper and lower dotted lines represent the 95% limits of agreement. The solid line represents the mean difference, which was 2 PD. AI: artificial intelligence; PD: prism diopter.

Bland–Altman plot of the difference between manual and AI results versus the average of the manual and AI results in 6 m for vertical strabismus. Upper and lower dotted lines represent the 95% limits of agreement. The solid line represents the mean difference, which was −4.3 PD. AI: artificial intelligence; PD: prism diopter.
Discussion
This article describes the diagnostic accuracy of a system combining VR and AI for a single type of pediatric strabismus. The system diagnosed strabismus with moderate agreement with manual results (Kappa = 0.562) and high sensitivity (83%) and specificity (79%). The system performed better in the diagnosis of esotropia and vertical strabismus groups, with high sensitivity and specificity, and high agreement with manual results (Kappa = 0.898 for esotropia strabismus group and Kappa = 1 for vertical strabismus). However, the number of patients in the vertical strabismus group was small and this result was not clinically representative. However, in the exotropia group, although the specificity of the system could be as high as 100%, the sensitivity was only 76.7%, which could be related to eye wandering considering the young age of the patients. The sensitivity of only 88.9% in the esotropia group may also be related to this factor. Taken together, the system has good performance in the diagnosis of strabismus and the classification of single strabismus.
In terms of measuring the angle of ocular deviation, the system was most accurate in measuring the angle of strabismus in patients with esotropia looking at the near mode. The agreement between the two modalities was high (ICC = 0.689, p > 0.05) and strongly correlated. This may be related to the clinical presentation of strabismus, which is generally more influenced by the accommodation set and shows a more stable constant strabismus, and therefore a more stable angle of ocular deviation during masking. However, the consistency and correlation between the two methods performed poorly in the other groups. The system had very poor agreement in the exotropia group, although the correlation was high. This may be related to the type of strabismus. There were 46 patients with intermittent exotropia in this study. The oblique angle of intermittent exotropia needs to be affected by control, accommodation, and attention, and its performance is unstable, and it needs to be measured after sufficiently breaking the fusion in order to obtain stable results. Currently, the monocular coverage time of the system is 2 seconds in all cases, which does not sufficiently break the fusion and may lead to inaccurate measurement of the angle of divergence of the eyes in the exotropia group. The results are consistent with the clinical manifestations and pathogenetic features of the disease, so it is feasible to determine the ocular deviation angle with this system, but the exact degree of strabismus in patients with exotropia needs to be determined by further studies by experts. The vertical strabismus group performed poorly in both near and far looking modes, but the results are not clinically representative due to the small sample size.
Clinical measurement of strabismus angle usually needs to be tested by APCT, the accuracy of which is subjectively affected by the level of doctors and the examination time is long. With the development of technology in recent years, many studies have begun to explore the application of eye-tracking technology to strabismus diagnosis to replace APCT. In 2018, Kohen et al. 12 reported a technique to diagnose strabismus by video, which has a high accuracy. However, the study was conducted on adults and the accuracy in children needs to be further investigated. Park et al. developed a technique for recognizing exotropia through video screenshots, which showed an excellent correlation with the manual results (R = 1), and the BA charts also showed no discrepancy, suggesting that the technique is highly accurate in diagnosing exotropia. 13 The results are complementary to those of the present study, and the technique needs to be learned and updated in subsequent studies. However, the study population in that article was mainly adults and the diagnostic efficacy for strabismus in children is unclear. Rai et al. 14 developed an eye-tracking system that can be used for the diagnosis of strabismus in patients aged 3 to 41 years, and their results were positively correlated and consistent. This suggests that it is feasible to measure strabismus using an eye-tracking system. However, how to require all children to understand and cooperate with the examination is a subsequent problem that needs to be solved. Zou et al. 15 applied eye-tracking technology to children and found that it could be used as an alternative to APCT for the diagnosis and measurement of strabismus in children. With the development of technology, both the VOG technique and the Gazelab technique have matured, and a study by Cerdan et al. 16 showed that VOG and APCT correlate well and can be used as an alternative to manual examination of strabismus. Palazón et al. 17 compared these two techniques with the covering test separately, in which the VOG Perea (VP) had a very high concordance, while the Gazelab (GL) system had a poor concordance. The new system for strabismus screening in this study was more accurate than the GL, but still slightly lower than the VP technique, which needs to be further improved in subsequent studies. In addition to this, VR and AR technologies have been rapidly developed before, and in this study, a virtual scene built with VR technology was used to diagnose strabismus, while Nixon et al. diagnosed ocular deviation with an AR device, and explored the performance of AR technology in 19 patients with strabismus and 7 normal control patients. The results showed moderate correlation (R = 0.62) between the AR technique and the APCT technique. 18 The diagnostic accuracy of this study was not as good as the results of the present study, but the accuracy of the measurements in this study for different types of strabismus still needs to be improved. Similarly, Li et al. 19 investigated the diagnostic efficacy of a video ophthalmoscope with low-cost hardware for strabismus in children. The results showed that the video glasses can do telemedicine, diagnose strabismus and measure the angle of ocular deviation, which provides the possibility of widely disseminating low-cost strabismus screening devices in the future. Currently, this study focuses on the diagnosis of common strabismus, and the eye-tracking technique developed by Orduna et al. 20 can examine paralytic strabismus with more accurate results than manual results. Further refinement of the diagnosis of multiple strabismus types is needed in subsequent studies.
This study still has some limitations. First, although this study can meet the need for large-scale screening of pediatric strabismus, it did not consider the effect of the Kappa angle, which may lead to false-positive results and waste of medical resources due to the individual differences in children’s development. Second, there were only four patients in the vertical strabismus group, and none of the results were clinically representative, requiring further study.
Conclusion
The system combines VR and AI can be used clinically to screen pediatric strabismus with high sensitivity and specificity. The system performs well in the diagnosis and classification of strabismus and can accurately calculate the ocular deviation angle in patients with esotropia. However, the calculation of ocular deviation angle in patients with exotropia and vertical strabismus is still deficient and needs further development.
