Abstract
1. Introduction
Gaze estimation has been used in a variety of applications, such as computer interfaces for the disabled, view navigation in virtual reality systems and driver inattention detection in intelligent vehicles. With the increasing need for an intelligent user interface on a large display, gaze detection technology, which detects where a user is looking, is being considered. Because gaze detection uses the position of the center of the pupil to calculate a user's gaze position, robust and accurate pupil detection is required. In general, the area of the pupil in an image is difficult to directly detect because the pupil occupies such a small portion of a captured face image. For this reason, the preferred approach is to first perform eye detection, followed by accurate pupil detection within the detected eye region.
In previous research, the adaptive boosting (Adaboost) algorithm [9, 10] and the continuously adaptive mean shift (CAMShift) algorithm [12] have been widely used for eye detection. However, the conventional eye detection method based on the Adaboost algorithm has several limitations: it requires considerable processing time and the detection accuracy can be reduced when the eyes are closed or are obstructed by eyeglasses. Another eye detection method based on the CAMShift algorithm has a disadvantage in that the detection error can accumulate over successive images. Other previous pupil detection methods based on binarization [1], deformable template matching [2] and the starburst method [3] suffer from performance degradation caused by variations in illumination or require long processing times. Another approach that can be used to perform iris and pupil detection is using a Fourier series-based deformable model [7]. However, this approach has the disadvantage of high processing complexity.
Given these limitations, we propose a new eye/pupil detection method for gaze detection on a large display. Compared to previous studies, this research is novel in four ways. (1) In order to overcome the performance limitation of conventional methods of eye detection, such as the Adaboost and CAMShift algorithms, we propose adaptive selection of the Adaboost and CAMShift methods. (2) This adaptive selection is based on two parameters: pixel differences in successive images and matching values by CAMShift. (3) A support vector machine (SVM)-based classifier is used with these two parameters as input, which improves the eye detection performance. (4) Within the detected eye region, the center of the pupil is accurately located by means of circular edge detection, binarization and calculation of the geometric center.
The remainder of this paper is organized as follows. Section 2 describes the details of the proposed method. The experimental results are presented in Section 3. Finally, the conclusions are drawn in Section 4.
2. Proposed Method
2.1 Gaze tracking system for the interface on a large display
In our research, we constructed a new interface system for a large display that is based on gaze tracking. As shown in Figure 1, the proposed system consists of one wide view camera (WVC) and one narrow view camera (NVC) that have panning, tilting and focusing functionalities [17]. The face and eye positions of the user are detected by the WVC and then transmitted to the NVC, which is panned, tilted and focused. Detailed explanations of the proposed system are as follows [17]. With the captured image from the WVC, the user's face area is detected by the Adaboost and CAMShift algorithms. In the located face region, the user's eye is detected by combining Adaboost with an adaptive template-matching algorithm. Next, the NVC is panned and tilted based on the detected eye position and is able to capture the magnified eye image. However, owing to the large focal length of the NVC, the depth-of-field of the NVC is small and the eye image from the NVC can be easily blurred according to changes in the Z distance. Thus, auto-focusing of the NVC is performed based on the estimation of the Z distance using the facial width in the WVC image. The focus value of the NVC image is then calculated by the focus assessment mask.

Proposed gaze tracking system for the interface on a large display
In this way, a magnified and focused image of the eye is obtained by the NVC and the gaze position is calculated on the basis of the detected center of the pupil and four specular reflections that are generated by four near-infrared (NIR) illuminators on the four corners of the large display [17]. Specifically, the four NIR illuminators at the four corners of the display in Figure 1 produce four specular reflections (SRs) on the eye, as shown in Figure 2. The rectangle formed by these four SRs represents the display screen because the four NIR illuminators are attached at the four corners of the display. Based on these four SR positions and the detected pupil center, the gaze position on the display can be calculated based on a geometric transform matrix [18].
However, considerable noise is usually included in an image captured by the NVC, as shown in Figure 2, which makes it difficult to accurately and distinctly detect the regions of the eye and pupil. For this reason, we propose a new eye/pupil detection method for gaze detection.

Eye images captured by the NVC that contain considerable noise: (a) two circles represent the noise of the hair and eyebrow and (b) three circles show the noise of the eyeglass frame and the specular reflections on the surface of the lens
2.2 Overview of proposed method
Figure 3 shows the overall procedure of the proposed method of eye/pupil detection used for the NVC image shown in Figure 1. In order to reduce the processing time, the original 1600 × 1200 pixel input image is sub-sampled to a 320 × 240 pixel image. If the input image is the first frame, eye detection is performed by the Adaboost algorithm because there is no previously stored template of the eye region. If the input image is not the first frame and the detected eye region exists in a previous frame, eye detection is performed by the CAMShift algorithm. Even if the input image is not in the first frame, eye detection is performed by the Adaboost method if there is no detected (stored) eye region in the previous frame. Following eye detection by the CAMShift method in the current frame, the histogram matching score and difference in the pixel scores of the eye region in the previous and current frames are calculated. On the basis of these two scores and the SVM, the detected eye region is determined to be correct or incorrect. If the detected eye region is determined as being incorrect, eye detection is performed by the Adaboost method again. If it is determined as being correct, the next step in the pupil detection procedure is performed.

Overall procedure of the proposed eye/pupil detection method
2.3 Eye detection by adaptive selection of the Adaboost and CAMShift methods based on an SVM
As explained in Section 2.2, if the input image is the first frame, eye detection is performed by the Adaboost algorithm because there is no previously stored template of the eye region. The Adaboost algorithm is based on Haar-like and cascaded weak classifiers [9, 10], which is a combination of weak classifiers that can form a strong classifier. In the execution phase, the eye region is located using the trained weak classifiers. However, the Adaboost algorithm requires considerable processing time and its detection accuracy can be reduced when the eyes are closed or obstructed by eyeglasses. Therefore, if the input image is not in the first frame and the detected eye region exists in the previous frame, eye detection is performed by the CAMShift algorithm. The CAMShift algorithm originates from the MeanShift method. The MeanShift [11] and CAMShift [12] methods are both based on colour histograms of the object that is being tracked. The MeanShift method has been used to track objects assuming fixed and static probability distributions of the object [13]. In order to deal with these distributions, which can change in a time sequence, the CAMShift algorithm was introduced [13]. However, an eye detection method based on the CAMShift algorithm has the disadvantage that the detection error can accumulate over successive images. For this reason, eye detection using adaptive selection of the Adaboost and CAMShift methods based on the SVM is newly proposed in this research.
In general, a larger frame difference between successive images yields a larger degradation in CAMShift performance. In addition, the histogram matching score between the eye region in the previous and current frames gives credibility to eye detection performed using the CAMShift method. Therefore, after eye detection using the CAMShift method in the current frame, the histogram matching score and pixel difference between the eye region in the previous and current frames are calculated.
Once the eye region is detected by CAMShift, the histogram for gray pixels can be obtained. We use the correlation method for calculating the similarity between the two histograms (H'1(I) and H'2(I), where
In addition, the pixel difference is calculated from all the pixels at the same position between the previous and current frames. The two calculated scores (the histogram matching and pixel difference scores) are used as input parameters for the SVM.
The SVM is a type of supervised learning method that is used as a non-linear classifier with multi-dimensional data. The objective of the SVM is to find the optimal hyper-plane that helps to discriminate between different classes on the basis of the maximum margin of the support vectors [14, 15]. Two classes are manually defined for SVM training. Class 1 represents the eye region correctly detected by the CAMShift method. Class 2 represents the region that is incorrectly detected by the CAMShift method, in which case the eye region needs to be detected again by the Adaboost method. Among various kernels for the SVM, including the linear, polynomial, radial basis function (RBF) and sigmoid kernels in Eq. (2), the RBF kernel was experimentally selected as the optimal one with the parameter γ = 0.0078125 obtained from training data. In this research, the LIBSVM program was used for the SVM classifier [16].
On the basis of the SVM classification results, if the detected eye region is determined as being an incorrect one (Class 2), eye detection is again performed by the Adaboost method. If it is determined as being the correct one (Class 1), the next step in the pupil detection procedure is performed (see Section 2.4).
2.4 Detection of the pupil region and center of the pupil
Within the eye region detected by the Adaboost or CAMShift algorithms, pixels that have very high gray levels are considered to be the SR from the NIR illuminators. As shown in Figure 4(b), a SR has a high gray level. Thus, an area of specular reflection usually creates a distinct boundary that makes it difficult to accurately detect the area of a pupil. Because the circular edge detection (CED) method locates the pupil boundary based on the gray difference between inner and outer circles, the distinctive boundary from a SR causes the incorrect detection of a pupil boundary by CED, as shown in Figure 4(c). Hence, the SRs are erased before performing CED.

Examples of erasing specular reflections and pupil detection: (a) erasing specular reflection, (b) pupil detection by a circular edge detector after erasing specular reflections and (c) pupil detection by a circular edge detector without erasing specular reflections
Because the accurate positions of the SRs are usually difficult to detect, pixels with gray levels that are higher than a threshold level are detected as SRs and then erased by the neighbouring pixels (whose gray levels are lower than the threshold) in the horizontal direction.
Next, the pupil area is located using the CED method. The CED method detects the center of the pupil, where the difference between the inner and the outer boundaries of the pupil area is maximized, while changing the center and radius of the pupil area [1, 18], as follows:
where I(x,y) is the pixel value at the (x,y) position, (x0, y0)is the center position of a circle and r is the radius. Given (r,x0,y0), Eq. (3) can be calculated and the position (x0,y0) with radius r that maximizes the value of Eq. (3) can be determined as the detected pupil boundary and radius, respectively.
However, because the pupil area does not have a circular shape, but rather a deformed circular shape [7], there is detection error at the center of a pupil that is located by CED. To more accurately detect the center of the pupil, a rectangular region including the pupil area that is detected by CED is defined and this region is binarized. Using the binarized image, the precise center of the pupil is obtained by calculating the geometric center of the black pixels [18].
Figure 5 illustrates the procedure for pupil detection. As shown in Figure 5(a), the box area (not the exact pupil boundary) can be detected by the proposed eye detection method by adaptive selection of the Adaboost and CAMShift methods based on the SVM. Then, an accurate pupil boundary with a circular shape can be detected within the box area using CED via Eq. (3) and the result is shown in Figure 5(b). However, there exists an error in the detected pupil boundary, as seen in Figures 5(b) and (c), because the pupil is not perfectly circular (the detected pupil boundary (white circle line) is not exactly the same as the actual pupil boundary). Thus, in order to alleviate this problem and detect a more accurate pupil center, we perform the following operation.

Procedure for pupil detection: (a) detection of eye region, (b) detection of approximate pupil area by CED, (c) enlarged image of the eye region of (b), (d) binarization of pupil area, (e) accurate detection of the geometric center of the pupil region and (f) enlarged image of the eye region of (e)
Based on the detected center and radius of the pupil boundary (white circle) in Figure 5(b), a rectangular region is defined and binarization is performed, as shown in Figure 5(d). Binarization means that if a pixel's gray value is greater than a certain threshold, the pixel value becomes white (255). If not, the pixel value becomes black (0). When we see the pupil area in Figure 5(b), the pixel value of the pupil is lower (darker) than other areas, such as the iris or specular reflection. After binarization, the pupil area is shown as black pixels and the other areas are shown as white pixels, as seen in Figure 5(d). Then, the geometric mean position (
3. Experimental Results
The proposed method of eye and pupil detection was tested on a desktop computer equipped with an Intel Core 2 Quad processor operating at 2.3GHz with 4GB of RAM. The method was implemented using Microsoft Foundation Class (MFC)-based C++ programming and the OpenCV library. We collected 3,600 images from 18 subjects, including people who wore eyeglasses, contact lenses and neither glasses nor contact lenses. Half of the images were used for SVM training and the remaining ones were used for testing. Figure 6 shows examples of these images.

Examples of images used in the experiments
We compared the accuracy and processing time of the proposed method to other methods, including eye and pupil detection. The results are listed in Tables 1 and 2. The accuracy was measured as the root mean square (RMS) error between the manually selected pupil center and the center that was detected automatically. In Tables 1 and 2, Adaboost indicates eye detection performed by the Adaboost method for all images. In the tables, CAMShift indicates that eye detection is performed by Adaboost only in the first frame. In successive images after the first frame, eye detection is performed by CAMShift using the detected eye/pupil region of the previous frame.
Adaptive template matching (ATM) is an eye detection method that utilizes Adaboost only in the first frame, with adaptive template matching performed on the basis of the detected eye/pupil region of the previous frame in successive images after the first frame. Rapid eye, which is a block-based eye detection method that employs integral imaging [19], is used for eye detection for all images. Adaptive selection by constant threshold is an eye detection method that employs adaptive selection of the Adaboost and CAMShift methods, using a constant threshold based on the two scores of histogram matching and pixel difference. The only difference between the Adaptive selection by constant threshold and the proposed method is that the former uses a constant threshold, whereas the latter uses an SVM-based classifier to determine whether the detected eye region is correct. With each eye detection method in Tables 1 and 2, the same procedure for pupil detection that was shown in Figure 2 was used in order to make a fair comparison between the performance of different methods of pupil detection.
Comparison of the RMS error of pupil detection (units: pixels) (NE: naked eye, GL: eye with eyeglasses and CL: eye with contact lens)
Comparison of pupil detection processing times (units: ms) (NE: naked eye, GL: eye with eyeglasses and CL: eye with contact lens)
As indicated by the data in Tables 1 and 2, the proposed method has a higher degree of accuracy and its processing speed is faster than that of the other methods. The proposed method can detect the center of the pupil at a speed of approximately 19.4 frames/s with an RMS error of approximately 5.75 pixels, which is superior to the results obtained using other methods.
Figure 7 shows examples of correct and incorrect detection results. The reason that a pupil detection error occurs in the cases depicted in Figures 7(d) and (e) is as follows. The first reason is that the reflection noise on the eyeglass surface is included in the left boundary of the pupil. The second is that blurring occurs and an incorrect binarized image is obtained due to the lower pixel value of the iris region by the reduction in illumination through the eyeglasses. The reason that the eye detection error in the case depicted in Figure 7(f) occurs is because eye detection by the Adaboost algorithm in the previous frame is incorrect due to the proximity of the eye position to the lower boundary of the image.

Correct and incorrect detection results: (a) result of correct eye detection, (b) result of correct pupil detection based on (a), (c) result of correct eye detection, (d) result of incorrect pupil detection based on (c), (e) enlarged image of eye region of (d) and (f) result of incorrect eye detection
4. Conclusion
In this paper, we have proposed a new eye/pupil detection method for gaze tracking. To reduce the processing time and to determine the precise position of the pupil, an approximate eye region is searched for using adaptive selection of the Adaboost and CAMShift algorithms based on an SVM. Within the detected eye region, the center of the pupil is accurately located using CED, binarization and calculation of the geometric center.
In a future work, we plan to test the proposed method in a variety of environments and with more subjects. In addition, we plan to apply the proposed adaptive selection method to other object-tracking applications, such as the tracking of people and cars.
