Abstract
1. Introduction
Developing robots able to collaborate with humans in real-world domains such as material transport, rescue, health care, manufacturing, etc. poses numerous advantages. A requirement to achieve this objective is to develop the robot ability to accurately and robustly detect and track human partners in order to generate the proper behaviour. This research deals with the issues related to developing a reliable “person following” behaviour that allows the robot to accompany a human. Once the target person is detected, the robot attempts to drive directly toward the person's location. To this aim our approach is to endow the robot with multimodal perception that enables it to fuse information from multiple sensors, assimilate these multimodal data in real time and then respond at the timescale of the interaction.
The primary requirement of this research has been to investigate the realization of a human tracking system based on low-cost sensing devices. Recently, research on sensing components and software led by Microsoft has provided useful results for extracting the human pose and kinematics [1]. The Kinect motion sensor device offers visual and depth data at a significantly low cost. While the Kinect is a great innovation for robotics, it has some limitations. First, the depth map is only valid for objects that are further than 80cm away from the sensing device. A recent study [2] about the resolution of the Kinect proves that for mapping applications the object must be in the range of 1-3 m in other to reduce the effect of noise and low resolution. Second, the Kinect uses an IR projector with an IR camera, which means that sunlight could have a negative effect, taking into account that the sun emits in the IR spectrum. Third, the Kinect relies on algorithms for detection of human activities captured by a static camera. In mobile robot applications the sensor configuration is embedded into the robot, which is usually moving. As a consequence, the robot is expected to deal with environments that are highly dynamic, cluttered and frequently subject to illumination changes.
To cope with this, our work is based on the hypothesis that the combination of multiple sensors, a Kinect, a thermopile array sensor (Heimann HTPA thermal sensor) and a Hokuyo Laser can significantly improve the robustness of human detection. Thermal vision helps to overcome some of the problems related to colour vision sensors, since humans have a distinctive thermal profile compared to non-living objects (therefore human pictures are not considered as positive) and there are no major differences in appearance between different people in a thermal image. Another advantage is that the sensor data does not depend on light conditions so people can also be detected in complete darkness. As a drawback, some phantom detections near heat sources such as industrial machines or radiators may appear. Therefore, it is a promising research direction to combine the advantages of different sensing sources because each modality has complementary benefits and drawbacks as it has been shown in other works, [3], [4], [5], [6] and [7].
We have experimented in a Science museum with different elements exposed and people moving around and strong illumination changes due to weather conditions.
The rest of the paper is organized as follows: In Section II related work in the area of human tracking is presented. We concentrate mainly on work done using multiple sensor fusion for people tracking. Section III describes the proposed approach and Section IV the experimental setup. Section V shows experimental results and Section VI shows conclusions and future work.
2. Related work
People detection and tracking systems have been studied extensively due to the increasing demand of advanced robots that must integrate natural Human-Robot Interaction capabilities in order to perform some specific tasks for humans or in collaboration with them. As a complete review on people detection is beyond the scope of this work (an extensive work can be found in [8] and [9]) we focus on most related work.
To our knowledge, two approaches are commonly used for detecting people using a mobile robot. The first includes vision-based techniques and the second approach combines vision with other modalities, normally range sensors, such as laser scanners or sonar like in [10], [11]. Recent computer vision literature is rich in people detection approaches in colour images. Most approaches focus on a particular feature: the face [12]; [13], the head, [14], the upper body or the torso, [15], the entire body, [16], just the legs, [17], or multimodal approaches that integrate motion information [3]. All methods for detecting and tracking people in colour images on a moving platform face similar problems and their performance depends heavily on the current light conditions, viewing angle, distance to persons and variability of appearance of people in the image.
Apart from cameras, the most common devices used for people tracking are laser sensors. One of the most popular approaches in this context is to extract the legs' position by detecting moving blobs that appear as local minima in the range image. [18] presents a system for detecting legs and following a person with only laser readings. A probabilistic model of a leg shape is implemented, along with a Kalman filter for robust tracking. [19] addresses the problem of detecting people using multiple layers of 2D laser range scans. Other implementations such as [20] also use a combination of face and laser-based leg detection.
Most existing combined vision-thermal-based methods, [4], [5], [6], [7], concern non-mobile applications in video monitoring applications and especially for pedestrian detection where the pose of the camera is fixed. Another approach, [21], shows the advantages of using thermal images for face detection. They suggest that the fusion of both visible and thermal-based face recognition methodologies yields better overall performance.
However, to the author's knowledge, there is little published work on using thermal sensor information to detect humans using mobile robots. The main reason for the limited number of applications using thermal vision so far is probably the relatively high price of this kind of sensor. [22] shows the use of thermal sensors and grey scale images to detect people in a mobile robot. A drawback of most of these approaches is the sequential integration of sensory cues. People are detected by thermal information only and are subsequently verified by visual or auditory cues.
3. Proposed approach
We propose a multimodal approach, which is characterized by parallel processing and filtering of sensory cues, as is shown in Figure 1. Since our algorithm is intended to run in a robot, our implementation is based on the ROS system [23].

Approach combining three input clues from RGB-D sensor, laser and thermal sensor using a particle filter approach.
The
The HTPA allows the measurement of the temperature distribution of the environment, where very high resolutions are not necessary, such as person detection, surveillance of temperature critical surfaces, hotspot or fire detection, energy management and security applications. The thermopile array can detect infrared radiation; we convert this information into an image where each pixel corresponds to a temperature value. The sensor only offers a 32×31 image that allows a rough resolution of the environment temperature. The benefits of this technology are very small power consumption, as well as the high sensitivity of the system.
Kinect provides depth data that we transform into depth images, it uses near infrared light to illuminate the subject and the sensor chip measures the disparity between the information received by the two IR sensors. It provides a 640×480 distance (depth) map in real time (30fps). In addition to the depth sensor, the Kinect also provides a traditional 640×480 RGB image.
The scanning laser range finder chosen for the leg and obstacles detection tasks is a Hokuyo UTM-30LX. This laser provides a measuring area of 270 angular degrees, from 0.1 to 30m in depth and an angular resolution of 0.25 (1080 readings per scan).

The used robotic platform: a Segway RMP 200 provided with the Kinect, a Hokuyo laser and the thermal sensor
2.1 Leg detection
The proposed

Leg pattern based on the work of Belloto et al. [3].
2.2 Vest detection
As stated above, RGB-D images are used to track the target and add information to the particle filter. The position of the target will be estimated by
The vest detection method is intended to be used with a person that is wearing an emergency vest. The foreseen applications for the people following behaviour implemented by the robot are mainly related with robots supporting emergency personnel, in rescue activities for instance. In this method an RGB filter is applied in order to focus the attention towards the emergency vest's yellow colour, obtaining a binary image where the white pixels correspond to the target colour. After that, the binary image is filtered by erasing first the very small white areas and then applying morphological dilation and erosion operations (the result is shown in Figure 4 (a)). If there is more than one white region in the resulting image, the one with the bigger area is selected. The chosen area is used to extract some image features (corners with big Eigenvalues) to be tracked in the received images (see Figure 4 (b)). The optical flow is calculated for the detected corners using the Lucas-Kanade [24] method. In each frame, after calculating the optical flow, the centroid of the corners will be extracted, this being the target's estimated position.

Binary image original with features Emergency vest detection
To increase the reliability of the tracking avoiding errors, especially when the target turns or changes its perspective, the image features are recalculated each time the distance between them changes or some points disappear.
2.3 Thermal detection
People present a thermal profile different from their surrounding environment. The temperature detected in the pixel corresponding to a person is usually around 37 degrees Celsius, with a slight tendency to be slightly lower, due to the presence of hair or clothes over the skin.
We implement a procedure that, given a thermal image, computes a vector of 32 floating point numbers. The
This computation is performed in three steps:
First of all, a likelihood of corresponding to a person is assigned to every thermopile pixel. This likelihood is computed under the assumption that the temperature of a person is normally distributed with mean μ and standard deviation
Then smoothing of the likelihood matrix is performed by convolution with a Gaussian kernel with a width of five pixels [25].
And finally the maximum of column
2.4 Fusion using particle filter
Once the
Focusing on the posed problem, the state in time
where X(t-1) is the previous state vector and V(t-1) is the process noise.
The observation, on the other hand, is defined by the information provided by
Finally the tracking procedure of the particle filter is done as shown in Figure 5.

People following algorithm using particle filter.
2.4.1 Leg Likelihood
As discussed above, the
where
2.4.2 Vest Likelihood
The information provided by the
where
2.4.3 Thermal Likelihood
The thermal likelihood is calculated using the information provided by
where
2.4.4 Final Likelihood
Once the likelihoods are calculated, the final likelihood of a state given an observation is calculated as
where Pvest(Xt|Zt)), Plaser(Xt|Zt) and Ptherm(Xt|Zt) are the likelihoods calculated from the information provided by
3. Experimental evaluation
In this section, we discuss the experimental evaluation of our approach. This section describes the tests that have been carried out in order to assess the performance of the different feature detectors considered, with regard to the task described in the previous sections.
We quantitatively evaluate our algorithm using data collected with a mobile robot moving in a Science museum.
During the experiments, the robot was remotely controlled. We asked different people to walk naturally in front of the robot. We collected two sets of data (320×240 RGB-D image+284 laser+32×31 thermal) at a frequency of 4Hz in different areas of the museum. The fist dataset contains 2862 data and the second dataset 2300 images.
Figure 6 shows some images taken by the robot. As it can be seen, lighting conditions affect to the image treatment, as there are crystal corridors in the museum (Figure 6 (a) and Figure 6 (b)). In addition there are some aesthetic elements that are detected as people, such as the big figure in the door in Figure 6 (c).

Eureka! Science Museum
For all the datasets, we hand-annotated the position of the people to be tracked, selecting the centre of the vest as a target point. From the depth images we consider a bound of 20 pixels in order to extract the distance from the robot. The annotation is provided for four images per second. The picture in the right corner of Figure 6 (d) shows a people following sequence where the annotated position is represented.
In order to establish the relative position between the target person and the robot a three-dimensional cylindrical coordinate system is defined, taking into account the posed problem and the available sensors as shown. Each target's position will be defined with three values (

Robot coordinate system
Figure 8 represent the range of data in the angle and depth used in the evaluation process.

Annotated ground truth data for the two datasets used in the experiments (angle and depth position of the target person). The target persons are in an angle range of 50-320 pixels and a depth range of 100-200cm.
4. Experimental results
The performance of the people tracking system is evaluated in terms of error in the estimation, in distance and in angle, during person following tasks. Table 1 and Table 2 show a summary of the mean and standard deviation of errors obtained by the different tracking systems and the combination of them. The evaluation has been performed using the detectors individually: only

Mean errors in angle estimation for the different combinations used in the two datasets.

Mean errors in depth estimation for the different combinations used in the two datasets.
Results in term of error in angle (degrees). L refers to leg detection, V vest detection and T thermal detection. The number before the letter refers to the weight given to each detector.
Results in term of error in depth (m). L refers to leg detection, V vest detection and T thermal detection. The number before the letter refers to the weight given to each detector. The thermopile sensor is not used to estimate the position in depth of the tracked person.
Laser and thermal detection individually provide the worst results, as well as the combination of both. It seems that a thermal sensor in combination with the rest of the sensors deteriorates the estimations and in the case of the laser and vest combination the results are better than laser vest and thermal with the same weight. However, the final combination considering the thermal information with a weight of 0.15 improves the vest and laser combination.
For the results achieved in the working range (1 to 2m distance to the target) with the final combination of sensors it is clear that the best behaviour in general is obtained by the particle filter combining all the sensory cues.
5. Conclusions
In this paper, we have introduced a multimodal approach to detect and track people in indoor spaces from a mobile platform. This approach has been designed to manage three kinds of input images, colour, depth and temperature to detect people. As shown in the experimental evaluation, using complementary sensors and fusing their results using particle filtering results in robust and accurate person detection.
In the near future we aim:
To develop improved detectors combining/fusing visual cues using particle filter strategies, including face recognition and motion information, in order to track people gestures.
To improve the algorithm's parameter adjustment without hand tuning, using machine learning approaches.
To integrate with robot navigation planning ability to explicitly consider humans in the loop during robot movement.
To extend to other scenarios. This is a first implementation of a system that can be extended toward an outdoors scenario.
