Abstract
Keywords
Introduction
Human action recognition supported by vision-based systems, ambulatory systems, or wireless sensor networks has tremendous potential in the areas of healthcare or wellbeing monitoring.1,2 It is also driven by growing real-world application needs in such areas as ambient-assisted living and security surveillance. 3
There are a number of reasons why human action recognition is a very challenging problem. For instance, human actions or activities depends on the coordinated movement of the human joints, that is, the same action can be performed in multiple configurations allowed by the joints involved to reach the action.
Automatic recognition of human actions in naturalistic conditions, principally using wearable sensors, is still an open research problem of the field of pervasive computing. 4 Currently, action recognition helps at providing information about the behavior and habits of users that enable computing systems to assist users with their daily tasks. 5
Some of the advantages of using wearable sensors over video systems for capturing human motion are the robustness to occlusion and to changes in lighting and parallel that, they are portable. Also, the visual systems require quite certain settings for appropriately working. 6 When compared with approaches based on vision systems, the wearable sensor-based approach is effective and also relatively inexpensive for data capturing and action recognition for certain types of human actions, mainly human movements involving the upper and lower limbs, such as walking or running.
Wearable inertial and magnetic sensors have been used in some clinical applications and for monitoring human activities outside clinical environments. These studies are too diverse and comprise the estimation of spatiotemporal gait parameters and assessment of gait abnormalities,7,8 the recognition of meaningful human expressions involving hands,9,10 arms,11,12 or body,13,14 fall detection,15,16 among others. Inertial orientation tracking provides the possibility to perform real-time capture of the movements of a person by fixing the wearable sensors to her or his body.
Recently, several studies have been carried out focused on the recognition of actions using wearable inertial sensors,17,18 in which raw sensor data are used to build classification models, and in a few of them high-level representations are obtained which are directly related to anatomical characteristics of the human body; some of these representations have been proposed and studied in the field of computational vision.19–21
This research focuses on extracting a set of features related to human motion, in particular, the motion of the upper and lower limbs, using wearable inertial sensors to recognize actions performed in daily living environments and contrast the results using features extracted from raw data of the sensors.
The remainder of this article is organized as follows. The following section presents the most relevant works related to this study. Section “Methods” details the proposed method to recognize human actions, and the performance of the classifiers is evaluated in section “Experimental results.” Finally, the article is concluded with potential future work in the section “Conclusion.”
Related work
The use of wearable inertial sensors placed on the human body has allowed the recognition of human actions in structured environments under controlled conditions, for example, laboratories or smart homes, 22 in which the test subjects are commonly asked to perform a set of predetermined actions, or in which the variability of environmental conditions is reduced. In other studies, data capture is made in real-life settings or unstructured environments, for example, offices or houses, 23 in which there is no specific request for actions to be performed and external factors can interfere, such as the intervention of other subjects.
The inertial sensors used for data capturing can be embedded in a device such as smartphones, 24 smartwatches, 25 or they can be encapsulated with other elements, making them portable, as an inertial measurement unit (IMU). From the data obtained by these devices, methods for the treatment of the data are proposed and can be divided into two approaches: (i) studies in which the raw signals of the sensors are used individually (low-level data), for example, raw acceleration signals and (ii) studies that combine the raw signals to obtain a representation of the human body in which their anatomical restrictions are considered (high-level data), for example, joint angle signals.
Most studies directly use low-level data as input of the proposed techniques for recognition of actions. In recent studies, acceleration data from smartphones has been used, features based on time and frequency domains were extracted and used with multiple classifiers to recognize five actions,
26
Convolutional neural networks (CNN) for local feature extraction together with simple statistical features for classifying six different actions of two data sets,
27
and histogram of gradient and centroid signature-based Fourier descriptor for extracting features and classifying 6 and 13 actions of two data sets using support vector machines (SVM) and
Few studies have been identified in which high-level data are used, and all of them use data from IMUs. In two of these studies, actions in a structured environment are recognized. In a first work,
33
eating, drinking and horizontal reaching were classified based on elbow orientation, elbow position, and wrist position relative to the shoulder. In the training stage, features were clustered using the
In two works, the actions have been performed outside structured environments. In a work, 35 nine everyday and fitness actions were classified based on time and frequency-domain features extracted from the orientation of the torso, shoulder, and elbow using decision trees as inference algorithm. The overall performance of the classifier was 87.18%. Ahmadi et al. 36 propose to use the discrete wavelet transform in conjunction with a random forest (RF) algorithm based on flexion/extension of the knees to classify six sports actions performed in an outdoor training environment. Classification accuracy of 98.3% was achieved.
Finally, in one work, 37 both low-level (linear accelerations and angular velocities) and high-level (joint orientations) were used to recognize five activities of daily living (ADL): sit down (on a chair), stand up (from a chair), reach (ground, mid, high), walk, and turn. The method uses a nonlinear transform and adaptive threshold to detect for peaks that correspond to the actions scoring 90% of accuracy.
Finally, in two of the previously cited works, in two of them, only two sensors were used,33,36 in one work 5 sensors, 35 and 17 sensors in the two most recent works.34,37 Only in two studies, the proposed methods were applied to people with injuries 36 or elderly people who were diagnosed with early stages of Parkinson’s disease, 37 the rest of works use data from test subjects without apparent mobility problems.
This research focuses on classifying a set of ADL, such as functional mobility, and instrumental activities of daily living (IADL), such as preparing meals, performed by test subjects in their homes in naturalistic conditions. The joint angles of upper and lower limbs are estimated using information from five sensors placed on the back, right upper arm, right forearm, left thigh, and left leg. A set of features related to joint movements is extracted from the orientation signals (high-level data), in addition to a set of features from acceleration signals (low-level data), and both sets are used to build classifiers using four inference algorithms: Naive Bayes (NB), KNN, SVM and RF.
Methods
In Figure 1, the methodological steps proposed to recognize human actions using low- and high-level data are presented. In summary, the flow of data starts filtering and converting raw data (low-level data) of sensors in local orientation. In the second step, joint angles (high-level data) are estimated using the local orientations as the input of a kinematic model. Then, a set of features is extracted from raw signals and joint angles. Finally, the features are used for building models to classify a set of human actions.

The general scheme of the action recognition method.
Signal processing
In this study, wearable inertial and magnetic sensors are used for tracking human motion. As mentioned before, one advantage of using this kind of sensors, over vision-based systems, is that wearable sensors can be brought out of controlled environments. The main disadvantages of using inertial sensors are, on the one hand, the drift or cumulative error resulting after integration of the angular velocity read by the gyroscopes. On the other hand, using accelerometers and magnetometers is only accurate for tracking objects that move slowly. Some algorithms38–41 have been developed to fuse the readings of these three type of sensors (accelerometer, magnetometer, and gyroscope) to mitigate the issues of each sensor.
The most widely used wearable sensors for human action recognition are accelerometers.42,43 In particular, acceleration is used for measuring motion, from now acceleration data is referred as to
The wearable sensors can be placed on specific anatomical references of the human body, for example, the right forearm, using velcro straps to firmly attach the sensor to the segment. In Figure 2(a), raw data

(a) Low-level (linear acceleration signals) and (b) high-level (joint angles signals) data during a Mouth care action.
Signals in
where
To estimate the local orientation of a sensor
From the configuration of sensors (

Degrees of freedom
Joint angles estimation
The human body is composed of bones linked by joints forming the skeleton and covered by soft tissue, such as muscles.
50
To obtain a movement representation according to the anatomical constraints of the human joints, the local orientations
A set of joint angles
For illustrating purposes, in Figure 2(b) the joint angles
Feature extraction
The feature extraction step reduces the signals in
where each segment
To select the type of features to be used in the recognition of certain actions,
The features selected for recognizing actions in this study are divided into two groups: signal-based features and high-level features. Signal-based features
Even though signal-based features have been widely used in the action recognition field, these features do not describe the mobility of human segments or joints, nor do they consider the relationship between them. One of the advantages of using a kinematic model for tracking human motion is that the signals of the joints can be characterized as terms of movements. This representation, based on the anatomical term of movements, can be used not only for classification purposes, but it can also be used for describing how people perform the actions, from now called high-level features
In addition to

High-level features (
Action classification
A supervised classification approach was used in the last step of the proposed method and is divided into two steps: (i) training and (ii) testing. In the first step, a classification model is learned based on inference algorithms using as input the vectors of features extracted from the segmented signals, also known as the training set
where
Four inference algorithms were selected because they are appropriate to deal with problems involving unbalanced data.56,57 First, NB is a Bayesian classification method that assumes independence between the variables or features, based on conditional probabilities.
58
Second, KNN classifies new instances according to the class with the greatest number of closest neighbors of the training set.
59
The value selected of
Experimental results
Data
An experimental protocol was designed and applied in daily living environments of the five test subjects (26.2 ± 4.4 years). The subjects were asked to perform their daily activities at will, that is, they did not perform any specific action in any order. Five wearable sensors LPMS-B (LP-research, Tokyo, Japan) were placed on the body of subjects:
Ten different actions, divided into 90 instances, were selected from recordings (

Sequence of a mouth care action of a test subject. Top images were captured by Google Glass and bottom images were captured by a GoPro camera.
Results
A set of features was extracted from low-level (
The classification results divided into eight data treatments are summarized in Table 1. The first treatments correspond to the use of low-level data extracting signal-based features: (i) raw accelerometer data
Correctly classified instances (%) using low- and high-level data.
NB: Naive Bayes; KNN:
The best results using low-level data were obtained using only raw acceleration data with an overall of 73.3% of instances correctly classified. The incorporation of the vector of magnitude of the acceleration only improves using RF, although the overall of the classifiers is not exceeded when using only data
Regarding the high-level data, using the modification of the high-level features
The results when combining low- and high-level data, which corresponds to the last treatment of the presented data, not only improves the overall classification with 88.5%, which is higher than the classification obtained by the rest of the data treatments, but also achieves 96.7% using SVM, which is the best classification result.
To highlight the differences using low- and high-level data when classifying each of the studied human actions, Tables 2 and 3 show the sensitivity (
Classification results for each action using low-level data
NB: Naive Bayes; KNN:
Classification results for each action high-level data
NB: Naive Bayes; KNN:
From Table 2, the actions with the highest rate of instances correctly classified using low-level data were Walking and Mouth care (overall >0.9), while the instances of descending stairs were not correctly classified by any model. The action with the lowest specificity was Walking, which indicates that the number of instances classified incorrectly as Walking was the highest. However, from Table 3, the actions with the highest rate of sensitivity using high-level data were Eating, Walking, Doing housework, and Standing (overall >0.9), while Ascending stairs was the action with the worst rate. Once again, Walking was the action with the lowest specificity.
Figure 6 shows the box-plots corresponding to the sensitivity values when using low- and high-level data from Tables 2 and 3. As can be noted, the true positive rate of five actions: Doing housework, Eating, Sitting, Standing, and Walking, is clearly superior using high-level data than using low-level data. The Walking action could be recognized by the classifiers, regardless of the data level, with a rate close to 1.0 and the lowest combined dispersion. Only two actions were better recognized using low-level data: Mouth care and Ascending stairs.

Finally, confusing matrices for classifiers built using the best classifier for each level of data: KNN classifier built using signal-based features extracted from raw acceleration signals (Figure 7(a)), SVM classifier built using high-level and signal-based features extracted from joint angles signals (Figure 7(b)), and SVM classifier built using high-level and signal-based features extracted from both raw acceleration and joint angles signals (Figure 7(c)). As a result, in the matrix shown in Figure 7(a), 20/90 instances were misclassified and 40% of them were misclassified as Walking; in particular, the actions with the highest proportion of instances incorrectly classified were Grooming, Descending stairs, and Standing. From matrix of Figure 7(b), 6/90 instances were misclassified; the action with the highest proportion of instances incorrectly classified was Ascending stairs. And from matrix in Figure 7(c), 3/90 instances were misclassified. At least there was a misclassified instance of cooking and an instance of Descending stairs from the three classifiers analyzed.

Confusion matrices for classifiers built using (a) low-level data ( KNN using
Discussion
The main advantage of using low-level data is that additional processing to represent movement is not required. Although the execution time in the training and testing phases using low-level data and high-level data is similar, the time required to extract the high-level features is longer than the time required to extract signal-based features, see Table 4. Even so, a series of assumptions have to be made to associate the inertial measurements with human movement. The use of the magnitude of the acceleration is useful as a technique to deal with the displacement of the sensors when they are used on the human body, or in case the anatomical references, in which the sensors are attached, are not exactly the same every time. However, when merging the information of the three axes, some relevant information is lost with respect to the plane in which the movements are performed.
Execution time (s) in the feature extraction step.
All the experiments have been run over a PC with 6-core CPU (Intel Xeon Bronze 1.7 GHz), 16 GB of RAM and a 2 TB Hard Drive.
Not only the average classification rate does not improve by including acceleration magnitude data
Regarding the rest of the classifiers, KNN scores its best result (91.1%) using all sets of high-level features. For their part, RF has the same classification result (88.9%) using only signal-based features from high-level data and when using all the feature sets of both data levels; the same case as when using NB (80%) and the high-level features
The new set of high-level features
Due to the imbalance of classes, the instances of Ascending stairs and Descending stairs were erroneously classified in most cases, which is confirmed by the low specificity of the Walking action which has the highest number of instances. Despite this, by combining low-level and high-level data, most instances of Ascending stairs and Descending Stairs could be correctly classified, while instances incorrectly classified were confused as Walking, whose action has similar characteristics of locomotion.
Conclusion
In this research, a set of low- and high-level features were used to classify a set of human actions performed by people in real settings. Then, our set of features was used for recognizing a set of actions performed by five test subjects in naturalistic conditions, and their discriminant capability under different conditions was analyzed and contrasted. The average classification rate for recognizing the ten actions of the four classifiers built bythe proposed set of low and high-level features was 88.5% (
One of the advantages of using descriptive features, low and high levels, based on the signals is that they can give information about how the activities were carried out. In the case of acceleration signals, the features describe the speed/intensity with which the limbs moved. However, the high-level features describe the total arc of motion reached by the limbs when performing the activities.
To evaluate the proposed techniques for human action recognition, such computational techniques will be applied to publicly available data sets that provide orientation data of the limbs of people, that is, rotation matrices, Euler angles, or quaternions. The orientation data can be captured by different types of motion sensors, such as cameras that track the skeleton of people. It is considered to incorporate more sensors to the current sensor network, to capture the motion of more segments of all the limbs, since the actual number of sensors corresponds to a minimum proposed configuration. Finally, concurrent and interleaved actions will be considered, because this type of actions are closer to the daily living of people, in which the interaction with the environment is necessary and essential.
