Sage Journals: Discover world-class research

Abstract

Human action recognition supported by highly accurate specialized systems, ambulatory systems, or wireless sensor networks has a tremendous potential in the areas of healthcare or wellbeing monitoring. Recently, several studies carried out focused on the recognition of actions using wearable inertial sensors, in which raw sensor data are used to build classification models, and in a few of them high-level representations are obtained which are directly related to anatomical characteristics of the human body. This research focuses on classifying a set of activities of daily living, such as functional mobility, and instrumental activities of daily living, such as preparing meals, performed by test subjects in their homes in naturalistic conditions. The joint angles of upper and lower limbs are estimated using information from five wearable inertial sensors placed on the body of five test subjects. A set of features related to human limb motions is extracted from the orientation signals (high-level data) and another set from the acceleration raw signals (low-level data) and both are used to build classifiers using four inference algorithms. The proposed features in this work are the number of movements and the average duration of consecutive movements. The classifiers are capable of successfully classifying the set of actions using raw data with up to 77.8% and 93.3% from high-level data. This study allowed comparing the use of two data levels to classify a set of actions performed in daily environments using an inertial sensor network.

Keywords

Action recognition action classification feature extraction indoor environments joint orientation

Introduction

Human action recognition supported by vision-based systems, ambulatory systems, or wireless sensor networks has tremendous potential in the areas of healthcare or wellbeing monitoring.^1,2 It is also driven by growing real-world application needs in such areas as ambient-assisted living and security surveillance.³

There are a number of reasons why human action recognition is a very challenging problem. For instance, human actions or activities depends on the coordinated movement of the human joints, that is, the same action can be performed in multiple configurations allowed by the joints involved to reach the action.

Automatic recognition of human actions in naturalistic conditions, principally using wearable sensors, is still an open research problem of the field of pervasive computing.⁴ Currently, action recognition helps at providing information about the behavior and habits of users that enable computing systems to assist users with their daily tasks.⁵

Some of the advantages of using wearable sensors over video systems for capturing human motion are the robustness to occlusion and to changes in lighting and parallel that, they are portable. Also, the visual systems require quite certain settings for appropriately working.⁶ When compared with approaches based on vision systems, the wearable sensor-based approach is effective and also relatively inexpensive for data capturing and action recognition for certain types of human actions, mainly human movements involving the upper and lower limbs, such as walking or running.

Wearable inertial and magnetic sensors have been used in some clinical applications and for monitoring human activities outside clinical environments. These studies are too diverse and comprise the estimation of spatiotemporal gait parameters and assessment of gait abnormalities,^7,8 the recognition of meaningful human expressions involving hands,^9,10 arms,^11,12 or body,^13,14 fall detection,^15,16 among others. Inertial orientation tracking provides the possibility to perform real-time capture of the movements of a person by fixing the wearable sensors to her or his body.

Recently, several studies have been carried out focused on the recognition of actions using wearable inertial sensors,^17,18 in which raw sensor data are used to build classification models, and in a few of them high-level representations are obtained which are directly related to anatomical characteristics of the human body; some of these representations have been proposed and studied in the field of computational vision.^19–21

This research focuses on extracting a set of features related to human motion, in particular, the motion of the upper and lower limbs, using wearable inertial sensors to recognize actions performed in daily living environments and contrast the results using features extracted from raw data of the sensors.

The remainder of this article is organized as follows. The following section presents the most relevant works related to this study. Section “Methods” details the proposed method to recognize human actions, and the performance of the classifiers is evaluated in section “Experimental results.” Finally, the article is concluded with potential future work in the section “Conclusion.”

Related work

The use of wearable inertial sensors placed on the human body has allowed the recognition of human actions in structured environments under controlled conditions, for example, laboratories or smart homes,²² in which the test subjects are commonly asked to perform a set of predetermined actions, or in which the variability of environmental conditions is reduced. In other studies, data capture is made in real-life settings or unstructured environments, for example, offices or houses,²³ in which there is no specific request for actions to be performed and external factors can interfere, such as the intervention of other subjects.

The inertial sensors used for data capturing can be embedded in a device such as smartphones,²⁴ smartwatches,²⁵ or they can be encapsulated with other elements, making them portable, as an inertial measurement unit (IMU). From the data obtained by these devices, methods for the treatment of the data are proposed and can be divided into two approaches: (i) studies in which the raw signals of the sensors are used individually (low-level data), for example, raw acceleration signals and (ii) studies that combine the raw signals to obtain a representation of the human body in which their anatomical restrictions are considered (high-level data), for example, joint angle signals.

Most studies directly use low-level data as input of the proposed techniques for recognition of actions. In recent studies, acceleration data from smartphones has been used, features based on time and frequency domains were extracted and used with multiple classifiers to recognize five actions,²⁶ Convolutional neural networks (CNN) for local feature extraction together with simple statistical features for classifying six different actions of two data sets,²⁷ and histogram of gradient and centroid signature-based Fourier descriptor for extracting features and classifying 6 and 13 actions of two data sets using support vector machines (SVM) and k-nearest neighbors (KNN) algorithms.²⁸ In other work,²⁹ inertial sensors were placed on index finger and on the wrist of subjects for classifying nine gestures using an unsupervised approach compared with a supervised one. Homomorphic analysis of magnitude signals of an accelerometer placed on waist belts was used to recognize 10 actions.³⁰ Jordao et al.³¹ proposed a data augmentation which enables a CNN to learn the patterns of the signal of one accelerometer placed on waist belts and exploit the attitude estimation to devise a set of novel feature descriptors which allow classifying 12 actions. Finally, the optimal number and combination of four sensors were obtained to classify four actions using a set of features from acceleration, angular velocity, and SVM.³² In most of these works, data were captured under controlled conditions,^27–29,31 and the rest in free-living conditions.^26,30,32

Few studies have been identified in which high-level data are used, and all of them use data from IMUs. In two of these studies, actions in a structured environment are recognized. In a first work,³³ eating, drinking and horizontal reaching were classified based on elbow orientation, elbow position, and wrist position relative to the shoulder. In the training stage, features were clustered using the k-means algorithm, and a histogram is generated from the clustering as a template for each action, which was used in the recognition stage by matching the templates. This clustering-based classifier scores an F-measure of 0.774. In the second work,³⁴ ten actions were classified based on 23 orientation signals of a biomechanical model, and Gaussian mixture models were used to represent human motion as a sequence of postures or motion primitives. Subsequences of primitives are identified through frequency analysis and compared via Dynamic time warping obtaining an accuracy of 72%.

In two works, the actions have been performed outside structured environments. In a work,³⁵ nine everyday and fitness actions were classified based on time and frequency-domain features extracted from the orientation of the torso, shoulder, and elbow using decision trees as inference algorithm. The overall performance of the classifier was 87.18%. Ahmadi et al.³⁶ propose to use the discrete wavelet transform in conjunction with a random forest (RF) algorithm based on flexion/extension of the knees to classify six sports actions performed in an outdoor training environment. Classification accuracy of 98.3% was achieved.

Finally, in one work,³⁷ both low-level (linear accelerations and angular velocities) and high-level (joint orientations) were used to recognize five activities of daily living (ADL): sit down (on a chair), stand up (from a chair), reach (ground, mid, high), walk, and turn. The method uses a nonlinear transform and adaptive threshold to detect for peaks that correspond to the actions scoring 90% of accuracy.

Finally, in two of the previously cited works, in two of them, only two sensors were used,^33,36 in one work 5 sensors,³⁵ and 17 sensors in the two most recent works.^34,37 Only in two studies, the proposed methods were applied to people with injuries³⁶ or elderly people who were diagnosed with early stages of Parkinson’s disease,³⁷ the rest of works use data from test subjects without apparent mobility problems.

This research focuses on classifying a set of ADL, such as functional mobility, and instrumental activities of daily living (IADL), such as preparing meals, performed by test subjects in their homes in naturalistic conditions. The joint angles of upper and lower limbs are estimated using information from five sensors placed on the back, right upper arm, right forearm, left thigh, and left leg. A set of features related to joint movements is extracted from the orientation signals (high-level data), in addition to a set of features from acceleration signals (low-level data), and both sets are used to build classifiers using four inference algorithms: Naive Bayes (NB), KNN, SVM and RF.

Methods

In Figure 1, the methodological steps proposed to recognize human actions using low- and high-level data are presented. In summary, the flow of data starts filtering and converting raw data (low-level data) of sensors in local orientation. In the second step, joint angles (high-level data) are estimated using the local orientations as the input of a kinematic model. Then, a set of features is extracted from raw signals and joint angles. Finally, the features are used for building models to classify a set of human actions.

Figure 1.

The general scheme of the action recognition method.

Signal processing

In this study, wearable inertial and magnetic sensors are used for tracking human motion. As mentioned before, one advantage of using this kind of sensors, over vision-based systems, is that wearable sensors can be brought out of controlled environments. The main disadvantages of using inertial sensors are, on the one hand, the drift or cumulative error resulting after integration of the angular velocity read by the gyroscopes. On the other hand, using accelerometers and magnetometers is only accurate for tracking objects that move slowly. Some algorithms^38–41 have been developed to fuse the readings of these three type of sensors (accelerometer, magnetometer, and gyroscope) to mitigate the issues of each sensor.

The most widely used wearable sensors for human action recognition are accelerometers.^42,43 In particular, acceleration is used for measuring motion, from now acceleration data is referred as to $A = {aX, aY, aZ}$ . In some studies, the magnitude of the three acceleration signals is combined in one $magXYZ$ . Gyroscope data, $G = {gX, gY, gZ}$ , is mainly use for integration and estimate local orientation of sensors (E), and magnetometer, $M = {mX, gY, gZ}$ is only used for correcting the integration, for example, using Kalman or complementary filters. Raw data comprise information of $D = {A, G, M}$ . Accelerometers measure the static and dynamic acceleration in a body frame,⁴⁴ gyroscopes, and magnetometers measure the rotational rate⁴⁵ and the magnetic field⁴⁶ in the body frame, respectively.

The wearable sensors can be placed on specific anatomical references of the human body, for example, the right forearm, using velcro straps to firmly attach the sensor to the segment. In Figure 2(a), raw data ${}^{fa}d$ of a sensor placed on the forearm of a person during mouth care action in his daily living environment is shown.

Figure 2.

(a) Low-level (linear acceleration signals) and (b) high-level (joint angles signals) data during a Mouth care action.

Signals in D are filtered to avoid movement artifacts and sensor noise. The filtering process consist on the application of a low pass filter and is based on the one given by Formula (1)

$D'_{i} = (1 - δ) D'_{i - 1} + δ D_{i}, i > 1$ (1)

where $δ$ is a coefficient to filter the actual sample with respect to the previous, this parameter was manually fixed to 0.1. The cutoff frequency of the filter is 1.76 Hz.

To estimate the local orientation of a sensor $(E = {roll, pitch, yaw})$ , it is necessary to integrate directly the rate gyroscope data or combining sensing data to produce a rotation between the world coordinate frame and the sensor coordinate frame.⁴⁷ From now on the term “sensor” describes any encapsulated unit that may contain one or more inertial or magnetic sensors, as well as additional components such as batteries, communication modules, and microprocessors.² The local body sensor orientation was estimated from the gyroscope readings for all three axes and corrected by the accelerometer and magnetometer readings with an extended Kalman filter (see Section 5.1 in López-Nava⁴⁸). The linear acceleration provides information about the roll and pitch orientation regarding the earth gravity vector, while the yaw axis is modified by the direction of the earth magnetic field.⁴⁹

From the configuration of sensors (S₁–S₅) presented in Figure 3 and as a result of step 1, Euler angles representation for each sensor is obtained $E = {^{ua} e,^{fa} e,^{ba} e,^{th} e,^{sh} e}$ . As mentioned earlier, it is assumed that ${}^{ua}e,^{fa} e,^{ba} e,^{th} e,^{sh} e$ , are the orientations of upper arm, forearm, back, thigh, and shank segments, respectively. The Euler angles were used due to its interpretability since they are directly related to the arcs of movement of the human body.

Figure 3.

Degrees of freedom $(θ_{1} - θ_{13})$ for a configuration of five sensors (S₁–S₅) placed on upper and lower limbs.

Joint angles estimation

The human body is composed of bones linked by joints forming the skeleton and covered by soft tissue, such as muscles.⁵⁰ To obtain a movement representation according to the anatomical constraints of the human joints, the local orientations E are embedded in a model with the human kinematic constraints.^51,52 In this process, individual estimations of human segments E are combined to obtain the orientation of human joints L. If the bones are considered as rigid segments, it is possible to assume that the body is divided into regions or anatomical segments, and so the motion between these segments can be described by the same methods used for manipulator kinematics.

A set of joint angles L is estimated based on kinematic models of upper and lower limbs.⁴⁸ Each orientation l depends on the degrees of freedom (DoF) of the joint, $L = {^{SH} l,^{EL} l,^{HP} l,^{KN} l}$ , that correspond respectively to shoulder, elbow, hip, and knee joints. In consequence, the description of human motion can be explained by the movements between the segments and the range of motion of each joint in three planes of the body, also called anatomical planes.⁵³ The number of DoF allowed by each joint in L is illustrated in Figure 3. $θ$ represent each DoF, from 1 to 12: flexion/extension, lateral flexion, and rotation of the vertebral column (1–3); flexion/extension, abduction/adduction, and internal/external rotation of the shoulder (4–6); flexion/extension of the elbow, and pronation/supination of the forearm (7 and 8); flexion/extension, abduction/adduction, and internal/external rotation of the hip (9–11); and flexion/extension, and rotation of the knee (12 and 13). According to the proposed model, angles $θ_{1} - θ_{3}$ are only used as reference values.

For illustrating purposes, in Figure 2(b) the joint angles ${}^{EL}l$ , using E as the input of the kinematic models, are shown. The precision in the orientation estimation depends on some factors including the position of the sensors, gyroscope drift, or the singularity problem.

Feature extraction

The feature extraction step reduces the signals in A (low-level data) and L (high-level data) into features that must be discriminative for the complex actions at hand. Features are extracted from features’ vectors $X_{i}$ on the segments W, with $F$ as the feature extraction function expressed in Formula (2)

$X_{i} = F (L, w_{i})$ (2)

where each segment $w_{i}$ contains an activity $y_{i}$ .

To select the type of features to be used in the recognition of certain actions, a priori knowledge is required, for example, to differentiate walking from resting data, using the energy of the acceleration signals (low-level data) as feature can be enough, whereas the very same feature would not be enough to classify walking from ascending or descending stairs.

The features selected for recognizing actions in this study are divided into two groups: signal-based features and high-level features. Signal-based features $sbf$ are: arithmetic mean, standard deviation, range of motion, zero crossing rate, root mean square, and the power spectral density of the signal in frequencies 0–2, 2–4, and 4–6 Hz. This set of signal-based features was selected because they have been used to discriminate actions from wearable inertial sensors.^42,54,55 All features are extracted for segmented signals W, and compose the ${{}^{sbf}X}_{i}$ vector.

Even though signal-based features have been widely used in the action recognition field, these features do not describe the mobility of human segments or joints, nor do they consider the relationship between them. One of the advantages of using a kinematic model for tracking human motion is that the signals of the joints can be characterized as terms of movements. This representation, based on the anatomical term of movements, can be used not only for classification purposes, but it can also be used for describing how people perform the actions, from now called high-level features $hlf$ . The complete description of the high-level features extraction is detailed in a previous study.⁴ In brief, the algorithm search tendencies in the signals L that is related to the anatomical terms of movements in $w_{i}$ , such as the number of flexions of the shoulder or the number of extensions of the knee, and constitute the set of features $hl f_{1}$ .

In addition to $hl f_{1}$ , in this study, a second set of high-level features are considered. Instead of considering only the number of terms of movements, the average duration of consecutive terms is calculated with the aim to include a variable related to the time, and constitute the set of features $hl f_{2}$ . In Figure 4, the difference between $hl f_{1}$ and $hl f_{2}$ from ${}^{EL}l$ of Figure 2(b), using as configuration 30° of magnitude as threshold and static templates, is shown. In both cases, the number of $hl f_{1}$ is the same for the movements, for example, flexion and pronation; however, the average duration of the extension is higher than the average duration of supination. The set of features $hl f_{1}$ and $hl f_{2}$ is used together to compose the ${{}^{hlf}X}_{i}$ vector.

Figure 4.

High-level features ( $hl f_{1}$ and $hl f_{2}$ ) during a mouth care action: (a) frames of flexion/extension movements and (b) frames of pronation/supination movements.

Action classification

A supervised classification approach was used in the last step of the proposed method and is divided into two steps: (i) training and (ii) testing. In the first step, a classification model is learned based on inference algorithms using as input the vectors of features extracted from the segmented signals, also known as the training set $T = {(X_{i}, y_{i})_{i = 1}^{n}}$ with n pairs of vectors $X_{i}$ and corresponding labels $y_{i}$ . This classifier assigns a class label, of the set $Y = y^{1}, . . ., y^{c}$ to new instances $χ_{i}$ in the second phase, which must match the type of features used in the training stage, and estimating corresponding scores $P_{i} = p_{i}^{1}, . . ., p_{i}^{c}$ , as defined in Formula 3

$p_{i} (y | χ_{i}, λ) = I (χ_{i}, λ), for each y \in Y$ (3)

where $I$ is the inference method and $λ$ are the model parameters for parametric algorithms. Then, the calculated scores $P_{i}$ are used to obtain the maximum score and select the corresponding class label $y_{i}$ as the classification output.

Four inference algorithms were selected because they are appropriate to deal with problems involving unbalanced data.^56,57 First, NB is a Bayesian classification method that assumes independence between the variables or features, based on conditional probabilities.⁵⁸ Second, KNN classifies new instances according to the class with the greatest number of closest neighbors of the training set.⁵⁹ The value selected of k was 1. Then, SVM maps the input data to a feature space of a larger dimension and finds a hyperplane that separates them and maximizes the margin between the classes.⁶⁰ A polynomial kernel $(κ (x, y) = (x^{T} y + c)^{p})$ was selected as kernel function, where the regularization parameter $c = 1$ and the exponent $p = 1$ . Finally, RF method is a modification of bagging that uses an ensemble of classification trees.⁶¹ The values of the parameters of KNN and SVM were selected according to previous works.

Experimental results

Data

An experimental protocol was designed and applied in daily living environments of the five test subjects (26.2 ± 4.4 years). The subjects were asked to perform their daily activities at will, that is, they did not perform any specific action in any order. Five wearable sensors LPMS-B (LP-research, Tokyo, Japan) were placed on the body of subjects: S₁ on the back, S₂ and S₃ on the upper arm and forearm of the right upper limb, respectively, and S₄ and S₅ on the thigh and shank of the left lower limb, according to the configuration in Figure 3. Each LPMS-B comprises an accelerometer, gyroscope, and magnetometer with a range up to ±16 g, ±2000 dps, and ±810 uT, respectively (https://lp-research.com/lpms-b/). The sensors were attached to the body of the subjects using elastic bands and an orthopedic vest so that the movement of the clothes did not modify the position of the sensors. Two wearable cameras were worn by the subjects for segmenting and labeling purposes, one on the frontal side of a vest (GoPro Inc., CA, USA), and a Google Glass (Google Inc., CA, USA) which was worn as typical lenses.

Ten different actions, divided into 90 instances, were selected from recordings (action: number of instances): Cooking: 7, Doing Housework: 11, Eating: 10, Grooming: 8, Mouth care: 8, Ascending stairs: 3, Descending stairs: 4, Sitting: 7, Standing: 9, and Walking: 23. All actions are related to the ADL or the IADLs,^62,63 and the duration of the actions ranges from 5 s to 5 min, approximately. Due to the differences between the infrastructure of living environments of the test subjects, the number of instances for each class is unbalanced, for example, the number of actions related to stairs was lower than just walking. In Figure 5, a sequence of images of a subject in his daily living environment while performing a mouth care action is shown.

Figure 5.

Sequence of a mouth care action of a test subject. Top images were captured by Google Glass and bottom images were captured by a GoPro camera.

Results

A set of features was extracted from low-level (A and $magXYZ$ ) and high-level (L) data based on the proposed action recognition method. Then, a set of classifiers were built using the extracted features and the inference algorithms described previously. For evaluating the performance of each classifier, sensitivity and specificity metrics were calculated. Sensitivity is the true positive rate TPR, also called Recall, and measures the proportion of positive observations which are correctly identified, and specificity is the true negative rate TNR and measures the proportion of negative observations which are correctly identified. k-fold cross-validation was used for partitioning datasets, with $k = 2$ .

The classification results divided into eight data treatments are summarized in Table 1. The first treatments correspond to the use of low-level data extracting signal-based features: (i) raw accelerometer data A from the signals of the five sensors, (ii) the signals of acceleration vector magnitude $magXYZ$ , and (iii) the combination of A and $magXYZ$ . The next treatments correspond to the use of joint angles signals L extracting: (iv) high-level features based on the number of terms of movements $hl f_{1}$ , (v) high-level features based on the average duration of consecutive terms $hl f_{2}$ , (vi) the combination of $hl f_{1}$ and $hl f_{2}$ , (vii) signal-based features $sbf$ , and (viii) the combination of $hl f_{1}$ , $hl f_{2}$ , and $sbf$ . In the last treatment (ix), both level data and all set of features were used.

Table 1.

Correctly classified instances (%) using low- and high-level data.

Level	Data	Features	NB	KNN	SVM	RF	Average $(σ)$
Low	A	$sbf$	67.8	77.8	76.7	75.6	73.3 (4.5)
	magXYZ	$sbf$	63.3	71.1	66.7	61.1	63.7 (4.4)
	A + magXYZ	$sbf$	67.8	76.7	72.2	77.8	72.6 (4.6)
High	L	$hl f_{1}$	80.0	74.4	73.3	82.2	78.5 (4.3)
	L	$hl f_{2}$	66.7	76.7	80.0	85.6	77.4 (7.9)
	L	$hl f_{1} + hl f_{2}$	71.1	80.0	80.0	84.4	78.5 (5.6)
	L	$sbf$	74.4	88.9	91.1	88.9	84.4 (7.7)
	L	$hl f_{1} + hl f_{2} + sbf$	80.0	91.1	93.3	87.8	87.0 (5.8)
Low + high	A + magXYZ + L	$hl f_{1} + hl f_{2} + sbf$	80.0	90.0	96.7	88.9	88.5 (6.8)

NB: Naive Bayes; KNN: k-nearest neighbors; SVM: support vector machine; RF: random forest.

The best results using low-level data were obtained using only raw acceleration data with an overall of 73.3% of instances correctly classified. The incorporation of the vector of magnitude of the acceleration only improves using RF, although the overall of the classifiers is not exceeded when using only data A.

Regarding the high-level data, using the modification of the high-level features $hl f_{2}$ is better to use the original version $hl f_{1}$ in three of the classifiers; however, when using NB, the worst results of the data treatments in this category are obtained. However, using signal-based features improves the results than using the proposed high-level features, except when using the $hl f_{1}$ and NB. By combining all the feature sets from high-level data and the SVM classifier, 93.3% of classification is obtained, which is the best result using high-level data.

The results when combining low- and high-level data, which corresponds to the last treatment of the presented data, not only improves the overall classification with 88.5%, which is higher than the classification obtained by the rest of the data treatments, but also achieves 96.7% using SVM, which is the best classification result.

To highlight the differences using low- and high-level data when classifying each of the studied human actions, Tables 2 and 3 show the sensitivity (TPR) and specificity (TNR) of the models for the best treatments of each level of data: for low-level the treatment includes raw acceleration signals and signal-based features, and for high-level the treatment includes the joint angles signals and the three sets of features, respectively.

Table 2.

Classification results for each action using low-level data A and features sbf.

Action	NB		KNN		SVM		RF
	TPR	TNR	TPR	TNR	TPR	TNR	TPR	TNR
Cooking	0.429	1.000	0.857	0.964	1.000	0.952	0.571	0.976
Doing housework	0.909	0.911	0.909	0.975	0.818	0.987	0.909	0.975
Eating	0.200	1.000	0.800	0.987	0.700	0.987	0.900	1.000
Grooming	0.875	0.988	0.375	1.000	0.625	1.000	0.375	1.000
Mouth care	0.750	0.963	1.000	0.976	1.000	0.976	1.000	0.951
Ascending stairs	0.000	1.000	1.000	1.000	0.667	0.989	0.000	1.000
Descending stairs	0.000	1.000	0.000	0.988	0.000	0.988	0.000	1.000
Sitting	0.714	0.952	0.857	0.964	0.714	0.952	0.571	0.988
Standing	0.556	0.926	0.444	1.000	0.444	0.975	0.778	0.963
Walking	1.000	0.881	0.957	0.810	0.957	0.925	1.000	0.851
Weighted average	0.678	0.943	0.778	0.957	0.767	0.965	0.756	0.948

NB: Naive Bayes; KNN: k-nearest neighbors; SVM: support vector machine; RF: random forest; TPR: true positive rate; TNR: true negative rate.

Table 3.

Classification results for each action high-level data L and all features.

Action	NB		KNN		SVM		RF
	TPR	TNR	TPR	TNR	TPR	TNR	TPR	TNR
Cooking	0.714	1.000	0.857	1.000	0.857	1.000	0.857	1.000
Doing housework	0.909	0.987	1.000	0.987	1.000	0.987	0.909	0.975
Eating	1.000	0.975	1.000	1.000	1.000	1.000	1.000	1.000
Grooming	0.750	1.000	0.875	0.988	0.875	1.000	0.750	0.976
Mouth care	0.875	0.951	0.875	0.988	1.000	0.988	0.875	0.988
Ascending stairs	0.000	1.000	0.000	1.000	0.333	1.000	0.333	0.989
Descending stairs	0.000	1.000	0.750	0.988	0.750	1.000	0.250	0.988
Sitting	0.714	0.988	1.000	1.000	0.857	1.000	0.857	1.000
Standing	0.778	0.988	1.000	1.000	1.000	0.988	1.000	0.988
Walking	0.957	0.866	0.957	0.940	1.000	0.955	1.000	0.955
Weighted average	0.800	0.955	0.911	0.981	0.933	0.985	0.878	0.980

NB: Naive Bayes; KNN: k-nearest neighbors; SVM: support vector machine; RF: random forest; TPR: true positive rate; TNR: true negative rate.

From Table 2, the actions with the highest rate of instances correctly classified using low-level data were Walking and Mouth care (overall >0.9), while the instances of descending stairs were not correctly classified by any model. The action with the lowest specificity was Walking, which indicates that the number of instances classified incorrectly as Walking was the highest. However, from Table 3, the actions with the highest rate of sensitivity using high-level data were Eating, Walking, Doing housework, and Standing (overall >0.9), while Ascending stairs was the action with the worst rate. Once again, Walking was the action with the lowest specificity.

Figure 6 shows the box-plots corresponding to the sensitivity values when using low- and high-level data from Tables 2 and 3. As can be noted, the true positive rate of five actions: Doing housework, Eating, Sitting, Standing, and Walking, is clearly superior using high-level data than using low-level data. The Walking action could be recognized by the classifiers, regardless of the data level, with a rate close to 1.0 and the lowest combined dispersion. Only two actions were better recognized using low-level data: Mouth care and Ascending stairs.

Figure 6.

Comparison of sensitivity values using low- and high-level data from Tables 2 and 3.

Finally, confusing matrices for classifiers built using the best classifier for each level of data: KNN classifier built using signal-based features extracted from raw acceleration signals (Figure 7(a)), SVM classifier built using high-level and signal-based features extracted from joint angles signals (Figure 7(b)), and SVM classifier built using high-level and signal-based features extracted from both raw acceleration and joint angles signals (Figure 7(c)). As a result, in the matrix shown in Figure 7(a), 20/90 instances were misclassified and 40% of them were misclassified as Walking; in particular, the actions with the highest proportion of instances incorrectly classified were Grooming, Descending stairs, and Standing. From matrix of Figure 7(b), 6/90 instances were misclassified; the action with the highest proportion of instances incorrectly classified was Ascending stairs. And from matrix in Figure 7(c), 3/90 instances were misclassified. At least there was a misclassified instance of cooking and an instance of Descending stairs from the three classifiers analyzed.

Figure 7.

Confusion matrices for classifiers built using (a) low-level data ( KNN using sbf from A), (b) high-level data ( SVM using hlf and sbf from L), and (c) both low- and high-level data (SVM using hlf and sbf from both A and L).

Discussion

The main advantage of using low-level data is that additional processing to represent movement is not required. Although the execution time in the training and testing phases using low-level data and high-level data is similar, the time required to extract the high-level features is longer than the time required to extract signal-based features, see Table 4. Even so, a series of assumptions have to be made to associate the inertial measurements with human movement. The use of the magnitude of the acceleration is useful as a technique to deal with the displacement of the sensors when they are used on the human body, or in case the anatomical references, in which the sensors are attached, are not exactly the same every time. However, when merging the information of the three axes, some relevant information is lost with respect to the plane in which the movements are performed.

Table 4.

Execution time (s) in the feature extraction step.

Data	Features	Total time	Average per instance
A + magXYZ	sbf	3.995	0.044
L	sbf	3.090	0.034
L	$hl f_{1} + hl f_{2}$	149.352	1.659

All the experiments have been run over a PC with 6-core CPU (Intel Xeon Bronze 1.7 GHz), 16 GB of RAM and a 2 TB Hard Drive.

Not only the average classification rate does not improve by including acceleration magnitude data $magXYZ$ to raw acceleration data $A$ , but also the number of features increases, so using the acceleration magnitude to classify this set of actions was not useful. Instead, the classification rate improves when combining both signal-based $sbf$ and high-level $hlf$ features from joint angles L. In one of the four classifiers, SVM, the classification rate increases considerably by combining acceleration data as low-level data and high-level data, obtaining the best result of all (96.7%).

Regarding the rest of the classifiers, KNN scores its best result (91.1%) using all sets of high-level features. For their part, RF has the same classification result (88.9%) using only signal-based features from high-level data and when using all the feature sets of both data levels; the same case as when using NB (80%) and the high-level features $hl f_{1}$ . The low result obtained with the classifiers built using NB may be due to the fact that this inference algorithm assumes independence between the attributes, that is not true for human motion, in which each segment of the human body is related to the rest.

The new set of high-level features $hl f_{2}$ proposed in this study allows the classification of the actions of interest with a sensitivity close to 0.800. In addition, in general, there is less inter-classifier dispersion when high-level data was used. The most important aspect to highlight is that although there were some misclassified instances by the classifiers built using high-level data, most of these instances were confused with similar actions, which reflects the consistency between the proposed features and the joint signals, as can be noted in confusion matrices of Figure 7.

Due to the imbalance of classes, the instances of Ascending stairs and Descending stairs were erroneously classified in most cases, which is confirmed by the low specificity of the Walking action which has the highest number of instances. Despite this, by combining low-level and high-level data, most instances of Ascending stairs and Descending Stairs could be correctly classified, while instances incorrectly classified were confused as Walking, whose action has similar characteristics of locomotion.

Conclusion

In this research, a set of low- and high-level features were used to classify a set of human actions performed by people in real settings. Then, our set of features was used for recognizing a set of actions performed by five test subjects in naturalistic conditions, and their discriminant capability under different conditions was analyzed and contrasted. The average classification rate for recognizing the ten actions of the four classifiers built bythe proposed set of low and high-level features was 88.5% ( $σ$ = 6.8).

One of the advantages of using descriptive features, low and high levels, based on the signals is that they can give information about how the activities were carried out. In the case of acceleration signals, the features describe the speed/intensity with which the limbs moved. However, the high-level features describe the total arc of motion reached by the limbs when performing the activities.

To evaluate the proposed techniques for human action recognition, such computational techniques will be applied to publicly available data sets that provide orientation data of the limbs of people, that is, rotation matrices, Euler angles, or quaternions. The orientation data can be captured by different types of motion sensors, such as cameras that track the skeleton of people. It is considered to incorporate more sensors to the current sensor network, to capture the motion of more segments of all the limbs, since the actual number of sensors corresponds to a minimum proposed configuration. Finally, concurrent and interleaved actions will be considered, because this type of actions are closer to the daily living of people, in which the interaction with the environment is necessary and essential.

Footnotes

Handling Editor: Oresti Banos

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research,authorship,and/or publication of this article: This research was financially supported by the Instituto Nacional de Astrofísca,Óptica y Electrónica. In addition,the first author was supported by the Mexican National Council for Science and Technology,under the grant number 271539/224405.

ORCID iD

Irvin Hussein Lopez-Nava

References

Yang

GZ.

Body sensor networks. London: Springer, 2006.

López-Nava

Muñoz-Meléndez

Wearable inertial sensors for human motion analysis: a review. IEEE Sens J 2016; 16(22): 7821–7834.

Nguyen

Aiello

Energy intelligent buildings based on user activity: a survey. Energ Buildings 2013; 56: 244–257.

López-Nava

Muñoz Meléndez

High-level features for recognizing human actions in daily living environments using wearable sensors. Proceedings 2018; 2(19): 1238.

Bulling

Blanke

Tan

, et al. Introduction to the special issue on activity recognition for interaction. ACM Trans Interact Intell Syst 2015; 4(4): 16.e:1–16.e:3.

Altun

Barshan

Tunçel

Comparative study on classifying human activities with miniature inertial and magnetic sensors. Pattern Recogn 2010; 43(10): 3605–3620.

Ferrari

Cutti

Garofalo

, et al. First in vivo assessment of “outwalk”: a novel protocol for clinical gait analysis based on inertial and magnetic sensors. Med Biol Eng Comput 2010; 48(1): 1–15.

van den Noort

Ferrari

Cutti

, et al. Gait analysis in children with cerebral palsy via inertial and magnetic sensors. Med Biol Eng Comput 2013; 51(4): 377–386.

Brigante

Abbate

Basile

, et al. Towards miniaturization of a MEMS-based wearable motion capture system. IEEE T Ind Electron 2011; 58(8): 3234–3241.

10.

Yang

. A calibration process for tracking upper limb motion with inertial sensors. In: Proceedings of the 2011 IEEE international conference on mechatronics and automation, Beijing, China, 7–10 August 2011, pp.618–623. New York: IEEE.

11.

Amft

Junker

Troster

. Detection of eating and drinking arm gestures using inertial body-worn sensors. In: Proceedings of the 9th IEEE international symposium on wearable computers, Osaka, Japan, 18–21 October 2005, pp.160–163. New York: IEEE.

12.

Luinge

Veltink

Baten

CT.

Ambulatory measurement of arm orientation. J Biomech 2007; 40(1): 78–85.

13.

Roetenberg

Slycke

Veltink

Ambulatory position and orientation tracking fusing magnetic and inertial sensing. IEEE T Biomed Eng 2007; 54(5): 883–890.

14.

Schepers

Roetenberg

Veltink

Ambulatory human motion tracking by fusion of inertial and magnetic sensing with adaptive actuation. Med Biol Eng Comput 2010; 48(1): 27–37.

15.

Xue

Portable preimpact fall detector with inertial sensors. IEEE T Neur Sys Reh 2008; 16(2): 178–183.

16.

Tong

Song

, et al. HMM-based human fall detection and prediction method using tri-axial accelerometer. IEEE Sens J 2013; 13(5): 1849–1856.

17.

Bulling

Blanke

Schiele

A tutorial on human activity recognition using body-worn inertial sensors. ACM Comput Surv 2014; 46(3): 33.

18.

Wang

Chen

Hao

, et al. Deep learning for sensor-based activity recognition: a survey. Pattern Recogn Lett 2019; 119: 3–11.

19.

Aggarwal

Xia

Human activity recognition from 3d data: a review. Pattern Recogn Lett 2014; 48: 70–80.

20.

Presti

La Cascia

3D skeleton-based human action classification: a survey. Pattern Recogn 2016; 53: 130–147.

21.

Cippitelli

Gasparrini

Gambi

, et al. A human activity recognition system using skeleton data from rgbd sensors. Comput Intel Neurosc 2016; 2016: 21.

22.

Debes

Merentitis

Sukhanov

, et al. Monitoring activities of daily living in smart homes: understanding human behavior. IEEE Signal Proc Mag 2016; 33(2): 81–94.

23.

Shoaib

Bosch

Incel

, et al. Complex human activity recognition using smartphone and wrist-worn motion sensors. Sensors 2016; 16(4): 426.

24.

Wei

Liu

, et al. Towards unsupervised physical activity recognition using smartphone accelerometers. Multimed Tools Appl 2017; 76(8): 10701–10719.

25.

Shahmohammadi

Hosseini

King

, et al. Smartwatch based activity recognition using active learning. In: Proceedings of the 2nd IEEE/ACM international conference on connected health: applications, systems and engineering technologies, Philadelphia, PA, 17–19 July 2017, pp.321–329. New York: IEEE.

26.

Wannenburg

Malekian

Physical activity recognition from smartphone accelerometer data for user context awareness sensing. IEEE T Syst Man Cy-S 2017; 47(12): 3142–3149.

27.

Ignatov

Real-time human activity recognition from accelerometer data using convolutional neural networks. Appl Soft Comput 2018; 62: 915–922.

28.

Jain

Kanhangad

Human activity classification in smartphones using accelerometer and gyroscope sensors. IEEE Sens J 2018; 18(3): 1169–1177.

29.

Moschetti

Fiorini

Esposito

, et al. Toward an unsupervised approach for daily gesture recognition in assisted living applications. IEEE Sens J 2017; 17(24): 8395–8403.

30.

Vanrell

Milone

Rufiner

HL.

Assessment of homomorphic analysis for human activity recognition from acceleration signals. IEEE J Biomed Health 2018; 22(4): 1001–1010.

31.

Jordao

Torres

LAB

Schwartz

WR.

Novel approaches to human activity recognition based on accelerometer data. Signal Image Video P 2018; 12(7): 1387–1394.

32.

Awais

Chiari

Ihlen

EAF

, et al. Physical activity classification for elderly people in free-living conditions. IEEE J Biomed Health 2019; 23(1): 197–207.

33.

Wang

Suvorova

Vaithianathan

, et al. Using trajectory features for upper limb action recognition. In: Proceedings of the IEEE 9th International conference on intelligent sensors, sensor networks and information processing, Singapore, 21–24 April 2014, pp.1–6. New York: IEEE.

34.

Field

Stirling

Pan

, et al. Recognizing human motions through mixture modeling of inertial data. Pattern Recogn 2015; 48(8): 2394–2406.

35.

Reiss

Hendeby

Bleser

, et al. Activity recognition using biomechanical model based pose estimation. In: Proceedings of the 5th European conference on smart sensing and context, Passau, 14–16 November 2010, pp.42–55. Berlin: Springer.

36.

Ahmadi

Mitchell

Destelle

, et al. Automatic activity classification and movement assessment during a sports training session using wearable inertial sensors. In: Proceedings of the 11th international conference on wearable and implantable body sensor networks, Zurich, 16–19 June 2014, pp.98–103. New York: IEEE.

37.

Nguyen

Lebel

Bogard

, et al. Using inertial sensors to automatically detect and segment activities of daily living in people with Parkinson’s disease. IEEE T Neur Sys Reh 2018; 26(1): 197–204.

38.

Young

. Comparison of orientation filter algorithms for realtime wireless inertial posture tracking. In: Proceedings of the 6th international workshop on wearable and implantable body sensor networks, Berkeley, CA, 3–5 June 2009, pp.59–64. New York: IEEE.

39.

Yun

Bachmann

McGhee

A simplified quaternion-based algorithm for orientation estimation from earth gravity and magnetic field measurements. IEEE T Instrum Meas 2008; 57(3): 638–650.

40.

Lee

Park

. A minimum-order Kalman filter for ambulatory real-time human body orientation tracking. In: Proceedings of the IEEE international conference on robotics and automation, Kobe, Japan, 12–17 May 2009, pp.3565–3570. New York: IEEE.

41.

Calusdian

Yun

Bachmann

. Adaptive-gain complementary filter of inertial and magnetic data for orientation estimation. In: Proceedings of the 2011 IEEE international conference on robotics and automation, Shanghai, China, 9–13 May 2011, pp.1916–1922. New York: IEEE.

42.

Lara

Labrador

MA.

A survey on human activity recognition using wearable sensors. IEEE Commun Surv Tut 2013; 15(3): 1192–1209.

43.

Cornacchia

Ozcan

Zheng

, et al. A survey on activity detection and classification using wearable sensors. IEEE Sens J 2017; 17(2): 386–403.

44.

Veltink

Bussmann

De Vries

, et al. Detection of static and dynamic activities using uniaxial accelerometers. IEEE T Rehabil Eng 1996; 4(4): 375–385.

45.

Titterton

Weston

JL.

Strapdown inertial navigation technology, vol. 17. Stevenage: IET, 2004.

46.

Fourati

Belkhiat

DEC

Iniewski

Multisensor attitude estimation: fundamental concepts and applications. Boca Raton, FL: CRC Press, 2016.

47.

Young

AD.

Wireless realtime motion tracking system using localised orientation estimation. PhD Thesis, School of Informatics, The University of Edinburgh, Edinburgh, 2010.

48.

López-Nava

IH.

Complex action recognition from human motion tracking using wearable sensors. PhD Thesis, Computer Science Department, Instituto Nacional de Astrofísica, Óptica y Electrónica, San Andrés Cholula, Mexico, 2018.

49.

Sabatini

AM.

Quaternion-based extended Kalman filter for determining orientation by inertial and magnetic sensing. IEEE T Biomed Eng 2006; 53(7): 1346–1356.

50.

Marieb

Hoehn

Human anatomy & physiology. London: Pearson Education, 2007.

51.

El-Gohary

McNames

Shoulder and elbow joint angle tracking with inertial sensors. IEEE T Biomed Eng 2012; 59(9): 2635–2641.

52.

Zhang

Wong

JK.

Ubiquitous human upper-limb motion estimation using wearable sensors. IEEE T Inf Technol B 2011; 15(4): 513–521.

53.

Moore

Dalley

Agur

AMR

. Clinically oriented anatomy. Philadelphia, PA: Lippincott Williams & Wilkins, 2014.

54.

Attal

Mohammed

Dedabrishvili

, et al. Physical human activity recognition using wearable sensors. Sensors 2015; 15(12): 31314–31338.

55.

Banaee

Ahmed

Loutfi

Data mining for wearable sensors in health monitoring systems: a review of recent trends and challenges. Sensors 2013; 13(12): 17472–17500.

56.

Sun

Wong

Kamel

MS.

Classification of imbalanced data: a review. Int J Pattern Recogn 2009; 23(4): 687–719.

57.

Liu

Zhou

ZH.

Exploratory undersampling for class-imbalance learning. IEEE T Syst Man Cy B 2009; 39(2): 539–550.

58.

Friedman

Geiger

Goldszmidt

Bayesian network classifiers. Mach Learn 1997; 29(2–3): 131–163.

59.

Keller

Gray

Givens

JA.

A fuzzy k-nearest neighbor algorithm. IEEE T Syst Man Cyb 1985; SMC-15(4): 580–585.

60.

Cortes

Vapnik

Support-vector networks. Mach Learn 1995; 20(3): 273–297.

61.

Díaz-Uriarte

De Andres

SA.

Gene selection and classification of microarray data using random forest. BMC Bioinformatics 2006; 7(1): 3.

62.

Williams

Chang

Landefeld

, et al. Current diagnosis & treatment: geriatrics. 2nd ed. New York: McGraw-Hill Professional, 2014.

63.

Bookman

Harrington

Pass

, et al. Family caregiver handbook. Cambridge, MA: Massachusetts Institute of Technology, 2007.

Human action recognition based on low- and high-level data from wearable inertial sensors

Abstract

Keywords

Introduction

Related work

Methods

Signal processing

Joint angles estimation

Feature extraction

Action classification

Experimental results

Data

Results

Discussion

Conclusion

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

References