Abstract
Keywords
Introduction
Gait analysis is the study of human movement involving the assessment of quantitative parameters of walking.1–3 Chatzaki et al. 3 proposed that gait can differ between individuals due to the relative physical and psychological state of the individual at that time. Each individual has their own unique gait, however a relatively ‘normative’ gait should be seen in most individuals without neuromuscular pathology. 3 Initially, gait analyses were mostly conducted in laboratory conditions with the use of multi-camera motion capture systems and force plates, in which the gold standard technologies were the optoelectronic systems.1,3–8 In recent times, the development of wearable sensors has mitigated several issues associated with traditional gait analysis including the cost of establishing specialised laboratories and equipments, lengthy set-up and post-processing times, as well as scheduling difficulties for patients to attend gait analysis sessions.1,9,10 Wearable sensors are lightweight and can be used outside of laboratory environments, allowing researchers and clinicians to collect indoor and outdoor movement data from the community (‘free-living gait’).1,4,7–11 Individuals are instructed to wear the wearable sensors usually around their waist, on their wrist or elsewhere in contact with the body.1,6,12 Current devices may include a range of motion sensors: gyroscopes, accelerometers, magnetometers force sensors, goniometers, inclinometers, strain gauges and more.1,4,12 Accurate placement of devices has also been demonstrated to mimic camera motion capture systems by providing simultaneous real-time positional data. 8 In general, as a subset most inertial wearable sensors usually include integrated accelerometers detecting linear motion along reference axes, and gyroscopes which detect angular motion during ambulation by measurement of the Coriolis force.1,12 However, wearable sensors have their limitations. One disadvantage of wearable sensors is false step detection. 13 This occurs when non-gait activity, such as swinging legs or arms while seated, gives a false impression of periodic movement in a gait cycle. 13 For wireless wearables there are also concerns regarding data losses during data transmission towards the host computers. 14 Furthermore, Peters et al. 15 mentioned the need for further reliability testing for wearables inertial devices, especially consumer-grade FitBit that are more affordable and accessible for patients to purchase compared to research-grade wearables. Another notable disadvantage that developers aim to mitigate is the issue of drifting, especially when placed near metal or magnetic fields due to the inner magnetometers.16,17
Data collected from multiple motion sensors require a computing system to process the complex input. 4 The output from the wearable devices can be processed through either custom algorithms or artificial intelligence (AI) models. This allows for outpatient day-to-day gait rehabilitation monitoring, providing information regarding health diagnostic and disease severity management.1,3,4,9,12 With recent advances in computing capabilities, researchers have been looking into incorporating more AI models due to its proposed capability to analyse large datasets and identify complex patterns in gait. 18 Machine learning (ML) models are a subset of complex AI models with supervised and unsupervised learning being the two major types.18,19 Supervised learning methods are employed when both the inputs and outputs are known (or ‘labelled’), and these include linear regression, logistic regression (LR) and decision trees.19–21 Unsupervised learning encourages the machine to discover new associations without human interference.19–21 However, processed outcomes from unsupervised learning can include significant amounts of variability and can be more time-consuming to manually filter for the desired data outcomes. 18 Some examples of unsupervised learning include hierarchical clustering and self-organising maps.19,20 For gait classification, prediction and analysis, the approaches are similar: data acquisition by wearables, followed by pre-processing (noise removal, background subtraction), feature extraction, feature selection, classification and analysis of outcome. 22
In many of the previous works on gait analysis, ML models have been applied in ambulatory gait analysis and optical motion capture systems, whether in conjunction or as separate fields of studies.16,17 Ambulatory gait analysis examines ‘normal walking bouts’ in healthy adults, such as running, walking. Regression models are generally preferred to evaluate gait parameters, such as walking speed, in ambulatory gait analysis.17,23,24 Our present study seeks to inform researchers and clinicians of pathological gait analysis in adults using wearable technologies, in which we believe is the first of its kind. There are multiple theorems behind ‘normal gait’, for example, the ‘six determinants of gait’, ‘inverted pendulum model’ and ‘dynamic walking’. 25 In a brief summary, pathological gait occurs when an injury or pathology affects the biomechanics of gait leading to increased energy expenditure compared to ‘normal gait’. 25 Some conditions that present with pathological gait include spinal cord injuries, stroke and limb loss which there is a reduction in strength and coordination. 25
There is no general consensus of the best suited algorithm or model for gait analysis. Carcreff et al. 13 discussed that algorithms, which commonly have fixed thresholds, are best used in specific populations of study groups, namely healthy adults or elderly patients without significant gait impairment. The authors proposed that adaptive threshold algorithms can detect the inter-step variability of out-of-laboratory gait monitoring, which can outperform some ML models such as random forest (RF) and support vector machine (SVM). 13 The reasoning behind this is that ‘abnormal gait’ has different gait patterns that can give a false impression of peak swing and/or stance phase and abnormally slow gait. 13
ML models are subject to overfitting when algorithm parameters are optimised to training data but are not generalised for ‘real-world’ test data.20,21 To avoid overfitting, a section of data can be removed for data validation while the remaining data set is used for training using post-hoc techniques.20,21
This systematic review aims to inform current applications of AI and custom algorithms in the health diagnostics of gait-altering pathologies. It also aims to inform clinicians and researchers of the most suitable and reliable ML models for diagnosing pathological gait, reporting on the accuracy and validity of current AI models. As such, the present review identifies the most common algorithms, data and applications, current issues and future possibilities.
Methods
The PRISMA statement guidelines were followed in identifying, screening and selecting studies for inclusion, and extracting data for this present review (Figure 1).

Flow diagram of selection and screening of included studies from databases and registers.
Eligibility criteria
The focus of this review was on published articles written in English and published between 1980 and February 2021, that assessed (health diagnostic) AI algorithms developed with data from inertial wearable devices. These AI algorithms were used to identify or classify pathological gait patterns (from healthy control groups). Studies involving non-inertial wearable devices such as robotic apparatus, exoskeletons or ground reaction force sensors were excluded. Reviews, conference abstracts and books were also excluded. See Figure 1 for the PRISMA flowchart and other exclusion reasons.
Search strategy
Relevant studies were identified through a systematic search for published papers in the following scientific databases from the date of inception to February 28, 2021: Medline (OvidSP), Embase (OvidSP) and Web of Science (ClarivateTM). The search ‘concepts’ were
Search strategy.
Study selection
The literature search was completed by two authors (PN and RDF). Titles and abstracts of all studies identified were screened for relevance. Studies which were clearly not relevant based on the title and abstract screen were excluded from the review. The full text of the record was reviewed if the study appeared relevance or was of uncertain relevance and a third reviewer consulted (RJM) until a consensus agreement was reached regarding inclusion/exclusion. The full text of all selected relevant records was reviewed, and eligibility was determined using the eligibility criteria defined above. The quality of each included record was assessed by two authors (PN and AL), and relevant information was extracted.
Data extraction
Data was collected regarding participant characteristics, sensor, algorithm, data methods and accuracy for each study included in our review. Participant information included number and type of participants, age and sex. Information about the type, model, sampling frequency and location on the body of each wearable sensor was also recorded. The pathology of interest, environment of the study and model features (gait variables: gait phases, spatiotemporal metrics, joint angles) were also recorded included walking speed, distance and time. Walking speed was labelled as ‘self-selected’ if participants were walking at a comfortable speed for them individually, but required to maintain the same speed. If participants were allowed to walk at their own speed, it was labelled as ‘not controlled’.
Quality assessment
To assess the quality of included studies, two different tools were used. To evaluate the method of data acquisition from the subjects, we applied a questionnaire inspired in the Critical Appraisal Skills Program (CASP) for Diagnostic Test Studies
23
as below:
We scored the studies as follows. If a study completely fulfilled a criterion, we gave it a 1.0. In absent cases, the grade was 0.0, while we rated a total of 0.5 for partly fulfilled criteria.
Results
Included studies
Database searches identified 1569 relevant studies. After the removal of duplicates, 948 studies remained. A total of 877 references were excluded from title and abstract screen.
A further 49 articles were excluded upon full-text review, leaving a final 21 studies to be included in qualitative synthesis (Figure 1: PRISMA flowchart). There were nine studies included in the custom algorithms category and 13 studies included in the AI category for comparison.
Study population
The most commonly investigated study population was Parkinson's disease (PD), with nine studies comparing participants with mild or unspecified severity of PD with healthy controls or another gait-altering pathology (Table 2). Other study populations included dementia, stroke, ataxia and sports injuries (post-ACL repair, concussion).
Study populations of included studies.
Inertial wearable sensor
Different wearable devices were employed by included studies to collect gait metrics. The number of sensors, type of sensors, device, location of devices and sampling frequency are evaluated in Table 3. The commonly used devices were Opal APDM, with L3 to L5 lower back being the preferred location for devices to be attached to participants (Table 3). The sampling frequency included in the studies varies from 50 to 160 Hz. Three studies developed custom-made wearable devices for the study: Tedesco et al., 46 Rovini et al. 44 and Tesconi et al. 31
Wearable devices employed by included artificial intelligence studies.
Protocols
All studies included gait tasks of varying methods: Timed Up and Go (TUG),29,30,34,35,45,47 6MWT, 32 instrumented stand and walk task 40 and many more. The discriminating factor for TUG is that it includes the ‘turning phase’. Similarly, in Di Lazzaro et al., 35 SVM reached 97% accuracy with features extracted from Pull Test, TUG, tremor and bradykinesia items. Four studies examined participants on different speeds37,41,44,47 while one study examined participants on maximum speed. 46 Other studies either did not specify gait speed 35 or allow participants to conduct the task at their preferred speed. Tedesco et al. 46 achieved an accuracy of 73.07% with a multilayer perceptron (MLP) model with participants conducting the task at maximum speed.
AI classification models
The most commonly used AI model was SVM, with other models including RF and neural network (NN) (see Figure 2). 40 Rehman et al. 43 used radial basis function kernel alongside SVM (SVM-RBF) in their study. One study applied back-propagation artificial NN (BP-ANN) while another study applied CNN. Both BP-ANN and CNN are classified under NN. Hsu et al. 36 and Tedesco et al. 46 applied MLP which is a class of feedforward ANN.

Artificial intelligence models employed by included studies.
Discussion
Our study demonstrates that gait analysis using the AI model yields promising results in discerning pathological gait from non-pathological gait. AI technology is an emerging field of medicine in which the search for the most fitting algorithm(s) still remains. The present study, reviewed 22 papers with custom algorithms and AI algorithms.
A total of 18 of the 22 studies were published from 2018 to 2020, suggesting an exponentially increasing interest in this topic (Figure 3). There is a varied patient population selected by the papers. Nine studies included participants with PD of varying disease severity.10,26,30,34,35,40,42–44 One study compared PD patients with progressive supranuclear palsy (PSP) which is an atypical Parkinsonian disorder that clinically shares certain features with PD. 34 This poses the potential for misdiagnosis of PD in PSP patients, therefore the use of the AI model with gait analysis has a role in supplementing the clinical diagnosis. Another study included patients with idiopathic hyposmia (IH) which is a condition that is associated with an increased risk of developing PD. 44 Other study populations include post-stroke patients,32,36,39 post-concussion athletes27,36 and dementia patients. 10

Increasing research interest in artificial intelligence-driven gait-analysis.
Gait tasks with specified distance or speed such as the TUG test30,34 were the preferred choice of most studies performing gait analysis. Some studies varied the speed and included additional movements (e.g. turning clockwise or head tilt) to examine if certain features are relevant for classification accuracy.34,41 Kashyap et al. 38 were unique in additionally evaluating speech patterns (consistency in volume, articulation and vocal instability) by asking participants to repeat consonant-vowel syllable /ta/. Moreover, studies that evaluated PD patients such as in Di Lazzaro et al. 35 included upper limb tests: resting tremor, postural tremor and finger-nose tests. These tasks were extracted from the clinical evaluation scoring system ‘Movement Disorders Society Unified Parkinson's Disease Rating Scale’ (MDS-UPDRS). Several studies also used clinical tools such as MDS-UPDRS beyond determining the clinical severity of PD in study participants. Di Lazzaro et al. 35 were informative in utilising MDS-UPDRS tasks as their motor tasks. Furthermore, Di Lazzaro et al. 35 demonstrated sensitivity in these clinical tools as there were nine features out of 19 features from Pull Test and TUG that were selected for the AI model. No subjective abnormalities were rated by clinicians in that cohort. This reinforces the role of wearable sensors technologies in being able to detect subtle movement abnormalities that are not discernible to visual observation (Tables 4 to 7).
Walking bout of included studies.
Characteristics of included artificial intelligence studies of custom algorithms.
Characteristics of included artificial intelligence studies of machine learning models.
Abbreviations : LR, Logistic Regression; RF, Random Forest; SVM Support Vector Machine; ANOVA, One-way-Analysis Of Variance.
CASP methodological quality assessment of studies of custom algorithms and machine learning models.
The classification accuracy of algorithms may be affected by multiple factors ranging from gait tasks to the selection of features and algorithms. There was not a huge discrepancy between results from studies that included TUG versus results from studies that had ‘normal’ gait tasks. Therefore, the authors recommend future research to include TUG as an additional task, where feasible, without omitting the use of ‘normal’ gait task. In Najafi et al., 7 researchers found that as the distance of the gait tasks increases, the participants’ speed increases about 3–15%. A total of 16 out of 23 papers did not specify the speed or instruct the participants to carry out the task at their ‘preferred speed’. 7
Testing environments
Studies were generally conducted in a supervised clinical environment, which is on a level surface and an obstacle-free environment with only four studies conducted the tasks in home environments or varied testing locations.28–30,48 The results may not be suitable to extrapolate to at-home and in-community environments. This is a limitation on external validity as the patients might find gait and motor tasks easier to conduct in a clinical environment, while pathological gait might be further exaggerated in certain environments and time of the day. Various authors such as Brodie et al. 6 have previously detailed the variability in cadence and step time between ‘free-living’ community gait and laboratory-assessed gait. As such, participants may be subjected to the ‘Hawthorne effect’ where they are more aware of their gait, necessitating the most valid gait analysis to be conducted in community or home environments to better reflect their ‘free-living gait’. 6
Future studies may explore gait assessments in various environments, uneven surfaces and at different times of the day. Most studies included were cross-sectional, hence a longer follow-up period to monitor changes in gait following disease severity and recovery may be warranted. The follow-up period of gait analysis could provide conformational findings such as in Rovini et al. 44 which hypothesised patients who were classified as IH have a higher likelihood of developing PD. As another example exemplifying the utility of follow-up, the Fino et al. 27 study of athlete post-acute concussion had a short-term follow-up of 10 days for all participants. Peak head angular velocity which is the main discriminating factor found in the study in classifying athletes was found to return to healthy control levels.
Machine learning
For AI technology there are various advantages and disadvantages of using one model from another. The selection of features and algorithms may play a major role in model performance when classifying pathological gait from healthy controls.
Feature selection methods allow for training datasets to be more precise for models and reduce the training times by removing redundant features. Zhang et al. 32 applied the Pearson correlation coefficient which is one of the measures of the filter method. Other types of statistical tools used within the filter method include the chi-squared and analysis of variance (ANOVA) test. In Rehman et al., 43 Di Lazzaro et al. 35 and De Vos et al. 34 the statistical tool ‘analysis of variance’ were used. Di Lazzaro et al. 35 used Kruskal-Wallis feature selection which is a non-parametric ANOVA test of the wrapper method and also relief ranking which is a type of filter method. The wrapper method is more complex than the filter method and therefore would perform comparatively better with a smaller dataset. It is also more computationally intensive than the filter method and can be subject to overfitting as model complexity increases. A benefit of the filter method is that it does not incorporate a specific ML algorithm. The embedded method combines features from both filter and wrapper methods. The best example of an embedded method is LASSO, or ‘least absolute shrinkage and selection operator’. De Vos et al. 34 implemented both ANOVA and LASSO feature selection methods. Similar to the wrapper method, the embedded method also has a high computational requirement.
Feature selection and extraction
An important concept for feature selection is ‘bias-variance tradeoff’.
49
Bias reflects how much model predictions differ from the correct value and variance is how much the predictions for a given point vary between different realizations of the model.
49
The ideal scenario for feature selection is when the authors achieve a balance between underfitting (high bias) and overfitting (high variance).
49
This can be prevented with model validation tools such as leave-one-out cross-validation and
Feature extraction differs from feature selection in that new features are created from existing features. This method minimises the number of features, by discarding the original features and improves the performance of the model. Principle components analysis (PCA) is one of the most used linear dimensionality reduction method, as in Kashyap et al. 38
SVM, RF and LR
In Hsu et al., 36 Moon et al. 40 and Tedesco et al., 46 ensemble methods were implemented which can help in preventing overfitting. Classically, studies favour the use of hidden Markov model (HMM)-based features and SVM classifier, as highlighted by Hsu et al. 36 This is further supported by eight out of 13 studies in ML which applied the SVM algorithm. SVM is a type of supervised learning and can process large datasets with good generalisation capabilities. It generally has a good predictive accuracy as shown in Nukala et al. 41 of 98% accuracy. However, three out of eight studies found that SVM was not the best classifier in terms of application. In Nukala et al., 41 BP-ANN yielded a 100% classification accuracy. Rovini et al. 44 compared SVM with RF and NB, in which SVM and NB achieved 95% accuracy meanwhile RF yield 97% accuracy. In Tedesco et al., 46 SVM classifiers showed the performance accuracy of only 71.18%. One of the benefits of the SVM classifier is that it is based on maximum margin hyperplane, and support vectors are the closest elements to the decision surface, therefore it avoids data overfitting. 21 In Rehman et al., 43 RBF-SVM was used where RBF is a nonparametric algorithm and it functions to convert non-linear problems into linear, therefore simplifying the SVM model. Following that the results showed that RBF-SVM performed better than RF. 43 RF algorithm is a supervised learning model that is derived from multiple decision trees. 19 It is also a class of ensemble learning method, however, was not categorised together with (gradient) bagging and boosting in the pie chart (Figure 2). 49 RF is widely used in various gait analyses and has shown promising results in Rovini et al. 44 The benefits of this algorithm is that it is robust in handling outliers, can work with large datasets and requires simple parametrization. This is due to the random nature of partitioning in which the features are selected for splitting.
One of the other classical AI algorithms is LR employed in two out of 13 included studies.34,40 LR uses a sigmoidal curve to explore the relationship between features and the probability of an outcome. 19
Neural networks
ANN and CNN are classes of NNs but have different functionality. Each ANN has nodes (or neurons) that take in inputs, and synapses with other nodes via connections. 19 The algorithm separates the classes by a line, plane or hyperplane in a two- or multi-dimensional space. 19 This is similar to SVM in which classes are divided by hyperplanes. Furthermore, features are transformed using the sigmoid function providing class associations as an output, not probability. 19 It has a high tolerance to noisy data and is suitable for untrained continuous data. 19 The robustness of NN performance is shown when processing complex data input from an unsupervised dynamic environment. 40 CNN is widely used in image classification and segmentation tasks. 19 It passes patches of an image from one node to another, the nodes would then apply convolutional filters to extract specific features of the image. Multiple layers can be conducted in which new filters are used each time and then feedforwarded into ANN. Steinmetzer et al. 45 adopted single-layer and multi-layer CNN for gait analysis. The final results show a promising result of 93.4% accuracy using wavelet transformation and a three-layer CNN algorithm. 45 Despite the complexity of NN mapping multi-dimensional images, the computational requirements of CNN are attainable from consumer-based products. For example, Steinmetzer et al. 45 applied a three-layer model CNN with consumer-grade Intel Core i7-6700HQ 2.6 GHz four cores and 16 GB RAM. The training of the CNN model took around 45 min to complete. 45 Deep learning models remove the need for manual feature selection process, conducted by experienced researchers or data scientists. In general SVM and RF, albeit commonly used, show varying results in terms of classification accuracy. Although NNs are used in a limited number of studies, the results are promising and should be further explored in future gait analysis studies.
Limitations
One of the study's main limitations is a lack of consistent measures for assessing model accuracy. Another limitation is this study did not take into account software variety and whether it would affect the performance of algorithms. In this study, the authors documented the location of wearable sensors and noted the most preferred location by researchers, however, no conclusions regarding the ideal number and location of wearable sensors could be made. There is a possibility that increasing the number of sensors used may be detrimental to the accuracy of the results, however, this is yet to be definitively determined. This review did not include paediatric gait analysis studies, therefore the findings are not generalisable to the paediatric population.
For the AI algorithms studies, there are variations in the application of AI algorithms. All studies applied AI algorithms in their ‘walking segment’, however, not all studies included the ‘turning phase’. This may affect the classification accuracy especially for the TUG gait tasks where the ‘sit-to-stand’, ‘stand-to-sit’ and ‘turning’ phase provides valuable information in discerning pathological gait. This is seen in the study conducted by Steinmetzer et al. 45 where the completed TUG task achieved an accuracy of 93.3% compared to gait alone of 90.3%. For non-walking segments, AI algorithms were applied in tasks, such as speech, balance, upper and lower limb movement tasks.
Feasibility
Unlike deep learning models, the conventional ML model requires less data input. For example within the study conducted by Hsu et al., the number of hidden layers used for NN with MLP was 2000, meanwhile the RF classifier used 1000 trees. The study analysed the first two strides of six successful gait trials. In the study conducted by Terrier, 50 the minimum number of strides was two continuous strides to reconfigure a pretrained CNN model to identify previously unseen gait. However, this is completed under a supervised laboratory environment with pressure-sensitive mat. When deciding on the size of the training set, there are no specific reference values for each ML model. The recommended rule is to plot a learning curve (LC). Model-set-curve is plotted with training-set-curve to determine whether there is a likelihood of high bias or variance.
In summary, the authors recommend the use of NNs, in particular CNN and BP-ANN. Even though it has high computational requirements, it delivered consistent accuracy (above 89%) across multiple studies. 33 Moreover, it incorporates characteristics of SVM which is the most commonly adopted AI algorithm, as evaluated in this study.
Conclusions
The results of the present review show the need for more research in the field of paediatric gait pathologies, types of Parkinsonism and analysis of free-living gait in the community and at home. It is difficult to determine which algorithms for feature selection, feature reduction, classification and validation are the most suitable for gait analysis. Therefore, there is a need for larger cohort studies to be conducted with various algorithms for direct comparison of classification accuracy. Duration of follow-up should be considered. The next step would also include longitudinal gait monitoring of patients on trials of disease-modifying medications on or off medications. The results from these can be used for prediction analyses of disease progression. Furthermore, considerations regarding the bias-variance tradeoff and the technological requirements are warranted in future development and uses of AI algorithms for gait analysis.
