Abstract
1 Introduction
In many medical applications, the observed data may be assumed to have arisen from a continuous curve or higher dimensional surface that is described by some function. Thus, the glucose levels as measured by a continuous glucose monitor or the tracings generated by an electroencephalogram of a patient are both examples where a continuous function may be used to help describe the underlying data.
In both of these examples, a single function is assumed to underlie the data for an individual patient so that for a sample of patients there is a sample of functions in which the data are observed at particular points. In this regard, the function is the element of interest 1 and analysis of the shape of these functions can help inform decisions on classification and prediction.2,3
Functional data analysis describes the statistical methods and techniques that are used to explore functional data. 1 The random variable is a functional, that is, a space of functions, defined on some continuous interval such as time.1,2 Thus, each realization of the variable is a function providing infinitely dimensional data, and the space of functions is generally assumed to be a Hilbert space. 4 Although our concern here will be with univariate functional data, sometimes multivariate functional data may be of interest.1,5,6 In practice, whichever the type of data, the functions are often sampled at a finite set of points.
In many situations, we need to know the hidden structure that explains how these curves and functions vary from one group to another. Thus, in the study of childhood obesity, growth curves of body measurements may be used to group children using a cluster method.
7
One such cluster method,
For many of the cluster methods used on functional data, the number of clusters is assumed to be fixed a priori. This makes determining the optimal number of clusters in the functional data important and motivates this research. Thus, the method proposed here can be used to identify the number of clusters and is a development on the forward search originally used to identify outliers in multivariate data9,10 and later, as a clustering method. 11
Here we use a forward search based on functional spatial ranks to analyze functional data. This extends previous work that introduced the forward search based on spatial ranks for the cluster analysis of multivariate data. 12 The functional forward search introduced here is based on the random start forward search, 13 and can be considered a new raw-data method that obviates the need for dimension reduction, since it performs the clustering directly on the discrete observation of the curves or functions.
It is a non-parametric method that can be used to determine the number of clusters, and assign each curve to its cluster. When compared with existing methods using different numerical examples from real data, it is shown to be an effective tool in clustering analysis.
The paper is organized as follows. In section 2, we discuss the curse of dimensionality in the traditional random start forward search method and the potential of using the forward search based on functional spatial ranks. In section 3, we propose the functional forward search algorithm based on functional spatial ranks. In section 4, we compare the proposed method with other functional data clustering methods using numerical examples before ending with the discussion in section 5.
2 The curse of dimensionality in the traditional forward search
The term “curse of dimensionality” was introduced by Bellman.
14
It refers to all problems caused by the analysis of high-dimensional data and, in general, arises from a relative sparsity of observations. For example, in order to run the traditional forward search algorithm based on Mahalanobis distances, we need to choose an initial subset
In contrast, when the number of observations is small compared to the number of variables it is not possible to estimate the variance–covariance matrix so the algorithm cannot proceed. Strictly, the traditional forward search based on Mahalanobis distances
11
cannot be applied to functional data owing to the random variables taking values into an infinite dimensional space. However, in practice the data consist of curves that have been sampled at a finite set of points, hence it is still possible to use forward search methods providing the dimension,
The forward search, like all multivariate methods based on Mahalanobis distances, suffers when the dimension grows. Since it starts with subsets of size
In contrast, using a forward search based on spatial ranks for clustering multivariate data overcomes this issue,
12
as it can be started with subsets of any size since the rank of any observation
3 The functional forward search based on functional spatial ranks
In this section, we propose a functional forward search algorithm based on functional spatial ranks. Part of the novelty of this algorithm is that unlike the traditional forward search algorithm, it works with functional data. Furthermore, as a raw data method it determines the number of clusters from the data without any need for parameter estimation. A key element is the need to extend
3.1 The functional spatial rank
A spatial approach to multivariate and functional data appeared as early as 1983, when the spatial median was used for robust location estimation for two dimensional spatial data. 15 The development of non-parametric geometrical approaches led to the introduction of multivariate spatial quantiles 16 and the multivariate spatial depth function. 17
The functional spatial depth (FSD), proposed by Chakraborty and Chaudhuri,
18
extends the notion of spatial depth from
The spatial depth function has been used to provide a nonparametric description of functional data, by using the functional version of spatial depth to identify some nonparametric descriptive features such as sample median and quantile curves. 20
The functional spatial median has been of particular interest to investigators. For example, Cardot et al. 21 used an averaged stochastic gradient algorithm to compute the functional spatial median in a Hilbert space in a fast way. And this functional spatial median has been used as a robust measure of center for a data set of electricity loading curves. 22 The kernelized functional spatial depth (KFSD) has been proposed 23 for the classification of functional data. It is based on the functional spatial depth introduced by Serfling and Wijesuriya. 17 In addition, the functional K-nearest neighbour classifier has been used in this work as a benchmark procedure.
Suppose that
In practice, the curves are observed at a finite set of points, so that there are discrete observations for each sample path
As a vector, the functional spatial rank provides information on the centrality of an observed curve and its direction. The
Clearly it requires deciding upon a suitable cut-off for an outlier and one approach is to trim the sample of a proportion of curves with the highest
A simpler approach is to derive the cut-off, C based on the upper whisker of the boxplot of
In principle, the functional spatial ranks can be applied for both regularly and irregularly sampled curves, where the functional spatial ranks are supposed to be calculated in general concept using the integrations instead of the summations quantities, and then with a formal procedure and methods we can estimate the integral functions and get the estimated values of the functional spatial ranks. Alternatively, we may use some smoothing functions or spline coefficients to get an equal length of the irregularly sampled curves, and then we can use the above equations to obtain the functional spatial ranks of the irregularly sampled curves.
3.2 Functional spatial ranks classifier
Before introducing the forward search algorithm, we consider the problem of classifying functional data to particular clusters. In general, it is important to assess whether the curves have been appropriately assigned to a cluster and whether they remain unassigned to any cluster. A further problem that may arise with some algorithms classifying functional data is when some curves are assigned to more than one cluster.
Clearly it is desirable to have a mechanism of assigning each curve in the functional data to an appropriate cluster. Here we use a nonparametric classifier based on the functional spatial ranks that is applied after determining the number of clusters. Assuming we have
3.3 The forward search based on functional spatial rank algorithm
Let Selecting random starting points, the search is started with an initial subset Calculate the functional spatial ranks Compute Grow the subset Repeat 2 − 4 until Plot Identify the subset size by finding the highest Apply the functional spatial ranks classifier in section 3.2 to confirm the assignment of each curve and allocate the unassigned/incorporated curves to the proper group.
Theoretically, the algorithm can be started with any number of random starting points, since it is not constrained with the rule of
The computation of
When the curves in
4 Numerical examples
In this section, we apply the FSFSR algorithm proposed in section 3.2 to some numerical examples. The first three examples are simulated data generated from three different models. The final two examples use data from real datasets.
To assess the performance of the FSFSR algorithm, it is important to recognize that the algorithm both identifies the number of clusters and assigns all the data to an appropriate cluster. Thus, any performance metric must capture both of these elements and penalize the performance when either it identifies an incorrect number of clusters or wrongly assigns data to a cluster. Thus, we use the following misclassification rate, which is similar to the classification error proposed by Meila. 26
For
The adjusted Rand Index (ARI) is also popular metric used for measuring the performance of clustering algorithms and it has been included here for completeness. 27 In contrast to H, which compares clusters by matching sets, the ARI compares clusters by counting the pairs of points in which the clusters agree or disagree. It also corrects for the expected value of the unadjusted Rand Index, where the expected value is based on a random choice of entries in the contingency table when the column and row totals are fixed.
For comparison, we considered six other methods for identifying the number of clusters and cluster sizes. Note in what follows the term in brackets, after the method name, corresponds to the function and package in
The first five methods were implemented as a raw-data method with discretized data and as a filtering method based on 10 spline coefficients. All six methods were applied to the three simulated and two real datasets and compared with the FSFSR algorithm using the number of clusters identified, H 26 and ARI. 27
For the
4.1 Simulated data examples
The first simulated data (model 1) consists of two groups. The first group includes curves that are generated from the process
A mixing proportion of 0.25 was used to generate the two clusters with a sample size

Simulated data, Model 1: (a) the observed curves with two groups, (b) the mean function, (c) the forward plot based on functional spatial ranks.
In Table 1 the performances of all the algorithms are summarized. Many of the algorithms identified the correct number of clusters and achieved perfect classification. However, three algorithms (
Comparison of different clustering approaches applied to Model 1.
Note: When a method is followed by letter in parentheses it denotes the following: (a) = raw-data methods with discretized data; (b) = filtering methods using spline coefficients from10 splines.
aResults are based on the mean of 1000 repetitions.
In the second model, there are also two groups. The first group consists of curves generated from the process similar to equation (7) but with a different mean function:
Figure 2(a) shows the simulated curves from the first cluster (black) and second cluster (red). The corresponding mean functions are shown in Figure 2(b). The forward plot based on the functional spatial ranks is shown in Figure 2(c). Again we can clearly see two peaks around

Simulated data, Model 2: (a) the observed curves with two groups, (b) the mean function, (c) the forward plot based on functional spatial ranks.
Comparison of different clustering approaches applied to Model 2.
Note: When a method is followed by letter in parentheses it denotes the following: (a) = raw-data methods with discretized data; (b) = filtering methods using spline coefficients from10 splines.
aResults are based on the mean of 1000 repetitions.
For the third model, we combine the two previous models so there are three clusters. The first cluster consists of the generated curves from the process defined in equation (7) but with a different mean function:
The three simulated clusters have sizes 30, 50 and 80. Figure 3(a) and (b) shows the respective simulated curves for the clusters and their mean functions. In Figure 3(c), the forward plot again demonstrates that the functional forward search algorithm has identified the correct number of clusters. There are three peaks at

Simulated data, Model 3: (a) the observed curves with three groups, (b) the mean function, (c) the forward plot based on functional spatial ranks.
A comparison with the other algorithms is provided in Table 3. It can be seen that the FSFSR algorithm is the only algorithm which identifies the correct number of clusters with near perfect classification. In contrast, the other algorithms suffer significant misclassification rates although the adjusted Rand indices are less affected with many returning a value of 0.764.
Comparison of different clustering approaches applied to Model 3.
Note: When a method is followed by letter in parentheses it denotes the following: (a) = raw-data methods with discretized data; (b) = filtering methods using spline coefficients from10 splines.
aResults are based on the mean of 1000 repetitions.
4.2 Real data examples
In this section we apply the FSFSR algorithm to two real datasets. The first dataset is known as the ECG data and is taken from the UCR Time Series Classification and Clustering Archive. 37 The dataset consists of 200 electrocardiograms from two groups of patients sampled at 96 time points, in which 133 are classified as normal and 67 as abnormal. The data consist of the ECG signals recorded between two electrodes during one heartbeat. The abnormal ECGs reflect a cardiac pathology known as a supraventricular premature beat.
The second dataset, known as the ‘DistalPhalanxOutlineCorrect’ data (hereon referred to as the Distal data) is also taken from the UCR Time Series Classification and Clustering Archive.38,39 It is designed to test the efficacy of hand and bone outline detection by an image processing algorithm. The outlines of the three bones of the middle finger in each image are summarized by a univariate series of 80 data points representing Euclidean distances of different points around the outline from a central point. Here, we consider the test sample of 276 images. There are two classes based on whether the bones have been correctly delineated by the image processing algorithm (115) or not (161) as determined by human evaluation.
Figure 4(a) shows the observed curves for the ECG data, and the forward plot based on functional spatial rank. From Figure 4(b), two clusters are evident with peaks at 58 and the other at 120. This suggests that some of the observations have not been captured by either cluster. In order to identify the membership of each cluster, the forward search was stopped at the first peak (

ECG data: panel (a) is the observed curves with two groups and panel (b) is the forward plot based on the functional spatial ranks. Two clusters are evident at subsets with sizes 58 and 120.
Table 4 gives the results for all the methods applied to the ECG data. It is clear that only the FSFSR algorithm gives the correct number of clusters (2) and has the lowest H (23.5%). Despite identifying an incorrect number of clusters, for many of the other methods the ARI is more favourable than the FSFSR algorithm. None is above 0.39 and this would suggest poor classification by all the algorithms; however, interpretation of the ARI is not straightforward as the baseline expected value for the Rand index varies as the contingency table varies.40,41
Comparison of different clustering approaches applied to the ECG dataset.
Note: When a method is followed by letter in parentheses it denotes the following: (a) = raw-data methods with discretized data; (b) = filtering methods using spline coefficients from10 splines.
aResults are based on the mean of 1000 repetitions.
Comparison of different clustering approaches applied to the Distal dataset.
Note: When a method is followed by letter in parentheses it denotes the following: (a) = raw-data methods. with discretized data; (b) = filtering methods using spline coefficients from 10 splines.
aResults are based on the mean of 1000 repetitions.
The Distal data curves for the 276 images are given in Figure 5(a). It is clearly seen that there is a high level of similarity between the two classes, which makes the distinction between them difficult. Figure 5(b) displays the forward plot based on the functional spatial rank for the Distal data. Two clusters are evident with two clear peaks at 67 and the other 175. Before applying the classifier in the algorithm, 36 curves remain unassigned to a cluster and two curves have been incorporated in both clusters. After step 7, each of these 38 curves has been assigned to a single appropriate cluster and H for the algorithm is 0.236.

Distal data: panel (a) is the observed curves with two groups and panel (b) is the forward plot based on the functional spatial ranks. Two clusters are evident at subsets with sizes 67 and 175.
It is clear from Figure 5(a) that one curve of the curves (number 220), which starts from the upper left and travels in a different direction from the other curves is a potential outlier. From the data, the cut-off C for outliers equals 0.9586 and for this curve,
Table 5 gives the results for the Distal data, and as can be seen, only five methods including the FSFSR algorithm gave the correct number of clusters (2). The FSFSR algorithm again records the lowest H (23.6%). For the ARI,
5 Discussion
In this paper, we have proposed a new forward search algorithm for clustering functional data. It is an extension to the forward search methodology based on spatial ranks that has been introduced for the multivariate case. 12 It may be used to identify the number of clusters in the underlying functional data and does not require any preprocessing of the data, nor the need to perform data registration or dimension reduction before clustering. Furthermore, it may be used in cases when the number of variables exceeds the number of observations or when the cluster size is less than the number of variables – this contrasts traditional forward searches based on Mahalanobis distances.
An important element of the algorithm is the inclusion of a classifier. This allows the classification of all curves to an appropriate cluster, even when in the early steps some have either not been assigned or have been assigned to more than one cluster.
As the FSFSR algorithm both identifies clusters and classifies functional data, any reasonable comparison should be with methods that are also capable of clustering and classifying. Equally, it is important that the metric used to gauge performance also adequately captures both clustering and classifying. To this end, we used the misclassification rate H, which penalizes methods that identify an incorrect number of clusters as well as assessing the error in classification and the adjusted Rand index (ARI) which is a popular metric used in classification and clustering.
For the simulated examples, the algorithm was able to identify correctly the number of clusters and the number of simulated curves in each cluster with an H of no more than 0.0063. Indeed for the third, more complex simulated example, it was the only algorithm to correctly identify the number of clusters with a near perfect H and ARI score.
For the two real examples, the FSFSR algorithm identified the correct number of clusters and had the lowest H amongst all the methods. However, in the last example it also had one of the poorest ARI scores and illustrates some of the shortcomings when using these metrics for comparing algorithms. It is clear from the real data examples that when an incorrect number of clusters are returned, H penalizes algorithms more severely than the ARI. In contrast, the ARI adjusts for correct classification by chance which should, in principle, give it an advantage over H.26,40,41 However, since the baseline expected Rand index may be different between two different partitions of the data, it is not clear if two algorithms were to return similar values for the ARI that this would represent equivalence in performance.26,40,41 Thus comparing performances can be difficult using this metric.
One of the limitations in the proposed algorithm is that, in order to identify the subset size correspondence to each peak in the trajectories of the random starts, we have to find the highest
Several authors have demonstrated the use of the forward search based around a Mahalanobis distance metric to detect outliers on multivariate data.43–47 Distributional results are known for the Mahalanobis distance and the minimum Mahalanobis distance allowing inferential statements to be made. In particular, percentile envelopes that contain most of the data may be estimated so that outlier points lie outside the enveloped region. In contrast, the forward search proposed here has been developed in a nonparametric framework. This makes it more difficult to use envelopes from order statistics based on distributional assumptions and approximations for unscaled distances and asymptotic results and requires further research.
When there are a large number of clusters the proposed forward search may produce too many peaks and this may make it difficult to determine the number of clusters and their sizes. Furthermore, the selection of random starting points as used here can result in multiple peaks which makes it hard to identify small clusters; hence, a more effective divisive strategy can instead be used.43–47
Although spatial ranks are invariant under orthogonal transformations, they are not invariant under general affine transformations of the data, thus the proposed algorithm is not affine invariant. An affine invariant version of the algorithm could be formulated based on affine invariant spatial ranks 48 and this could improve the results if the scales of the clusters were different for instance. However, this would make the algorithm computationally expensive and greatly increase the process time and as a result we did not use any affine invariant versions of spatial ranks here.
The treatment of outliers is important in cluster analysis as their presence may indicate the existence of clusters or populations not specified in the initial analysis. Equally they may arise due to errors in recording of some form. Potentially both can distort the process of cluster identification and data classification and, in the case of model-based approaches, bias the estimates of associated parameters. This has led some investigators to propose methods such as ‘trimming’ the data of outliers as part of the analysis.24,25
The first part of the FSFSR algorithm, the forward search, identifies the number of clusters and their constituents. In some cases, some of the data may remain unclassified at this stage of the algorithm, as in the case of the Distal dataset. These unclassified data tend to have functional spatial rank norms (
Here we used the calculated upper whisker of the box plot distribution as the threshold for outliers and this identified one potential outlier in the Distal dataset. Although the source of the outlier is unclear, in itself it would be insufficient to conclude that it arose from another population. Other approaches to the detection and treatment of outliers have been described and this remains an active area of research.24,25,43,49
In this study, both simulated and real datasets were used to compare the proposed algorithm with existing methods. One drawback when dealing with real data is that the identification of ‘true clusters’ is often not as clearly defined as in simulated datasets. Thus, errors in the reference classes, and the intrinsic dependence of the reference classification on the problem at hand, may diminish the effectiveness of this approach as a benchmarking procedure.50,51
In this study, we have proposed the FSFSR algorithm and demonstrated its potential as a clustering and classifying method for functional data. A more extensive evaluation of its performance across a greater range of examples is clearly necessary. However, as a data-driven non-parametric method, the approach proposed here is free from assumptions on the underlying distributions of the data and we believe it represents a significant development in functional data analysis.
