Abstract
Over the past few decades, researchers have become more interested in sequence analysis (SA) for the holistic analysis of life-course and other longitudinal data. The usual approach is to construct sequences, calculate pairwise dissimilarities, and then use a clustering algorithm on the dissimilarities for finding groups of similar sequences. Typically, these clusters are then described and interpreted as typologies. Increasingly, researchers are interested in analyzing the relationships between sequences and other characteristics, usually by using cluster membership as a dependent or independent variable in a linear or nonlinear regression model.
Almost unanimously, the clustering methods used in the SA context have been hard or crisp clustering algorithms, such as Ward’s method or partitioning around medoids (PAM). These algorithms find a partitioning where each sequence belongs to one cluster and one cluster only, which easily translates into a categorical variable with internally homogeneous and mutually exclusive groups. Applications using cluster membership as an observed characteristic of the units of analysis in regression models are also common (e.g., Chaparro et al. 2017; Fuller 2015). This approach is often problematic because the implicit assumption is that cluster membership is a fixed and known characteristic of an individual (or other subject), even though there is considerable uncertainty in clustering solutions because of various possibilities of choosing (dis)similarity measures, clustering algorithms, and the number of clusters. Furthermore, individual sequences might be mixtures of two or more ideal types or distant from all ideal types, making the whole concept of classification into clear or true clusters problematic. Failing to account for uncertain and mixed memberships may lead to wrong conclusions about the existence and nature of the studied relationships. Our aim is to bring forward and discuss the potential problems of the “traditional” approach of creating variables from SA clusters and to compare alternative options for creating explanatory variables using dissimilarities between sequences.
Methods
Social scientists have increasingly called attention to how existing methods understate the certainty with which individual cases are allocated to sequence clusters and overstate within-cluster homogeneity, arguing for the need for methodological developments (e.g., Warren et al. 2015). Studer (2013) and Piccarreta and Studer (2019) discussed the problems with linking SA cluster membership and a covariate. By assigning the same cluster membership value to all sequences in the same cluster, we are neglecting the possible within-cluster variation of the sequences. This is not a problem if the structure of the clustering is strong, that is, there are clear subgroups in the data and we can be fairly certain of cluster memberships.
Furthermore, the relationship between the sequences and the outcome of interest should be sufficiently explained by the cluster memberships (we refer to this as a “class-dependent outcome”). This refers to type A in Figure 1: there are two clear clusters and the value of the outcome—indicated by the shade of the dot—depends on the class only, not on the subject’s position within the class (all within-class variation is random). A simple example of this situation is when changes in childhood family structure explain educational outcomes, such as when parental separation would have the same kind of effect on all children. In this case, children’s position in relation to the clusters (e.g., because of the timing of the separation and possible parental repartnering) would not matter for explaining the relationship between the pattern of childhood family structural changes and later educational outcomes.

Illustration of four data types on the basis of the strength of the clustering tendency and the type of the sequence–outcome link: (A) strong clustering, class-dependent outcome; (B) weak clustering, class-dependent outcome; (C) strong clustering, similarity-based outcome; and (D) weak clustering, similarity-based outcome.
In all other cases, however, the standard approach is potentially problematic. In a type B situation (Figure 1), the sequence–outcome link is similar to that of type A, but the clusters are overlapping. The weak clustering structure is a problem as it leads to misallocation of sequences. Even if this misallocation is random, this can bias the estimates, as in the analogous case of measurement error in covariates (cf. regression dilution/attenuation; e.g., Berglund 2012), and in some cases failing to account for this classification error can lead to too small standard errors and
In the social sciences, we argue, it is often unrealistic to assume that any true underlying clusters exist (contrary to, e.g., pattern recognition applications). However, even if true clusters existed, they are difficult to identify using existing methods (Warren et al. 2015) and thus the sequence–outcome link cannot be easily reduced to the relationship between fixed cluster memberships and an outcome. Typically, the sequence typology derived from clustering can be regarded as an imperfect assignment of sequences to categories that approximate different ideal types. In this situation, the outcome depends on how strongly the sequences resemble the ideal types, or how they relate to one another (their relative positions). Illustrations of such data with “similarity-based outcomes” are shown in Panels C and D of Figure 1. A simplified example is the relationship between employment trajectories and lifetime accumulated income. In such a case, accounting for other factors, such as education level, an individual 1 in a long, stable employment career would have, on average, higher accumulated income than individual 2, who never had a stable job. In such a situation, the accumulated income of individual 3, who entered the labor market at a later age and was consistently employed thereafter, would be somewhere in between those of individuals 1 and 2 (again accounting for educational level). Careers more similar to that of individual 1 would tend to have higher incomes, and careers more similar to that of individual 2 would tend to have lower incomes.
In a type C situation, we have a strong clustering structure from which we can easily name some representative or ideal-type sequences (e.g., normative school-to-work trajectories). In a type D situation, there is merely a weak clustering tendency or no clear structure at all, but different types of trajectories are nevertheless related to different levels of the outcome. In this situation, cluster analysis can be used as a tool for finding some representative sequences that help in assessing and interpreting the sequence–outcome relationship. For a general presentation on the differences of uncertain or mixed memberships in clustering crisp or fuzzy data, see, for example, D’Urso (2007).
To date, there are few proposals to account for the uncertainty of the clustering result. Studer (2018) first brought up the idea of using “fuzzy” or “soft” clustering methods to account for mixed cluster memberships of sequence data in cases where sequences are the outcome of interest. In terms of sequences as a predictor (the interest in this article), to account for classification error, Jalovaara and Fasang (2020) conducted robustness checks by excluding cases with poor silhouette values (reflecting a poor fit to their respective cluster; Rousseeuw 1987). In their study, excluding cases with low silhouette values led to relatively small deviations in estimates but a substantial loss of cases and a considerable increase in standard errors of the estimates. In the following sections, we propose and discuss three alternatives to the traditional hard classification approach.
Membership Probability and Representativeness
If we assume we have fixed cluster memberships and class-dependent outcomes, our main goal is to assign individuals to their correct clusters.
A membership matrix is less straightforward to use in a regression model. Studer (2018) proposed using the membership matrix as the outcome in a Dirichlet regression model, but to our knowledge, no one has yet proposed creating explanatory variables from a membership matrix of sequence data. If we look beyond the SA literature, some work in the latent class analysis (LCA) literature has suggested creating independent variables from latent classes where, similar to cluster analysis, true class memberships are unknown. The most interesting approach is the
Although likely an improvement over hard classification, typically when using soft classification and pseudoclass methods, researchers still assume that each subject belongs to a single cluster, but the methods account for the uncertainty in the cluster assignments. We argue that this dependence on specific clusters is often unrealistic in the social sciences, as many individual characteristics are continuous in nature and there are an unlimited number of different life-courses instead of fixed categories. If we do not believe in the existence of true clusters, but instead assume the relative positions of the sequences matter more, we need to focus on their (dis)similarities directly. Using pairwise dissimilarities in explaining an outcome is practically impossible, so we turn to the concept of
In discussing representativeness of sequences, Gabadinho and Ritschard (2013) consider different options, including frequency, neighborhood density, and centrality. Here, centrality considers the distances or dissimilarities between sequences. Centrality can be calculated as the sum of dissimilarities between a subject and all (other) members in a group. The smaller the sum, the more central the subject; the most central subject is called the
As an example of how membership probability and representativeness differ, consider the situation depicted in Figure 2. Subjects M1, M2, and M3 are the medoids, that is, the most central members of their respective clusters. As such, they are the best single representatives to their clusters. We can be fairly certain they belong to their respective clusters; their membership probabilities are high regarding their own clusters and low regarding all other clusters.

Example clusters with strong representatives (medoids M1, M2, and M3) and two types of weak representatives (S1 and S2).
Subjects S1 and S2, on the other hand, are distant from the closest medoid M2, so they are much less representative to cluster 2, and medoid M2 is much less representative of them than most of the other members. S1 and S2 are, however, different in their positioning. Subject S1 is of a mixed type, almost equally distant from medoids M2 and M3. Its membership probabilities for clusters 2 and 3 are thus similar, close to 0.5. Subject S2, however, is simply a distant subject: it is distant from medoid M2 but even further away from medoids M1 and M3. Even though it does not fit any cluster particularly well, its membership probability to cluster 2 is high, corresponding to strong certainty of being a member of cluster 2. Hence, we see that membership probability itself is not always a good measure of representativeness.
If we are dealing with a type A or type B situation (class-dependent sequence–outcome relationship), the relative position within the cluster and thus subjects’ representativeness is not an issue, unless we assume to find subjects that are not members of any clusters (outliers). However, in situations of types C and D, representativeness is arguably more important and often a theoretically more justified approach, as we must consider subjects’ positions in relation to others, for example, by comparing them with some theoretical ideal types or medoids.
Creating Variables from Sequences
Table 1 presents different ways of constructing variables from sequences, two of which are based on a crisp clustering algorithm (in this case, the PAM algorithm) and two on a fuzzy clustering algorithm, here the fuzzy analysis (FANNY) algorithm (Kaufman and Rousseeuw 2009).
Variable Construction for the Simulation and Empirical Studies Including Two Methods for Crisp Clustering (Using the PAM Algorithm) and Two for Fuzzy Clustering (Using the FANNY Algorithm)
Let
Finally, we construct a variable that takes into account
This leads to
Simulation Study
In this section we illustrate how different approaches succeed in predicting the outcome when the sequence–outcome relationship is class dependent or similarity based. All analyses were done in the R environment (R Core Team 2021), using packages cluster (Maechler et al. 2021), seqHMM (Helske and Helske 2019), TraMineR (Gabadinho et al. 2011), ggplot2 (Wickham 2016), and dplyr (Wickham et al. 2021). The code to reproduce the simulation experiment and additional analyses can be found on GitHub (https://github.com/helske/seqs2vars).
We first generated sequence data by creating three mixture Markov models with varying clustering tendencies, each with four states and four mixture components (“clusters”). We simulated 10,000 sequences of length 20 from each of these models. We then calculated dissimilarities using optimal matching for spell sequences with constant substitution costs (Studer and Ritschard 2016). We chose this measure because it is sensitive to sequencing and thus is well suited for analyzing data generated with a Markovian model. We then clustered the sequences using PAM and FANNY. Assessed using the average silhouette width (ASW; based on PAM) as a measure of clustering tendency (Kaufman and Rousseeuw 2009), the first model generated sequences with strong clustering tendency (ASW of about 0.8), the second generated sequences with a reasonable clustering tendency with some overlap between sequences from different submodels (ASW of about 0.6), and the third generated sequences with a weak clustering tendency (ASW of about 0.3). Figure 3 shows samples of clustered sequences. Using the clustering solutions and the corresponding dissimilarity matrix, we then created several covariate matrices
with

Clusters of sequences simulated from three types of mixture Markov models with varying clustering tendencies (weak, reasonable, and strong).
Using each of these data sets, we ran Monte Carlo simulations in which for each replication we sampled 1,000 of the original sequences and a corresponding
In reality, sequence data are unlikely to be generated by such simple Markovian models, and the relationship between sequences and outcome variables is more complex. Thus, the following results reflect more of a best-case scenario; in practice, the differences between the methods and potential errors could be much larger than observed here.
Figure 4 shows the average RMSE and 95th percentile intervals from 10,000 replications for different data-generating models and estimation methods. We see that for classification-based data, the prediction improved (RMSE decreased) when the clustering tendency strengthened. Not surprisingly, the estimation based on hard classification performed best with strong, clear clusters. Soft classification performed, on average, slightly better in cases where we had classification error (data with a reasonable or weak clustering tendency). The hard classification method produced the widest percentile intervals: its performance was the most inconsistent. When the outcome was generated on the basis of membership probabilities, the clustering tendency did not have a strong effect on the average RMSE when using the estimation method that matched the data-generation process (soft clustering, the best-case scenario), whereas other methods performed best with stronger clustering tendency.

Average root mean squared errors (RMSEs) of predictions from 10,000 simulations with 95th percentile intervals.
On the other hand, when the data were generated on the basis of representativeness (the case we argue is typically the most realistic in social sciences), the clustering tendency did not have a clear effect on the average RMSE for any of the methods, and all methods produced results not far from the theoretical value of 0.25 (the standard deviation of the noise term
We performed additional experiments where the original data-generation and covariate creation was done with FANNY-based hard classification and gravity centers, a potential alternative to our representativeness measure (Batagelj 1988). We also tested ranking the methods on the basis of the Bayesian information criterion instead of RMSE (excluding the pseudoclass method, for which the Bayesian information criterion is not defined). These results are available in the supplementary material on GitHub https://github.com/helske/seqs2vars/tree/main/simulations. These additional simulations were in line with the conclusions of the main results, with FANNY-based hard classification performing similarly to the PAM-based hard classification and the gravity center method being similar to the representativeness method.
Empirical Study
We now illustrate the performance of the four methods with an empirical research problem: predicting a continuous earnings variable or a binary poverty variable with simple two-state sequences of employment trajectories. The timing, length, and frequency of employment and unemployment spells have a profound effect on earnings (Fuller 2015; Gangl 2006). These features of one’s occupational career determine the opportunities for on-the-job human capital accumulation, while also signaling a worker’s competence and unobservable qualities to potential employers (Gangl 2006). Over time, the cumulative effects on earnings can be substantial (Fuller 2015).
The data used in this example come from the Swedish population registers. The data set comprises a sample of all residents of Sweden who turned 18 years old in 1997 and who lived continuously in the country until 2017 (
We were interested in two outcome variables: (1) the probability of being in the lowest income quintile at the end of the sequence (a measure of poverty) and (2) the square root of cumulative income over the entire sequence (in 1,000 SEK). Income in this case is income from wages, business, and other economic activity, including social benefits related to economic activity (e.g., parental leave and sick leave compensations). We also had measurements of characteristics of the individual and their family background at the start of the sequence: region of residence (metropolitan areas, smaller cities, countryside), mother’s education, father’s education, mother’s employment status, father’s employment status, and sex.
We estimated the models for poverty using logistic regression and the models for income using ordinary least squares regression with four different methods to predict the outcome with employment histories. In both cases, we controlled for characteristics of the individual and their family at the start of the sequence.
For the clustering of sequences, we used a dissimilarity measure that is sensitive to the duration of (un)employment spells, namely, optimal matching with a substitution cost of 2 and an indel cost of 1 (Studer and Ritschard 2016). We chose a solution with five clusters for our example, with clusters differing in timing, prevalence, and continuity of employment. Figure 5 shows the medoids and the index plots for the sequences within each cluster of the hard classification (PAM) solution. The first cluster,

Yearly employment state distribution and medoids for five-cluster partitioning-around-medoids solution.
A classification assigning cluster membership on the basis of the highest membership probability obtained by the FANNY algorithm showed similar qualitative patterns, with the majority of all sequences in each of the five classifications being allocated to the same cluster. There were, however, minor differences in allocation. First, the lowest degree of overlap between the two classifications was 57 percent for the category
Earlier research suggests the varying degrees of attachment to employment and the different lengths of employment spells found in each cluster would have distinct outcomes in terms of poverty and cumulative earnings. This can be studied using the cluster variable as a predictor of these two outcomes, in a similar way to the study by Fuller (2015). In addition, we repeated the analysis using the other approaches described in the simulation study, namely pseudoclass, soft classification, and representativeness.
In this case, we did not believe any true employment clusters exist or that the outcomes would be class dependent. Instead, we assumed the relationship between the work trajectory and the outcome (income or poverty) is similarity based and expected that representativeness would be the most appropriate measure to use. Our analysis highlights the substantial differences in how the different types of sequence variables perform as predictors. Before showing the full results, we illustrate the differences in predicted values with a simple example.
When using hard cluster memberships and setting
where
(
(
(
A hard classification method (PAM) assigns these three sequences to the same cluster, which is characterized by long unemployment spells. Here, sequence (M) is the medoid of the cluster and shows a pattern of mostly unemployment, (A) consists solely of unemployment spells, and (B) is an outlier with a long spell of nearly continuous employment that ends halfway through the period. For the case of hard cluster memberships as predictors, the square root of expected 20-year cumulative earnings for all these sequences is reduced to
Note that cluster membership is reflected in the equation as a single parameter referring to the cluster assigned to all three sequences (in this case the first cluster). For simplicity, if we assume the individuals in question belonged to the baseline category for all other covariates, the predicted value of the square root of the 20-year cumulative earnings (in thousands of Swedish kronor) is 31.16, which translates approximately to SEK 970,000 for the three cases of (M), (A), and (B).
Likewise, for the pseudoclass approach, the equation is
where the coefficients are averages over multiple pseudoclass samples (the estimates are different compared with those from the hard classification method, as reflected by the asterisks). The equivalent square root of predicted earnings is 35.79, translating into approximately SEK 1,280,000 for all of (M), (A), and (B).
As discussed earlier, a key difference between hard clustering and pseudoclass is that pseudoclass assigns cluster memberships on the basis of the estimated membership probabilities from a fuzzy cluster solution. The coefficients represent the averaged cluster membership effect over all the replications, and the standard errors are adjusted to reflect the uncertainty deriving from the probabilistic cluster allocation. In this way, pseudoclass deals with the problem of treating group assignment as certain by adjusting the estimated parameters and standard errors so they reflect the uncertainty in cluster allocation. Yet pseudoclass is similar to hard classification in that it attributes a uniform effect to all members of the same cluster, as our example shows. Also note the difference in estimates between the methods: pseudoclass tends to shrink estimates toward the average (Bray et al. 2015; Lanza et al. 2013), which makes it the most conservative of all methods in terms of finding differences between the groups.
In contrast, the equations for the soft classification and representativeness methods reflect within-cluster variability by incorporating more parameters and changing the predictors into continuous measures. For soft classification, the equation includes
where
For (M), the predicted value of square root earnings (in 1,000 SEK) is 21.10, translating into about SEK 445,000; for (A), the same predicted value is 19.9, translating into about SEK 396,000; and for (B) the value is 40.41, which translates into about SEK 1,633,000. Thus, the estimation based on soft classification captures the considerable earnings difference that results from differences in the presence of unemployment spells within the three sequences.
In a similar vein, the equation using representativeness incorporates multiple parameters (which do not have to sum to 1):
where
The predicted square root earnings (in 1,000 SEK) for sequence (M) using the representativeness method is 29.80, translating into approximately SEK 888,000. For sequence (A) it is 12.29, which translates to approximately SEK 151,000. For sequence (B), the value is 34.17, translating into about SEK 1,167,000. As in the case of soft classification, representativeness also captures the differences in earnings between the three sequences even when the clustering algorithm has assigned them to the same group.
As illustrated, the four approaches differ in terms of how predictions are calculated, which means they also differ in terms of interpreting the estimated modeling results. Interpretation is most straightforward for hard clustering, as it is interpreted as any categorical variable: parameter coefficients
Here we show model results as AMPs for all four approaches. AMPs and AMEs are similar concepts, except that instead of comparisons with a reference case as in the more typical AMEs, the AMPs, also known as average adjusted predictions, show marginal predictions under some interesting configurations, in our case, at the medoids obtained from the hard classification. Specifically, separately for each medoid, we predicted the outcome for each individual by replacing their observed representativeness values with the representativeness values of the medoid (while keeping other covariates at their observed values) and then calculated the average of the predictions over all individuals. Similarly, for soft classification, we replaced the observed membership probabilities of each individual with those of the medoids, and with hard classification, AMPs are calculated by replacing the observed cluster memberships. Finally, the pseudoclass AMPs are calculated for each pseudoclass replication as with hard classification, and the set of AMPs obtained from all pseudoclass replications are then combined using Rubin’s rules.
The top two panels (a and b) in Figure 6 display the AMPs for the clusters (hard and soft classification and pseudoclass) or medoids of each cluster (representativeness) by outcome and estimation approach. The estimates largely agree with each other, predicting worst outcomes for the

Average marginal predictions, root mean squared errors (RMSEs), and Brier scores by estimation method and outcome: (a) average marginal predictions (income), (b) average marginal predictions (poverty), (c) RMSEs (income), and (d) Brier scores (poverty).
The lower panels (c and d) in Figure 6 show the RMSEs and Brier scores that we used to assess the accuracy of the predictions. We computed them using a leave-one-out cross-validation method over 100 folds and estimated confidence intervals by using bootstrapping with 1,000 replications. As expected, representativeness produced more accurate estimates in both cases than did the hard classification and pseudoclass methods; soft classification was close to the performance of representativeness, especially in the continuous case. The Appendix provides further results for the empirical study, such as descriptive statistics, parameter estimates, and information criteria from each model.
Discussion
In this article, we aimed to bring forward and discuss the problems of the traditional approach of creating variables from SA clusters and to propose some alternative approaches. Our simulation study demonstrated how the type of data-generating process affects the performance of the different methods. In cases with true but unknown clusters, hard classification worked well on data with strong clustering tendency, whereas soft classification was consistently better on data with weaker clustering tendencies (i.e., when classification error is an issue). However, when there were no true clusters to begin with but the sequence–outcome relationship was assumed to be similarity based, representativeness clearly outperformed other methods.
We also studied the performance of the methods on empirical data, where we predicted two types of income-related variables (a continuous cumulative income variable and a binary poverty measure) with simple employment trajectories and control variables. In this case, we assumed the relationship between the sequences and the outcome would be closest to the similarity-based setup and expected that the representativeness measure would result in better predictions than the other methods. This was confirmed by our analyses using cross-validation, but the advantage of using representativeness was not as evident in the empirical case as it was in the simulations. Soft classification was equally good for the continuous outcome, but it performed less well when the outcome was binary.
We argue that in the social sciences, subjects are typically more or less hybrids of multiple ideal types, and the outcome variable of interest is affected by multiple factors with varying magnitudes, which is not properly captured by hard classification into clusters. Earlier LCA literature hypothesized that the pseudoclass method can better account for uncertainty due to clustering. The benefit of pseudoclass method over other proposed alternatives is that it tries to adjust for the uncertainty in the classification without altering the interpretation of the model in terms of the corresponding predictors. However, its performance in our simulation and empirical studies was less than convincing, which is in line with recent LCA literature (Bray et al. 2015; Lanza et al. 2013). The pseudoclass method is also computationally the most demanding of the considered methods. Although our pseudoclass approach is based on fuzzy clustering of sequence dissimilarities, not latent class models, on the basis of all these findings we cannot recommend the pseudoclass method as an alternative to the traditional hard classification technique.
Soft classification with mixed memberships account for uncertainty of membership allocation, and as such it is a clear improvement over the traditional hard classification with fixed memberships. The potential problem with soft classification is its inability to deal with cases that are not well represented by any of the ideal types (outliers). Similarity-based approaches such as representativeness take into account the closeness of the sequence to the ideal types while also distinguishing between mixed types and outliers. Other similarity-based measures may also work, such as measures based on multidimensional scaling, especially when the data show clear and easily interpretable principal components or when the goal is to construct a control variable (where interpretation of effects of the sequence variables is not relevant). If outliers are not a big issue, soft classification and representativeness measures are expected to lead to relatively similar results. In this case, soft classification could be favored because of simpler interpretation. In theory, the use of representativeness and membership probabilities can induce some level of multicollinearity to the modeling, but we do not see this as a major issue, as multicollinearity affects only the interpretation of individual predictors, and in these cases, the effects of these sequence-related variables are best considered as a whole (as in our examples).
Related to multicollinearity, we fixed the number of clusters and medoids to be the same across the different approaches for comparability. In practice, it may be advisable to use a smaller number of clusters/medoids for representativeness and possibly also for soft classification in comparison with hard classification, because of the continuous nature of these measures. For example, in our simple empirical example, dissimilarities to sequences of “always working” and “never working” capture the same information (causing multicollinearity), so in practice adding only one of these as a representativeness predictor would be sufficient.
To conclude, we demonstrated the importance of considering how sequences and the outcome variable of interest are related, and the need to adjust the analysis accordingly. If true underlying clusters are expected to exist, then hard or soft classification methods should be preferred (depending on how big an issue classification error is expected to be). In social sciences, the whole idea of the existence of any “true clusters” is often implausible. Often the main purpose of cluster analysis is to reduce the complexity of the sequence data, in which case similarity-based approaches or soft classification should be considered. On the basis of our analyses, the representativeness method shows promising results, and perhaps other alternatives will emerge in future work. We hope this article will encourage further discussion and research on combining SA and subsequent modeling.
