Abstract
Introduction
Market segmentation analysis represents one of the key techniques in tourism research used to develop knowledge about consumer behavior of and gain market intelligence about tourists. In the academic tourism literature, approximately 5% of articles published between 1986 and 2005 were related to market segmentation (Zins 2008), which shows that the topic represents a key methodology used by academic tourism researchers to develop knowledge. Between 2011 and 2012 the three main publication outlets for data-driven segmentation studies in tourism (
Despite the popularity of data-driven market segmentation analysis in tourism (recent examples published in this journal alone include Nicolau 2012; Masiereo and Nicolau 2012; Weaver and Lawton 2011; and Needham et al. 2011), areas of segmentation analysis remain where no recommendations are given to data analysts (Dolnicar and Lazarevski 2009). One such area relates to the sample size required for data-driven market segmentation analysis: no guidelines exist that would allow the data analyst to ensure that the sample size available is sufficient for the analysis, because recommendations or techniques used to determine sample sizes for other statistical techniques cannot be used (such as for example sample sizes derived from power analysis for statistical hypothesis testing and from optimal design for regression analysis). Nevertheless, insufficient sample sizes can have serious negative consequences for the validity of the market segmentation solution, where validity implies that the true structure of the data has been identified.
To understand the potential negative consequences of insufficient sample sizes in data-driven market segmentation studies, we must bear in mind that segmentation analysis is exploratory by its very nature, and consequently, any segmentation algorithm will always arrive at a grouping of individuals, whether or not the grouping is worthy. Furthermore, in segmentation analysis, each variable represents one dimension in space. For example, a data-driven segmentation study with 20 variables (which is not uncommon in tourism) means that a mathematical problem is solved in 20-dimensional space. To find groups in 20-dimensional space, many data points are required; otherwise no patterns can be detected and any resulting grouping is entirely random. This can be illustrated by imagining the simplest possible situation, that of two-dimensional space, as shown in Figure 1. Figure 1(A) shows 100 data points in this two-dimensional space. It clearly shows that two clusters exist in these data. Figure 1(B) shows the same data situation but with only six data points. Here, it is impossible to determine, based on those six data points only, what the true structure of the data is. Based on the illustration in Figure 1(B), the correct solution may include anything between one and six clusters.

(A) A sample with 100 observations from a bivariate normal mixture with two equally sized components. (B) A subset with 6 observations from the 100.
The problem illustrated in Figure 1 becomes exponentially worse as the number of dimensions increases. It is exasperated by the fact that no indicator exists to warn the data analyst if the sample size–to–variable number ratio is critical and thus may lead the data analyst to draw incorrect conclusions.
The issue of selecting an adequate sample size is largely ignored when a posteriori or data-driven segmentation studies are conducted. In a review of 47 data-driven tourism segmentation studies, Dolnicar (2002) highlighted the problem of potentially insufficient numbers of respondents, given large numbers of variables in the segmentation base. Specifically, Dolnicar (2002) reports that the sample sizes of the 47 reviewed studies ranged from a mere 46 to nearly 8000 respondents, with a median of 461. More than one-third of the data sets had fewer than 400 respondents. Simultaneously, the number of variables ranged from 3 to 56, with approximately two-thirds of studies using between 10 and 22 variables in their segmentation bases. The median ratio of the number of respondents divided by the number of variables is 22.4. The correlation between sample size and number of variables is insignificant, indicating that data analysts do not collect larger samples in cases where the data situation is more complex because a high number of variables is included in the segmentation base.
For the present study, we conducted a review similar to that conducted by Dolnicar in 2002 with more recent articles, specifically data-driven market segmentation studies published in the last decade in the
These results also indicate that in tourism research the sample sizes are at best modest and there is no need to employ subsampling strategies to reduce the computational burden in the segmentation analysis due to large data sets, as suggested for other areas of research where millions of observations are available (cf. Bejarano et al. 2011).
Dolnicar’s (2002) results indicate that despite market segmentation being used extensively in the field of tourism research, the fundamental question of how many variables should be used for a certain number of respondents has not yet been explicitly considered, and practically no guidance is available to data analysts with respect to the sample size required.
The contribution of the present study is to derive sample size requirements for data-driven market segmentation analyses. This will allow data analysts to check whether the sample available for their segmentation analysis is sufficient, given the number of variables in the segmentation base, or whether it may be necessary to either collect more data or reduce the number of variables used in the analysis.
In the present study, sample size requirements are derived by conducting an extensive simulation study using artificial data sets whose correct cluster structure is known. Artificial data are required because for empirical survey data the true segmentation solution is unknown. Consequently, the effects of insufficient sample sizes cannot be studied, because of the lack of a dependent variable (correctness of the segmentation solution). We conduct simulations for a range of scenarios, which have been modeled to be similar in nature to typical empirical tourism data sets to ensure that the final recommendation is adequate—even under the most difficult of data circumstances. Characteristics of typical data sets used in data-driven segmentation studies conducted in the tourism literature have been taken from Dolnicar’s (2002) review of data-driven segmentation studies published in tourism.
Prior Work
Only two recommendations about the appropriate ratio of respondents to numbers of variables have been published to date. Both recommendations are not easily accessible to the English-speaking and scientific community, with one being a research monograph in German by a Viennese psychologist (Formann 1984) and the other one a recommendation available from a help page of the add-on package called clusterGeneration (Qiu and Joe 2009) to the statistical software environment R (R Development Core Team 2013).
Formann (1984) proposes including at least 2
Qiu and Joe (2009) suggest that the sample size should amount to a minimum of 10 times the number of variables in the segmentation base times the number of clusters (10·
According to Qiu and Joe (2009), we may assume that more respondents are needed when data contain more segments or clusters. More generally, we might expect that the sample size requirements will increase with the difficulty of the segmentation task.
Levels of difficulty of segmentation tasks have been discussed by Dolnicar and Leisch (2010). They argue that typical survey data situations range from natural clusters (where density clusters are present in the data) to reproducible clusters (where data contains only a weak structure) and constructive clustering (where virtually no structure is in the data that would allow repeated segmentation analysis to lead to the same results). The further apart market segments are, the more likely it is that the underlying structure is one of natural clusters. The closer they are, the more likely that cluster structure cannot even be identified, making constructive clustering necessary, leading to the assumption that the separation between market segments is a key criterion for determining the required sample size.
Finally, it is not uncommon for survey data to contain variables that do not necessarily contribute to understanding the cluster structure of the data. Data analysts may include such variables in the analysis because they do not know in advance whether or not they contribute to the segmentation solution. If they do not contribute, they may instead mask the cluster structure and, as a consequence, lead to less homogeneous clusters. Such variables are referred to as
Methodology
Data Generation
Because, as opposed to power calculations for statistical hypothesis testing, there is no direct way of calculating what an adequate sample size is for any given segmentation problem, we used simulation analyses. Simulation studies have one major advantage over studies with empirical survey data: the true cluster structure is known. Consequently, whether any given segmentation solution has identified the cluster structure in the data correctly can easily be assessed.
In statistical hypothesis testing, the performance of a statistical test is measured by its power. Segmentation analysis results in a partition of the data, and a natural performance measure is the correctness of the predicted grouping compared to the true grouping. Therefore, correctness represents the key performance criterion and dependent variable in the present study. It is computed as follows: for each simulated respondent, the segment membership resulting from the clustering algorithm is compared to the true membership. Thus, the criterion is how well the original partition of the data is revealed by the clustering algorithm. As noted by Ben-David, Pál, and Simon (2007) the solution that is returned as the “best” one by the clustering algorithm does not need to correspond to the best solution with respect to the original partition. However, the use of internal criteria for selecting a solution is unavoidable in clustering where the true partition is unknown. A cross-tabulation of the assignments is used to determine the adjusted Rand index (Hubert and Arabie 1985), a measure of agreement between two partitions of a data set. The Rand index was introduced by Rand (1971) and is defined as the number of pairs of objects that are either consistently assigned to the same or different clusters across two different partitions. The Rand index does not correct for agreement by chance (Hubert and Arabie 1985), while the adjusted Rand index does. Values of the adjusted Rand index lie between −1 and 1, where 1 indicates that the exact same solution is identified across repeated computations. See the appendix for details on the Rand index and the adjusted Rand index.
To ensure that recommendations about adequate sample size are valid across a range of data circumstances encountered in tourism research, artificial data sets with different characteristics are generated by drawing data from different finite mixtures of multivariate Gaussian distributions, where the settings are selected to cover the range of situations as described in the review on previous segmentation studies in tourism given in Dolnicar (2002). Specifically, the settings differ in (1) the number of variables in the segmentation base, (2) the number of respondents, (3) the number of clusters, (4) the level of separation between clusters, and (5) the proportion of noisy variables in the segmentation base.
Where possible, we chose the exact parameters for the above variations in artificial data sets in order to model as closely as possible the characteristics of empirical tourism data sets used in previous segmentation studies in tourism. Information about typical tourism data characteristics used in segmentation studies was taken from the Dolnicar (2002) review article. As shown in Table 1, artificial data sets include 10, 16, or 22 variables in the segmentation base. These values represent the midpoint (16) and the borders of the interval containing the middle two-thirds of prior tourism segmentation studies: 10 and 22. Because 64% of prior tourism segmentation studies grouped respondents into three or four segments, these two numbers of clusters have been chosen.
Overview of Factors Used in the Full-Factorial Design Simulation Study.
With respect to noisy variables, no guidance is available based on the results of previous segmentation studies where noisy variables were not explicitly accounted for. However, we may assume situations exist where no noisy variables are present (when, e.g., segmentation variables were carefully selected in advance), as well as situations where a substantive amount of variables is not relevant for the clustering structure. Therefore, we created artificial data sets with levels of contamination covering an extensive range: some contained no noisy variables (0% contamination), some one-quarter (25% contamination), and in some cases half of all variables were noisy (50% contamination). Noisy variables were generated by drawing from a multivariate Gaussian distribution of similar variation to the non-noisy variables and independent of the non-noisy variables.
The degree of cluster separation was controlled using the so-called separation index, as described by Qiu and Joe (2006). The separation index measures the amount of space between two clusters by determining the optimal projection direction for the data and then defining the distance between groups based on the lower and upper
The number of respondents were chosen using the rule given by Qiu and Joe (2009) implying that they linearly depend on the number of variables leading to 10·
The full-factorial design of all independent variables led to 540 data settings (see Table 1). For each setting, 50 data sets were created, which were subsequently clustered using the
The artificial data sets were generated using the statistical software environment R and the package cluster Generation.
Generalized Additive Model
In order to systematically investigate the association between the adjusted Rand index value and sample size, we used a generalized additive model (Hastie and Tibshirani 1986). Generalized additive models are regression models that enable the analyst to flexibly model the functional relationship between the independent and dependent variables. When using a linear regression approach, the functional relationship is restricted to being linear. When approximating a more flexible functional relationship, expansions of the independent variables can be used as predictors. In additive models, the functional relationship is assumed to be given by a smooth function, and spline bases are used for approximation, while flexibility and wiggliness of the function are penalized. The amount of penalization is controlled by a hyperparameter. We use generalized additive models because the influence of sample size is not expected to be linear. Generalized additive models allow the estimation of the effect of sample size on the dependent variable (adjusted Rand index) as a nonlinear, smooth function while controlling for the remaining covariates (number of variables, number of clusters, number of noisy variables, and separation value). The hyperparameter controlling the smoothness of the function was automatically selected using generalized cross-validation. The model was estimated in R using the package mgcv (Wood 2006, 2012). Thin plate regression splines were used as the spline basis. The control variables were added as categorical variables with a fully saturated design. This led to the following model:
where RIadj is the dependent variable corresponding to the adjusted Rand index value,
Results indicate that the number of variables does not affect the adjusted Rand index significantly, either alone or in interaction with other independent variables. The R-squared measure of goodness of fit is 0.754 for both the models with and without the covariate “number of variables.” However, this does not mean that the number of variables has no effect at all. We generated samples of size 10·
where
We consider the adequate sample size to be the smallest sample size where the adjusted Rand index values do not significantly differ from adjusted Rand index values of higher sample sizes (i.e., 100·
Results
Simulation results are displayed in Figure 2 and Figures A1 to A3 (appendix), which show the smooth function

Smooth function of sample size and its effect on adjusted Rand index values with 95% confidence interval (dashed lines) resulting for the aggregated data over all levels of number of clusters, noisy variables, and separation indices. The recommended sample size is marked with a cross as well as a vertical line and the corresponding confidence interval using dashed lines.
Figure 2 depicts the function
In the aggregated data setting, the degree of freedom of the smoothing term for sample size is 7.49 (
The coefficients of the three factors (number of clusters, number of noisy variables, and separation index) are significant (
The first key result emerging from the simulation study is that depending on the data situation, correctness of the segmentation solution can suffer substantially if the sample size for analysis is insufficient. The fitted smooth function shows a clear increase, which means that the adjusted Rand index values are higher when higher sample sizes are available. This effect is more obvious for difficult segmentation problems. As mentioned above, smoothed functions are almost linear and horizontal in data situations with a clear group structure. In these cases, the effect of additional samples on the correctness of cluster results is not very strong. In the remaining cases, the additional samples have a considerable positive effect on the adjusted Rand index values, which increases with the difficulty of the task. This means that for harder segmentation tasks, higher sample sizes can result in considerably improved results. In the case of 50% contamination by noisy variables, a separation index of −0.1, and four clusters, the improvement is 0.24 when comparing the adjusted Rand index values for sample sizes of 10·
In the aggregated data setting the adequate sample size is 60·
The fact that most optimal sample sizes range between 30·
Overall, our results indicate that sample size-to-variable ratios currently used in tourism segmentation studies cannot be considered to be adequate. Based on the review by Dolnicar (2002), the median ratio is 22.4, meaning that half of the segmentation studies published in tourism use samples that are lower than 22.4 times the number of variables in the segmentation base, and half use samples sizes that are higher than 22.4 times the number of variables. A Wilcoxon test (
The complication when attempting to derive sample size requirements in dependence of the number of variables in the segmentation base is that each empirical data set is different and the degree of cluster structure is not known. We therefore recommend using at least 70·
Conclusions
The aim of the present study is to determine the required sample size for data-driven market segmentation studies in tourism. An extensive simulation study using artificial data sets of varying structure and difficulty was conducted, and the effect of a number of typical factors relating to data structure was examined. Our results indicate that in most cases, the correctness of segmentation analyses can be significantly improved by increasing the sample size. This effect is stronger for more difficult segmentation tasks. Only in the case of data with a very clear segment structure—a situation yet to be encountered by the authors—does increasing the sample size
Because it is impossible in the case of empirical survey data to know the true data structure, we must by default assume that the segmentation task is complex, and consequently, the most conservative rule for sample size requirement resulting from the simulation study should be used:
Another conclusion that we may draw from the present study is that noisy variables in the segmentation base increase the complexity of the segmentation task substantially. It is therefore worthwhile carefully selecting variables to be included in the segmentation base, rather than including an entire question battery by default. Noisy variables in the segmentation base can be avoided by (1) identifying and removing them after data collection (Brusco and Cradit 2001; Carmone, Kara, and Maxwell 1999; Steinley and Brusco 2008) or by (2) ensuring, before data collection, that survey questions are only included if they contain relevant information, as advocated by Rossiter (2002, 2011). Methods for identifying and removing can either be employed before the clustering using characteristics of the distribution of the single variables (Steinley and Brusco 2008) or simultaneously during clustering, by taking into account the concordance and agreement between cluster solutions implied by different variables (see Brusco and Cradit 2001, who directly build on and improve Carmone, Kara, and Maxwell 1999).
Future work that would further add to our understanding of sample size requirements could investigate the degree to which sample size requirements vary across scale formats and clustering algorithms.
