Abstract
1 Introduction
Individual differences in phonetic production are often systematic, including covariation across multiple segments (Chodroff & Wilson, 2017; Tanner et al., 2020) and cues (Bang et al., 2018; Clayards, 2018; Tanner et al., 2020). Less is known about related individual differences in contrast signaling (Clayards, 2018), or systematicity of cue weighting across individuals. Examining weights of multiple cues can shed light on the nature of individual differences, in particular, whether they are attributable to speech style and degree of clear speech or contrast maintenance and cue trading. This paper investigates individual differences in the use of multiple cues in Mandarin sibilant production as a test case of these hypotheses about individual differences in cue weighting strategies.
1.1 Individual differences in production cue weighting
All speech sound contrasts make use of multiple co-varying phonetic dimensions. The sounds examined here, Mandarin sibilants, have been described as differing in spectral mean or center of gravity, kurtosis, spectral peak, and second formant of the following vowel, to name only a few (Kallay & Holliday, 2012; Lee-Kim, 2011). While speakers and listeners make use of many phonetic dimensions in contrast realization, dimensions differ in their relative strength. Cue weighting quantifies the degree to which an individual dimension contributes to overall perception or production of a contrast.
This paper focuses on production cue weight, which is typically calculated from production data using a classification algorithm (see Schertz & Clare, 2020, for a review). Individual differences in production cue weight have been observed among native speakers of the same language (Shultz et al., 2012), native and non-native speakers (Schertz et al., 2015), non-native speakers with different levels of L2 exposure (Kong & Yoon, 2013), and speakers of a language undergoing sound change (Bang et al., 2018; Coetzee et al., 2018; Kuang & Cui, 2018). However, specific investigations into the relationship between weights of multiple cues in production provide conflicting results with limited case studies.
There are three possible relationships between two cues to a contrast: no correlation, positive correlation, and inverse correlation (Schertz & Clare, 2020). These relationships can be described at token and talker levels (following discussion in Clayards, 2018). Cue relationships at the token level compare phonetic values of two cues across tokens, while cue relationships at the talker level compare weights of two cues across individual speakers. At the token level, it is possible for the values of two cues to exhibit no correlation, co-vary within category, and/or co-vary across category (Schertz & Clare, 2020). At the talker level, it is possible for weights of two cues to exhibit no correlation, positive correlation, or negative correlation.
A positive correlation indicates that speakers who use one cue more distinctively simultaneously use the other cue more distinctively. At the token level, values of two cues may be correlated within and across categories if the two cues are produced by the same articulatory mechanism (instrinsic linkage as in Wang & Fillmore, 1961). At the talker level, a positive correlation would be expected if speech style modulates the relationship between primary and secondary cue weights. Speakers who adopt a clearer speech style may enhance contrast on all dimensions. Such a relationship has been observed in the production of English stops. Clayards (2018) found a positive correlation between voice onset time (VOT) and fundamental frequency (F0) cue weights across speakers; those who produced VOT more distinctively also produced F0 more distinctively. However, this was not statistically significant.
The relationship between cues could also be modulated by cue trading (Repp, 1982). The term is traditionally used to describe this pattern at the token level, where speakers exaggerate the value of one cue when producing an ambiguous value of the other cue. In a trading relationship at the talker level, an inverse correlation between cue weights indicates that speakers who produce one cue more distinctively produce the other cue less distinctively. To distinguish between cue trading relationships at the token and talker levels, I will refer to the talker-level trading relationship between cue weights as a “trade-off” relationship rather than “cue trading.” A trade-off relationship between cue weights has been found in previous work. Bang et al. (2018) found that Korean speakers who have higher production cue weight for F0 have lower cue weight for VOT. They argued that this indicates the emergence of F0 as the primary cue in a quasi-tonogenetic sound change. The same relationship has also been found in English stops (Clayards, 2008; Shultz et al., 2012), where it is thought to be related to individual speaker differences rather than community-level sound change.
The talker-level relationship between cue weights across speakers bears on the nature of individual differences. A positive correlation between cue weights indicates that systematic individual differences are modulated by speech style, with those using clear speech enhancing contrast on all dimensions. A trade-off relationship across speakers instead indicates a pressure for contrast maintenance—the contrast between categories is maintained by all speakers using different relative contributions of cues. Previous work is unclear on the relationship between primary and secondary cue weights across speakers and has been limited to studies of VOT and F0 of stops in a few languages. The data here provide a new test case from Mandarin sibilants.
1.2 Mandarin sibilants
Standardized Mandarin exhibits a three-way place contrast among alveolar, retroflex, and alveopalatal sibilant fricatives. Although [s] and [ʂ] are typically used in phonetic transcription of Mandarin, these sibilants have been described with a variety of different places of articulation including alveoar, dental, and denti-alveolar for [s], and retroflex, laminal post-alveolar, and apical post-alveolar for [ʂ] (see Chang & Shih, 2015 for a review of claims about places of articulation). This paper uses the terms “alveolar” and “retroflex” (following Chang & Shih, 2015; Duanmu, 2007; Ladefoged & Wu, 1984). There is less controversy about the place of articulation of alveopalatal [ɕ] (though sometimes the place is also termed “alveolopalatal”).
There is an allophonic restriction on sibilants requiring [ɕ] before high front vowels (Duanmu, 2007; Lin, 2014), but there is disagreement as to how [ɕ] patterns elsewhere. The alveopalatal sibilant is represented as “x” in Pinyin script, a romanized quasi-phonemic orthographic system, and the literature diverges regarding how orthographic “xia” and “xiu” sequences are represented at phonemic and surface levels. Some analyses posit that these are pronounced as represented in the orthography, [ɕia] [ɕiu] (Lee & Zee, 2003). Under this assumption, [ɕ] only appears before high front vowels, and is in complementary distribution with the other sibilants and the velar fricative. Other analyses claim that “xia” and “xiu” orthographic sequences are pronounced as [ɕa] and [ɕu/ɕǝu], such that there is a surface contrast between all three sibilants before [a] and [ǝu/u] (Duanmu, 2007; Lee-Kim, 2011; F. Li, 2008; W.-C. Li, 1999; Lin, 2014). Some who assume these surface pronunciations analyze [ɕ] as derived from underlying /si/, such that /sia/ → [ɕa] (Duanmu, 2007), while others analyze all three sibilants as separate underlying phonemes (Lee-Kim, 2011; W.-C. Li, 1999). This paper follows the assumption of a three-way place contrast among sibilants in the standardized variety in /a/ and /u/ vowel contexts, and treat “xia” and “xiu” orthographic sequences as [ɕa] and [ɕu]. However, this assumption about the phonemic analysis is not crucial for any of the results presented here. The key results of individual variability and the relationship between cue weights hold regardless of the phonological representation of alveopalatal sequences.
There is a merger between /s ʂ/ in Taiwan Mandarin, which is often described as a loss of retroflexion or a substitution of retroflex sibilants with the alveolar sibilant (Chang & Shih, 2015; Chen, 1999; Chiu et al., 2020; Chung, 2006; Jeng, 2006). This is frequently attributed to contact with Southern Min, which lacks retroflex consonants (Chuang & Fon, 2010; Kubler, 1985). The use of retroflexion has socio-indexical value; it is associated with higher education levels and distinguishes standardized Mandarin pronunciation from “dialect-accented” Mandarin (Chang et al., 2013).
Although the merger is typically associated with Taiwan Mandarin, there is individual variability among Taiwan and mainland speakers. For example, Chang (2011) analyzes corpus data and finds that Taiwan speakers have higher /ʂ/ COG relative to Beijing speakers, but both groups display overlapping /s ʂ/ categories to at least some degree. Chiu et al. (2020) find that Taiwan speakers range from complete merger with total category overlap to no merger with no category overlap. In addition, vowel context, sociolinguistic factors, formality of task, and prosodic context have all been shown to enhance the alveolar–retroflex contrast, contributing to contextual variation in degree of merger (Chang & Shih, 2012, 2015; Chuang & Fon, 2010; Chung, 2006; Jeng, 2006; Y. Li, 2009).
Although vowel context effects are well-documented, existing literature diverges on which contexts facilitate contrast enhancement versus merger. Chung (2006) describes retroflexion (i.e., /s ʂ/ distinction) as being more common before back-rounded vowels /o u/, and offers an articulatory explanation that the back articulation facilitates anticipatory tongue retraction. Similarly, Chiu et al. (2020) finds that some speakers have a pattern of contextually dependent merger only in the /a/ context, producing more distinction before back-rounded /o/. By contrast, Y. Li (2009), Jeng (2006), and Chang and Shih (2012) all find that the sibilants are less distinct in the back-rounded /u/ context. It is possible that these conflicting results were obtained because of differences in type of experimental task. This study includes two vowel contexts: back-rounded /u/ and low-central /a/ all for an additional test case of vowel context effects.
1.3 Choice of cues
Acoustic studies of standardized Mandarin have often characterized the sibilants as exhibiting a three-way distinction in spectral center of gravity (COG; also sometimes referred to as spectral mean or the first spectral moment, M1). [ʂ] typically exhibits the lowest COG and [s] the highest, as COG is negatively correlated with length of the front cavity (Kallay & Holliday, 2012; Lee, 1999; Lee-Kim, 2011). COG has been shown to be influenced by coarticulation with following vowels, such that COG is lower when followed by a rounded vowel (Jeng, 2006; Y. Li, 2009). However, Hu (2008) found that individual differences, not vowel context effects, were the major source of variability in both articulation and acoustics.
While COG tends to be the most common measure used in studies of Mandarin, the sibilants have also been shown to differ in other spectral moments and spectral peak locations (Kallay & Holliday, 2012; Lee-Kim, 2011). In particular, sibilants can exhibit multiple major spectral peaks in rounded vowel contexts, which cannot be directly captured by COG measurement. Lee-Kim (2011) shows that Mandarin sibilants before [u] exhibit an amplification of an additional lower frequency peak not seen before [a], which could lower COG measurements. Pape and Żygis (2016) find that Polish retroflex sibilants exhibit an amplification of an additional higher peak in rounded vowel contexts, which is attributed to the lip–teeth cavity created by strong lip protrusion. These additional peaks in rounded vowel contexts sometimes have higher energy than the major peak associated with front cavity resonance.
Although the presence of these multiple spectral peaks can affect COG, the two measures are often correlated. In a study of the /s ʂ/ merger in Taiwan Mandarin speakers, Lee-Kim and Chou (2022b) finds that spectral peak distance and COG distance between /s ʂ/ are significantly correlated, and proceed with reporting COG measurements (Lee-Kim & Chou, 2022a). A similar approach is taken in this study; spectral peak and COG values were calculated and examined for correlation. Significant correlations were observed between COG and spectral peak values across tokens as well as cue weights calculated for COG and spectral peak across speakers. This is the case both for frequency of the highest amplitude spectral peak and the frequency of the first major spectral peak. These calculations and graphs are provided in Appendix A.
As single measures, neither COG nor spectral peak provide a complete view of sibilant spectra. COG can average over multiple spectral peaks, but extracting the frequency of the highest amplitude peak can also be misleading as a place cue since additional resonances from lip rounding can exhibit higher energy than the resonance associated with the front cavity (Pape & Żygis, 2016). However, COG is the measure used most consistently in literature on Mandarin sibilants, especially in work on the alveolar–retroflex merger where difference in COG is commonly used to quantify degree of merger (Chang & Shih, 2015; Lee-Kim & Chou, 2022a). In addition, COG has been shown to be robust in reflecting both articulatory differences and native perceptual patterns (Chiu et al., 2020). Multiple perception studies have found the primary perceptual cue for the alveolar–retroflex place contrast to be COG (Chiu et al., 2020; F. Li, 2008; Wu & Lin, 1989), with dialectal variation in discrimination boundary according to the status of the /s ʂ/ merger. Given the consistency in patterning with native perception and use in previous work, the calculations in this paper proceed with examining COG as a primary cue to sibilant place of articulation.
While COG is well-established as a primary cue for the /s ʂ/ contrast, it is not always sufficient to distinguish the alveopalatal from the other two sibilants, regardless of the status of the alveolar–retroflex merger. Instead, the primary cue distinguishing /ɕ/ from /s ʂ/ is typically second formant frequency at the onset of the following vowel (Chiu, 2009; F. Li, 2008; M. Li, 2017). Mandarin speakers frequently maintain some COG distinction between /ʂ ɕ/ (Chiu, 2009; Lee-Kim, 2011). However, the degree to which COG distinction is maintained varies across individual Mandarin speakers. This differs from Polish, which has a similar sibilant system, but tends toward similar COG values for /ʂ ɕ/, relying heavily on F2 as the primary cue (Chiu, 2009; Lee-Kim, 2011).
Overall, previous work on Mandarin sibilants demonstrates variation in how the sibilant contrasts are realized phonetically among individual speakers and regional dialects. While F2 is consistently the primary perceptual cue for contrasts involving /ɕ/, there is variability in degree of COG distinction produced for these contrasts. There is also a merger between the alveolar and retroflex sibilants in some varieties, which can vary across speakers and contexts. Given these findings, we expect to see individual differences in sibilant realization, even among individuals not from regions typically associated with the merger.
2 Experimental design
2.1 Participants
All speakers were between the ages of 18–24 and recruited from the student population at the University of Massachusetts Amherst through linguistics courses and email advertisements to the “Taiwanese and Chinese Students’ Association.” All recruitment materials (emails, sign-up info, and so on) were distributed in Mandarin orthography. Participants were compensated with course credit or US$15 per hour for their participation. Twenty speakers (13 female, 7 male 1 ) were recorded and two speakers were excluded because they did not complete the task.
All speakers acquired Mandarin natively in China, relocated to the United States for their undergraduate studies, and continue to use Mandarin on a daily basis. None of the speakers reported early L1 experience with any other languages. Eleven speakers (6 female, 5 male) reported origins in southern/eastern areas that are geographically close to Taiwan and have been associated with the /s ʂ/ merger in previous literature (Shanghai, Jiangsu, Fujian). Nine speakers (7 female, 2 male) reported origins in northern areas that are typically associated with /s ʂ/ distinction (Beijing).
2.2 Stimuli
The stimuli were words and rare words, which were expected to behave as non-words. Because the Mandarin writing system is logosyllabic, use of non-words presents problems for participant reading. Instead of attempting to design new and orthographically natural characters, we used rare words with existing characters as “non-words.” Each stimulus was presented with the simplified Mandarin character and the Pinyin romanization. With the Pinyin presented alongside the characters, the participants were able to pronounce the intended stimulus even if they were unfamiliar with the word. No participants reported trouble reading either orthographic system when asked upon study completion. The stimuli were read in the carrier phrase “wǒ bǎ X dú yī biàn” (“I read X once”), which was presented in Mandarin characters.
To ensure the rare words were actually unknown to participants and could be analyzed as non-words for purposes of lexical frequency balancing, a word frequency judgment task was also conducted after recording. This task was a paper survey, which took about 1–2 minutes to complete. The survey asked the question “How common are each of these words? Circle your answer” and participants answered for all stimuli. Possible answers included “common,” “moderately common,” “rare,” and “I don’t know this word.” The results supported analyzing the rare words as non-words—rare words were all unknown to all participants.
The stimuli were crossed according to the following factors: sibilant (three levels: s ʂ ɕ) × vowel context (two levels: a u) × word status (three levels: high frequency/low frequency/non-word) × number of syllables (two levels) × tone (four levels). Due to limitations of the lexicon, some of the tones are not fully crossed with all other factors. However, in post hoc analyses, there were no significant effects of tone on F2 or COG values (see Results section for further discussion). There were a total of 137 distinct sibilant stimuli. Additional stimuli with word-initial affricates and stops were included as fillers. Word-initial non-sibilant fricatives were not included in the task. The full set of sibilant stimuli is included as supplemental material.
2.3 Recording
The participants were recorded in a sound-attenuated booth using Audacity software (Audacity Team, 1999-2021). Recordings were collected using an M-Audio Fast Track Pro Mobile Audio Interface and a Shure SM10A head-worn microphone with a sampling rate of 44.1 kHz and a bit depth of 16. The participants were presented with stimuli on a laptop computer inside the booth and were asked to produce the phrases as naturally as possible. Experimenters were trained to give feedback to encourage natural production, which included things such as suggesting the participant speak as if they were talking to a friend rather than giving a presentation. The stimuli were recorded in four separate blocks, each with a different random order, totaling four repetitions of each stimulus for analysis.
2.4 Data processing and analysis
The recordings from each speaker were first scanned by the author and trained research assistants for speech errors. Tokens were excluded if they were speech errors (different word as determined by a native speaker), or included non-speech vocalization (e.g. coughing). Participant 13 produced five such tokens, and all other participants produced one to three each. The recordings were then force aligned using the Montreal Forced Aligner (MFA; McAuliffe et al., 2017) with a pretrained Mandarin model. 2 This created Praat (Boersma, 2001) TextGrids marking the boundaries between segments. All sibilant and vowel boundaries of the MFA TextGrids were then hand-edited to ensure accuracy of extracted measurements.
A Praat script based on DiCanio (2021) was used to extract spectral COG of the fricatives and formant values of the following vowels. COG was computed from a time-averaged spectrum over the middle 80% of the fricative interval using a Hanning window with size of 0.015 seconds over six windows for the time average. Frequencies below 1,000 Hz were filtered out. As a test of whether the fricative interval is stationary, COG values were also averaged across two multitaper spectra (Chodroff & Wilson, 2014), one extracted from the beginning and one from the end of the middle 80% of the fricative interval. COG values from both methods were significantly correlated (results in Appendix A). Results in the following section are presented with the time-averaged COG, but the same results are obtained if the multitaper-averaged COG is used instead. The formants were estimated using the Burg method and extracted at 10-millisecond intervals throughout the duration of the vowel. The formant measurements were also all hand-checked and manually adjusted when they contained formant tracking errors (more than 100 Hz difference from the hand-checked measurements). All results presented below use F2 values at 20 milliseconds into the vowel, following Nowak (2006). All analysis and data visualization was done in
Linear discriminant analysis (LDA) was used to quantify cue weight in production (Duda et al., 2012; Fisher, 1936; Fukunaga, 1990). LDA is a classification method that relates continuous predictor variables to category labels. The purpose of LDA is to find the linear function 3 that best discriminates a set of categories (here, the sibilant categories), given a set of predicting features (here, the acoustic measures of COG and F2). There is precedent in phonetic literature for using LDA to quantify cue weight in production (see Schertz & Clare, 2020, for a review). Following this, I use the coefficients of linear discriminants as the measure of cue weight from LDA. The coefficients are regression weights used to calculate the probability of category membership (James et al., 2013). They indicate the contribution of each predictor variable to the discriminant function and can be interpreted as indexing the strength of individual predictors.
The main question of interest is whether individual differences in cue weighting are suggestive of cue trading or degree of clear speech. A linear mixed-effects regression was used to assess this. The dependent variable is F2 cue weight (LDA coefficient) with predictors of Vowel Context, Contrast, COG cue weight (LDA coefficient), and random intercepts for speaker. The trade-off hypothesis predicts a significant negative estimate of COG cue weight for the dependent variable of F2 cue weight, indicating that speakers who use one cue more contrastively use the other less contrastively. The speech style hypothesis predicts a significant positive estimate of COG cue weight on F2 cue weight, indicating that speakers who use one cue more contrastively also use the other cue more contrastively. Full results are presented in the following section.
3 Results
This section provides results on individual differences in contrast implementation, cue weighting patterns at the group level, cue relationships within contrasts across speakers, and effects of the /s ʂ/ merger on cue weights across contrasts.
3.1 Individual differences in contrast implementation
In accordance with previous literature, there are individual differences in use of COG versus F2 and degree of category overlap. To provide a visual sample of this variation, speakers were grouped according to whether they display any visual overlap between sibilant categories in COG or F2. Data from example speakers are provided in Figures 1 to 3 and discussed in this section. COG values are on the X-axis and F2 values at 20 milliseconds into the following vowel are on the Y-axis. The /a/ context is in the left panel and the /u/ context is in the right panel. Ellipses were generated with the default settings of the stat_ellipse() function in ggplot2, which displays a 95% confidence interval for a multivariate

Example speakers with three visually distinct sibilant categories. Ellipses show 95% confidence interval of multivariate

Example speakers with some visual overlap between sibilant categories. Ellipses show 95% confidence interval of multivariate

Example speakers with near complete /s-ʂ/ merger. Ellipses show 95% confidence interval of multivariate
Figure 1 shows two example speakers who exhibit three non-overlapping sibilant categories in both vowel contexts. These are representatives of the 10 speakers in the sample who exhibit three distinct sibilant categories with no visual overlap between categories in COG or F2. Speaker 16 in the top panel of Figure 1 is the speaker with the greatest COG distinction between /s ʂ/ in both vowel contexts. While the contrast between /s ɕ ʂ/ is primarily a COG contrast in the /a/ context, the /ʂ ɕ/ contrast is primarily an F2 contrast in the /u/ context, such that there is no three-way COG contrast in this context. Speaker 19 (bottom panel of Figure 1) realizes the /ʂ ɕ/ contrast differently, maintaining more COG distinction between /ʂ ɕ/ in the /u/ context.
Eight speakers exhibit some degree of visual overlap between /s ʂ/, of which two speakers have almost entirely overlapping /s ʂ/ categories in both vowel contexts. These speakers also exhibit intraspeaker variation between vowel contexts; most speakers produce more COG distinction before /a/ than before /u/. Example data from two speakers with some /s ʂ/ overlap are shown in Figure 2 and the two speakers with almost entirely overlapping /s ʂ/ categories in both vowel contexts are shown in Figure 3.
3.2 Group-level cue weighting patterns
An analysis of lexical properties was first conducted to determine the best space for performing the LDAs. Separate linear mixed-effects models were fit for COG and F2 values using the lme4 package (Bates et al., 2007). In both models, Sibilant, Vowel context, the Sibilant × Vowel interaction, Experimental block, Lexical frequency, and Tone were predictors with random intercepts for Speaker and Item. The anova() function in R was used to obtain
A set of LDAs was performed for each speaker within vowel contexts between each pair of sibilants using COG and F2 as the relevant predictors. This was done using the LDA function from the MASS R package (Ripley et al., 2013). Prior to performing the LDAs, the acoustic measurements for COG and F2 were standardized using within-speaker within-vowel
The COG and F2 coefficients for each contrast are plotted in Figure 4. For each pair of contrasting sounds, the two cues of COG and F2 are indicated on the X-axis, with their respective weights on the Y-axis. For the contrasts involving alveopalatal /ɕ/, the average F2 cue weight is higher than the average COG cue weight. This is expected based on previous work noting F2 as the primary cue distinguishing the alveopalatal from the other sibilants. The average F2 cue weight for these contrasts is also higher than the average F2 cue weight for the /s ʂ/ contrast, where COG cue weight is higher. This also is in line with previous work noting COG as the primary cue distinguishing /s ʂ/. There is more variability in COG cue weight for the contrasts involving /ɕ/, as speakers differ in how much COG contrast is maintained when the primary cue is F2.

Cue weights for each contrast collapsed over speakers and vowel contexts.
3.3 Cue relationships within contrasts across individuals
Graphs showing each individual speaker’s cue weights are given in Figure 5, which is partitioned by sibilant contrast and vowel context. Sibilant contrast is given in the rows, and within each row, the /a/ vowel context is in the left panels and the /u/ vowel context is in the right panels. Within each panel, COG cue weights are on the X-axis and F2 cue weights are on the Y-axis. Each point represents the cue weights of an individual speaker. The points are labeled with the speakers’ participant numbers and can be cross-referenced with the graphs in Figures 1 to 3 and Appendix D. The linear best-fit regression line is provided in each panel along with a correlation test.

Cue weights for each contrast across speakers.
The individual weights for each contrast exist on a continuum, including the alveolar–retroflex /s ʂ/ contrast, which is involved in a merger. There is no clear delineation between speakers who exhibit the merger and speakers who contrast /s ʂ/ (at least with respect to the cue weight values here). Within each contrast and vowel context (each individual panel of Figure 5), there is a negative correlation, though this correlation is slight and non-significant for the /s-ʂ/ contrast. To estimate the strength of the correlations, these data were submitted to a linear mixed-effects regression where the dependent variable is F2 cue weight. The predictors are Vowel context, Contrast, and COG cue weight with random intercepts for Speaker. Results for fixed effects are given in Table 1.
Fixed Effect Table for Linear Mixed-Effects Regression.
indicates
The positive estimates and significant effects of Contrast indicate that F2 cue weight is significantly higher for the contrasts that involve /ɕ/ relative to the alveolar–retroflex contrast (the reference level). This is expected as F2 has been noted as the primary cue for those contrasts. This effect can also be observed in the boxplot in Figure 4, which shows that F2 weights are, on average, higher for contrasts involving /ɕ/. The negative estimate of COG cue weight indicates that there is an inverse relationship between COG cue weight and F2 cue weight for the intercept contrast /s ʂ/, but this relationship is not significant. The significant interactions with the other contrasts show negative estimates, indicating that the inverse relationship between COG cue weight and F2 cue weight for these contrasts is significantly stronger relative to the intercept /s ʂ/ contrast. This means that COG cue weight is significantly negatively correlated with F2 cue weight for the contrasts involving /ɕ/.
3.4 Effects of /s ʂ/ merger on cue weight across contrasts
The previous section analyzes the relationships between cue weights within contrasts and results suggest a trade-off relationship between use of F2 and use of COG. This section examines use of COG across contrasts and the effect of the /s ʂ/ merger on the other sibilant contrasts in the system.
Because of the way the merger tends to be realized, the loss of /s ʂ/ COG contrast simultaneously results in loss of COG contrast with /ɕ/. Figure 3 provides examples of speakers with a (near) complete merger of /s ʂ/. These are the speakers who have some of the lowest COG cue weights for distinguishing /s ʂ/, and their categories are almost entirely overlapping in both vowel contexts. For these and other speakers that tend toward merger in this data set, the merger is realized as an increase in within-category COG variability for /s/ and /ʂ/, rather than a shift of one category toward the other. This means that any COG distinction between /s ʂ/ and /ɕ/ also collapses, as /ɕ/ COG values are typically between those of /s/ and /ʂ/. For speakers with the merger, the merged /s ʂ/ category occupies all the COG space in an individual’s system, leaving no space for COG contrast with /ɕ/.
If this pattern holds across speakers, we would expect a positive correlation between COG cue weights of different contrasts. Speakers that have low COG cue weights due to the /s ʂ/ merger should also exhibit low COG cue weights for the /s ɕ/ and /ʂ ɕ/ contrasts. Speakers that maintain distinction between /s ʂ/ (and therefore have higher COG cue weights for /s ʂ/) should also be more likely to distinguish /ɕ/ using COG, as their phonetic space permits such a distinction.
This positive trend can be observed in the graphs in Figure 6. COG weight for the /s ʂ/ contrast is given on the X-axis, with COG weight for the /s ɕ/ contrast on the Y-axis in the top panel and COG weight for the /ʂ ɕ/ contrast in the bottom panel. As in Figure 5, the graphs are partitioned by vowel context and each point represents the weights of an individual speaker. Within each panel, there is a positive correlation across speakers, indicating that speakers who have lower /s ʂ/ COG cue weight also have lower COG cue weight for the /s ɕ/ and / ʂ ɕ/ contrasts.

COG cue weights across contrasts.
To estimate the strength of the correlations shown in Figure 6, a linear mixed-effects model was performed with /s ʂ/ COG cue weight as the dependent variable. Predictors are Vowel context and the interactions between Vowel context and COG cue weight for the /s ɕ/ and /ʂ ɕ/ contrasts, respectively, with random intercepts for Speaker. Results are provided in Table 2. The effect of Vowel compares the /u/ context to the reference level /a/ context and shows no significant difference in COG cue weight for /s ʂ/ between the two contexts. The positive estimates and significant effects of the COG weight by Vowel interactions indicate a significant positive relationship between COG cue weights for the /s ʂ/ contrast and COG cue weights for the other contrasts in the system. This is true of all contrasts and vowel contexts except the /ʂ ɕ/ contrast in the /u/ context, where the positive relationship does not reach significance.
Fixed Effect Table for Linear Mixed-Effects Regression.
indicates
Overall, these results indicate that speakers who merge the alveolar and retroflex sibilants also exhibit lower COG cue weight for most other contrasts, and speakers who maintain COG distinction between the alveolar and the retroflex also maintain COG distinction between these categories and alveopalatal /ɕ/. This suggests that the /s ʂ/ merger not only involves overlap of those two categories, but also represents an overall decrease in COG contrast throughout the sibilant system.
4 Discussion
This paper has examined the relationship between cue weights in the Mandarin sibilant system at the talker level, both between and across sibilant contrasts. When cue weights are significantly correlated, they show an inverse relationship between COG cue weight and F2 cue weight across speakers. This is consistent with a cue trade-off account of individual differences in contrast signaling. Speakers that produce one cue less distinctively tend to produce the other cue more distinctively. Previous work on weights of multiple cues in production has yielded conflicting results on whether cue trade-off or speech style modulates the relationship between cues. The results here strengthen existing support for a trade-off relationship between cue weights (Bang et al., 2018; Shultz et al., 2012 on stops in Korean and English).
The cue trade-off relationship also has implications for the phonetic realization of all sibilants when /s ʂ/ are merged, as a decrease in COG contrast increases reliance on F2. Between sibilant contrasts, there is a positive relationship of COG cue weights across speakers, such that speakers that have low COG cue weights for the /s ʂ/ contrast also have low COG weights for the /s ɕ/ and /ʂ ɕ/ contrasts. This is because a complete merger effectively collapses all COG contrast in the system, shifting the contrasts involving /ɕ/ to rely less on COG and more on F2. Therefore, the /s ʂ/ merger not only involves overlap of those categories, but also restructures the acoustic space of the sibilant system.
Although there was an inverse relationship between cue weights for all contrasts in all vowel contexts, the relationships between COG and F2 for the /s-ʂ/ contrast did not reach significance. This could be due to the fact that F2 cue weight values are generally lower for /s ʂ/, which is expected as F2 has not been reported as an important cue to that contrast. With all speakers exhibiting low F2 cue weight, there may not be enough interspeaker variation for any systematic relationships to emerge. It is possible that weights of other cues to retroflex place of articulation, such as F3 of the following vowel, may exhibit significant correlations with COG cue weight.
Multiple studies have examined the relationship between cue weights in production and cue weights in perception for the same individuals and have typically found no significant relationship (Kim & Clayards, 2019; Schertz et al., 2015; Shultz et al., 2012), though some have observed positive trends, potentially indicating a weak relationship between cue weight in production and cue weight in perception. Based on this, we would not expect the differences in COG and F2 cue weight observed here to necessarily be predictive of how the speakers would use COG versus F2 in a perception task. Another area for further work would be to determine the relationship between production cue weight and individual differences in perception of Mandarin sibilants.
While previous work attributes the /s ʂ/ merger to “some southern dialects, including Taiwan Mandarin” (Chang & Shih, 2012), these results demonstrate varying degrees of merger from speakers with origins in northern and southern mainland China. The two speakers with a near-complete merger and the two speakers with the highest COG cue weight were all from southern provinces. This could reflect a high degree of interspeaker variation in these varieties (in line with Chiu et al., 2020, who found a range of /s ʂ/ realizations among Taiwan speakers, the variety most associated with the merger). The high COG cue weights from some speakers could also be the result of hyperarticulation and/or hypercorrection of the contrast (Chung, 2006). It is possible that some speakers who would merge in casual contexts produced distinction in the laboratory context. Further work specifically manipulating formality of task to elicit hypo- and hyperarticulated sibilants from the same participants could be done to disentangle these explanations.
Incorporating additional dimensions and parameters is an important area for further work on Mandarin sibilants. While this paper focuses on the two cues shown in perception studies to be the primary cues to sibilant place contrast (see Section 1.3 for a review), additional work could clarify how these cues might be better operationalized. For the purposes of testing the hypotheses in this paper, many formulations of an intrinsic spectral cue to sibilant place of articulation lead to the same overall result: time-averaged COG, COG averaged across two multitaper spectra, frequency of the highest amplitude spectral peak, and frequency of the first major spectral peak. Although all of these measures are correlated in the production data here, no measure individually provides a complete view of sibilant spectra and further work is needed to determine which measure best reflects native Mandarin perception.
5 Conclusion
This paper has presented the results of a production study comparing COG and F2 cue weight of sibilants across Mandarin speakers with origins in mainland China. Interspeaker variation in sibilant realization and relative cue weighting patterns was observed. These individual differences were quantified using cue weights of the COG and F2 dimensions. The observed relationship between COG cue weight and F2 cue weight is one of trading. For all sibilant contrasts involving /ɕ/, the speakers that displayed the least distinction in COG also displayed the most distinction in F2, and vice versa. Because of the increase in COG variability associated with the /s ʂ/ merger, speakers who merge /s ʂ/ simultaneously collapse COG contrast between /s ʂ/ and alveopalatal /ɕ/. Therefore, the merger not only involves overlap of the retroflex and alveolar categories, but also restructures the acoustic space for sibilants such that COG contrast is generally reduced in the sibilant system, and merging speakers shift toward relying fully on F2 to distinguish the remaining contrasts.
Supplemental Material
sj-xlsx-1-las-10.1177_00238309231152495 – Supplemental material for Differential Cue Weighting in Mandarin Sibilant Production
Supplemental material, sj-xlsx-1-las-10.1177_00238309231152495 for Differential Cue Weighting in Mandarin Sibilant Production by Ivy Hauser in Language and Speech
Footnotes
Funding
Supplemental material
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
