Abstract
1 Introduction
The study of social meaning—the association of linguistic form with social categories—by now has a long tradition in sociolinguistics. The focus on social meaning is a characteristic of the so-called third wave of sociolinguistics (Eckert, 2012). Work in the first wave of sociolinguistics was primarily aggregating large demographic tracts and associating language behavior with sociodemographic characteristics, as sociolinguists attempted to map out the speech of large urban centers, such as New York City (Labov, 1972a), Norwich, England (Trudgill, 1974), and Montreal, Canada (Sankoff & Cedergren, 1972), so they were able to draw correlations between linguistic variables and macrosocial categories, such as age and social class. Later, work of the second wave, such as Milroy (1980) in Belfast and Rickford (1986) in Guyana, made local sense of macrosocial variables, such as religion and class. They also ushered in methodological innovations such as the use of social network analysis and ethnography. The third wave of variation (Eckert, 2012) is distinguished by the exploration of the stylistic expression of individuals as part of a social order, where language change is linked to the indexing of variables to personae (Coupland, 2007).
In such contexts, the study of voice quality as a social marker has only relatively recently gained traction (e.g., Podesva, 2007, 2013). In its broadest sense, voice quality refers to the particular combination of settings implemented during the production of speech, including phonatory, articulatory, and muscular settings (Laver, 1994). Often, however, the term voice quality is used in a narrower sense to refer to phonation only, and the different phonation types that result due to changes in laryngeal settings (Keating & Esposito, 2006; Sóskuthy & Stuart-Smith, 2020; Wright et al., 2019). Research on social meanings of voice quality has generally focused on this narrower sense and has largely been concerned with non-modal phonation, such as creaky voice (Lefkowitz & Sicoli, 2007; Mendoza-Denton, 2011; Stuart-Smith, 1999b; Yuasa, 2010). Creaky voice is generally produced by rather low airflow through the glottis resulting in slow and irregular vocal fold vibration which in turn causes a low fundamental frequency (F0) (Davidson, 2020; Keating et al., 2015; Laver, 1980) (though note there are a number of different acoustic realizations of creaky voice; see Keating et al., 2015 and Garellek, 2019 for a description of these). For this reason, it has been associated with performing toughness and gender in analogy to Ohala’s (1994) frequency code, where low frequency is indicative of larger body sizes and high frequency is associated with tininess (in addition to a range of communicative functions and social meanings across different language varieties, see e.g., Gobl & Ní Chasaide, 2003; Yuasa, 2010). Johnson (2006, p. 486) points out that “the cross-language and within-language phonetic arbitrariness of gender” calls into question “unitary abstract phonetic representations” and suggests that gender is subject to performance, highlighting that speakers are social actors. This fits well with creak not only as a performance of toughness by gang girls in California (Mendoza-Denton, 2011) but also as a signal of young urban upwardly oriented professional females in California (Yuasa, 2010). Podesva (2013) reports overall higher rates of creak in females than males in his sample; thus, toughness and gender performance are socially constructed through creak and through interpretations of creak.
Breathy voice, on the other hand, has received much less attention as a socially meaningful marker. While it has often been found to be associated with the speech of women and producing a “desirable” sounding female voice (Hall, 1995; Henton & Bladon, 1985; Ito, 2003; Ohara, 2004; Stuart-Smith, 1999a), it is less often interpreted along the lines of social performances or social categories (Podesva, 2013—though see Podesva & Callier, 2015, for a discussion of the role of breathiness, among other voice qualities, in reported speech and displaying affect, and Teshigawara, 2003, for an examination of voice quality including breathiness in portraying characters as either good or evil in Japanese anime). Breathy voice is produced with a glottis open along most of its length, allowing for more air to pass through the vocal folds which nevertheless vibrate regularly. This airflow mechanism makes it articulatorily rather difficult to sustain breathy voice over a longer stretch of speech material (Catford, 1977), which may contribute to why it is less often observed as a social marker. In addition, breathy voice may be considered to have a less clearly defined real-world correspondence compared to F0 and creaky voice. Breathy voice may also lend itself less to acquiring social meaning because of its less salient perceptual properties (Laver, 1980). As neither creaky nor breathy voice is phonologized in English or German, voice quality is not linguistically contrastive, as a change in voice quality does not change the meaning of a particular word or utterance. Taking a functional perspective, creaky voice can be said to take over some structuring properties in languages such as English or German as it often marks the end of phrases or the onset of strong syllables (Fougeron & Keating, 1997; Garellek, 2014, 2015; Henton & Bladon, 1988; Kreiman, 1982; Ogden, 2001). Breathy voice is not known to mark prosodic strengthening at the onset of linguistic domains.
Variation in fine phonetic detail often describes within-category variation. Johnson (2006) rightfully pointed out that this causes a problem for assumptions underlying theories of speech perception posing abstract phonological primes to which variants must be matched. A smart way around this mapping problem is the idea of alternants receiving meaning in themselves through exposure. The learned link between the linguistic form and a social category or identity of users, speakers’ ideological stances, their social demographics or attitudes, and so on (Eckert & Labov, 2017) allows for the interpretation in perception (Jannedy & Weirich, 2014c; Weirich et al., 2020). Lacking a meaningful interpretation causes a variant to remain what Labov (1966) calls an indicator without an association to a social category.
When the usage of specific pronunciation variants or voice qualities becomes emblematic and indexical (Silverstein, 2003) of a specific speaker group and it is recognized as such by speakers and hearers, it becomes enregistered (Agha, 2003). With such recognition, it surpasses what Labov calls an indicator (there is a difference but it is not noticed) and it becomes a marker. In this study, we examine three acoustic parameters, two of which are segmental alternants that have been previously found to vary significantly in the multiethnolect Kiezdeutsch (KD) as spoken in Berlin and regional standard German (SG) as spoken in Berlin: coronalization of palatal fricatives and fronting of the diphthong /ɔɪ/. The third parameter is voice quality, which has not previously been investigated in this context. To illuminate the role of these three phonetic factors in the partitioning of the social and linguistic landscape of Berlin, we have conducted a perception test with the goal to understand the social meaning associated with specific alternants.
1.1 Kiezdeutsch
KD is a multiethnolectal variety of German spoken by young people from multicultural communities that exhibits lexical, syntactic, and phonetic differences from SG (Auer & Dirim, 2003; Jannedy & Weirich, 2014c; Wiese, 2012). It originated in neighborhoods with predominant Turkish migrant workers. Today, adolescents from many ethnic and linguistic backgrounds use features of this variant to varying degrees, particularly in neighborhoods with high levels of multilingualism, such as Kreuzberg.
A rather salient and pervasive feature of KD is the fronting of the standard palatal fricative /ç/ (as in
On the perception side, previous studies have shown that the spectral characteristics in the fricative productions of KD speakers are indeed perceivable and salient for listeners (Jannedy et al., 2011; Jannedy & Weirich, 2014c; Weirich et al., 2020). Jannedy and Weirich (2014c) examined in a perception experiment whether listeners categorize identical acoustic stimuli differently in the context of two different primes: the names of two neighborhoods of Berlin (multilingual Kreuzberg and Zehlendorf, a monolingual/affluent district) and a control condition with no additional information. The acoustic stimuli consisted of natural acoustic stimuli with synthetic fricatives synthesized along a continuum ranging from /ç/ to /ʃ/ as either
Further perception work (Weirich et al., 2020) investigated the strength of associations between phonetic alternations and social attributes in the context of KD and German with a French accent. An Implicit Association Task (IAT) was run with participants categorizing written words as having a positive or negative valence and auditory stimuli containing pronunciation variations of /ç/ as canonical [ç] (labeled
Altogether, these studies show that the phonetic realization of the fricative /ç/ in German carries social meaning, and its realization as [ɕ] or [ʃ] is strongly associated with the speaker group of KD.
Much less research has been conducted on other segmental characteristics of KD. Previous research has pointed to the realization of /ɔɪ/ as a feature of KD (Jannedy & Weirich, 2014a, 2014b; Weirich & Jannedy, 2013). The studies showed that for female speakers, the nucleus of /ɔɪ/ is realized as more centralized in KD speakers compared to SG speakers, with higher F2, particularly from the start to the mid part of the diphthong. For male KD speakers, F2 is also higher, not only at the start but throughout the diphthong. No effect was apparent for F1. Jannedy and Weirich (2014a) also looked at linguistic influences on diphthong centralization and found that while male KD speakers showed a raised F2 value irrespective of segmental environment, for female speakers, the centralization of the diphthong was enhanced by following and preceding obstruents. Syllable structure or sentence accent seems to have less of an effect on diphthong centralization. Ongoing investigations, also including the two other German diphthongs /aɪ/ and /aʊ/, point to the significance of /ɔɪ/-fronting in KD (Jannedy & Weirich, 2013, 2014a; Jannedy, Weirich, Mendelsohn & Schüppenhauer, 2019; Weirich et al., 2024).
KD is a linguistic conglomerate of various speech features constituting a specific speech style (Auer & Dirim, 2003; Wiese, 2012), predominantly used in informal peer-group settings by multilingual speakers from multicultural neighborhoods, and it is this style that is being deployed when speaking among each other in school yards or outside of formal social contexts. However, we have found that this is not always the case. In our previous work exploring the acoustic contrast between /ç/ and /ʃ/ (Jannedy & Weirich, 2017), we tried to elicit the greatest possible contrast between these two sounds by eliciting minimal pairs. According to Labov’s (1972b) taxonomy, minimal pairs should bring out a contrast if speakers have a contrast. A group of speakers did not produce a contrast and even appeared puzzled to see apparently two different ways of orthographically capturing the same sound string. Thus, KD to some is not a performance, but a primary linguistic resource used across different situational and functional settings, independent of addressee. Note that we have exemplified this with minimal pairs, but have also observed the use of other KD features (lack of agreement, usage of bare nouns, etc.) in students’ interactions with their teachers and in the laboratory.
Work on social meaning is predicated on the premise that language possesses different strata of meaning that transcend the confines of lexical constituents. Linguistic choices then, in essence, reflect the multifaceted dimensions of human existence, encompassing identities, affiliations, attitudes, stances, and ideological orientations. Through the examination of these nuanced linguistic choices and variants, and a detailed exploration of phonetic subtleties, we strive to unveil the social meaning entailed in fine phonetic detail in communicative processes. However, it is paramount to acknowledge that the assumed social meaning of fine phonetic detail can only be validated by listeners who are capable of decoding, interpreting, and ascribing significance to the meaning, that is, for whom a variant is enregistered.
In this work, we take this approach and test whether the pronunciation variant of /ɔɪ/ is salient also for perception and used by listeners to infer social information about the speaker.
1.2 Voice quality as a social marker?
As described above, the term voice quality is often used in a narrow sense to refer to differences in phonation resulting from changes in laryngeal settings, and in this paper, we will use the term voice quality in this sense. Although many different non-modal voice qualities can be described (Laver, 1980), two general categories are most often referred to in the literature: breathy voice and creaky voice (Garellek, 2019; Keating & Esposito, 2006). Breathy voice is generally produced with increased glottal opening resulting in additional aspiration noise; creaky voice, however, is generally produced with increased glottal constriction and low and irregular F0 (Garellek, 2019; see Keating et al., 2015 for an overview of the different acoustic manifestations of creaky voice).
In some languages, phonation is a contrastive feature that can signal a change in meaning. This is the case, for example, in Jalapa Mazatec, in which creaky voice quality produces a contrast to the same item produced with modal or breathy voice (Garellek & Keating, 2011). Voice quality is not generally a contrastive phonological feature of Indo–European languages (Keating et al., 2023), although changes in phonation can serve as a cue to segmental contrasts (see e.g., Penney et al., 2018). However, non-modal voice qualities may be exploited for prosodic and sociolinguistic purposes. For example, creaky voice has been shown to mark phrase/utterance finality in multiple languages, such as English (Garellek, 2015; Henton & Bladon, 1988; Kreiman, 1982), Estonian (Aare et al., 2018), Finnish (Ogden, 2001), Swedish (Carlson et al., 2005), and German (Köser, 2014; Peters, 2003).
Many studies have shown that voice quality differences can serve as a social cue and may index elements of a speaker’s identity. For example, in some varieties of British English, female speakers make more use of breathy voice (Henton & Bladon, 1985; Stuart-Smith, 1999a), whereas creaky voice is associated with middle-class male speakers (Esling, 1978; Henton & Bladon, 1988) and working-class males tend to use harsh or whispery voice (Esling, 1978; Stuart-Smith, 1999a). It has also been suggested that female speech may be breathier than male speech in general, due to incomplete glottal closure in females (Södersten & Lindestad, 1990). However, recent studies suggest that there may be a prevalence for more creaky voice in female speech, particularly in the case of younger women in American English (Abdelli-Beruh et al., 2014; Podesva, 2013; Wolk et al., 2012; Yuasa, 2010), although this may also reflect researcher bias, as female speakers of American English tend to be the primary target of such studies (Dallaston & Docherty, 2020). Aside from indexing gender, voice quality can signal other elements of a speaker’s personality. For example, in a study on Chicano gang members, creaky voice (along with visual cues, such as the length of eyeliner) was found to be associated with members projecting more “hardcore” personas (Mendoza-Denton, 2011).
Integrating differences in voice quality may also be a way for speakers from migrant backgrounds to express their ethnolinguistic repertoires or identities that reference their ethnocultural heritage and divergence from the mainstream (Clyne et al., 2001), and differences in voice quality have been identified between speakers of standard varieties and (multi)ethnolectal speakers in a number of varieties of English. For example, Newman and Wu (2011) found that American English speakers of Asian (Chinese and Korean) descent produced a breathier voice quality (among other features) than speakers of other non-Asian backgrounds. In New Zealand English, speakers of Māori descent have higher F0 than Pākehā speakers (i.e., those of European descent), and this is considered to index speakers’ ethnic identities (Szakay, 2006; Szakay & King, 2018). In addition to pitch differences, Szakay (2012) found that Māori speakers also produced a creakier voice quality than Pākehā speakers. In something of a contradiction, a subsequent study by Szakay and King (2018) suggested that speakers of Māori descent may in fact use
In Multicultural London English (MLE), a multiethnolect spoken in linguistically diverse areas of London, Szakay and Torgersen (2015) found that male speakers have a breathier voice quality than those who live outside of London and have an Anglo background. They initially also found that female speakers of MLE have a creakier voice quality than their non-London counterparts of Anglo background, though this result was impacted by issues of tracking F0 at low frequencies, and a later re-analysis determined that creaky voice was rather a marker of outer London Anglo speech (Szakay & Torgersen, 2019). More recently in Australian English, Penney and Cox (2021) have identified increased breathiness in monosyllabic CV words in speakers of Lebanese background compared to mainstream Australian English speakers. Loakes and Gregory (2022) have also identified voice quality differences between Indigenous speakers of Australian Aboriginal English and mainstream Australian English, with lower F0 and a creakier voice quality produced by the Australian Aboriginal English speakers.
These results, taken together, demonstrate that voice quality differences may be employed by speakers of (multi) ethnolectal varieties in various language environments. However, voice quality remains understudied in work on multiethnolectal variation, particularly in non-English speaking contexts. Anecdotal observations suggest that higher F0 and increased breathiness may be present in the speech of KD speakers. Therefore, in this study, we present an acoustic comparison of voice quality in adolescent KD speakers and SG speakers from Berlin.
1.3 Hypotheses and structure of the paper
In parts 2 and 3 of this paper, we will present our analysis on voice quality in KD speakers, which was conducted on two sets of production data taken from a previously collected and annotated corpus: data from a conversational task and from a reading task. Based on the previous findings regarding voice quality in similar multiethnolectal groups, we may hypothesize that KD speakers will be found to produce a breathier voice quality than SG speakers. Knowing that voice quality can signal various social stances, we examine whether breathy voice is associated with KD speakers and interpreted to signal group affiliation. Therefore, in parts 5 and 6, a perception test is described which investigates the potential impact of three phonetic cues (coronalization of /ç/, /ɔɪ/-fronting, breathy voice quality) on the attribution of speaker background (i.e., KD speaker or not). We hypothesize that the different cues will have different weights regarding their indexing values: while the association between /ç/ and KD is strong and its role in perception has been shown before, the relative salience of /ɔɪ/ and differences in voice quality remain to be seen. These cues could be factors that add to an existing bias but might not be sufficient to index a speaker’s background on their own.
2 Methods: production
2.1 Data and speakers
The production data analyzed here were extracted from a database of annotated audio-recordings of speech maintained by ZAS Berlin, which contains recordings of male and female KD speakers from multicultural neighborhoods in Berlin that are highly associated with the multiethnolect, and recordings of male and female SG speakers from other parts of the city. To ensure that speakers were balanced for age between KD and SG—and consequently that any observed differences were not due to age grading or differences in processes of laryngeal development—we included only data from adolescent speakers aged 14–17 years, as the number of speakers in each of the groups was most comparable in this age range. Data for one male speaker within this age range were excluded as his mean F0 was substantially higher than all of the other males (256 Hz compared to 99–133 Hz) indicating he had not yet experienced the lowering of F0 associated with voice break in puberty.
Data from three sets of speakers are included in this analysis. Two of the groups comprised KD speakers from the neighborhoods of Wedding, Kreuzberg, and Neukölln: one group of KD speakers was recorded as they engaged in spontaneous conversation with a research assistant in an interview task,
Summary of Speakers and Number of Segments Included in Analyses According to Task.
The speakers were recorded either in a quiet room of their school or in the laboratory at ZAS. The recording sessions were conducted by the third author and a research assistant trained in sociolinguistic interviews. The interviewers were known to the speakers prior to the recording sessions through casual contact in youth centers and their school. Participants were under the impression that their voices were being recorded for research to improve speech technology, which they were excited to contribute to. They were familiarized with the situation beforehand and the atmosphere of the sessions was relaxed, with snacks and drinks provided. Recordings were made with a Sennheiser ME64 microphone to a Tascam DR-05 recorder with a sampling rate of 48 kHz and 16 Bit resolution. All data were orthographically transcribed and initially segmented using WebMAUS (Kisler et al., 2017), and all segment boundaries for the diphthongs /ɔɪ/, /aɪ/, and /aʊ/ were hand-corrected. The correction of diphthong segment boundaries was carried out as part of another, unrelated study (Weirich et al., 2024). For practical reasons, we chose to limit this analysis to these diphthong segments so that we could be confident in the accuracy of our segmentation, though we acknowledge that voice quality information can be carried by all vowels (and indeed any segments that are phonetically voiced).
2.2 Acoustic analysis
We extracted estimates of F0, H1*–H2*, and Harmonics-to-Noise ratio (HNR) using VoiceSauce (Shue et al., 2011). For each of these three acoustic measures, values were averaged across each diphthong segment. F0 was calculated using the STRAIGHT algorithm (Kawahara et al., 1999). There are a number of acoustic measures that have been used to describe differences in voice quality, including various measures of spectral tilt, noise, periodicity, and intensity (Gordon & Ladefoged, 2001; Keating et al., 2023). Of these, the most frequently used measurements, at least in recent phonetic research on voice quality, are those which quantify spectral tilt: how sharply harmonic amplitude drops off at higher frequencies (Keating et al., 2023). H1*–H2* is a measure of spectral tilt that calculates the difference in amplitude between the first harmonic (H1) and the second harmonic (H2), with the application of an algorithm to correct for the effect of formant frequencies (as indicated by the asterisks), which may increase the amplitude of harmonics in the vicinity (Iseli et al., 2007). As a measure of spectral tilt, higher values of H1*–H2* are correlated with increased glottal opening (and hence increased breathiness) and lower values of H1*–H2* are correlated with increased glottal constriction (and hence increased creakiness) (Garellek, 2019; Hillenbrand et al., 1994; Holmberg et al., 1995; Keating et al., 2023).
While higher values of H1*–H2* would generally suggest a breathier voice quality, it can be difficult to determine whether higher values are indeed due to a breathier voice quality compared to a more modal voice quality, or whether they simply represent modal voice quality and the lower values of H1*–H2* indicate a creakier voice quality. However, breathy voice also exhibits lower values of HNR relative to modal voice due to the presence of aspiration. Therefore, following Garellek (2019), we measured HNR in addition to H1*–H2*: higher values of H1*–H2* together with lower values of HNR would provide converging evidence for increased breathiness. HNR was measured in the band below 500 Hz (Garellek, 2012). Importantly, H1*–H2* and HNR have been shown to be among the most informative acoustic measures of phonation differences (Keating et al., 2023) that together capture spectral tilt and noise, which are the important dimensions of voice quality in a psychoacoustic model of voice that links speech production to perception through acoustics (Garellek, 2019; Garellek et al., 2016; Kreiman et al., 2014, 2021).
Prior to modeling, we excluded all very short and long diphthongs with durations of less than 50 or greater than 300 ms. Word initial vowels are generally marked with glottal onsets in SG (Kohler, 1990), which could lead to lower H1*–H2* values in the following vowel. As it was not clear to what extent this would also be the case in the KD speakers, we visually explored whether excluding items that occurred in word initial position would cause a change in the overall patterns visible in the data. Exclusion of items with word initial vowels did not alter the general patterns observed, so word initial items were ultimately retained to increase the overall number of items analyzed. For the conversation task data, this resulted in 3,454 (KD: 2,250; SG: 1,204) items remaining for analysis. For the sentence-reading task data, this resulted in 1,367 (KD: 425; SG: 942) items remaining for analysis.
2.3 Statistical analysis
Linear mixed-effects regression models were constructed using the lme4 (Bates et al., 2015) and lmerTest (Kuznetsova, et al., 2017) packages in
3 Results: production
3.1 Analysis of F0
3.1.1 Conversation task data
Figure 1 illustrates the raw F0 (Hz) estimates for the KD and SG groups according to gender for the conversation task data. As can be seen, KD speakers (both female and male) exhibit somewhat higher F0 than their SG counterparts. The results of the linear mixed-effects model are shown in Table 2. Unsurprisingly, we found a significant effect of

Mean F0 values (Hz) in conversation task data according to group (left, SG; right, KD) and gender (left panel, female; right panel, male).
Summary of Linear Mixed-Effects Model for Effects of Group (Reference Level: SG) and Gender (Reference Level: Female) on F0 in Conversation Task Data.
3.1.2. Reading task data
Figure 2 illustrates the raw F0 (Hz) estimates for both groups according to gender for the sentence-reading task data. The results of the linear mixed-effects model are shown in Table 3. Again, unsurprisingly, there was a significant effect of

Mean F0 values (Hz) in reading task data according to group (left, SG; right, KD) and gender (left panel, female; right panel, male).
Summary of Linear Mixed-Effects Model for Effects of Group (Reference Level: SG) and Gender (Reference Level: Female) on F0 in Reading Task Data.
3.2 Analysis of H1*–H2*
3.2.1 Conversation task data
Figure 3 illustrates the H1*–H2* estimates for both groups according to gender for the conversation task data and shows that in both female and male speakers, higher H1*–H2* values are found for the KD speakers. This supports the hypothesis of a breathier voice quality in the KD group, although we note that as mentioned above, it is also possible that this represents a more constricted (i.e., creakier) voice quality in the SG group, and less constriction in the KD speakers, rather than breathier voice per se. The results of the linear mixed effects model are shown in Table 4. There were significant effects of

Mean H1*–H2* values (dB) in conversation task data according to group (left, SG; right, KD) and gender (left panel, female; right panel, male).
Summary of Linear Mixed-Effects Model for Effects of Group (Reference Level: SG) and Gender (Reference Level: Female) on H1*–H2* in Conversation Task Data.
3.2.2 Reading task data
Figure 4 illustrates the H1*–H2* estimates for both groups according to gender for the sentence-reading task data. As was the case in the conversation task data, higher H1*–H2* values were found for the KD speakers for both female and male speakers, indicating a breathier (or at least a less constricted) voice quality compared to the SG group. The results of the linear mixed-effects model are shown in Table 5. There was a significant effect of

Mean H1*–H2* values (dB) in reading task data according to group (left, SG; right, KD) and gender (left panel, female; right panel, male).
Summary of Linear Mixed-Effects Model for Effects of Group (Reference Level: SG) and Gender (Reference Level: Female) on H1*–H2* in Reading Task Data.
3.3 Analysis of HNR
3.3.1 Conversation task data
Figure 5 illustrates the HNR estimates for both groups according to gender for the conversation task data. Lower values of HNR are visible in the KD group, particularly in the male speakers. The results of the linear mixed-effects model are shown in Table 6. There was a significant effect of

Mean HNR values (dB) in conversation task data according to group (left, SG; right, KD) and gender (left panel, female; right panel, male).
Summary of Linear Mixed-Effects Model for Effects of Group (Reference Level: SG) and Gender (Reference Level: Female) on HNR in Conversation Task Data.
3.3.2 Reading task data
Figure 6 illustrates the HNR estimates for both groups according to gender for the sentence-reading task data. Lower values are visible for the KD speakers in both females and male speakers. The results of the linear mixed-effects model are shown in Table 7. There were significant differences for

Mean HNR values (dB) in reading task data according to group (left, SG; right, KD) and gender (left panel, female; right panel, male).
Summary of Linear Mixed-Effects Model for Effects of Group (Reference Level: SG) and Gender (Reference Level: Female) on HNR in Reading Task Data.
3.4 H1*–H2* and HNR in combination
The results in Sections 3.2 and 3.3 above suggest that participants in the KD group may produce a breathier voice quality compared to those in the SG group. In Section 3.2, higher values of H1*–H2* were observed in the KD group, which is indicative of increased glottal opening as would be expected in breathy voice. In Section 3.3, lower values of HNR were observed in the KD group, which is consistent with increased noise and which is also to be expected in breathy voice. These results were observed in both the conservational speech and in the reading task.
Figures 7 (conversational task) and 8 (reading task) illustrate these results in a combined manner for each individual item, with HNR values shown on the vertical axis and H1*–H2* values shown on the horizontal axis. The middle 50% of all items per category are represented by the ellipses. Overall, in all cases, the KD group exhibits values that are further to the right (higher H1*–H2*) and lower (lower HNR) than the SG group, confirming increased breathiness in this group. This is evident in both of the tasks and across both genders, although the difference appears to be slightly weaker in the females in the conversation data (Figure 7).

HNR (vertical axis) and H1*–H2* (horizontal axis) values for each item in the conversational task data according to group (red = SG; blue = KD) and gender (left panel, female; right panel, male). Ellipses represent the center 50% of items per category.
4 Interim discussion
These results provide complementary evidence that KD speakers produce a breathier voice quality than their SG speaking peers. In both the conversation and reading task data, H1*–H2* values were significantly higher in the KD speakers compared to the SG speakers, which is indicative of breathy voice. Correspondingly, significantly lower values for HNR were also found in the KD speakers, indicating more noise in their speech and also pointing toward a breathier voice quality (Garellek, 2019). Figures 7 and 8 also illustrate that at the level of individual items, the KD group appears to be breathier than the SG group. Taken together, these results provide converging evidence that the KD speakers in our data have a breathier voice quality than their SG counterparts, with the results from both cues pointing in the same direction. It should also be pointed out that the two tasks could potentially be considered as representing different communicative situations, with a more controlled reading task on one hand and a spontaneous conversation task on the other, and as such more natural speech might be expected in the conversations compared to the reading task. That we have found similar results across the different tasks (and across different groups of KD speakers) suggests either that both tasks were similar in perceived formality by our speakers or that increased breathy voice in KD speakers is a robust effect that is evident across different speaker groups in different communicative situations. We leave it to future research to explore this question in more detail.

HNR (vertical axis) and H1*–H2* (horizontal axis) values for each item in the reading task data according to group (red = SG; blue = KD) and gender (left panel, female; right panel, male). Ellipses represent the center 50% of items per category.
We also observed a tendency for higher F0 values in the KD speakers, particularly in the males, though no significant effect of group was found. Of course, significant effects were found for F0 with regard to
Having established that there appear to be differences in voice quality between these two groups of speakers, the question that needs to be asked is whether listeners are sensitive to these differences, and whether such a difference in voice quality carries social meaning. Listeners can of course perceive differences in voice quality; even in languages where voice quality is not used contrastively, differences in voice quality are known to influence listeners’ perceptions of a speaker’s characteristics or their affective state (e.g., Anderson et al., 2014; Gobl & Ní Chasaide, 2003; Yuasa, 2010). Yet the degree to which listeners associate a particular voice quality with a (multi)ethnolectal group of speakers is not well understood. In American English, Newman and Wu (2011) speculated that listeners may rely on a breathier voice quality as one feature in identifying Asian Americans from those with other ethnicities; however, this notion was not tested directly. In New Zealand English, Szakay (2012) found that listeners used voice quality differences (combined with rhythmic and intonational cues) to identify whether speakers sounded more Māori or more Pākehā, with breathier voice qualities being perceived as more Pākehā sounding and creakier voices as more Māori sounding. This effect was strongest for listeners who were highly integrated within Māori communities. Whether voice quality also indexes group affiliation for listeners from outside of a group, and whether voice quality on its own is sufficient for this, remains to be investigated.
While non-modal voice quality features such as creak may signal the onset of stressed syllables or phrase/utterance finality, breathy voice is not known to be employed for any structural purpose in German. If at all, it might be considered as signaling femininity or intimacy, as in other languages spoken in Europe, such as Dutch and Spanish (Gobl & Ní Chasaide, 2003; Mendoza et al., 1996; Sulter & Peters, 1996; Van Borsel et al., 2009), though to our knowledge, this has not been empirically examined specifically for German. Such a perception would, however, appear to be at odds with the stereotypical image portrayed by many KD speakers, who tend to project a tough, inner city image (Bahlo & Lohse, 2021). In addition, it is not clear whether or to what extent variation in voice quality is salient to listeners, and whether it forms a part of their perception of specific social groups, separate from more salient variation in segmental features. That is, it is not clear whether there is a link between the phonetic form of producing a breathier voice quality and the social characteristic of being a KD speaker (Johnson et al., 1999), in the absence of a segmental marker such as coronalization of /ç/, which has been shown to be robustly linked to listeners’ perception of KD speakers.
Therefore, in the following sections, we report on a perception test designed to examine the social meaning of breathy voice with regard to KD, and whether this is perceived relative to other segmental cues. We examined the effect of breathy voice on the perception of a speaker’s background both on its own and in combination with two segmental cues to KD: coronalization of /ç/, and /ɔɪ/-fronting. Coronalization has been shown to be a salient marker of KD (Jannedy & Weirich, 2014c; Weirich et al., 2020), whereas evidence for /ɔɪ/-fronting has been found in production studies (Jannedy & Weirich, 2013, 2014a, 2014b) but it is not yet clear to what extent listeners associate this with KD. If a breathy voice quality has been enregistered as a feature of KD, we would expect listeners to be more likely to rate someone as a KD speaker when hearing a breathier utterance than when hearing a modally voiced utterance. However, it is possible that voice quality on its own is not sufficient to shift listeners’ perception, but rather that its implementation in conjunction with other cues, such as the coronalization of /ç/, will enhance the perception of a KD speaker. Finally, it is also possible that differences in voice quality exist between KD and SG speakers, but that this phonetic feature has no perceptual relevance for listeners; that is, despite acoustic differences in production, voice quality may not (yet) be enregistered as a perceptually relevant cue to KD.
5 Methods: perception
5.1 Listeners
171 listeners (diverse: 5, non-binary: 1, female: 56, male: 98, no info: 11) took part in the online perception test. Participants who did not finish the test or said they did not know how KD sounds were excluded from the analysis resulting in 140 listeners (diverse: 2, non-binary: 1, female: 50, male: 87). Listener age ranged from 18 to 40 years with a mean age of 28.1. Participants varied in the time they had lived in Berlin from 1 to 39 years.
5.2 Stimuli
The stimuli used in the perception test consisted of the sentence
A professional voice actor—from Berlin, with a multiethnolectal background and knowledge about the variety KD—was paid to produce the stimuli sentences several times, and was trained in varying his voice quality from modal to breathy voice. We acknowledge that using a voice actor necessarily entails that our stimulus items were not produced by an “authentic” speaker of KD. However, we felt that this was necessary to ensure a level of phonetic control over the variants of the relevant features and to ensure that other features not being examined remained consistent between the items, and to avoid including additional cues to the identity of the speaker. Moreover, as (at least some of) the features being tested appear to be below the level of awareness of speakers, it can be quite difficult to find speakers who can vary these convincingly. The sentence productions differed with regard to voice quality (modal vs. breathy voice) and the variants used for the diphthong and the fricative (supposedly coming from a KD speaker or an SG speaker). Out of all productions, one sentence was chosen for each voice quality condition (based on H1*–H2* measures) to be used as carrier sentences for the other stimuli with varying segmental cues. Based on our knowledge of the acoustic characteristics of the KD and SG variants, the most suitable productions of the two variants for each condition (KD and SG) were chosen and spliced into the respective carrier sentence (i.e., the modal and breathy conditions), both singly and in combination. Great care was taken to cut and splice at zero crossings to avoid auditory discontinuities. Thus, the final stimuli used for the perception test consisted of four sentences in the modal condition and four sentences in the breathy condition that varied only in the production of the target cues (1: both variants in SG, 2: KD diphthong and SG fricative, 3: SG diphthong and KD fricative, 4: both variants in KD; see also Table 8 below), while the rest of the sentence did not vary (within the modal and breathy conditions). Attention was also paid to the comparability of the segmental cues in terms of formants and COG between the modal and breathy conditions. While the variation of F0 over the utterance was very similar between the carrier sentences (see Figure 9), mean F0 varied to some extent (breathy mean F0: 106 Hz; modal mean F0: 161 Hz). To control for a possible effect of fundamental frequency on the ratings, mean F0 of all sentences was changed to 120 Hz using the
Acoustic Characteristics of All Eight Stimuli Sentences.
COG (

Oscillogram and spectrogram of the breathy stimulus
Several analyses were made to compare the spectral characteristics of diphthongs and fricatives in KD and SG, and Table 8 shows the acoustic parameters measured in the original files used to create the final stimuli for the perception test. While a clear difference between the fricatives in SG and KD in terms of COG and

Oscillogram and spectrogram of /lɔɪ/ (from “Leute”) in SG (left) and KD stimuli (modal condition).
Table 8 also includes mean H1*–H2* values across all voiced segments in the stimuli sentences and shows clear differences between the modal and the breathy stimuli, with higher values in the breathy stimuli.
5.3 Perception test
The perception test was run via an online platform (Easyfeedback.de), and the link was distributed through the authors’ social networks. Listeners were not paid for participation but were given the option to register for a prize draw of €50 worth of gift vouchers. To minimize the time and effort participants had to invest to take part in the perception test and to ensure participants’ decisions on the perceived background of the speaker were judged against their own past experiences rather than categorized with regard to other stimulus items, each listener was asked to rate just one of the eight stimuli. Each of the eight stimuli was rated by a different group of listeners, varying in number between 15 and 22 listeners. Listeners were asked how likely it was on a scale from 0 (
5.4 Statistical analysis
As in the production study, linear regression models were used for the analysis of the perception data using the lme4 (Bates et al., 2015) and lmerTest (Kuznetsova et al., 2017) packages in
Means and Standard Deviations for Listener Age According to Stimulus.
6 Results: perception
Figure 11 illustrates the perception scores for each stimulus separated by voice quality (modal, breathy) and age (here separated into a categorical factor (below and above 0, i.e. the centered variable listener age). The mean ages of these groups are 24.2 (SD: 2.54) and 33.9 (SD 3.1) respectively. A separation between the ratings is apparent between the stimuli with a KD fricative (stimulus numbers 3 and 4) compared to the stimuli with a SG fricative (stimulus numbers 1 and 2) in all subgroups but to varying degrees. The clearest separation between all four stimuli is found in the younger listeners and the modal voice quality: here the expected stepwise rising of the ratings can be seen with the stimulus with only SG variants showing the lowest score and the stimulus with only KD variants showing the highest score. The stimuli with only one of the segments in KD version lie in between, with the stimulus with the KD fricative showing higher ratings than the stimulus with the KD diphthong. The smallest differences between the stimuli are found in the older listeners in the modal voice condition. Here, ratings in favour of KD are generally low, while the older listeners tend to rate the speaker as KD when the KD fricative is matched with a breathy voice quality. In contrast, the younger listeners rate the speaker as less KD like when the KD fricative is matched with a breathy voice quality. This points to a varying effect of voice quality on KD ratings depending on listener age when the KD fricative is contained in the stimulus.

Results of the perception test according to age and voice quality conditions.
A linear model was run testing for effects of voice quality, diphthong variant, fricative variant, and age of listener on the ratings including all interaction terms. The results are presented in Table 10.
Summary of Linear Model for Effects of Voice Quality Condition, Fricative, and Diphthong Variant (Reference Level: VQ_modal, /ɔɪ/-KD, /ç/-KD) and Listener Age (Numerical Variable With 0 As Mean Age and Negative Values Presenting Younger Listeners) on Perception Ratings.
For the reference levels (modal voice, both variants KD-like), we found an effect of age, with older listeners rating the stimuli in general less KD-like (Estimate −0.848,

Model plot visualizing the interaction term age (ageCS) * voice quality (vq) * fricative variant (KD/SG).
7 General discussion
In this study, we explored differences in voice quality between two varieties of urban German as spoken by adolescents in Berlin. We also tested the relative perceptual salience of the segmental alternations /ç/–[∫] and the fronting of /ɔɪ/, previously found to be characteristic in speech production in the German youth-style multiethnolect KD (Jannedy et al., 2011; Jannedy & Weirich, 2014a, 2014b, 2017), and of the breathy voice quality found for speakers of KD as described in this work. Results indicate a perceptual gradience for phonetic alternations detected in KD. The most widely observed, prevalent and obvious segmental alternation /ç/–[∫] that is strongly associated with KD (Jannedy & Weirich, 2014c; Weirich et al., 2020) was shown to be a highly salient and most reliable marker in our speech perception experiment. The second segmental alternation, the fronting of /ɔɪ/, also found to be a reliable marker of KD in speech production (Jannedy & Weirich, 2013; Weirich et al., 2024), seems to have been generally detectable especially in modal voice by the younger listeners (see Figure 12), but it was not reliably interpretable: association rates with KD failed to reach significance showing that this alternation is not associated and enregistered with KD. In other words, our results indicate that the fronting of /ɔɪ/ currently is an indicator (reliable difference in production but unnoticed) with the potential to eventually become a marker (reliable difference in production and connected to a social trait) to a wider group of listeners. The seemingly categorical distinction between indicator and marker seems somewhat problematic given that such categorization appears to be highly listener-specific. In fact, this process resembles that of phonologization where acoustic variation can give rise to new sound patterns or structures in the grammar. In the social domain, acoustic variation gives rise to patterns in social structure by means of enregisterment (Agha, 2003) “whereby distinct forms of speech come to be socially recognized (or enregistered) as indexical of speaker attributes by a population of language users” (Agha, 2005, p. 38). Thus, the process of enregisterment resembles that of sound change which slowly progresses through populations of speakers and hearers rather than constituting instantaneous switches.
As for the phonation difference, while we did not find an effect of voice quality overall, we found that for older listeners, the combination of breathy voice with the KD-like fricative variant resulted in a greater proportion of KD responses; however, for younger listeners, who we might expect to have increased experience with KD given it is a youth-style multiethnolect, the addition of breathy voice to stimuli containing the fricative marker resulted in a lower proportion of KD responses. That is, for younger listeners, who were overall more likely to identify the speaker as a KD speaker, breathy voice does not appear to be associated with KD and reduces the likelihood that listeners perceive a KD speaker even when they produce coronalized variants of the palatal fricative, which is otherwise a strong cue to a KD speaker. Why then did the addition of breathy voice result in more KD responses for the older listeners, who were less likely to rate the speaker as a KD speaker overall? Two interpretations here seem possible, though one appears to us more plausible: Perhaps older listeners are more sensitive to the phonetic features of KD, to the extent that they have internalized increased breathiness as a marker of the variety. This seems rather unlikely, given their lower sensitivity to the fricative and diphthong variants. Or perhaps older listeners perceived the voice quality difference in the stimuli, and realizing that this was not typical of their past experience with SG, concluded that this must be connected to KD. That is, rather than drawing on their past experience with KD, they were listening for difference from their own production/variety, and the items with both a salient fricative difference
The data presented suggest that segmental differences may be more easily and more reliably detected and learned when they go hand in hand with meaning differences. Such meaning differences may be phonological in nature, for example when a minute change in the articulatory parameter causes an abrupt change in the acoustic space and when a minute change in the acoustics causes an abrupt shift in the perceptual category (Stevens, 1972). Such
The failure for younger listeners to connect a breathier voice quality to KD although voice quality differences in production have been shown here and also between other social groups, for example in Australia and Great Britain (e.g., Loakes & Gregory, 2022; Penney & Cox, 2021; Szakay & Torgersen, 2015), raises the question of whether we have merely asked the wrong question in the perception experiment or whether a global feature, such as voice quality, in which the domain of application spans over larger stretches of speech rather than being localized to individual words, morphemes, or sounds, is too variable to be rigidly connected to specific speaker groups. However, F0 is also a global parameter and is often connected to social constructs, such as gender performance or authority. According to Ohala’s (1994) biological frequency code hypothesis and its interpretations (cf. Gussenhoven, 2002), a higher F0 is associated with smallness and deference while a lower F0 is associated with tallness and masculinity. In addition, a meta-study by Winter et al. (2021) found that mean F0 was generally lower when speakers conversed with an imaginary superior as compared to an imaginary friend. This fits well with former British Prime Minister Margret Thatcher striving for authority by lowering her voice (Beattie et al., 1982) as not to appear submissive and overly feminine. This is corroborated by Klofstad et al. (2015) who found that voters prefer leaders with lower-pitched voices because they are perceived as more competent and having greater integrity. Thus, F0 as a global parameter does lend itself to social meaning. Nevertheless, in our production study, F0 was not found to differ significantly between the KD and SG groups.
So either voice quality differences are found in multiethnolects but they do not have a social meaning from the perspective of the language user and from the perspective of the hearer and interpreter, or we must also entertain the thought that the stimuli used in the perception study lacked specific acoustic characteristics that would have been the necessary prerequisites for different ratings. For example, it may be possible that the increased breathiness in KD speakers is linked to hoarseness rather than breathy voice per se; this may explain the acoustic effects we found, as hoarse voice (or harsh whispery voice, Esling et al., 2019; Laver, 1980) can also contain a breathy component paired with additional supraglottal constriction. In such a case, perhaps additional acoustic features rather than just increased breathiness would have generated different perception results. The speaker who produced the stimuli was also older than the participants in the production study, thus the perceived speaker age may also have played a role. Another reason that comes to mind as to why voice quality differences went unnoticed in our perception study is that breathiness may become interpretable only in conjunction with other features such as specific requirements on rhythm or F0 (cf. Szakay, 2012). Finally, an alternative explanation may be that a breathier voice quality
8 Conclusion
Beyond physiological aspects, voice quality is learned behavior, just as any other phonological or phonetic expression. The implementation of breathy voice in several languages by specific speaker groups points to a larger pattern that currently has not attracted sufficient attention for associations with these speaker groups and thus social meaning to emerge. In this paper, we showed that in speech production, a voice quality difference is apparent in KD speakers (as has been found in several other multiethnolectal varieties of English); however, this was not picked up on in perception by (younger) German listeners. We firmly believe that deriving conclusions on social meanings from speech production studies alone poses the danger of leaving the meaning of an alternation up to the researcher rather than the speech community in which it occurs
