Sage Journals: Discover world-class research

Abstract

The current paper studies the differences in speech intelligibility in noise measured under reproduced acoustic environments implemented using different recording and rendering techniques. Acoustics of two rooms with different volume and reverberation time were reproduced by spherical harmonics-based spatial sound reproduction and the speech intelligibility under reproduced acoustic environments were compared to that in the original rooms by conducting subjective listening tests. Four implementations of spatial sound reproduction realised by the combinations of two recording techniques (using first and higher order Ambisonics microphones) and two rendering techniques (using a headphone and loudspeaker array) were evaluated. The experimental results found the speech intelligibility under reproduced acoustic environments implemented by using either a first- or higher-order Ambisonics microphone and a 32-ch loudspeaker array achieved to replicate results not significantly different from that observed in the original real rooms when the room is highly reverberant. The same implementation with the higher-order Ambisonics microphone also most accurately replicated the effect of angular separation between the speech and noise sources as well as source distance on the speech intelligibility. The results also suggest that the technique used for rendering would have larger effect on reproducibility of speech intelligibility in the real room than the recording technique whereas the technique used for recording would have larger effect on reproducibility of the effect of angular separation between speech and noise sources in the real room.

Keywords

speech intelligibility spatial sound reproduction spherical harmonics spatial release from masking loudspeaker array binaural

Introduction

Achieving good speech intelligibility is one of the prioritised goals in designing the acoustics of rooms. Despite being a critical factor in room acoustics design, speech intelligibility can be easily compromised when the space is noisy due to auditory masking.¹ Nevertheless, the human hearing system has various countermeasures to address the issue.² The benefit of binaural hearing, that is, hearing sound by two ears, enables us to capture spatial acoustics, which characterises the spatial properties of sound propagation in the space. While binaural hearing brings significant benefits in understanding speech in noise such as the effect of spatial release from marking,^3,4 the benefits it brings are also subject to the spatial acoustics of environments. Many previous studies have investigated how the spatial acoustics of environments would affect the benefits.^5–8

Although it is a common practice in speech and hearing research to conduct subjective listening tests for measuring speech perception in noise, the tests have to be conducted under controlled and reproducible settings. Such requirements are normally satisfied by undertaking the test in a laboratory such as an anechoic chamber or acoustically treated/insulated listening booth. However, when the focus of the study is in investigating the effect of spatial acoustics of environments, it is infeasible to conduct the test in these laboratory settings because the acoustics of environments has to be varied widely to cover various acoustical environments typically seen in real-world. Historically, studies looking into the effect of room acoustics on speech perception are typically conducted using stimuli by adding artificially simulated acoustical effects (e.g. Rogers et al.⁹ and Masuda¹⁰), which does not necessarily represent real acoustical environments.

Over the last few decades, spatial sound reproduction¹¹ has emerged which allows users to virtually experience various acoustic environments without visiting the actual sites. Using spatial sound reproduction has various benefits to address the challenge in speech and hearing research thanks to their ability in delivering realistic but also controllable and reproducible spatial acoustics experience to the participants.¹² Several recent studies have adopted this methodology and conducted subjective listening tests examining the effect of varying room acoustics on the intelligibility of speech in noise using state-of-the-art spatial sound reproduction.^7,13,14 However, a question regarding reproducibility remains; “How accurately would the results collected under reproduced acoustic environments replicate the results collected in the real environment (had it been ever feasible)?” To answer this crucial but non-trivial question, a previous study¹⁵ compared speech intelligibility in noise between a real room and its reproduced acoustic environments realised by Ambisonics technology.¹⁶ Ambisonics is one of the most commonly used technologies to realise spatial sound reproduction where the original sound environment is described by the spherical harmonics (We call such technology spherical harmonics-based spatial sound reproduction hereafter.).¹⁷ The technology is implemented by measuring the sound in the original environment using an array of microphones encoded by spherical harmonics, and then the sound field in the original environment is reproduced by rendering the recordings processed by a decoder from either a headphone or loudspeaker array. In the previous study,¹⁵ a 64-ch loudspeaker array was used for rendering sound while recording was achieved by either numerical simulation or practical measurement of the room impulse response (RIR) in the real room using a 52-ch spherical microphone array. The study found the results collected in reproduced acoustic environments that used the measured RIRs provided the highest match to the results collected in the real environments. Another study¹⁸ investigated the differences in speech perception between real and reproduced acoustic environments by recruiting both normal hearing and hearing impaired listeners. The study also used the same 64-ch loudspeaker array with a 32-ch microphone array for recording the RIRs.

While these previous studies have provided great insights into the validity and limitations of the methodology using spatial sound reproduction in speech and hearing research, the microphone and loudspeaker arrays used in these studies are often not accessible for many researchers due to their high cost. With spherical harmonics-based spatial sound reproduction, the reproduction accuracy of the acoustics in the original space strongly depends on the order of spherical harmonics the implemented system can realise,¹⁹ which is governed by the number of microphones used for recording sound as well as the number of loudspeakers used for rendering sound. Fortunately, more varieties of equipment designed for implementing spherical harmonics-based spatial sound reproduction are available in the recent market, some of which are much more affordable and accessible. For microphones, many products use only four microphones, typically arranged in a tetrahedral shape, being able to realise only up to first order spherical harmonics, known as first order Ambisonics (FoA) microphones. Other equipment uses many more microphones to enable higher order spherical harmonics, known as higher order Ambisonics (HoA) microphones, however, such equipment is often less affordable than the FoA counterparts. Likewise for rendering, the most commonly used technique is binaural rendering via headphones because of their affordability for most users. However, their reproduction accuracy may be compromised when it is used with Ambisonics microphones because the recordings would not allow accurate modelling of the directivity pattern of human ears. Compounded with the fact that the sound recorded by the Ambisonics microphones is already an approximation due to the limited spherical harmonics order, the technique would suffer from low reproduction accuracy. In addition, it also relies on the average head related transfer function (HRTF) measured by a head and torso simulator, which is often different from individual’s HRTF. An alternative to address these issues is using an array of loudspeakers, which physically reproduces the sound field measured in the original environment inside the loudspeaker array where listeners are seated. However, the system is not as accessible as binaural rendering since it requires dozens of loudspeakers installed in an anechoic chamber. These differences in recording/rendering techniques imply the results collected under reproduced acoustic environments implemented with different recording/rendering techniques would vary and may not eventually match well with the results collected in the real acoustic environments. For example, a few previous studies investigated the spatial release from masking under reproduced acoustic environments realised by as low as first order spherical harmonics and by rendering sound via headphone²⁰ or loudspeaker array.^15,21 However, no previous study has compared speech intelligibility under reproduced acoustic environments realised by different means of audio recording/rendering techniques, for example, headphone versus loudspeaker array, as well as against that in real acoustic environments.

The present study investigates the differences in speech intelligibility in noise, including the benefit from spatial release from masking focussed in the previous studies, measured under spherical harmonics-based spatial sound reproduction implemented by various means of recording/rendering techniques. The study discusses the reproducibility of data collected in the real environments using spherical harmonics-based spatial sound reproduction by comparing the results from reproduced acoustic environments and real rooms. The study will answer the research question: (1) “How do varying implementations of spherical harmonics-based spatial sound reproduction differ in replicating speech intelligibility in noise in real rooms?” In addition, the study will also investigate: (2) “How does the effect of spatial acoustics on speech perception vary under different implementations?” Here, “implementation” refers to the recording/rendering equipment rather than the technology for realising spatial sound reproduction itself such as the order of spherical harmonics used for recording or the algorithms used for rendering. We hypothesise that the implementation using HoA microphone with loudspeaker array, which is able to deliver more accurate reproduction of the original acoustic environment using higher order spherical harmonics without suffering from the limited channel number/HRTF mismatches, would deliver the best matching results to that collected from the real acoustic environments including the effect of spatial acoustics. While the present study utilises SPARTA suite (https://leomccormack.github.io/sparta-site/), an open-source state-of-the-art spatial sound reproduction plugin available online, as an example of spherical harmonics-based spatial sound reproduction technique because of its high accessibility, the study does not intend to discuss the difference of performance between various spatial sound reproduction techniques. Through findings from the current study, we rather aim to allow users of spherical harmonics-based spatial sound reproduction for speech and hearing research to make an informed decision on the choice of their implementation design.

Methodology

Subjective listening tests were conducted to measure speech intelligibility in noise under spherical harmonics-based spatial sound reproduction implemented by different combinations of recording and rendering techniques. To acquire baseline data, the same test was also conducted in real (original) rooms the acoustics of which was reproduced. The results collected from both the real rooms and reproduced acoustic environments were compared. This section summarises the design of the listening tests along with the details of the spatial sound reproduction implemented in this study.

Venues and their acoustics

To cover acoustic environments with broad conditions seen in real-world, two venues with varying acoustics were used in this study, which were the seminar room (#405-430) and the MacLaurin Chapel (#107-G03), both at the University of Auckland, New Zealand, shown in Figure 1. The listening tests in real rooms were conducted in these venues. The tests under reproduced acoustic environments were conducted in the anechoic chamber of the Acoustics Laboratory at the University of Auckland (Figure 2) by replicating the acoustics measured in the seminar room and chapel by using the selected implementations of spatial sound reproduction discussed in the next section. The seminar room has a moderate reverberation, with the reverberation time ( $T_{20}$ ) being approximately 0.7 s, whereas the chapel has a longer reverberation, with $T_{20}$ being about 1.8 s.

Figure 1.

Venues used in the current study for conducting the listening test in real room as well as measuring the room impulse responses to virtually replicate their acoustics using spatial sound reproduction: (a) seminar room (#405-430) and (b) MacLaurin Chapel (#107-G03).

Figure 2.

The first author seated at the centre of the 32-ch loudspeaker array installed in the anechoic chamber at the University of Auckland.

Acoustics at the centre of the rooms when the sound source was located at one of the eight positions in the rooms (4 angles $\times$ 2 distances) shown in Figure 3 were used in this study. Details of how they were measured and used are stated in the following sections. The speech clarity (C50) at the centre of the rooms when the source was located at 2 and 5 m (both at $0^{\circ}$ ) in the seminar room were 15.9 and 6.8 dB, respectively and that in the chapel were 10.3 and 3.3 dB, respectively. The approximate volume of the seminar room and the chapel are 400 and 1200 $m^{3}$ , respectively.

Figure 3.

Dimension and volume of the rooms used in the current study with the positions of sound sources and participants. The location of the target speech was fixed at one of the green loudspeaker icons, whereas the location of the noise source was varied between the locations of both green and orange loudspeaker icons. The distance of the target speech and noise was kept the same, that is, either 2 m or 5 m. (a) Seminar room and (b) chapel.

Spatial sound reproduction implementation

The study adopted the SPARTA suite as an example of spherical harmonics-based spatial sound reproduction. It was implemented by combinations of two different recording and rendering techniques with varying accessibility, resulting in four different implementations as summarised in Table 1. Among the four implementations, F-Bi would be the most accessible technique realised by a FoA microphone and a headphone whereas H-Spk would be the most costly setup requiring large numbers of microphones and loudspeakers. Details of the techniques used for recording and rendering are summarised below.

Table 1.

Reproduced acoustic environments tested in the experiment by varying implementation techniques.

Implementation techniques		Recording (Measurement)
		FoA (NT-SF1)	HoA (Eigenmike)
Rendering (Playback)	Binaural	F-Bi	H-Bi
Rendering (Playback)	32-ch loudspeaker array	F-Spk	H-Spk

Recording

Same as the previous studies,^15,18 recording of room acoustics was realised by measuring the room impulse responses (RIRs) of the targeted venues. Two types of microphone array were used for measuring the RIRs: (i) FoA microphone (Røde NT-SF1) and (ii) HoA microphone (MH acoustics Eigenmike em32). NT-SF1 is a tetrahedral microphone array with four unidirectional microphones, which is able to record only the first order spherical harmonics signals (frequency response: 20 Hz–20 kHz, equivalent noise level: 17 dBA). Eigenmike em32 has 32 microphones embedded on a rigid sphere, which is able to describe recorded sound by up to the fourth order spherical harmonics (frequency response: 30 Hz–20 kHz, equivalent noise level: 15 dBA).

The RIRs were measured by playing a swept sine signal from the loudspeaker (Genelec 8020D) located at one of the eight positions in the rooms while recording the sound by the microphone array located at the centre of the rooms (Figure 3; The measured RIRs are available from https://doi.org/10.17605/OSF.IO/ZWPF3). The length of the measured RIRs was 2 s for the seminar room and 2.5 s for the chapel. Figure 4 shows the energy decay curves²² of the RIRs measured by the FoA and HoA microphones in each room (To remove the effect of microphones’ directivity, the 0th order spherical harmonics (equivalent to omni-directional) of the full-band RIRs were used to calculate the decay curves.). The measured RIRs were then transformed to spherical harmonics signals (called spherical harmonics RIRs hereafter) by SPARTA plugin Array2SH,²³ which were used in the following rendering process.

Figure 4.

Energy decay curve of the room impulse responses (full-band, source angle 0°) measured by the FoA and HoA microphones converted to the 0th order spherical harmonics signal (omni-directional). The energy of direct sound propagating from sources located at 2 m is normalised to 0 dB. Solid lines: 2 m, dashed lines: 5 m, blue lines: FoA microphone, red lines: HoA microphone. (a) Seminar room and (b) chapel.

Rendering

Two rendering techniques were utilised: (i) binaural rendering using a headphone (Sennheiser HD800S) and (ii) rendering from a 32-ch loudspeaker (Genelec 8020D) array. For binaural rendering, the spherical harmonics RIRs were decoded into binaural RIRs by the SPARTA plugin AmbiBIN with the magnitude least-squares (MagLS).²⁴ A preset HRTF wrapped in the plugin was used. The stimuli used in the listening test were generated by convolving an arbitrary sound source with the binaural RIRs, which were then played through the headphone via an audio interface (Roland OCTA-CAPTURE). The loudspeaker array was configured as shown in Figure 5 and installed in the anechoic chamber (Figure 2). Similar to binaural rendering, the spherical harmonics RIRs were decoded by SPARTA plugin HO-SIRR²⁵ into 32-ch RIRs corresponding to each loudspeaker by using the coordinates of the loudspeakers as a parameter. The resultant RIRs were then convolved with an arbitrary sound source to generate the stimuli. The rest of the loudspeaker array’s implementation was the same as that reported in Hui et al.¹⁴ For both rendering techniques, the sampling rate of the stimuli was set to $48$ kHz.

Figure 5.

Configuration of the 32-ch loudspeaker array.

Stimuli

The stimuli used in the test comprised of target speech and noise, which were rendered simultaneously to replicate noisy speech. The target speech was chosen from the Bamford-Kowal-Bench et al. (BKB) sentence lists²⁶ in the Speech Perception Assessments New Zealand (SPANZ) corpus²⁷ and the noise was the babble noise from the NOISEX-92 corpus.²⁸ The target speech and noise were played at 50 dBA and 53 dBA ( $\pm 1$ dBA), respectively, both measured at the participant’s seat by using a sound level metre (MiniDSP UMIK-2 + RoomEQ Wizard (https://www.roomeqwizard.com/)) for the real room test or the test under reproduced acoustic environments using the loudspeaker array, or by using an ear simulator (MiniDSP EARS) for the test using binaural rendering, setting the target-to-masker ratio (TMR) being −3 dB regardless of the room or source distance. Both in the real room and reproduced acoustic environments, the position of the target speech was fixed at $0^{\circ}$ while that of the noise was varied between the four different angles of separation (speech-noise separation), that is, $0^{\circ}, \pm 45^{\circ}$ and $180^{\circ}$ , to test the effect of spatial release from masking. The same distance (either 2 m or 5 m) was used for both the target speech and noise. Each combination of parameters (source distance and speech-noise separation) was evaluated by four different sentences for repetition.

Participants

In total 76 participants were recruited in Auckland, New Zealand, for the listening test, consisting of the following five groups.

• Group 1 ( $N = 15$ (seven female, eight male), mean age = 21.2, SD = 0.42): undertook the test using binaural rendering with FoA microphone (F-Bi)

• Group 2 ( $N = 16$ (eight female, eight male), mean age = 21.7, SD = 1.66): undertook the test using the 32-ch loudspeaker array with FoA microphone (F-Spk)

• Group 3 ( $N = 15$ (2 female, 13 male), mean age = 21.3, SD = 0.77): undertook the test using the 32-ch loudspeaker array and binaural rendering with HoA microphone (H-Spk, H-Bi)

• Group 4 ( $N = 15$ (seven female, eight male), mean age = 24.1, SD = 3.56): undertook the test in the real seminar room

• Group 5 ( $N = 15$ (nine female, four male, two prefer not to say), mean age = 21.6, SD = 2.56): undertook the test in the real chapel

All participants were native listeners of English familiar with New Zealand English and self-reported that they had normal hearing.

Procedure

In the listening test, participants were asked to transcribe speech sentences they listened to under noise through the graphical user interface displayed on a monitor located in front ( $0^{\circ}$ ). While the speech sentences and their order were kept consistent for all participants, the combination of parameters (source distance and speech-noise separation) applied to each speech sentence was randomised between participants. Overall, participants from Group 1 to 3 evaluated 64 sentences (2 source distance $\times$ 4 speech-noise separation $\times$ 2 rooms $\times$ 4 repetition) for each implementation whereas those from Group 4 and 5 evaluated 32 sentences (2 source distance $\times$ 4 speech-noise separation $\times$ 4 repetition). For the test using the 32-ch loudspeaker array (F-Spk and H-Spk), participants were seated at the centre of the loudspeaker array, by adjusting the height of the seat to ensure their ears were level with the loudspeakers on the middle ring (Figure 5). They were instructed to hold this position during the test and keep their head facing towards the monitor. For the test using binaural rendering (F-Bi and H-Bi), participants were also seated in the anechoic chamber, however wore the headphone to listen to the stimuli (i.e. no sound was played from the loudspeaker array). A similar setting was applied in the real room test but the participants were seated at the centre of the rooms instead and adjusted the height of their seat to have their ears level with the loudspeakers located at the source positions shown in Figure 3.

The participants typed in their responses through a keyboard while looking at the monitor in front of them ( $0^{\circ}$ ). Participants could take as long as they needed to type in their answer, however, they were not allowed to listen to the sentences more than once. They were encouraged to take a break whenever they needed to. The length of the listening tests ranged from 30 to 45 min for each participant. Participants were given a voucher worth NZD 20 to thank their participation to the experiment. The procedure was approved by the University of Auckland Human Ethics Committee (#024182).

Marking and statistical analysis

To quantify speech intelligibility, the participants’ responses were marked according to the marking schedule recommended in the SPANZ corpus²⁷ by scoring the root of the word as opposed to the whole word. Details of the marking process followed that used in the previous study.¹⁴ The resultant intelligibility scores normalised from 0 to 1, called proportion correct hereafter, were statistically analysed by linear mixed effect models using the lme4 package in R²⁹ with the source distance, speech-noise separation and implementation (four implementations specified in Table 1 plus the real room) as fixed effects and participant ID as a random effect. The analysis was applied to the results collected in each room (seminar room or chapel) separately. The model fitting was conducted by the step function from lmerTest,³⁰ by including interactions between multiple factors when the fitness of the model was improved. Post-hoc analysis for pairwise comparison was implemented by emmeans package³¹ with correction of p-values by Tukey method. For evaluation of statistical significance, $p < 0.01$ was used to prioritise minimising the occurrence of Type-I errors.

Results

Figures 6 and 7 display the speech intelligibility scores (proportion correct) predicted by the linear mixed effect models in terms of speech-noise separation split by the source distance and implementation, respectively. For the seminar room, two-way interaction was found between speech-noise separation and implementation ( $χ^{2}$ (12) = 39.17, $p < . 0001$ ), and source distance and implementation ( $χ^{2}$ (4) = 11.71, $p < . 02$ ). Likewise for the chapel, significant two-way interactions were found between source distance and speech-noise separation ( $χ^{2}$ (3) = 12.13, $p < 0.01$ ), and between speech-noise separation and implementation ( $χ^{2}$ (12) = 34.75, $p < . 001$ ).

Figure 6.

Linear prediction of proportion correct in terms of speech-noise separation split by sound source distance. Error bars show 99% confidence intervals: (a) seminar room and (b) chapel.

Figure 7.

Linear prediction of proportion correct in terms of speech-noise separation split by implementation. Error bars show 99% confidence intervals: (a) seminar room and (b) chapel.

Effect of implementation

From visual inspection of Figure 6, overall the speech intelligibility under reproduced acoustic environments is lower than that in the real room and varied between implementation when the amount of reverberation is moderate (seminar room). When the amount of reverberation is high (chapel), the speech intelligibility scores are mostly similar between the real room and reproduced acoustic environments. These observations are confirmed by Tables 2 and 3, which show the pairwise contrast of proportion correct between implementations for the seminar room and chapel, respectively. From the contrast between the real room and reproduced acoustic environments (top four blocks of the tables), in the seminar room, the difference in speech intelligibility under reproduced acoustic environments is mostly significantly lower than that in the real room regardless of implementation except F-Spk, some cases from F-Bi ( $- 45^{\circ}$ and $0^{\circ}$ in 2 m) and H-Spk ( $\pm 45^{\circ}$ ). In the chapel, on the other hand, mostly no significant difference is observed between the real room and reproduced acoustic environments regardless of implementation except F-Bi ( $180^{\circ}$ ) and H-Bi ( $0^{\circ}$ and $180^{\circ}$ ).

Table 2.

Pairwise contrasts between implementation for the seminar room.

Contrast: F-Bi – Real room
Angle	2 m			5 m
Angle	Estimate	t-Ratio	p-Value	Estimate	t-Ratio	p-Value
−45	−0.13	−2.35	0.13	−0.23	−4.10	<0.001
0	−0.15	−2.59	0.07	−0.25	−4.35	<0.001
45	−0.26	−4.54	<0.0001	−0.36	−6.29	<0.0001
180	−0.30	−5.21	<0.0001	−0.40	−6.96	<0.0001
Contrast: H-Bi – Real room
Angle	2 m			5 m
Angle	Estimate	t-Ratio	p-Value	Estimate	t-Ratio	p-Value
−45	−0.36	−6.38	<0.0001	−0.42	−7.44	<0.0001
0	−0.32	−5.81	<0.0001	−0.38	−6.87	<0.0001
45	−0.29	−5.18	<0.0001	−0.35	−6.24	<0.0001
180	−0.40	−7.10	<0.0001	−0.46	−8.15	<0.0001
Contrast: F-Spk – Real room
Angle	2 m			5 m
Angle	Estimate	t-Ratio	p-Value	Estimate	t-Ratio	p-Value
−45	−0.12	−2.23	0.17	−0.10	−1.88	0.33
0	−0.10	−1.91	0.31	−0.08	−1.56	0.53
45	−0.17	−3.21	0.01	−0.15	−2.85	0.04
180	−0.05	−1.00	0.86	−0.03	−0.64	0.97
Contrast: H-Spk – Real room
Angle	2 m			5 m
Angle	Estimate	t-Ratio	p-Value	Estimate	t-Ratio	p-Value
−45	−0.13	−2.14	0.20	−0.07	−1.17	0.77
0	−0.30	−5.07	<0.0001	−0.24	−4.09	<0.001
45	−0.17	−2.90	0.03	−0.11	−1.92	0.31
180	−0.33	−5.68	<0.0001	−0.28	−4.71	<0.0001
Contrast: F-Bi – H-Bi
Angle	2 m			5 m
Angle	Estimate	t-Ratio	p-Value	Estimate	t-Ratio	p-Value
−45	0.22	3.80	<0.01	0.18	3.10	0.02
0	0.18	3.02	0.02	0.14	2.32	0.14
45	0.03	0.52	0.99	−0.01	−0.18	1.00
180	0.10	1.69	0.44	0.06	0.99	0.86
Contrast: F-Spk – H-Spk
Angle	2 m			5 m
Angle	Estimate	t-Ratio	p-Value	Estimate	t-Ratio	p-Value
−45	0.01	0.17	1	−0.03	−0.51	0.99
0	0.20	3.47	<0.01	0.16	2.79	0.04
45	0	0.05	1	−0.04	−0.63	0.97
180	0.28	4.94	<0.0001	0.24	4.26	<0.001
Contrast: F-Bi – F-Spk
Angle	2 m			5 m
Angle	Estimate	t-Ratio	p-Value	Estimate	t-Ratio	p-Value
−45	−0.01	−0.14	1	−0.14	−2.46	0.10
0	−0.05	−0.88	0.91	−0.17	−3.01	0.02
45	−0.09	−1.66	0.46	−0.21	−3.80	<0.01
180	−0.25	−4.43	<0.001	−0.36	−6.57	<0.0001
Contrast: H-Bi – H-Spk
Angle	2 m			5 m
Angle	Estimate	t-Ratio	p-Value	Estimate	t-Ratio	p-Value
−45	−0.23	−3.85	<0.01	−0.35	−5.79	<0.0001
0	−0.03	−0.45	0.99	−0.14	−2.39	0.12
45	−0.12	−1.99	0.27	−0.24	−3.93	<0.001
180	−0.06	−1.04	0.83	−0.18	−2.98	0.02
Contrast: F-Bi – H-Spk
Angle	2 m			5 m
Angle	Estimate	t-Ratio	p-Value	Estimate	t-Ratio	p-Value
−45	−0.01	−0.14	1	−0.17	−2.71	0.05
0	0.15	2.45	0.10	−0.01	−0.13	1
45	−0.09	−1.46	0.59	−0.25	−4.03	<0.001
180	0.04	0.59	0.98	−0.12	−1.98	0.28
Contrast: H-Bi – F-Spk
Angle	2 m			5 m
Angle	Estimate	t-Ratio	p-Value	Estimate	t-Ratio	p-Value
−45	−0.24	−4.45	<0.001	−0.32	−5.89	<0.0001
0	−0.23	−4.17	<0.001	−0.30	−5.61	<0.0001
45	−0.12	−2.27	0.16	−0.20	−3.70	<0.01
180	−0.34	−6.39	<0.0001	−0.42	−7.82	<0.0001

p-Values in boldface indicate significant pairs (p < .01).

Table 3.

Pairwise contrasts between implementation for the chapel.

Contrast: F-Bi – Real
Angle	Estimate	t-Ratio	p-Value
−45	−0.06	−1.13	0.79
0	−0.12	−2.38	0.12
45	0.04	0.85	0.92
180	−0.27	−5.19	<0.0001
Contrast: H-Bi - Real
Angle	Estimate	t-Ratio	p-Value
−45	−0.13	−2.61	0.07
0	−0.20	−3.88	<0.01
45	−0.16	−3.05	0.02
180	−0.17	−3.36	<0.01
Contrast: F-Spk – Real
Angle	Estimate	t-Ratio	p-Value
−45	0.03	0.71	0.95
0	−0.09	−1.82	0.36
45	0.03	0.53	0.98
180	0.03	0.60	0.98
Contrast: H-Spk – Real
Angle	Estimate	t-Ratio	p-Value
−45	0.06	1.12	0.79
0	−0.06	−1.05	0.83
45	−0.02	−0.46	0.99
180	−0.05	−0.98	0.87
Contrast: F-Bi – H-Bi
Angle	Estimate	t-Ratio	p-Value
−45	0.07	1.38	0.64
0	0.07	1.37	0.65
45	0.20	3.73	<0.01
180	−0.10	−1.85	0.35
Contrast: F-Spk – H-Spk
Angle	Estimate	t-Ratio	p-Value
−45	−0.03	−0.51	0.99
0	−0.03	−0.58	0.98
45	0.05	0.95	0.88
180	0.08	1.54	0.54
Contrast: F-Bi – F-Spk
Angle	Estimate	t-Ratio	p-Value
−45	−0.09	−1.83	0.36
0	−0.04	−0.74	0.95
45	0.02	0.37	1
180	−0.30	−5.88	<0.0001
Contrast: H-Bi - H-Spk
Angle	Estimate	t-Ratio	p-Value
−45	−0.19	−3.51	<0.0001
0	−0.14	−2.57	0.08
45	−0.13	−2.38	0.12
180	−0.12	−2.16	0.20
Contrast: F-Bi - H-Spk
Angle	Estimate	t-Ratio	p-Value
−45	−0.06	−1.03	0.84
0	0.1	1.66	0.46
45	−0.18	−3.08	0.02
180	−0.03	−0.57	0.98
Contrast: H-Bi - F-Spk
Angle	Estimate	t-Ratio	p-Value
−45	−0.12	−2.13	0.21
0	−0.07	−1.21	0.74
45	0.07	1.23	0.74
180	−0.22	−3.89	<0.01

p-Values in boldface indicate significant pairs (p < .01).

Among the reproduced acoustic environments (bottom six blocks of Tables 2 and 3), while no explicit trends can be observed in both rooms, no significant pairs are detected between F-Spk and H-Spk in the chapel, suggesting both implementations performed similarly in terms of replicating speech intelligibility. This trend is not clearly observed in the seminar room, where a significant difference is observed at $0^{\circ}$ in 2 m and $180^{\circ}$ between F-Spk and H-Spk.

Effect of speech-noise separation

The effect of speech-noise separation can be used to quantify the effect of spatial release from masking⁴ by inspecting the “dip” caused by the difference of proportion correct between $0^{\circ}$ and $\pm 45^{\circ}$ .¹⁴ It would also be interesting to look into the difference of proportion correct between $0^{\circ}$ and $180^{\circ}$ as previous studies³² have found the effect of spatial release from masking would be lessened at $180^{\circ}$ due to back-to-front confusion.³³

Tables 4 and 5 display the pairwise contrast of proportion correct between $0^{\circ}$ and other speech-noise separation angles ( $\pm 45^{\circ}$ and $180^{\circ}$ ) for the seminar room and chapel, respectively. In the real rooms, the dip in the seminar room is significant (i.e. the difference between $0^{\circ}$ and $\pm 45^{\circ}$ is significant) regardless of the source distances (right end block in Table 4). However, in the chapel, the dip is significant at 2 m but not at 5 m (bottom block in Table 5). For reproduced acoustic environments, in the seminar room (four blocks from the left in Table 4), the dip is significant under H-Bi and H-Spk, but under F-Bi and F-Spk, no significant difference is observed for $0^{\circ} -$ – $45^{\circ}$ . In the chapel (top four blocks of Table 5), all implementations observe a significant dip at 2 m whereas only F-Spk observes the phenomenon at 5 m too.

Table 4.

Pairwise comparisons of contrasts between speech-noise separation for the seminar room.

	F-Bi			H-Bi			F-Spk			H-Spk			Real room
Contrast	Estimate	t-Ratio	p-Value	Estimate	t-Ratio	p-Value	Estimate	t-Ratio	p-Value	Estimate	t-Ratio	p-Value	Estimate	t-Ratio	p-Value
(−45) to 0	0.21	3.85	<0.001	0.16	3.13	<0.01	0.17	3.96	<0.001	0.36	6.50	<0.0001	0.19	4.06	<0.001
0–45	−0.07	−1.23	0.61	−0.21	−4.15	<0.001	−0.11	−2.46	0.07	−0.30	−5.44	<0.0001	−0.18	−3.74	<0.01
0–180	−0.01	−0.17	1	−0.09	−1.71	0.32	−0.21	−4.67	<0.0001	−0.12	−2.19	0.13	−0.16	−3.36	<0.01

p-Values in boldface indicate significant pairs (p < .01).

Table 5.

Pairwise comparisons of contrasts between speech-noise separation for the chapel.

F-Bi
Contrast	2 m			5 m
Contrast	Estimate	t-Ratio	p-Value	Estimate	t-Ratio	p-Value
(−45) to 0	0.24	4.02	<0.001	0.14	2.31	0.10
0–45	−0.36	−6.05	<0.0001	−0.31	−5.16	<0.0001
0–180	0.05	0.86	0.83	0.01	0.09	1
H-Bi
Contrast	2 m			5 m
Contrast	Estimate	t-Ratio	p-Value	Estimate	t-Ratio	p-Value
(−45) to 0	0.24	4.16	<0.001	0.14	2.39	0.08
0–45	−0.23	−4.06	<0.001	−0.18	−3.15	<0.01
0–180	−0.12	−2.14	0.14	−0.17	−2.93	0.02
F-Spk
Contrast	2 m			5 m
Contrast	Estimate	t-Ratio	p-Value	Estimate	t-Ratio	p-Value
(−45) to 0	0.29	5.78	<0.0001	0.19	3.78	<0.001
0–45	−0.30	−5.94	<0.0001	−0.25	−4.91	<0.0001
0–180	−0.21	−4.14	<0.001	−0.26	−5.04	<0.0001
H-Spk
Contrast	2 m			5 m
Contrast	Estimate	t-Ratio	p-Value	Estimate	t-Ratio	p-Value
(−45) to 0	0.29	4.69	<0.0001	0.19	3.05	0.01
0–45	−0.22	−3.59	<0.01	−0.17	−2.74	0.03
0–180	−0.10	−1.61	0.37	−0.15	−2.35	0.09
Real room
Contrast	2 m			5 m
Contrast	Estimate	t-Ratio	p-Value	Estimate	t-Ratio	p-Value
(−45) to 0	0.17	3.25	<0.01	0.07	1.36	0.52
0–45	−0.19	−3.57	<0.01	−0.14	−2.62	0.05
0–180	−0.10	−1.79	0.28	−0.14	−2.69	0.04

p-Values in boldface indicate significant pairs (p < .01).

For the difference between $0^{\circ}$ and $180^{\circ}$ , in the real rooms, significant difference is observed in the seminar room but not in the chapel. In reproduced acoustic environments, for both the seminar room and chapel, there is no significant difference under all implementations except F-Spk.

Effect of source distance

From Figure 7, clear separation can be seen between source distance (2 m or 5 m) regardless of implementation or the venue. In both the seminar room and chapel, the difference in speech intelligibility between source distance is significant in all implementations of reproduced acoustic environment as well as in the real rooms, as reported in Tables 6 and 7, respectively. These results suggest that speech intelligibility improved regardless of implementation or speech-noise separation when the sound source was closer to the listener.

Table 6.

Pairwise contrasts between distances (2–5 m) for the seminar room.

F-Bi			H-Bi			F-Spk			H-Spk			Real room
Estimate	t-Ratio	p-Value	Estimate	t-Ratio	p-Value	Estimate	t-Ratio	p-Value	Estimate	t-Ratio	p-Value	Estimate	t-Ratio	p-Value
0.27	7.12	<0.0001	0.23	6.30	<0.0001	0.15	4.80	<0.0001	0.11	2.81	<0.01	0.17	5.04	<0.0001

p-Values in boldface indicate significant pairs (p < .01).

Table 7.

Pairwise contrasts between distances (2–5 m) for the chapel.

Angle	Estimate	t-Ratio	p-Value
−45	0.34	10.55	<0.0001
0	0.24	7.41	<0.0001
45	0.29	9.03	<0.0001
180	0.19	6.00	<0.0001

p-Values in boldface indicate significant pairs (p < .01).

Discussion

As discussed in Introduction, the current study aims to answer the research questions: (1) “How do varying implementations of spherical harmonics-based spatial sound reproduction differ in replicating speech intelligibility in noise in real rooms?” and (2) “How does the effect of spatial acoustics on speech perception vary under different implementations?” We hypothesise that speech perception under the implementation using the HoA microphone and loudspeaker array (H-Spk) would provide the results closest to that observed in the real room because such an implementation is able to describe sound field more accurately with signals of higher spherical harmonics order without being affected by the disadvantages of binaural rendering. For the same reason, it is hypothesised that the implementation using lower spherical harmonics and binaural rendering (F-Bi) would be the least accurate in terms of replicating speech perception in the real room. Hence, we expect the results from H-Spk would assimilate the results in the real room the best whereas the results from F-Bi would deviate the most from the real room data. This section discusses the results to test these hypotheses and answer the research questions.

Reproducibility of speech intelligibility in real rooms

To answer question 1, we first look into how the speech intelligibility scores (proportion correct) collected from reproduced acoustic environments assimilate with the results in the real rooms. In the results, the trend is somewhat different between the rooms, hence we discuss the results for each room separately.

In the seminar room, according to the post-hoc test results (Table 2), the proportion correct between the real room and reproduced acoustic environments is significantly different in most cases regardless of implementation. The only exception is F-Spk where no significant difference was observed at every speech-noise separation angle examined. Among the reproduced acoustic environments, all implementation pairs observed significant differences at some speech-noise separation angles, indicating the implementation affects the reproducibility of speech intelligibility. Contrary to the hypothesis, only F-Spk successfully replicated the speech intelligibility in the real room (i.e. provided proportion correct not significantly different from that of the real room). From Figure 6(a), both H-Bi and H-Spk show larger drop of proportion correct in $0^{\circ}$ and $180^{\circ}$ at both 2 and 5 m compared to that observed in the real room or in the reproduced acoustic environment implemented using the FoA microphone (F-Bi and F-Spk).

This trend suggests the extent of reduction in speech intelligibility when the target speech and noise are co-located/located front-and-back was exaggerated under reproduced acoustic environments using HoA microphone. It is also notable that, according to the magnitude of the estimate values in Table 2, deviation from the results in the real room is smaller when the 32-ch loudspeaker array is used for rendering with the same recording technique (F-Bi vs F-Spk and H-Bi vs H-Spk) at all tested angles. While further study is required to draw a solid conclusion in terms of the effect of recording technique, the results suggest implementations using the 32-ch loudspeaker array for rendering would be superior to using binaural rendering based on Ambisonics recording for replicating speech intelligibility measured in a moderately reverberant environment (seminar room).

On the other hand, in the chapel, the post-hoc test results (Table 3) show that F-Spk and H-Spk observed no significant difference from the real room in terms of proportion correct whereas F-Bi and H-Bi observed significantly lower score at $0^{\circ}$ and/or $180^{\circ}$ . Similarly, among the reproduced acoustic environments, no significant difference was observed at any speech-noise separation between F-Spk and H-Spk. These suggest that in highly reverberant environments (chapel), spatial sound reproduction can replicate speech intelligibility in the original room as long as the implementation uses the 32-ch loudspeaker array for rendering.

Overall, these observations indicate that the hypothesis in terms of implementation with the best reproducibility was partially supported in the highly reverberant environment (chapel) where H-Spk (along with F-Spk) replicated the speech intelligibility in the real room without significant difference. Contrary, the hypothesis was not supported in a moderately reverberant environment (seminar room) where only F-Spk replicated the speech intelligibility in the real room accurately. On the other hand, in terms of implementation with the worst reproducibility, while both F-Bi and H-Bi suffered from deviation from the real room results regardless of the room, surprisingly, in most angles/distances the extent of the deviation was larger in H-Bi than F-Bi. Hence, the hypothesis that F-Bi would be the implementation with the worst reproducibility was not supported.

Although the deviation from the real room was not significant with F-Spk in both rooms, the extent of the deviation was mostly larger in the seminar room compared to the chapel according to the magnitude of the estimate values (see F-Spk – Real room contrast in Tables 2 and 3). The same trend can be seen in reproduced acoustic environments using other implementations, which resulted in significantly different proportion correct between the real room and reproduced acoustic environments in many conditions. The larger deviation observed in the seminar room may have been caused by the length of the RIRs (2 s) being much longer than the reverberation time of the room (0.7 s). Typically a measured RIR includes noise which hinders replicating the energy decay of very late reverberation due to the effect of noise floor.³⁴ As can be seen in Figure 4, there is clear noise floor in the decay curves especially in the seminar room RIRs after around 0.6 s. This noise floor is likely caused by the ambient noise present in the rooms when the RIRs were measured, and/or the internal noise in the equipment used for the RIR measurements. While the level of noise floor in the RIRs were relatively low mostly below −40 dB, the noise floor at the tail of the RIRs may have appeared as additional reverberation in the stimuli after convolving the RIRs with speech. This could have eventually made the reproduced environment more reverberant than it was supposed to be resulting in degraded speech intelligibility due to reverberation.^35,36 While further research would be required to confirm this, implementing methods that remove the effect of noise floor in the measured RIR such as truncating the RIR³⁷ may reduce the large deviation observed in the seminar room, which is open to future studies.

Reproducibility of the effect of spatial acoustics

The answer to question 2 is sought by investigating the effect of spatial acoustics, that is, speech-noise separation and source distance, on the results. To test if the aforementioned hypotheses are supported, we particularly investigate if the same effect is consistently observed between the real room and reproduced acoustic environments.

Speech-noise separation – Spatial release from masking

As previously discussed, the effect of spatial release from masking can be evaluated by measuring the depth of the dip between $0^{\circ}$ and $\pm 45^{\circ}$ . In the real rooms, the results suggest the effect was significant at both 2 and 5 m in the seminar room, but only at 2 m in the chapel. This result is aligned with the previous study by Kidd et al.,⁵ which also found the effect of spatial release from masking is lessened when the space is more reverberant.

On the other hand, under reproduced acoustic environments, previous studies have found that the effect of spatial release from masking is observed under environment represented by even first order harmonics signals (i.e. recorded using FoA microphone). It has been observed both under binaural rendering in anechoic environment²⁰ and under loudspeaker array rendering in various reverberant spaces with the source distance being 2.5 m.²¹ In the current study, the effect is found at both source distances in the seminar room for H-Bi and H-Spk. In the chapel, the effect is observed for every implementation at 2 m but only for F-Bi at 5 m. Although direct comparison to the previous studies cannot be made due to the acoustic conditions and the spatial sound reproduction techniques not being identical, overall the results from the current study are aligned with that of the previous studies by finding the effect under reproduced acoustic environments.

In terms of agreement of the results regarding replicating the effect of spatial release from masking between the real room and reproduced acoustic environments, the result from H-Spk perfectly matches with that from the real room regardless of the room while H-Bi does so only in the seminar room. These facts suggest reproduced acoustic environments using HoA microphone and loudspeaker array rendering (H-Spk) would replicate the effect of spatial release from masking in the real rooms most accurately. The implementations using FoA microphone (F-Bi and F-Spk) suffered from mismatch (from the real room results) in the seminar room at 45° as well as in the chapel at 5 m. This supports the hypothesis that H-Spk has the most accurate reproducibility in terms of the effect of spatial release from masking observed in the real room. It is a plausible finding because sound reproduction using higher order spherical harmonics is able to provide better localisation accuracy especially around $\pm 45^{\circ}$ ³⁸ while the effect of spatial release from masking is highly dependent on the location of sound sources.³

Another interesting aspect in terms of speech-noise separation is the difference between $0^{\circ}$ and $180^{\circ}$ . It is known that the effect of spatial release from masking is lessened when the noise source is located at the back of the listener,^3,32 that is, speech-noise separation being $180^{\circ}$ , due to back-to-front confusion.³³ According to Tables 4 and 5, in the real rooms, significant difference between $0^{\circ}$ and $180^{\circ}$ is observed in the seminar room but not in the chapel. The lack of lessened effect of spatial release from masking in the seminar room, contradicting from the previous findings,^3,32 indicates there would have been acoustical factors other than the reverberation time of the rooms that prevented the occurrence of back-to-front confusion in the room. Hence, the rest of the discussion on this point will only focus on the results in the chapel (where the lessening effect was observed in the real room test). Under reproduced acoustic environments, the previous study found the lessening trend at $180^{\circ}$ was not clear under the environment implemented by FoA microphone and 16-ch loudspeaker array²¹ whereas the trend became clearer under the implementation using HoA microphone and 16-ch loudspeaker array.¹⁴ Hence, in the current study, one would expect the lessening trend to be seen only under the implementations involving spherical harmonics of higher order (i.e. H-Bi or H-Spk). The results in the chapel from the current study are to some extent aligned with this hypothesis. Implementations using HoA microphone (H-Bi and H-Spk) did not observe significant differences between 0° and $180^{\circ}$ consistently, meaning the effect of spatial release from masking was lessened. In contrast, the lessening trend varied between implementations using the FoA microphone (F-Bi and F-Spk). While further investigation would be required to rule out other acoustical factors affecting the effect of back-to-front confusion, these observations may suggest the back-to-front confusion could be more accurately reproduced under the implementations using HoA microphone but could be less dependent on the technique used for rendering.

Source distance – Speech clarity (C50)

Source distance is another key factor in spatial acoustics that affects the intelligibility of speech. In an enclosure with reverberation, the level of direct sound exponentially decays by increasing the source distance while the level of reverberation remains relatively homogeneous across the space, resulting in speech intelligibility degradation at far source distances. The direct-to-reverberation ratio (DRR)³⁹ is a metric that quantifies this phenomenon by calculating the energy ratio between direct sound and reverberation, which shows a strong correlation with speech intelligibility. This has resulted in developing the commonly used speech clarity (C50) metric, a variant of DRR, as one of the objective metrics of speech intelligibility.⁴⁰ Based on this, in the current study, it is expected that the proportion correct between 2 and 5 m would be significantly different especially when the space is highly reverberant, that is, chapel.

According to the results in Tables 6 and 7, in the real rooms, the degradation of speech intelligibility by increasing source distance is clearly observed at all speech-noise separations both in the seminar room and chapel. This result clearly matches with the reports in previous studies undertaken in real reverberant acoustic environments, for example, Bradley.⁴¹ Under reproduced acoustic environments, on the other hand, exactly the same trend was observed in both rooms; the proportion correct being significantly different between distance under all implementations, speech-noise separation and rooms. Hence, the effect of source distance on speech intelligibility can be accurately reproduced by any of the implementations tested.

Limitations

There are some limitations in the current study that need to be addressed in future studies. Firstly, the findings from the current study are limited to the specific hardware utilised to implement the spatial sound reproduction. The study also does not evaluate the technologies for realising spatial sound reproduction itself such as the order of spherical harmonics used for recording or the algorithms used for rendering. Also, due to the limitation on the number of stimuli tested by each participant to avoid fatigue as well as the logistical challenge of conducting experiments in real rooms, the study focussed only on two venues with two distances. Including more variety of acoustical environments in the experimental design may uncover more nuanced insights into the interaction between room acoustics and implementation of spatial sound reproduction. Similarly, due the number of participants in each group was relatively small, the results of the statistical analysis should be interpreted with caution.

Another limitation of the current study is its restricted scope by examining only a specific design of spatial sound reproduction and using only specific type of noise and venue, all due to time constraint that each participant could spend for the subjective listening test. Although the spherical harmonics-based spatial sound reproduction techniques^23–25 used in the current study are well established, conclusions may vary if the same test was conducted under environments reproduced by different sound reproduction techniques. We note that the accuracy of spherical harmonics-based sound reproduction is subject to the order of spherical harmonics. Given that the order of spherical harmonics is finite, other commonly used techniques that are free from this restriction, such as using HRTF measured by head and torso simulator for binaural rendering, might deliver higher reproducibility than the implementations investigated in this study. For the noise type, the previous study¹⁵ tested both speech and noise interferers and found that the interferer type interacts with the interferer location (corresponds to speech-noise separation in the current study). Further study involving varying types of noise is needed to investigate how noise type interacts with implementation. Similarly, another study focussing on the effect of varying room acoustics by conducting tests in various rooms would also provide insights on how room acoustics and implementation would interact with each other. These points remain as open questions to future studies.

Conclusion

The current study has investigated the speech intelligibility in noise measured under varying implementations of spherical harmonics-based spatial sound reproduction. To examine the reproducibility the results were compared against the data collected in the real rooms the acoustics of which was reproduced. The study hypothesised that the implementation using HoA microphone with loudspeaker array (H-Spk) would deliver the best matching results to that collected from the real rooms.

In terms of reproducibility of speech intelligibility, it was found that the hypothesis was partly supported in a highly reverberant environment, with both the F-Spk and H-Spk implementations having successfully replicated speech intelligibility observed in the real room. In the moderately reverberant room, however, the hypothesis was not supported by observing only the F-Spk implementation reproduced speech intelligibility in the real room. These results suggest that implementations using the 32-ch loudspeaker array for rendering would be superior to using binaural rendering. For recording, Based on the results of this study, higher reproducibility may be achieved by using the applied FoA microphone than the applied HoA microphone when the room is moderately reverberant. Nonetheless, further study by addressing the limitations of the current study would be required to confirm this. Overall, the study has found the technique used for rendering (binaural or 32-ch loudspeaker array) has more significant effect on the reproducibility of speech intelligibility than the technique used for recording (FoA or HoA microphone).

For the reproducibility of the effect of spatial acoustics, the study has proven that the hypothesis was supported where the H-Spk implementation correctly replicated the effect of spatial release from masking observed in the real rooms. In terms of the effect of speech-noise separation, the study found that the technique used for recording (FoA or HoA microphone) had more significant effect on the results than the technique used for rendering (binaural or 32-ch loudspeaker array), which is opposite to that observed in the reproducibility of speech intelligibility. The study has found the effect of source distance can be reproduced correctly regardless of implementation.

While the current study has some limitations, these findings would provide insights if Ambisonics microphones are applied for the implementation of spatial sound reproduction techniques used for speech and hearing research that focuses on the effect of spatial acoustics in rooms.

Footnotes

Acknowledgements

The authors thank Mr Geoffrey Zhu for his help to conduct the experiment in the real rooms, Mr Dhruv Jagmohan and Mr Hong Kit Li for helping collect the results from Group 3 and Prof Suzanne Purdy for letting us use the SPANZ speech corpus. Also our thanks should go to all the participants who undertook the experiments of this study.

ORCID iDs

Yusuke Hioka

Chung Ting Justine Hui

Yunqi C. Zhang

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was partly supported by the Engineering Faculty Research Development Fund.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Pulkki

Karjalainen

Communication acoustics: an introduction to speech, audio and psychoacoustics. John Wiley and Sons, 2015.

Bronkhorst

AW.

The cocktail party phenomenon: A review of research on speech intelligibility in multiple-talker conditions. Acustica 2000; 86: 117–128.

Plomp

Mimpen

AM.

Effect of the orientation of the speaker’s head and the azimuth of a noise source on the speech-reception threshold for sentences. Acta Acust United Acust 1981; 48(5): 325–328.

Litovsky

RY.

Spatial release from masking in adults. Acoust Today 2012; 8: 18–25.

Kidd

Mason

Brughera

, et al. The role of reverberation in release from masking due to spatial separation of sources for speech identification. Acta Acust United Acust 2005; 91(3): 526–536.

Marrone

Mason

Kidd

Jr.

The effects of hearing loss and age on the benefit of spatial separation between multiple talkers in reverberant rooms. J Acoust Soc Am 2008; 124(5): 3064–3075.

Peng

Pausch

Fels

Spatial release from masking in reverberation for school-age children. J Acoust Soc Am 2021; 150(5): 3263–3274.

Puglisi

Warzybok

Astolfi

, et al. Effect of reverberation and noise type on speech intelligibility in real complex acoustic scenarios. Build Environ 2021; 204: 108137.

Rogers

Lister

Febo

, et al. Effects of bilingualism, noise, and reverberation on speech perception by listeners with normal hearing. Appl Psycholinguist 2006; 27(3): 465–485.

10.

Masuda

Misperception patterns of American English consonants by Japanese listeners in reverberant and noisy environments. Speech Commun 2016; 79: 74–87.

11.

Begault

DR.

3-D sound for virtual reality and multimedia. Academic Press, 1994. ISBN: 9780120847358.

12.

Picinali

Grimm

Hioka

, et al. VR/AR and hearing research: current examples and future challenges. In: Proceedings of forum acusticum, 2023.

13.

Hui

CTJ

Xiao

, et al. Differences in speech intelligibility in noise between native and non-native listeners under ambisonics-based sound reproduction system. Appl Acoust 2021; 184: 108368.

14.

Hui

CTJ

Hioka

Masuda

, et al. Differences between listeners with early and late immersion age in spatial release from masking in various acoustic environments. Speech Commun 2022; 139: 51–61.

15.

Ahrens

Marschall

Dau

Measuring and modeling speech intelligibility in real and loudspeaker-based virtual sound environments. Hear Res 2019; 377: 307–317.

16.

Yue

de Planque

3-D ambisonics experience for virtual reality. https://api.semanticscholar.org/CorpusID:202696285 (2017, accessed 17 December 2025).

17.

Williams

Mann

III . Fourier acoustics: sound radiation and nearfield acoustical holography, Academic Press, 1999.

18.

Mansour

Marschall

May

, et al. Speech intelligibility in a realistic virtual sound environment. J Acoust Soc Am 2021; 149(4): 2791–2801.

19.

Zotter F

, Ambisonics - a practical 3D audio theory for recording, studio production, sound reinforcement, and virtual reality. Springer International Publishing, 2019.

20.

Dagan

Shabtai

Rafaely

Spatial release from masking for binaural reproduction of speech in noise with varying spherical harmonics order. Appl Acoust 2019; 156: 258–261.

21.

Xiao

Hui

CTJ

, et al. Speech intelligibility in noise with varying spatial acoustics under ambisonics-based sound reproduction system. Appl Acoust 2021; 174: 107707.

22.

Schroeder

MR.

New method of measuring reverberation time. J Acoust Soc Am 1965; 37(3): 409–412.

23.

McCormack

Delikaris-Manias

Farina

, et al. Real-time conversion of sensor array signals into spherical harmonic signals with applications to spatially localized sub-band sound-field analysis. In: Audio engineering society convention 144. Audio Engineering Society, 2018.

24.

Zaunschirm

Schörkhuber

Höldrich

Binaural rendering of ambisonic signals by head-related impulse response time alignment and a diffuseness constraint. J Acoust Soc Am 2018; 143(6): 3616–3627.

25.

McCormack

Pulkki

Politis

, et al. Higher-order spatial impulse response rendering: investigating the perceived effects of spherical order, dedicated diffuse rendering, and frequency resolution. J Audio Eng Soc 2020; 68(5): 338–354.

26.

Bench

Kowal

Bamford

The bkb (Bamford-Kowal-Bench) sentence lists for partially-hearing children. Br J Audiol 1979; 13(3): 108–112.

27.

Kim

Purdy

SC.

Speech perception assessments New Zealand (SPANZ). New Zeal Audiol Soc Bull 2015; 24(1): 9–16.

28.

Varga

Steeneken

HJM.

Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun 1993; 12(3): 247–251.

29.

Bates

Linear mixed model implementation in LME4. 2007.

30.

Kuznetsova

Brockhoff

Christensen

RHB

. lmerTest package: tests in linear mixed effects models. J Stat Softw 2017; 82(13): 1–26. DOI: 10.18637/jss.v082.i13.

31.

Lenth

emmeans: estimated marginal means, aka least-squares means. https://cran.r-project.org/package=emmeans (2019, accessed 19 November 2025).

32.

Bronkhorst

Plomp

The effect of head-induced interaural time and level differences on speech intelligibility in noise. J Acoust Soc Am 1988; 83(4): 1508–1516.

33.

Middlebrooks

(ed.). Sound localization. 1st ed, vol. 129. Elsevier B.V, 2015.

34.

Abel

Bryan

NJ.

Methods for extending room impulse responses beyond their noise floor. In: Audio engineering society convention 129. Audio Engineering Society, 2010.

35.

Lochner

JPA

Burger

. The influence of reflections on auditorium acoustics. J Sound Vib 1964; 1(4): 426–454.

36.

Nábĕlek

Robinson

PK.

Monaural and binaural speech perception in reverberation for listeners of various ages. J Acoust Soc Am 1982; 71(5): 1242–1248.

37.

Guski

Vorländer

Comparison of noise compensation methods for room acoustic impulse response evaluations. Acta Acust United Acust 2014; 100(2): 320–327.

38.

Bertet

Daniel

Parizet

, et al. Investigation on localisation accuracy for first and higher order ambisonics reproduced sound sources. Acta Acust United Acust 2013; 99(4): 642–657.

39.

Hioka

Niwa

Sakauchi

, et al. Estimating direct-to-reverberant energy ratio using d/r spatial correlation matrix model. IEEE Trans Audio Speech Lang Process 2011; 19(8): 2374–2384.

40.

Bradley

Reich

Norcross

SG.

On the combined effects of signal-to-noise ratio and room acoustics on speech intelligibility. J Acoust Soc Am 1999; 106(4 Pt 1): 1820–1828.

41.

Bradley

JS.

Predictors of speech intelligibility in rooms. J Acoust Soc Am 1986; 80(3): 837–845.

Differences in speech intelligibility in noise measured under spatial sound reproduction implemented with varying recording and rendering techniques

Abstract

Keywords

Introduction

Methodology

Venues and their acoustics

Spatial sound reproduction implementation

Recording

Rendering

Stimuli

Participants

Procedure

Marking and statistical analysis

Results

Effect of implementation

Effect of speech-noise separation

Effect of source distance

Discussion

Reproducibility of speech intelligibility in real rooms

Reproducibility of the effect of spatial acoustics

Speech-noise separation – Spatial release from masking

Source distance – Speech clarity (C50)

Limitations

Conclusion

Footnotes

Acknowledgements

ORCID iDs

Funding

Declaration of conflicting interests

References