Abstract
Keywords
Introduction
For nearly five decades, cochlear implants (CIs) have been developed and refined into highly functional biomedical devices able to provide a sense of hearing to the profoundly deaf. In quiet listening situations in particular, CI users today show remarkable speech understanding, reaching speech intelligibility scores of up to 100% (Lenarz, Sonmez, Joseph, Buchner, & Lenarz, 2012; Wilson & Dorman, 2007; Zeng, 2004). In more adverse listening conditions, however, such as reverberant or noisy environments, the ability of CI users to understand speech degrades much more rapidly than for normal-hearing listeners (Friesen, Shannon, Baskent, & Wang, 2001; Stickney, Zeng, Litovsky, & Assmann, 2004). Bilateral implantation, that is the implantation of a CI device in both ears, is increasingly common, and many studies report a benefit in speech understanding when both CIs are used (Chadha, Papsin, Jiwani, & Gordon, 2011; Gifford, Dorman, Sheffield, Teece, & Olund, 2014; Van Deun, van Wieringen, & Wouters, 2010; Van Hoesel & Tyler, 2003; Wanna, Gifford, McRackan, Rivas, & Haynes, 2012), particularly a substantial increase in word recognition (Laszig et al., 2004; Litovsky, Parkinson, Arcaroli, & Sammeth, 2006; Loizou et al., 2009). This increase in performance arises from gaining access to sounds arriving at the other ear, and without recourse to noise reduction strategies, which were developed primarily for use in hearing aids (Allen, Berkley, & Blauert, 1977).
Recent efforts at improving CI performance have been devoted to developing and adapting noise reduction strategies specifically for CIs (Hamacher, Doering, Mauer, Fleischmann, & Hennecke, 1997; Hu, Krasoulis, Lutman, & Bleeck, 2013; Hu et al., 2012; Loizou, Lobo, & Hu, 2005; Nie, Stickney, & Zeng, 2005; Wouters & Van den Berghe, 2001), including noise reduction performed on single input channels (Hu & Loizou, 2010; Mauger, Arora, & Dawson, 2012; Yang & Fu, 2005). Consistent with the development of algorithms for use in hearing aids, however, the majority of signal enhancement research for CI users has employed spatial filtering techniques such as beamforming. This technique offers great potential for signal enhancement and shows a clear benefit for speech intelligibility in unilateral CI users (e.g., Hersbach, Grayden, Fallon, & McDermott, 2013, Spriet et al., 2007, Van Hoesel & Clark, 1995). By combining spatial filtering with single-channel noise reduction techniques, Hersbach, Arora, Mauger, and Dawson (2012) confirmed the benefit to speech intelligibility of beamforming algorithms that pre-process noisy speech signals. In speech-weighted noise, they demonstrated a small, but significant additional advantage derived from using single-channel noise reduction in combination with spatial filters.
With the technical solutions of a commercial CI system allowing for binaural pre-processing still pending, Buechner, Dyballa, Hehrmann, Fredelake, and Lenarz (2014) evaluated a binaural beamforming strategy by combining the signal pre-processing of a commercial binaural hearing aid with a CI processor. While the benefit in speech intelligibility was evaluated in unilateral CI users, the setup employed two behind-the-ear (BTE) hearing aid processors which performed binaural beamforming on audio input from both ears, generating enhanced beamformer directionality. Although significant improvements in speech intelligibility were observed, unlike Hersbach et al. (2012), the addition of single-channel noise reduction did not generate further improvements in speech intelligibility. Additionally, binaural beamforming algorithms have been shown to provide larger improvements compared to monaurally independent beamforming algorithms. Kokkinakis and Loizou (2010), for example, reported a significant benefit in speech intelligibility using a four-microphone algorithm to enhance binaural signals, compared to two interaurally independent, two-microphone beamformers.
In general, data from speech intelligibility tests indicate that substantial improvements in speech intelligibility in noise can be achieved for CI users (unilateral as well as bilateral) by employing acoustic filtering such as binaural beamformers. Most tests of speech intelligibility, however, have been performed in artificial listening environments (often anechoic rooms) using stationary noise and partially colocated target speech and noise sources (Fink, Furst, & Muchnik, 2012; Hehrmann, Fredelake, Hamacher, Dyballa, & Buechner, 2012; Hersbach et al., 2012; Kokkinakis & Loizou, 2010; Yang & Fu, 2005). Moreover, while numerous studies each assess a small number of algorithms, comparing the benefits for speech intelligibility across these studies is difficult because of differences in measurement procedures, stimulus characteristics (of the speech and noise), and differences between the groups of subjects assessed. Here, we compiled an extensive collection of signal-enhancement algorithms and assessed their capacity to improve speech intelligibility in noise. Three realistic noise scenarios in a highly reverberant virtual environment were created. Eight bilaterally implanted CI subjects participated in adaptive speech intelligibility measurements for all algorithms in each noise condition. The goal of the study design was to achieve high comparability across algorithms and noise conditions, independent of the specific device (i.e., manufacturer) used by the CI listeners.
The algorithms have been described and instrumentally evaluated in depth in an accompanying study (Baumgärtel et al., 2015). In a second accompanying article (Völker, Warzybok, & Ernst, 2015), the same algorithms were tested in the same noise scenarios with acoustically stimulated hearing-impaired (HI) and normal-hearing (NH) listeners.
Materials and Methods
The eight signal enhancement algorithms evaluated in this study are briefly described in the following section, along with the speech material and test scenarios employed in the evaluation. These methods were described in more detail in Baumgärtel et al. (2015). In the central part of this section, the stimulus presentation details are described, the subject group is introduced, and the speech reception threshold (SRT) measurement procedure is described.
Noise Reduction Algorithms
Eight signal pre-processing strategies were selected to be evaluated in this study and implemented to run in real-time on a common research platform (Master Hearing Aid, (MHA), Grimm, Herzke, Berg, & Hohmann, 2006). This platform offers the possibility to test algorithms in real-time without constraints of actual hearing-aid (HA) or CI processors, such as limited computational complexity or power consumption.
The algorithms have previously been described in detail and evaluated in depth using instrumental measures in Baumgärtel et al. (2015). A list can be found in Table 1 and a brief summary is provided below.
Adaptive differential microphones
To implement the adaptive differential microphone (ADM) processing algorithm, two omnidirectional microphones in each of the BTE processors were combined adaptively to steer a spatial zero toward the most prominent sound source originating in the rear hemisphere (Elko & Anh-Tho Nguyen, 1995). Such independently operating ADMs are already available in most current CI sound processor models.
Coherence-based postfilter
This noise reduction technique relied on the assumption that the desired target speech is a coherent sound source, while the interfering background noise is assumed to be incoherent. Consequently, the coherence-based postfilter (coh) assessed the coherence between the signals at the left and right ears to separate the signal into coherent (desired speech) and incoherent (undesired noise) components, enhancing the former while suppressing the latter (Grimm, Hohmann, & Kollmeier, 2009). Here, the coh processing technique was applied in combination with the ADMs.
Single-channel noise reduction
The single-channel noise reduction (SCNR) algorithm obtained short-time Fourier transform (STFT)-domain estimates of the noise power spectral density and the speech power through a speech presence probability estimator (Gerkmann & Hendriks, 2012) and temporal cepstrum smoothing (Breithaupt, Gerkmann, & Martin, 2008), respectively. The clean spectral amplitude was subsequently estimated from the speech power and the noise power estimates. The time-domain signal was resynthesized using overlap-add. Single-channel type noise reduction is already available in commercial CI processors.
Fixed minimum variance distortionless response beamformer
The fixed minimum variance distortionless response (MVDR) beamformer is a spatial filtering technique, aimed at minimizing the noise power output while simultaneously preserving the desired target speech components. Filters for the left and right hearing devices, WL and WR, were predesigned under the assumption that the target speech source is located in front of the listener. The noise field was assumed to be diffuse. The left and right output signals were calculated by filtering and summing the left and right microphone signals using WL and WR, respectively.
Adaptive MVDR beamformer
In the adaptive MVDR beamformer algorithm, the fixed MVDR beamformer described above was used to generate a speech reference signal. A noise reference signal was obtained by steering a spatial zero toward the direction of the speech source. A multichannel adaptive filtering stage finally aimed at removing the correlation between the noise reference and the remaining noise component in the speech reference.
Common postfilter
Both output signals of a beamformer were transformed to the frequency domain and SCNR processing, as described above, was applied. A common gain function was derived based on the left and right channels and applied to the STFT of the left and right microphone signals. The enhanced signals were resynthesized via overlap-add. By applying the same gain to the left and right channels, this postfiltering technique preserved the interaural level differences. Here, it was used in combination with both, the fixed MVDR beamformer (com PF (fixed MVDR) and the adaptive MVDR beamformer (com PF (adapt MVDR)).
Individual postfilter
The individual posterfiltering scheme differed from the common postfiltering scheme only in choice of gain. Here, the gain was derived for the left and right channels individually. While this provided optimal noise reduction, interaural level difference cues are not preserved. This postfiltering technique was used in combination with the adaptive MVDR beamformer (ind PF (adapt MVDR)).
ADM and SCNR can work on the left and right BTE microphone arrays independently, that is, without a binaural link. All other noise reduction algorithms utilize information from the left and right ear simultaneously, resulting in true binaural signal processing.
For the perceptual evaluation with bilaterally implanted CI users, different parameter settings of the SCNR algorithm were chosen than were used in the instrumental evaluation (Baumgärtel et al., 2015). It has previously been shown (Qazi, van Dijk, Moonen, & Wouters, 2012) that CI users are able to tolerate more signal distortion introduced by noise reduction algorithms compared to NH or HI listeners. Therefore, a more aggressive parameter set was chosen for the bilateral CI evaluation compared to the default setting. Most importantly, the lower limit of the gain function was extended to Gmin = −17 dB (compared to −9 dB in Baumgärtel et al., 2015). This parameter set results in signal distortions not usually tolerated by NH or HI listeners but has provided the most improvements in intelligibility-weighted signal-to-noise-ratio (iSNR) in an instrumental parameter comparison (see appendix Figure A1 and Table A1).
Speech and Noise Materials
All noise scenarios were created using virtual acoustics (Kayser et al., 2009) in a highly reverberant environment (
List of Signal Enhancement Strategies.
Characteristics of Noise Environments Used in Perceptual Evaluation.
Stimulus Presentation
CI and audio level settings
All stimuli were presented directly to the subjects’ clinical processors via audio cable. A digital standard level was chosen to be −35 dB Root Mean Square (RMS) full-scale. The subjects were instructed to adjust the level control of their CI processors until they perceived speech-shaped noise presented at the standard level at a reasonably loud but comfortable level. Typically, subjects were satisfied with their standard level setting. These CI settings were then used throughout the entire duration of the measurement. As elaborated below, the constant speech level was chosen such that the overall signal level did not exceed −35 dB RMS, thus ruling out signal clipping. Additionally, overstimulation of the subjects was avoided along with signal presentation levels that were high enough to activate the CI processor’s limiter.
Subjects used their clinical maps for testing. For Cochlear and Advanced Bionics (AB) users, one unused program slot on each subject’s processors was programmed for the duration of the listening tests: all possible signal enhancement techniques were turned off (including automatic dynamic range optimization (ADRO) for Cochlear users). The MED-EL users participating in this study did not use any pre-processing in their everyday programs and, therefore, used their everyday program for testing.
Hardware
All measurement tools were implemented on an Acer Iconia W700 tablet PC running Microsoft Windows 8, using the internal soundcard. For all but one subject (S3), the sound output level of the tablet PC was set to maximum (100), resulting in an average voltage of 21 ± 1 mV RMS at the audio jack for sound signals presented at the set standard level.
During the speech intelligibility measurements, subjects were able to enter their answers self-paced through a graphical user interface (GUI) and the tablet’s touchscreen. One subject (S2) chose not to enter his answers himself, but instead repeated understood words to the instructor who then entered the answers.
Subjects
In total, eight adult subjects participated in this study, all of them experienced users of bilateral CIs. Inclusion criteria for participation in the study were at least 12 months of bilateral CI experience and at least 70% speech intelligibility in quiet using the OLSA test material with both ears and either ear alone. Four male and four female subjects were tested, with a mean age of 44.3 ± 18.3 years. The monaural CI experience ranged from 3 years to 15 years (8.8 ± 4.0 years), the duration of bilateral CI use from 1 year to 10 years (5.0 ± 2.5 years). All subjects experienced periods of hearing impairment before implantation ranging from 3 years to 42 years (23.0 ± 14.6 years).
Detailed Subject Information.
For measurements this subject was fitted with two Cochlear Freedom processors (same map, threshold, and maximum comfortable levels as in clinical device).
Subjects were compensated on an hourly basis. All subjects participated in four 1.5 to 2 h sessions. All subjects were volunteers and signed an informed consent form before participating in the measurements. The measurement procedures were approved by the ethics committee at the Carl von Ossietzky Universität.
Measurement Procedure
The general measurement setup employed in this study is depicted in Figure 1.
Preliminary measurements
Before subjects participated in the speech intelligibility measurements in noise, a training session as well as speech intelligibility tests in quiet were performed.
Since the OLSA sentence test shows a pronounced training effect (Wagener, Brand, & Kollmeier, 1999), it was necessary to allow subjects enough time to get acquainted with the speech material. Therefore, three lists of 20 sentences were measured. The speech material was presented bilaterally, without interfering noise at a constant, comfortable level.
Once subjects were trained with the OLSA material, the ceiling performance level was determined by measuring one list (clean speech, constant level) each in the following conditions: presentation to left ear only (left), presentation to right ear only (right), and presentation to both ears (bilateral). The results of this pretest can be found in Table 3, column 10 (OLSA in quiet).
SRT50 measurements
The 50% speech reception threshold (SRT50) is the signal-to-noise ration (SNR) at which 50% of the words are understood correctly. The SRT50 was measured using an adaptive procedure according to Brand and Kollmeier (2002), implemented within the framework of the AFC software package, a tool designed to run psychoacoustic measurements in Matlab (Ewert, 2013). In short, word scoring for each sentence is used to adaptively determine the SNR of the next OLSA sentence. For each correctly understood word, the SNR is decreased by one stepsize, while an incorrectly understood word results in an SNR increase by one stepsize. The stepsizes are decreased by a factor of
During the tests, the speech level was held constant while adjusting the SNR by adaptively varying the noise level. After determining the SNR value for the next presentation, the background noise level was adjusted with a Hanning ramp of 500 ms duration. Following this volume change, 2.5 s of noise-only was presented before presenting the next sentence to allow the algorithms to adapt to the new noise level.
CI users show a wide range of performance in speech-in-noise tasks (e.g., Müller, Schon, & Helms, 2002). Furthermore, the input dynamic range of CI speech processors is limited, usually from 25 dB to 65 dB SPL at the microphones. In some cases, sound presented via audio cable is further limited in dynamic range. Both factors taken together make it difficult to set a speech level that is sufficiently loud and at the same time low enough to avoid clipping of the signal or near-infinite compression in the CI processor when high levels of noise are added to reach the SRT50 in best performing subjects. We therefore chose to use a subject and listening scenario specific speech level during the measurement procedure, according to each subject’s baseline SRT50 determined in a pretest. As a pretest before each session, one list was measured in the current noise condition without pre-processing. This pretest gave the subject the chance to get acquainted with the noisy background and yielded an estimate of the SRT50 values to be expected.
For the speech intelligibility measurements, the speech level was set lower than the standard level by this individual SRT50 plus an additional buffer to allow enough headroom for SNR increases without signal clipping. Additionally, by employing this procedure, the overall presentation level (speech plus noise near the SRT50) was similar across subjects while simultaneously allowing each subject to perform measurements at the highest possible speech signal level. The overall presentation level during the measurement did not exceed the standard level, ensuring comfortable loudness for all subjects at all times and at the same time avoided signal presentation at levels that would active the CI processors’ limiter. This procedure may potentially result in speech levels too soft to be transmitted through the audio input. In the measurement procedure, the initial SNR level for each measurement was adjusted to be 5 dB higher than the SRT50 value determined in the pretest to ensure above threshold presentations in the beginning of the measurement for all subjects. Subjects were instructed to report back to the experimenter if they were unable to understand the first sentence during a measurement. In this case, the speech level was raised until the subject was able to perceive the sentence and kept constant at this level for the remainder of the measurement. In the rare cases where a level adjustment was necessary, no signal clipping occurred during the following measurements.
During the measurements, all algorithms were presented in randomized order, and the noise scenarios were measured in a set order (20T, SCT, and CAN). One list of 20 sentences was measured per condition to determine the SRT50.
Statistical Analysis
Statistical analyses were performed using IBM SPSS (Version 22, IBM Corp., Armonk, NY). A Shapiro-Wilk test was not found to be significant (
Instrumental Evaluation of Algorithm Performance
In Baumgärtel et al. (2015), the tested algorithms were evaluated using three instrumental measures of speech intelligibility as well as sound quality. First, the intelligibility-weighted signal-to-noise ratio (iSNR; Greenberg, Peterson, & Zurek, 1993) was used, which determines the SNR ratio in different frequency bands and subsequently weighs them according to the SII standard (ANSI 20S3.5, 1997). Second, the short time objective intelligibility index (STOI; Taal, Hendriks, Heusdens, & Jensen, 2011) was employed, which calculates the correlation of time-frequency segments of a noisy test file and a clean reference file and from these correlations determines a speech intelligibility index. Third, the quality of a (noisy, reverberant) test file with respect to a (clean) reference file was determined using the perceptual evaluation of speech quality (PESQ; Rix, Beerends, Hollier, & Hekstra, 2001).
For STOI and PESQ, clean speech processed with anechoic HRTFs was chosen as a reference condition, resulting in spectral colorations not being evaluated negatively in the instrumental assessment. Reverberation and residual noise, however, is rated negatively. Speech and noise signals for 120 OLSA sentences were mixed at a broadband, long-term SNR of 0 dB and evaluated by the three measures introduced above. The results from the instrumental evaluation presented here differ slightly from those previously reported (Baumgärtel et al., 2015) in that they utilized SCNR settings matching those employed in the perceptual evaluation in bilateral CI users. For better comparability to the perceptually obtained SRT50 improvements, benefits obtained by each algorithm in each acoustic scenario are represented in terms of better-channel improvements for each measure. For each condition, the better channel of the resulting stereo sound file (left or right) after processing with the algorithm is determined, and the difference is calculated to the better channel in the unprocessed reference file.
The better-channel improvements were used to determine the power of each of these measures to predict SRT50 improvements in bilateral CI users. Kendall’s rank correlation was used as an indicator of the predictive power. We assessed the correlation for each measure individually and took into account either each noise condition in isolation (τ 20 T, τ SCT, and τ CAN results) or pooled data from all noise scenarios to determine an overall correlation (τ overall).
Results
SRT50 Measurements
STR50 were determined using the adaptive measurement procedure described in the Methods section, for three distinct noise scenarios. Differences in SRT50 were calculated ( Schematic representation of the measurement system. Target speech material was convolved with reverberant, binaural HRTFs, resulting in a four-channel signal (two BTE microphones on each side). Speech and virtual acoustic background noise were mixed adaptively. Signal processing is carried out online as the subjects performed measurements. Processed signals are presented to bilaterally implanted CI subjects via the processors’ audio input channel. Average SRT50 improvements compared to unprocessed baseline condition for all signal pre-processing strategies tested. Error bars indicate the standard deviation. Asterisks denote results that are statistically significantly different from SRT50,NoPre (***

Comparing across the three background noise scenarios, similar baseline (NoPre) performance was found in the cafeteria ambient noise and the single competing talker noise scenarios, but baseline SRTs in the 20-talker babble condition fall about 3 dB higher on average.
A repeated-measures ANOVA revealed significant within-subjects effects of the algorithm condition when averaged across all three noise types, (
ADMs without a binaural link serve as a second baseline condition against which all binaural noise reduction strategies were compared. Noise reduction algorithms similar to ADMs are already available in commercial CI processors, and this comparison allows isolating the advantage of the binaural link. Improvements relative to ADM-processed signals were obtained as Average SRT50 improvements compared to the ADM-processed condition for all binaural signal pre-processing strategies. Error bars indicate the standard deviation. Asterisks denote results that are statistically significantly different from SRT50,ADM (***
Pairwise Comparison of Algorithm Performance.
The SCNR algorithm evaluated here did not generate any significant improvements in speech intelligibility. While most subjects showed a slight reduction in SRT50 for the SCNR-processed signals, some subjects experienced an increase in speech intelligibility (S1, S5, and S8 in 20T, S4, and S5 in CAN, and S1 and S4 in SCT; see Figure A2). On average, however, no significant intelligibility impairment or improvement compared to the unprocessed baseline condition was found in all three noise scenarios.
Generally, the binaural MVDR beamforming strategies (fixed and adaptive MVDR), with and without postfilters, showed the best improvements in SRT50. One exception to this general finding was the combination of an adaptive MVDR beamformer with a common postfilter in the cafeteria ambient noise scenario, in which no statistically significant differences were obtained, relative to the unprocessed condition. Additionally, this condition generated more variable data than all other algorithm conditions assessed in this noise scenario. Examination of the single-subject data (appendix Figure A2) revealed that, while the majority of subjects were indeed able to derive a benefit from the signal processing, two subjects (S6 and S7) experienced a reduction in speech intelligibility.
In each of the two noise scenarios in which the noise was multidirectional (20T and CAN), no significant difference was found between any of the five MVDR versions in
In the highly spatial and non-stationary, single competing talker scenario, the adaptive MVDR algorithms performed significantly better than the fixed beamforming algorithms (
In combination with the adaptive MVDR beamformer, two different postfilter schemes were tested: a common postfilter that applied the same gains to the left and right channels, and an individual postfilter that derived the gains for the left and right channels individually. Although no statistically significant differences in average SRT50 scores were observed for any of the noise scenarios tested, subject-specific differences were observed (see Appendix Figure A2 for single-subject data).
When comparing the SRT50 improvements obtained with the three versions of adaptive MVDR algorithms across different noise scenarios, performance in the highly directional single competing talker noise scenario was significantly better—by at least 9.7 dB—compared to improvements obtained in the other two noise scenarios (adapt MVDR, 20T vs. SCT: 11.7 dB,
Compared to the ADM baseline, in the realistic cafeteria ambient noise scenario, none of the binaural algorithms achieved statistically significantly better results than the ADMs. The fixed binaural MVDR beamformer resulted in a significant SRT50 improvement of 3.4 dB ± 1.5 dB in the 20 talker babble condition and the adaptive MVDR beamformer yielded a significant improvement of 10.6 dB ± 3.0 dB over the ADM baseline in the single competing talker scenario.
The binaural adaptive MVDR beamformer outperformed the monaural, independent ADMs in the single competing talker scenario (
The fixed binaural MVDR beamformer achieved significantly (
Relation to Instrumental Evaluation
Improvements in instrumental iSNR were determined in the same unit of measurement (dB SNR) as the perceptually obtained improvements in SRT50. From Figure 4 (left panel), it is apparent that improvements in SRT50 are largely the result of improvements in iSNR. In most conditions, iSNR improvements were found to be slightly larger than SRT50 improvements. For the adaptive binaural beamformers (with and without postfilters) in the single competing talker scenario, however, estimates of the improvement in iSNR were smaller than measured SRT50 (yellow numbers 6, 8, and 9, top right corner of left panel). Regarding all noise scenarios pooled together, iSNR and PESQ scores correlated fairly well with the perceptual SRT50 data. When assessing each noise condition individually, however, correlation scores for the 20 talker babble and cafeteria ambient noise scenarios were rather low. Only in the single competing talker noise, the instrumentally obtained scores correlated highly with the perceptually measured SRT50. Taking together the individual scores (each noise scenario) as well as the overall score, the STOI measure provided the best correlation with the measured SRT50 data.
Correlation between perceptually measured and instrumentally predicted speech intelligibility improvements. Kendall’s τ for correlations between the average SRT50 improvements determined from measurements in bilateral CI users in this study and average instrumental measures results from iSNR (left panel), STOI (middle panel), and PESQ (right panel) measures are represented here. Each algorithm is represented by its corresponding number (compare Table 1). The color codes for the three different test scenarios. The dash-dotted line in left panel represents instances where improvements in iSNR and improvements in SRT50 are equal in magnitude. In the boxes, Kendall’s τ is given for each test scenario independently as well as an overall score across all test scenarios.
Discussion
The aim of this study was to comprehensively evaluate the capabilities of binaural noise reduction algorithms in improving speech intelligibility in noise for bilaterally implanted CI users. Three complex, realistic noise scenarios were created, all including a significant amount of reverberation. Eight bilateral CI users, wearing devices from three different CI manufacturers have participated in the study. Improvements in SRT50 achieved by the algorithms relative to the unprocessed signal as well as to the baseline performance of ADMs without a binaural link were compared across noise scenarios. Improvements relative to the unprocessed signal were additionally related to improvements predicted by instrumental measures.
It was possible to obtain substantial, statistically significant improvements in SRT50 relative to the unprocessed signal in all three noise scenarios tested. While the noise scenarios did include a considerable amount of reverberation and non-stationary interferers, resulting in realistic listening environments, the chosen spatial layout of the scenarios was very beneficial for the algorithms tested, especially for the binaural beamformers. Additionally, the use of HRTFs to create all test materials paired with signal presentation via audio input rather than in the free field eliminates any influence of head movements a listener would experience in real listening situations. These head movements as well as potential movements of the target source are expected to decrease the efficiency of the tested beamforming algorithms. A possible solution to this issue are steerable beamformers, such as the setup tested by Adiloğlu et al. (2015). Nevertheless, the significant amount of reverberation, non-stationary interfering noise sources at angles < 45° as well as the use of interfering noise material with speech-like spectra create fairly realistic test scenarios that allow for a more accurate estimate of the algorithms’ performance than classical setups (e.g., anechoic rooms, stationary speech shaped noise).
In the unprocessed condition, differences of about 3 dB in average SRT50 were found between the 20 talker babble scenario on the one hand and the cafeteria ambient noise and single competing talker scenarios on the other hand, which are presumably due to the different spectro-temporal properties of the scenarios. The 20 talker babble scenario is stationary and has a high spectral overlap with the speech test material. Therefore, the highest energetic masking ability is expected for this scenario. Both the cafeteria ambient noise and the single competing talker scenarios contain speech as masking sounds, therefore, also have high spectral overlap with the test material but their temporal structure is highly non-stationary, potentially allowing for listening in the dips (either no masking noise at all as in SCT or spectrally different masking noise as in CAN) which results in lower unprocessed SRT50.
The SCNR algorithm evaluated here was the only single-channel processing scheme included in the comparison, with all other algorithms performing signal processing based on multichannel input. Multichannel processing generally provides larger improvements than single-channel processing algorithms. Therefore, the lack of improvements in speech intelligibility when using the SCNR algorithm was anticipated. Such signal-processing strategies have previously been shown (e.g., Luts et al., 2010) to provide an increase in ease of listening and reduction in listening effort, but rarely improve speech intelligibility, especially in non-stationary noise scenarios. The single-channel noise reduction algorithm assessed here was based on a speech-presence probability estimator prone to errors in speech-on-speech masking situations, such as the single competing talker scenario (see Baumgärtel et al., 2015 for further explanation). It is, therefore, noteworthy that, on average, there is no significant reduction in speech intelligibility. SCNR noise reductions algorithms implemented directly on CI speech processors can circumvent the resynthesis step to the time domain after processing in the frequency domain. We hypothesize that this process likely reduces signal artifacts, resulting in better speech intelligibility than those reported here. Consistent with this interpretation, Buechner et al. (2010) demonstrated statistically significant improvements in speech intelligibility using a commercially available single channel noise reduction strategy, implemented on a CI BTE processor.
ADMs were used as a second baseline against which the binaural beamforming algorithms were compared. While the difference in performance between the fixed binaural MVDR beamformer and the monaural ADMs was only found to be statistically significant in one scenario (20T), the general trend across all noise scenarios indicates better performance (i.e., larger SRT50 improvements) with the binaural beamforming algorithms. The lack of statistical significance can reasonably be attributed to the large interindividual variability.
The addition of coherence-based noise reduction on top of the ADMs did not result in a statistically significant benefit in SRT50 for bilaterally implanted CI users. This finding is in accordance with results obtained by Luts et al. (2010) in hearing-impaired listeners using the same coherence-based noise reduction algorithm. In Baumgärtel et al. (2015), the combination of coh with ADM was shown to increase the iSNR, STOI, and PESQ of noisy signals in all scenarios tested here. Discrepancies between the perceptually measured SRT50 and instrumentally determined speech enhancement are presumably due to signal distortions in the processed signal not appropriately accounted for in the instrumental measures.
The largest improvements in speech intelligibility were observed when adaptive, binaural MVDR beamforming algorithms were employed in the single competing talker scenario. Since in this scenario, the speech source and the interfering noise source are highly directional and spatially separated, significant benefits can accrue from spatial noise-reduction algorithms. The adaptive binaural MVDR beamformer is especially well suited to this task, capable not only of enhancing sounds originating from the front (0°) but also steering a spatial zero toward a competing noise source at a location different from 0° (in this case presumably toward the competing talker located at 90° azimuth), resulting in optimal noise suppression.
For all binaural beamforming algorithms, larger SRT50 improvements were found in the 20 talker babble scenario than in the cafeteria ambient noise scenario (see Figure 2). All beamforming algorithms are tuned to enhance sound originating from 0° (frontal position) regardless of the noise environment. In the 20 talker condition, no direct interfering sound originates from 0°, allowing for efficient noise suppression by the beamforming algorithms. In the cafeteria ambient noise scenario, there are several noise sources spread throughout the cafeteria, some located around 0°. These fairly central sources reduce the beamformers’ effectiveness leading to slightly smaller improvements than in the 20 talker babble scenario.
When comparing data obtained across all noise scenarios using adaptive MVDR beamformers in bilaterally implanted CI users, to data obtained for the same conditions in other subject groups (NH and HI, see Völker et al., 2015), striking differences were found. In the spatially distinct scenario (SCT) bilaterally implanted CI users benefited significantly more from the adaptive, compared with the fixed MDVR algorithm (ΔSRT50 = 10.4 dB,
The adaptive binaural beamforming strategies make use of the spatial separation between target and interfering sound sources in their signal processing, particularly in the single competing talker scenario. In doing so, however, binaural or spatial cues present in the unprocessed signal are distorted. NH and HI listeners who, in the unprocessed condition, could benefit efficiently from binaural release from masking, were negatively impacted by this distortion of binaural cues. For the algorithm to generate improvements in speech intelligibility, there, the benefit of the noise reduction had to outweigh the disadvantage introduced by distorting binaural cues. The potential benefit these listener groups could expect from the algorithms was, therefore, reduced by the loss of spatial release from masking due to cue distortion. Bilateral CI users on the other hand could only make limited use of binaural unmasking in the unprocessed condition. Consequently, they could not be negatively impacted by the binaural cue distortion introduced by the signal processing and were able to access the full SNR improvements provided by the algorithm. The exceptionally large improvements in speech intelligibility—15.2 dB in terms of SRT50—however, were not directly anticipated from the SNR improvements. In the instrumental speech-intelligibility prediction, the intelligibility-weighted SNR (iSNR) measure predicted an improvement in iSNR of a maximum of 10.8 dB (adapt MVDR). On average, the bilateral CI users gained an additional 4.4 dB in SRT50 on top of the 10.8 dB explained by the iSNR. This can partially be explained by baseline SNR dependence of the improvement in iSNR provided by the adaptive MVDR: The 10.8 dB improvement was derived at 0 dB baseline SNR, whereas the CI subjects in this particularly favourable condition measured at −16.5 dB SNR on average. In an instrumental evaluation performed at −16.5 dB, the iSNR improvement provided by the adaptive binaural MVDR increased by approximately 2 dB relative to the improvement at 0 dB SNR, increasing the iSNR improvements to about 13 dB. Further, the instrumental iSNR improvement is defined as the difference between the iSNR at the better ear in the reference condition (NoPre) and the better ear iSNR after processing. The single competing talker noise scenario featured one prominent interfering speech source located at the right of the listener, therefore, the left ear was subjected to less noisy signals at a considerably higher SNR than the right ear (∼5 dB iSNR independent of input SNR range, Baumgärtel et al., 2015). Post processing, however, both ear signals had the same iSNR. Consequently, the instrumental iSNR improvements can be understood as the improvement in iSNR of the signal at the left ear. For real listeners, however, the actual improvement in iSNR depended on their hearing ability with either ear, and three different cases can be distinguished as follows: (a) subjects with substantially better speech intelligibility in the left ear (S3). These subjects should, in theory, perform as predicted by the iSNR; (b) subjects with similar speech intelligibility in both ears (S1, S4, S5, S6, S8) should also perform as predicted by the iSNR, benefiting from the iSNR improvement and, additionally, from binaural summation (since, after processing, signals with identical intelligibility were presented to both ears). With binaural summation in CI listeners of 0 to 3 dB, typically around 2 dB for those with similar intelligibility in both ears (Schleich et al., 2004), theoretically these subjects should, therefore, gain about 15 dB in SRT50; and (c) subjects with substantially better speech intelligibility in the right ear (S2, S7), who were forced to rely on their weak ear in the unprocessed baseline condition. These subjects not only benefited from the iSNR improvement, the signal processing also provided access to the better-performing right ear, theoretically resulting in SRT50 improvements >15 dB. The average gain in SRT50 across the three different subject groups can, therefore, be estimated to be about 15 dB, which is in good agreement with the experimentally determined average SRT50 gain of 15.2 dB. Taken together, all subjects were expected to perform as well as predicted by the iSNR or better and indeed 7 out of 8 subjects showed an improvement in SRT50 of at least 10.8 dB. Speech intelligibility for the only exception to this (S8) lay within the test-retest confidence of the measurement setup of about 2 dB.
For all other algorithm-noise-combinations, improvements in the measured SRT50 could be accounted for by the improvements predicted by the iSNR. For the 20 talker babble and cafeteria ambient noise scenarios, the mean gap of 1.9 dB and 2.4 dB, respectively, is on the lower end of the typical 2 to 3 dB observed in acoustic evaluations of speech intelligibility (e.g., Van den Bogaert, Doclo, Wouters, & Moonen, 2009). Since the gap is traditionally explained by the detrimental impact of processing artifacts and, in some cases, by the degradation of binaural cues (Van den Bogaert et al., 2009), the fact that the gap is slightly smaller here is an indication that processing artifacts and the nonpreservation of binaural cues are of less relevance to most CI subjects than they are for NH and HI listeners. In the single competing talker scenario no gap, but an SRT of on average 1.2 dB larger than the iSNR prediction is observed, resulting from the above discussed improvements of the bilateral CI users in this test scenario using adaptive binaural MVDR beamforming algorithms.
To predict improvements in speech intelligibility provided by the algorithms beyond iSNR improvements, STOI and PESQ measures were employed. STOI has previously been shown to correlate well with speech intelligibility in NH listeners as well as speech intelligibility of vocoded speech (Hu et al., 2012; Taal et al., 2011). Indeed, assessing each noise scenario individually as well as all scenarios taken together, STOI provided the best correlation with the SRT50 data measured in this study. This measure outputs an intelligibility index that can be related to percentage-correct speech intelligibility scores but cannot directly be related to the SRT50 measured here.
Considering that the audio quality measure PESQ is not a measure of speech intelligibility per se, it could not be expected to highly correlate with the perceptual data (e.g., Hu & Loizou, 2007a, b). However, a correlation of τ = 0.79 in the single competing talker scenario is observed due to the large range of algorithm performance in this scenario.
Furthermore, correlation analyses were performed on average SRT50 results only. Since large interindividual differences in our subject’s individual SRT50 performance were observed, the predictive value of each instrumental measure for a single subject’s SRT50 performance is expected to be limited even further. This large inter-subject variability is evident from the rather large error bars (see Figure 1, single subject data can be found in appendix Figure A2), with the largest variations occurring in the single competing talker scenario (ADM +coh) and the cafeteria ambient noise scenario (com PF (adapt MVDR)). In Baumgärtel et al. (2015), the algorithm was evaluated in the same noise scenario using instrumental measures. The fluctuations in the interfering speech source were the same as in the current study, the variations, however, were found to be much smaller. Therefore, the large remainder of variations has to be attributed to randomly larger standard deviations (as they occur at a sample size of 8) and potentially subject-specific factors that were not isolated in this study.
In case of the common postfilter based on the adaptive binaural MVDR, the algorithm is expected to be influenced to the same extent by fluctuations in the interfering noise as the individual postfilter based on the adaptive binaural MVDR. This, however, is not the case. The majority of the remaining variability for the com PF (adapt MVDR) algorithm is, therefore, likely again an effect of the rather small sample size and potentially of individual differences in subjects’ hearing abilities or preferences.
Summary and Conclusions
The fixed binaural MDVR beamformer investigated here provided good improvements for all subjects in all noise conditions. Depending on the noise environment and subject specific factors, the addition of adaptive noise cancellation was able to provide even larger speech intelligibility improvements. Both beamforming algorithms (with and without added postfiltering) outperformed the ADMs without a binaural link.
Perceptually measured speech intelligibility improvements correlated reasonably well with instrumentally estimated speech intelligibility improvements. A large portion of the SRT50 improvements could be attributed to an improvement in intelligibility weighted SNR (iSNR).
In comparison to hearing-impaired listeners (see accompanying study, Völker et al., 2015 for detailed results), bilateral CI users profit much more from the binaural signal preprocessing, especially in listening environments with a large spatial separation of target and interferer. It is, therefore, expected that the development of binaural signal processing for CIs should provide a sizeable benefit in speech intelligibility in certain listening environments for bilaterally implanted CI users, exceeding what is generally found for hearing-impaired listeners.
