Abstract
Keywords
In everyday life, human listeners deal with complex acoustic scenarios, in which target speech and interfering sound sources arise at different locations. This phenomenon is termed the “cocktail party problem” (Cherry, 1953, p. 976). When target speech and interfering sound sources are spatially separated, they differ in their interaural level differences (ILDs), interaural time differences (ITDs), and interaural phase differences (IPDs; Bronkhorst, 2000). Two mechanisms are thought to play a primary role in such adverse spatial acoustic conditions:
Particular back ends include the speech intelligibility index (SII; ANSI S3.5-1997, 1997; (Beutelmann & Brand, 2006; Beutelmann et al., 2010; Jelfs et al., 2011; Lavandier & Culling, 2010; Lavandier et al., 2012; Wan et al., 2010), analyzing the SNR in the modulation frequency domain (Chabot-Leclerc et al., 2016), and the correlation between the clean speech and the noisy speech (Andersen et al., 2016). All of these models are based on assumptions about the EC process made by Durlach (1963), namely that the equalization process has inherent processing errors in level and time, leading to an imperfect alignment of the left and right ear signals and, therefore, to an imperfect cancellation of the masker signal. Durlach’s original specifications of the error parameters were subsequently updated by vom Hövel (1984) to better agree with data by Langford and Jeffress (1964) and (Egan, 1964). vom Hövel’s results have since been incorporated into the binaural speech intelligibility model by Beutelmann and Brand (2006) and its revised version (Beutelmann et al., 2010; termed here
Our approach has similarities to an earlier blind model by Cosentino et al. (2014) which uses the binaural localization model proposed by Dietz et al. (2011) to estimate the IPD of both the target source and the interfering source. They differentiate between signal and interferer by assuming that the target speech source is located directly in front of the listeners, that is, at 0° in the horizontal plane. Therefore, the other IPD, different to 0°, is associated with the position of the interferer. The estimated IPDs of both target and interferer are then used to calculate the binaural masking level difference (for details, see Culling et al., 2004, 2005). Note that this model assumes that the sound sources can be localized to perform binaural unmasking. Localization certainly plays a role in our everyday communication, but binaural unmasking does not necessarily require a correct localization of target and interferer, because it works best for stimuli with an IPD of π (e.g., Licklider, 1948). Such stimuli have a frequency-dependent ITD which prevents the perception of a clear direction or lateralization.
A similar but more technical approach was presented by Tang et al. (2018), who proposed a blind model for speech intelligibility prediction, which combines a blind source separation algorithm with a nonblind speech intelligibility back end. Like Cosentino et al. (2014), they assumed that the target source is directly in front of two microphones and, additionally, that only one masker source is present. Using these assumptions, the blind source separation algorithm was able to extract estimates of the speech signal and the noise signal from the mixed signals, which were then further analyzed using different speech intelligibility back ends.
A third binaural speech intelligibility model that works blindly was presented by Geravanchizadeh and Fallah (2015). They combined a model of the auditory periphery (Dau et al., 1996) and an EC mechanism to the mixed signals. The back end was a dynamic time warp speech recognizer (Sakoe & Chiba, 1978), which compares the processed mixture of speech and noise with an internally stored reference of the target speech. However, this approach is limited to negative SNRs: Positive SNRs are explicitly excluded as for those normal EC processing would cancel the signal, not the noise. In the real world, SNRs vary over a wide range, and thus any model has to deal with both negative and positive SNRs. This leads to a problem, in that a blind binaural SNR improvement needs to do two opposing things: the
Aim of This Study
Our aim was to develop a blind, signal-driven binaural processing stage that considers both negative and positive SNRs. To do this, we needed to address two questions: (a) Can a blind EC process be realized without assuming a certain direction of the target or interferer? (b) How should positive SNRs be considered?
Blind Model Proposed in This Study
According to Lord Rayleigh’s Duplex theory (1907) of sound source localization, the binaural auditory system uses ITDs (and thus IPDs) at frequencies below 1500 Hz to localize sound sources, but above 1500 Hz, ILDs are used. Motivated by this theory, in our model, we apply the EC model of binaural unmasking for frequencies up to 1500 Hz, while the better ear is used at frequencies above 1500 Hz.
The modeling of binaural unmasking below 1500 Hz is achieved in two steps: First, in the equalization step, the left and right ear signals are equalized in level (by amplification and attenuation) and in phase (by delaying the band pass filtered signals). Second, in the cancellation step, two alternative strategies are used. If the SNR is negative, and thus the noise is dominant, the signal from one ear is subtracted from the other which cancels the noise by destructive interference causing a minimization of the model’s output level and therefore an improvement of the SNR. If instead the SNR is positive, then the two ear signals are added, which enhances the signal by constructive interference causing a maximization of the model’s output level. Including both strategies in the binaural processing stage requires a slight modification of the EC mechanism to allow for constructive interference. The concept of allowing a summation of left and right ear signal was previously proposed by Green (1966). But he considered only situations where subtraction was beneficial. Moreover, the aim of his article was neither to propose a binaural model that can be applied to arbitrary SNRs, nor to propose a method to differentiate addition and subtraction as the better strategy.
To determine whichever of level minimization or level maximization is required, we use a modulation analysis based on the speech-to-reverberation modulation ratio (SRMR; Santos et al., 2014). The output of the SRMR is the ratio between modulation energy in low and high modulation frequency channels. A high value is associated with the presence of speech-like modulation. A low value is associated with the presence of reverberation or noise as both decrease the modulation depth at low modulation frequencies. The SRMR approach is conceptually similar to the classification of signals based on modulation spectra, which was proposed by Ostendorf et al. (1998).
Whichever of the two EC paths yields the higher SRMR value is then used for further processing in the model. We call this
In this study, we focus on the binaural processing stage of the model and do not modify the SII back end from BSIM2010, allowing for maximum comparability to that. In addition, the comparability to other nonblind back ends is basically conserved: As the whole model processing is linear (apart from the nonlinear control strategy of selecting the EC parameters and level minimization vs. maximization), it is in principle possible to process target speech and interfering signals separately (see later). We call our new model
Note that the blind model front end is independent of the model back end, as the speech intelligibility measure is not required for optimizing the EC parameters. The front end can be combined with arbitrary back ends, for example, with intrusive back ends like SII (ANSI S3.5-1997, 1997), or with blind back ends, which use principles of automatic speech recognition (e.g., Schädler et al., 2016; Spille et al., 2018).
Organization of This Article
In Experiment I, we evaluated whether the new BSIM2020 model predicts SRT data in stationary noise, located at different azimuths in the horizontal plane, with the same accuracy as the earlier, nonblind BSIM2010 model by Beutelmann et al. (2010).
In Experiment II, we collected new data to test whether the proposed maximization strategy is actually required to model human performance in speech-in-noise experiments at positive and negative SNRs or whether simply switching off EC processing at positive SNRs and using the ear with the better SNR was sufficient. To provoke binaural release from masking at positive SNRs, we low-pass filtered speech and noise (which can be regarded as a simple model of high-frequency hearing loss) and time compressed them (which can be regarded as a simple model of reduced central processing speed). In the model analyses of both experiments, we compared the new modulation-based selection between level minimization and maximization with using either level minimization only or level maximization only to determine if the new selection process was indeed necessary.
Methods
Front-end Binaural Processing
The BISM2020 model is schematically shown in Figure 1. It uses mixed speech and noise as input signals to determine the EC parameters.

Block Diagram of the General Processing Performed in BSIM 2020. The mixed signals on the left and right ear are divided into 30 frequency bands ranging from 150 Hz to 8500 Hz using a gammatone filter bank (Hohmann, 2002). Afterward, frequency bands below 1500 Hz are fed to the EC stage, where both a level minimization and level maximization is performed in parallel, denoted as EC_min and EC_max. The speech-to-reverberation modulation ratio (SRMR; Santos et al., 2014), denoted as select-stage is used (a) for selecting if the level-minimization or the level-maximization produces the best SNR improvement and (b) for determining the better ear. This is indicated by the numbers in the selection stage. For low frequencies, either the EC-Min (1) or the EC-Max (2) path is selected. For high frequencies (above 1500 Hz), either the left ear channel (3) or the right ear channel is selected. Both, binaurally processed channels and the better ear channels are combined, and a single-channel output is resynthesized using a gammatone synthesis filter bank. The output can then be analyzed by an arbitrary back end, which is the SII in this study.
First, the mixed signal is divided into 30 ERB (Moore & Glasberg, 1983) spaced frequency bands ranging from 150 Hz to 8500 Hz using a gammatone filter bank (Hohmann, 2002), simulating the frequency selectivity of the auditory system. Next, EC processing is then performed in frequency channels up to 1500 Hz (i.e., in the lower 15 frequency channels). The ILD is estimated in each frequency band by calculating the power in each frequency band, and then the power of the right ear channel is subtracted from the left ear channel leading to a signed ILD, where negative values correspond to a higher level at the right ear and positive values to a higher level at the left ear. The signals in the left ear and right ear channel are then amplified or attenuated such that the levels are equalized between the left and right ear channel. We assume that the equalization is imperfect, which we implemented via a jitter in the interaural level equalization process that prevents perfect level equalization between the left and right ear channel. The jitter was a normally distributed random variable (vom Hövel, 1984; see supplementary material for more details). The jitter is applied to the signals directly, and thus, a Monte-Carlo simulation is required to model the statistics of the uncertainties. This procedure is similar to the method used by Beutelmann and Brand (2006) and Wan et al. (2010), but different to the method used by Beutelmann et al. (2010), where the uncertainties were incorporated analytically with respect to their expectation values and variances. Afterward, the ITD is estimated in each frequency channel from the phase information of the cross-power spectral density between left and right ear channel. After the estimation, the ITD is compensated for. This equalization process in time again includes binaural processing inaccuracies, which are assumed to be independent realizations of another normally distributed random variable (vom Hövel, 1984).
The cancellation step is performed by either subtracting the left ear channel from the right ear channel (as the left and right ear signals are equalized, the subtraction operation is symmetric and can be performed either way) or adding, depending on the SRMR (see later). More details about the processing in BSIM2020 can be found in the supplementary material.
Selection of EC Path and Better Ear Based on Modulation Analysis
In the next step, the better of the EC processing strategies and the better ear are selected blindly to produce a binaurally processed mono signal, which can be analyzed by an arbitrary speech intelligibility back end. To do this, we use the SRMR measure (Santos et al., 2014) applied independently in each of the 30 frequency bands. The envelope of the EC processed signals and the envelope of the two ear signals are extracted by taking the absolute value of the analytical signal (i.e., the Hilbert envelope).
These are then analyzed by a modulation filter bank with eight logarithmically spaced filters ranging from 4 Hz to 128 Hz. In each modulation filter, the power (i.e., energy per block) is computed by taking the squared magnitude of the Fourier transformed envelope (in this study, each sentence was treated as one block). The SRMR is the ratio of the power in the four lowest modulation filters to the power of the four highest modulation filters. For frequency channels up to 1500 Hz, the SRMR is calculated on the outputs of both the minimizing and the maximizing EC paths. The path yielding the higher SRMR value is kept for further processing.
Above 1500 Hz, however, the SRMR measure is calculated on the left and right ear signals. If the left-right difference in SRMR exceeds a value of 0.1, then the ear providing the higher value is selected as better ear channel, but otherwise, the ear providing the lower root mean square is selected.
The selected EC channels are then combined with the selected better ear channels using a gammatone synthesis filter bank to produce a single signal, which is then analyzed by a speech intelligibility back end or listened to by a human listener.
Model Testing and SRT Calculation
The model was tested using sentences from the Oldenburg Sentence test corpus (OlSa; Wagener et al., 1999c). OlSa sentences consist of five-word sentences with a fixed grammatical structure
The speech material was mixed with the noise at 41 different SNRs ranging from –20 dB to +20 dB in steps of 1 dB. For each tested SNR, 30 random sets of the jitter random variables were used in each frequency channel. In total, 12,300 simulations were conducted for each tested condition in the two experiments. We obtained SRTs via intermediate calculation of SII values. To obtain these, the speech and noise signals were processed separately using identical EC parameters and random variables as for the mixed signals.
The SII values were averaged across Monte-Carlo simulations for each of the 10 sentences used in the simulations. The SII for each of the 10 sentences leading to the SRT of –7.8 dB (the average SRT across listeners with normal hearing) obtained for the colocated condition was averaged across all sentences. The result served as reference SII value. Next, the mean SII (across the 10 sentences and 30 Monte-Carlo simulations) was calculated for each of the 41 tested SNRs, and then whichever SNR yielded an average SII closest to the reference SII was selected for the different azimuth positions of the noise. This SNR is taken as the estimate of the SRT50.
We should clarify that the model is binaurally blind even though speech and noise are run separately through for the SII calculation. The reason is that the model uses the mixed signals to determine the EC parameters; hence, it is binaurally blind. But the SII calculation requires separate signals, so we used the same EC parameters for speech and noise.
Experiment I—Modeling Speech Intelligibility in Spatially Separated Stationary Noise
To compare the blind estimation process of our new BSIM2020 with the SNR optimization procedure of BSIM2010, we simulated the speech intelligibility experiments conducted by Beutelmann and Brand (2006) with both models.
In those experiments, 10 listeners with normal hearing (21–43 years; audiometric thresholds of 20 dB HL or better between 250 and 8000 Hz) participated. Speech intelligibility experiments were conducted using the OlSa sentences in noise (Wagener et al., 1999a, 1999b, 1999c). An adaptive procedure (Equation 9, Brand & Kollmeier, 2002) was used to determine the SNR at which 50% of the sentences were understood correctly. All measurements were conducted using the Oldenburg Measurement Applications (HörTech gGmbH, Oldenburg, Germany). SRTs were obtained in three acoustical environments (which were denoted as “anechoic,” “office,” and “cafeteria”) and for different directions of the noise source, while the target speech was always presented from an azimuth of 0° (directly in front of a listener). The tested noise directions were –140°, –100°, –45°, 0°, 45°, 80°, 125°, and 180° in the anechoic and office conditions and –135°, –90°, –45°, 0°, 45°, 90°, 135°, and 180° in the cafeteria condition. In the anechoic condition, speech and noise signals were convolved with head-related transfer functions taken from Algazi et al. (2001). For the office and cafeteria conditions, Beutelmann and Brand (2006) used their own recordings of head-related transfer functions. All stimuli were presented binaurally using HD200 headphones (Sennheiser, Wedemark, Germany), which were free-field equalized using an finite impulse response (FIR) filter with 801 coefficients. The SRT was determined using test lists of 20 sentences. The test lists were randomly selected out of 45 lists. The noise level was set to 65 dB SPL, and the speech level was varied adaptively to find the individual SRT.
Results
Figure 2 shows the predicted SIIs from our model for a selection of four of the noise directions (–100°, 0°, 45°, and 180° azimuth) and for the tested SNRs ranging from –20 dB to +20 dB. These four angles were selected because they show the general characteristics representative of all of the simulations. The top-right panel shows the SII curves for colocated speech and noise sources (i.e., S0N0). The different EC processing strategies result in identical SII curves. This result was expected, because there are no interaural differences for the binaural processor to make any use of to enhance the SNR. The same holds for the noise located at 180° azimuth, shown in the bottom right panel. The curves are slightly shifted toward negative SNRs, which is an effect of pinna cues as they slightly improve the SNR for the S0N180 condition compared with the S0N0 condition.

Exemplary SII Curves for Speech (Located at 0°) in Noise (Various Locations). The SII is shown for the three tested processing schemes in the EC mechanism, which is either a level minimization (solid green) at the output, a level maximization (dotted dashed blue line) at the output, or a modulation-based selection of level minimization and maximization (dashed red line).
In both of the left panels, the effects of the different processing strategies for spatially separated target and masker can be observed. The masker at –100° azimuth provides a larger release from masking than the masker located at 45° azimuth. At negative SNRs, the level minimization provides the best SII, whereas at positive SNRs, the level maximization provides the best SII. The SII of modulation-based blind selection between minimization and maximization converges toward the SII of the level minimization at negative SNRs and to the SII of the level maximization at positive SNRs. For SNRs close to 0 dB, the modulation-based selection provides higher SII values than the level minimization or level maximization alone. This is probably due to the fact that the SRMR measure selects the optimal strategy independently in each frequency channel leading to a synergistic effect.
Figure 3 shows the SRTs predicted using either the level minimizing EC mechanism on its own or the modulation-based blind selection of either the minimizing or maximizing EC processing strategy (results obtained with the maximization strategy and listening only monaurally are reported in the supplementary material). Predictions are shown along with the data by Beutelmann and Brand (2006). The predictions were essentially equally accurate. Using the level minimization, very accurate predictions in terms of the coefficient of determination

Data (Black Dots) and Predictions Obtained for the Anechoic Situation Using the Level Minimization as EC Processing Criterion (Green Diamonds) and Using the Modulation-Based Selection of Level Minimization and Level Maximization (Red Diamonds). Error bars of the obtained data indicate the interindividual standard deviation; error bars of the predictions show the standard deviation across sentences.
In summary, the analysis of this experiment shows that our new model, using modulation-based blind selection of the optimal binaural processing strategy (either minimizing or maximizing of the output level) and the blind selection of the better ear, is able to describe SRTs for spatially separated speech in noise at negative SNRs.
Experiment II—Binaural Speech Intelligibility at Positive SNRs
To test the model at positive SNRs, we collected new data to investigate binaural intelligibility level differences and binaural release from masking at positive SNRs. We used two sets of speech material: the OlSa speech material as in Experiment I and the Göttingen sentence test’s material (GoeSa; Kollmeier & Wesselkamp, 1997), which are everyday sentences. Low-pass filtering and time compression were applied to degrade speech intelligibility and shift SRTs to positive SNRs. In Schlueter et al. (2015), it was shown that the SRT50 of listeners with normal hearing for OlSa sentences can be shifted to an SNR of 3 dB if the speech material is time compressed to 25% of its original length. However, in this study, such extreme compression was avoided. Instead, low-pass filtering was applied in addition to time compression. Low-pass filtering additionally lowers speech intelligibility while preserving the usable binaural ITD cues, which is necessary for achieving binaural unmasking. The binaural configuration used depended on the sentences. For the OlSa sentences, binaural release from masking was induced by imposing either an IPD of π or an ITD of 750 μs on the noise. For the GoeSa sentences, however, pilot experiments revealed that SRTs could not be determined reliably if they were compressed to 50% of their original length and low-pass filtered at 1200 Hz. Thus, they were only compressed to 66% of their original length and low-pass filtered at 1500 Hz. Moreover, only an IPD of π of the noise was tested. The IPD condition was chosen because it ought to give the maximal binaural release from masking, whereas the ITD condition was a more realistic scenario because the stimulus can be associated to a certain direction.
Listeners
A total of 13 listeners with normal hearing (6 male, 9 female, 19–28 years, mean age: 23 years) participated in the experiment. They had no previous experience with sentence test procedures, and audiometric thresholds did not exceed 20 dB HL.
Stimuli
The speech materials from the Oldenburg sentence test and the Goettingen sentence test in noise were used. The OlSa sentences have a fixed syntactical structure (e.g., “Peter sieht vier nasse Tassen.”—“Peter sees four wet cups.”), whereas the GoeSa sentences consist of fixed meaningful sentences with a variable syntactical structure and a variable length. These provide more semantical context information and can therefore be considered to be more similar to the sentences used in everyday conversation (e.g., “Ein kleiner Junge war der Sieger.”—“A little boy was the winner.” or “Jetzt wird das Fundament gelegt.”—“Now the foundation is laid.”). They were manipulated in three ways: (a) unprocessed, that is, standard OlSa or GoeSa measurements, (b) time compression to 66% of their original length and low-pass filtering with a cut-off frequency of 1500 Hz termed “LP 1500 Hz, TC 0.66,” and (c) time compression to 50% and low-pass filtering to 1200 Hz termed “LP 1200 Hz, TC 0.5,” which was only applied to the OlSa sentences.
Both time compression and low-pass filtering were performed using the PRAAT software (Boersma & van Heuven, 2001; Boersma & Weenink, 2018). In PRAAT, the low-pass filters are realized as a one-tailed Hann window with adjustable cut-off frequency (–6 dB) and filter slope (the width between pass and stop band) of 100 Hz, meaning that full attenuation was achieved within 100 Hz. Time compression was done using the pitch-synchronous overlap add algorithm (Moulines & Charpentier, 1990), which allows for time compression without change in pitch. This method was evaluated in Schlueter et al. (2015) and was shown to be a valid method to increase SRTs based on time compressing the speech material.
SRTs were determined as the SNR for 80% speech intelligibility and found by changing the level of the speech adaptively and by using 0.8 as the target value (Equation 9, Brand & Kollmeier, 2002). Five sentence lists with 20 sentences each were provided for training, and then the main binaural conditions with IPD of π or ITD of 750 μs (see earlier) were run.
Results
Figure 4 shows the SRT results obtained for OlSa sentences in the three tested conditions: unprocessed (left panel), time compression to 66% of the original length and low-pass filtering at 1500 Hz (LP 1500 Hz, TC 0.66; middle panel), and time compression to 50% of the original length and low-pass filtering at 1200 Hz (LP 1200 Hz, TC 0.50; right panel). The median SRT80 was found to be at –5.2 dB SNR in the S0N0 condition for the unprocessed OlSa sentences. By applying time compression and low-pass filtering to the OlSa sentences, the median SRTs in the S0N0 condition were shifted to –0.4 dB SNR for the “LP 1500 Hz, TC 0.66” condition and +4.6 dB SNR for the “LP 1200 Hz, TC 0.50” condition. In the S0Nπ condition, the median SRT was found to be 6 dB below the SRT in the N0S0 condition, which was independent of the manipulation applied to the stimuli. In the S0N750 condition, the median SRT was always 1 dB higher (worse) than the median SRT in the NπS0 condition. However, in both of the binaural conditions, the variation across listeners was increased for the time-compressed and low-pass filtered stimuli.

Boxplots (Median: 25%–75% Confidence Interval [Box], 9%–91% Confidence Interval [Whisker] in Black, and Outliers in Red) of the SRT80 Obtained for 13 Listeners With Normal Hearing Using the OlSa. Unprocessed denotes the original stimuli, LP denotes the cut-off frequency of the low-pass filter, and TC denotes the applied time compression. N0S0 denotes the diotic (same signal at both ears) presentation of speech and noise, NπS0 denotes that the noise was interaurally phase inverted, and N750S0 denotes that the noise was interaurally delayed by 750 μs. Predictions obtained with the three EC outputs are shown, where blue squares show the predicted SRT for the level maximization, green circles are the predicted SRTs using the level minimization, and red diamonds denote the results obtained by combining level minimization and maximization based on modulation analysis. Black diamonds correspond to the diotic, that is, N0S0, model outcome.
Figure 4 also shows the predictions from the BSIM2020 model using the three processing strategies: level minimization only, level maximization only, and modulation-based selection between level minimization and maximization. In the “unprocessed” condition, both the level minimization and the modulation-based strategies resulted in the same predicted SRT, which slightly overestimated the mean binaural release from masking and coincided with the best SRT obtained by the human listeners, which was found to be at –12 dB SNR. The level-maximization strategy was not adequate to describe the results obtained in the listening experiment because the predicted SRT was even higher than in the N0S0 condition.
In the “LP1500 Hz, TC 0.66” condition, the SRT80 was found to be at –0.4 dB SNR for the S0N0 condition. In the binaural conditions, the SRTs were 5–6 dB lower. The predicted SRT using the level-minimization strategy resulted in a predicted SRT which coincided with the upper quartile of the obtained data. However, the standard deviation was very large in the S0Nπ condition. The predicted SRT using the level-maximization strategy was worse than in the diotic condition. The predicted SRT obtained using the output of the modulation-based selection strategy was again in the range of the best human listener.
In the “LP1200, Hz TC 0.5” condition, the diotic SRT was found to be at +4.6 dB SNR. The SRTs in both of the binaural conditions were found to be at 1.5 dB SNR for the S0N750 condition and 0.26 dB for the S0Nπ condition. In the binaural conditions, the level-minimization strategy failed to predict the obtained SRTs. The predicted SRT using the level-maximization strategy was obtained at the upper end of the figure, while the modulation-based selection resulted in predicted SRTs close to the median. The model performance was evaluated by calculating the RMSE between the predicted and measured SRT. For the modulation-based selection strategy, it was 2.5 dB.
For statistical evaluation, first the Kolmogorov–Smirnov test (α = .05) was conducted to test if the data was normally distributed. It revealed that a normal distribution can only be assumed for the results obtained in the S0N0 condition (LP = 1500 Hz, TC = 0.66) and NπS0 condition (LP = 1200 Hz, TC = 0.5). Therefore, a Wilcoxon signed-rank test (α = .01) was conducted for the evaluation of the SRT differences across conditions. A statistically significant effect of inverting the phase of the noise on SRTs was found for all conditions—p(Unprocessed) = 2.4 × 10–4, p(LP1.500 Hz) = 4.8 × 10–4, p(LP1200Hz) = 4.8 × 10–4. The same was shown for the S0N750 condition—p(Unprocessed) = 2.4 × 10–4, p(LP1.500 Hz) = 2.4 × 10–4—but not for the LP1.200 Hz condition—p (LP1.200 Hz) = 0.0105.
Figure 5 shows the corresponding binaural intelligibility differences (BILDs), which are the differences in SRTs between the S0N0 conditions and S0Nπ conditions. In general, the median BILD was comparable across manipulations at about 2–7 dB. However, the variance across listeners was increased for the conditions with time compression and low-pass filtering. A Wilcoxon signed-rank test found no statistical difference between the BILDs obtained for the unprocessed stimuli and the time compressed and low-pass filtered stimuli [p(LP1.500 Hz) = 0.68, p(LP1.200 Hz)= 0.58].

Boxplots (Median [Horizontal Line], 25%–75% Confidence Interval [Box], 9%–91% Confidence Interval [Whisker], and Outliers [Red Crosses]) of BILDs Obtained for 13 Listeners With Normal Hearing Using OlSa Sentences. The noise had an IPD of π. Unprocessed denotes the unmanipulated OlSa sentences, and LP and TC denote the low-pass filter and time compression applied to the OlSa material.
Figure 6 shows the BILDs for the S0N750 condition. Again, these were not significantly different for the low-pass filtered and time-compressed conditions compared with the unprocessed condition [p(LP1.500 Hz) = 0.41, p(LP1.200 Hz)= 0.24)].

Boxplots (Median [Horizontal Line], 25%–75% Confidence Interval [Box], 9%–91% Confidence Interval [Whisker], and Outliers [Red Crosses]) of BILDs Obtained for 13 Listeners With Normal Hearing Using OlSa Sentences. The noise had an ITD of 750 μs. Unprocessed denotes the unmanipulated OlSa sentences, and LP and TC denote the low-pass filter and time compression applied to the OlSa material.
Figure 7 shows the results obtained with the GoeSa sentences. Compared with the SRTs obtained with the OlSa sentences, SRTs with the GoeSa sentences were shifted to even more positive SNRs. The standard deviation was also larger. For the unprocessed GoeSa sentences, the median SRT in the S0N0 condition was found to be at –4.6 dB SNR. The median SRT in the S0Nπ conditions was found to be at –7.7 dB SNR, which was significantly lower (Wilcoxon signed-rank:

Boxplots (Median [Horizontal Line], 25%–75% Confidence Interval [Box], 9%–91% Confidence Interval [Whisker], and Outliers [Red Crosses]) of SRT80 Obtained for 13 Listeners With Normal Hearing Using the GoeSa. Unprocessed denotes the original stimuli, LP denotes the cut-off frequency of the low-pass filter, and TC denotes the applied time compression. Predictions obtained with the three EC outputs are shown, where blue squares show the predicted SRT for the level maximization, green circles are the predicted SRTs using the level minimization, and red diamonds denote the results obtained by combining level minimization and maximization based on modulation analysis.
The BSIM2020 model predicted a binaural release from masking for both unprocessed and processed GoeSa sentences, respectively. The predicted SRTs using the level maximization and the modulation-based selection were close to the median SRT in the binaural condition. However, due to the very large variability across listeners, a direct comparison between measured and predicted SRTs was problematic.
In summary, Experiment II showed that binaural release from masking can be found at negative and positive SNRs. Therefore, it is also necessary to consider positive SNRs in models of binaural speech intelligibility. The SRTs predicted with the modulation-based selection strategy of our BSIM 2020 model showed good agreement with the measured SRTs at positive and negative SNRs.
Discussion
Blind Modeling of Binaural Processing
This study introduces a new
The concept of minimizing and maximizing the level at the EC output is different to binaural speech intelligibility models from the literature, where the back end (speech intelligibility metric) is typically used to optimize the EC parameters in a top-down process (Andersen et al., 2016; Beutelmann et al., 2010). In such a top-down process, the EC parameters are adjusted to maximize the speech intelligibility metric. Compared with the models proposed by Cosentino et al. (2014), Tang et al. (2018), and Geravanchizadeh and Fallah (2015), our model has the advantage that it does not make assumptions about the SNR, the location of a target speaker or interfering noise source, or the number of interfering noise sources. Moreover, the model is designed in such a way that it can be combined with arbitrary speech intelligibility back ends, because it produces a binaurally processed mono signal. Therefore, it can also be used as simple binaural beamformer for signal enhancement. Geravanchizadeh and Fallah (2015) proposed a binaural speech intelligibility model with a blind EC processing stage. However, positive SNRs where explicitly excluded, because they assumed 100% speech intelligibility at positive SNRs and, consequently, that binaural processing is only relevant at negative SNRs.
The concept presented here can be regarded as a bottom-up process, where the binaurally processed signals are fed to a modulation-based selection stage, which serves as a simple “gate” that passes the channel with the best representation of speech to the back end. Moreover, the relatively simple binaural processing presented here does not require any assumption on the localization of the target and/or interfering source. Binaural unmasking is not necessarily based on localization, as it works best for stimuli which are interaurally phase inverted (e.g., Levitt & Rabiner, 1967) and, consequently, have frequency-dependent ITDs which are ambiguous with respect to location or lateralization. Nevertheless, binaural unmasking and binaural sound source localization are certainly related, for example, with respect to object formation and stream segregation (e.g., Bronkhorst, 2000). The inclusion of localization cues into the present model will be subject of future studies.
In our study, we evaluated the blind binaural processing using the SII to be able to compare it with an earlier model, BSIM2010 (Beutelmann et al., 2010), which used an SNR optimization for its EC processing, that is, a level-minimization strategy at negative SNRs. Nevertheless, the novel binaural processing stage can be combined with arbitrary speech intelligibility back ends, which may be either based on mixed signals (e.g., Andersen et al., 2017; Schädler et al., 2016; Spille et al., 2018) or on separate speech and noise signals as input (e.g., ANSI S3.5-1997, 1997; Steeneken & Houtgast, 1980; Taal et al., 2011). The latter can be achieved (like it is done in this study) by additionally processing speech and noise in isolation but using blindly estimated binaural parameters and processing strategies.
The results of Experiment I demonstrate that the level-minimization strategy of the EC output is sufficient to describe the data obtained in the binaural listening experiment by Beutelmann and Brand (2006). The predictions of the blind selection in the BSIM2020 model were comparable to predictions obtained with the earlier BSIM2010 model.
The bias and RMSE were only slightly increased, which was caused by an interaction between the modulation analysis of the EC processed output and the binaural processing inaccuracies used in the Monte-Carlo simulation. To explain, the binaural processing inaccuracies limit the SNR improvement of the EC process to mirror human performance. In some cases, where the random variables for the jitter are drawn from the tails of the normal distribution, the difference in the SRMR measure between maximized and minimized EC output is very small, because both paths are dominated by the noise. If this is the case, the selection of the theoretically better EC channel is uncertain. If the binaural processing inaccuracies are disabled, the modulation-based selection of either the minimized or maximized output works more robustly. However, as we assume that processing errors in lower binaural processing stages also affect all following processing stages including the assumed selection of the best channel, only results using binaural processing inaccuracies are presented in this study.
In further work, we also attempted to use SRMR as back end for directly predicting SRT based on the mixed speech and noise signals. However, this approach failed, because the increase of the SRMR with increasing SNR was too shallow at negative SNRs to derive SRTs for 50% speech intelligibility. Figure 8 shows the SRMR output for speech in noise based on the signal at the output of the blind binaural processing stage, where the noise is either located at 0° or 125° in the horizontal plane. The SRMR is able to select the better EC processing path and the better ear. However, the binaural benefit is overestimated as the difference in SRMR between 0° and 125° is too large. Consequently, no SRT criterion can be chosen that holds for both conditions, because both curves do not cross. Cosentino et al. (2014) did not run into this problem, because they used the SRMR only to determine the better ear by applying it to the left and right ear signal. The higher SRMR value of both ears was then converted to a dB value using a fitting function, which was derived in their study, assuming that listening with the better ear produces a benefit in the range from 0 to 6 dB. We do not use their mapping function of SRMR to SNR improvement, because our model produces a binaurally processed mono signal.

SRMR of Output of EC Stage as a Function of SNR for Speech in Noise. The noise was either located at 0° in the horizontal plane (red line) or at 125° in the horizontal plane (blue line).
Binaural Release From Masking at Positive SNRs
In Experiment II, binaural speech intelligibility and binaural unmasking was investigated at positive SNRs. To this end, time compression and low-pass filtering of the stimuli was used to increase the SRT80. For the OlSa sentences, the release from masking did not differ significantly across all tested scenarios, demonstrating that the binaural auditory system produces release from masking even at positive SNRs. The BSIM2020 model is able to predict this by using the modulation-based blind selection of level minimization or level maximization at frequencies below 1500 Hz and selecting the better ear above 1500 Hz. In the “LP 1500 Hz, TC 0.66” condition, the standard deviation of the predicted SRT using level minimization was very large. In the “LP 1200 Hz, TC 0.5” condition, the level-minimization strategy failed to predict an SRT, because the target is cancelled from the mixed signals at positive SNRs. This was caused by the flattening and nonmonotonic character of the SII curves at 0 dB SNR, which was also observed in the SII curves based on the level minimization obtained for Experiment I (see Figure 2). However, the effect was larger in Experiment II, because there was no better ear in the NπS0 stimulus. Moreover, the low-pass filtering enhanced the effect of the EC processing on the obtained SII value, because only the low-frequency region was considered, where binaural processing (and EC processing) can be assumed to take place. This can also be seen in Figure 9, where the SII curves obtained for the low-pass filtered stimuli (cut-off frequency 1200 Hz) are shown. The SII based on the output of the level minimization is a nonmonotonic function, causing the large variance in the predicted SRTs with LP = 1500 Hz and TC = 0.66. The blindly selected output shows the synergistic effects of combining level maximization and level minimization for SNRs close to 0 dB, providing a better agreement between measured and predicted data. However, the RMSE between predictions using the blind binaural processing stage and measured SRTs is in the range of 2–5 dB for the OlSa sentences.

SII Curves Obtained for Low-Pass Filtered and Time-Compressed OlSa Sentences. The green curves denotes the SII obtained for the output of the level minimization, blue curves the output of the level maximization, and red the output of the nonintrusively selected output. The black-dotted line indicates the SII criterion for SRT80.
The results obtained with the GoeSa sentences were not as clear as for the OlSa sentences. For the GoeSa sentences, binaural release from masking was only observed for the unprocessed condition. In the time-compressed and low-pass filtered condition, the large standard deviation across listeners made it impossible to draw conclusions about the binaural release from masking. Moreover, by applying low-pass filtering to 1500 Hz and time compression to 0.66 of the original length, SRTs were increased by 15 dB for the open-set GoeSa sentences but only by 5 dB for the closed-set OlSa sentences. This finding is in line with observations made by Rennies et al. (2014) and Warzybok et al. (2016), where the combined effect of reverberation and SNR on speech intelligibility (and listening effort) was investigated using OlSa sentences (Rennies et al., 2014) and GoeSa sentences (Warzybok et al., 2016): GoeSa sentences were more affected by additional reverberation than OlSa sentences, meaning that the intelligibility for a certain combination of noise and reverberation is higher for OlSa sentences than for GoeSa sentences. This might indicate that the GoeSa sentences in general are more affected by manipulations (like time compression and low-pass filtering) than OlSa sentences. This might be caused by the special structure of OlSa sentences which consist of 5 words drawn from a pool of 50 words and have a fixed grammatical structure, which makes it easier for the listeners to generate an expectation.
Future Work
While Experiment II showed that it is difficult to quantify binaural unmasking at positive SNRs by measuring SRTs with listeners with normal hearing, listening effort has been shown to be affected by the SNR even at 100% speech intelligibility. In Rennies and Kidd (2018), a binaural benefit was observed in measurements of listening effort, where a spatial separation of the target source from the interfering source resulted in reduced listening effort, similar to the binaural unmasking observed for SRTs presented in the current study. In principle, the BSIM2020 model could be applied to predict binaural listening effort in its current version. This requires the modeling of speech in noise at positive SNRs, which is possible with the BSIM2020 model using a readjustment of the reference SII value to fit the SNR range where listening effort changes but speech intelligibility is already saturated. Instead of converting SII values to SRTs, a conversion of SII values to the different categorical effort scaling values as measured, for example, by the method of (Krueger et al., 2017a) would be needed. Note that even when 100% speech intelligibility has been achieved, listening effort can still be further reduced by increasing the SNR which is related to a further increase of the SII with increasing SNR above the SRT (Krueger et al., 2017b).
Conclusions
This study presents a new binaural speech intelligibility model termed
We found that the model gave RMSEs from experimental data of less than 1 dB (Experiment I) or 2.5 dB (Experiment II). In that second experiment, the increase may have been due to the large variance in the observed data, especially for SRTs at positive SNRs. Blind-level minimization of the output of the EC process is sufficient to describe results of listening experiments at negative SNRs, which is usually the case in SRT50 measurements with listeners with normal hearing.
Our experimental results demonstrated that binaural release from masking also occurred at positive SNRs. We did this using time-compressed and low-pass filtered OlSa sentences. These two manipulations preserved the binaural cues and can be regarded as simple simulations of reduced cognitive processing speed and high-frequency hearing loss.
Supplemental Material
sj-pdf-1-tia-10.1177_2331216520975630 - Supplemental material for Modeling Binaural Unmasking of Speech Using a Blind Binaural Processing Stage
Supplemental material, sj-pdf-1-tia-10.1177_2331216520975630 for Modeling Binaural Unmasking of Speech Using a Blind Binaural Processing Stage by Christopher F. Hauth, Simon C. Berning, Birger Kollmeier and Thomas Brand in Trends in Hearing
Footnotes
Acknowledgment
Declaration of Conflicting Interests
Funding
Supplemental material
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
