Task, Rules and Evaluation

What task is at hand and how it will be evaluated

Objective

The goal is to obtain the best possible enhancement of the target signal at the array’s binaural microphones. In the context of augmented reality, it is not yet known what ‘best possible’ is, but one might presume that the optimal enhancement would result in the target talker’s speech being fully intelligible and relatively free from excessive reverberation, background noise and interferers. Each dataset will be split in Train, Development and Evaluation batches. The last batch will serve to evaluate the performances of the participants and will be released three weeks before the end of the challenge.

Participants will be given as input (more details in download):

  • 6 channels microphone array audio for the AR glasses wearer
  • Array orientation in quaternions
  • Direction of arrival, azimuth and elevation of all sources

Extra metadata information will also be made available for the Train and Development batches but should not be made an essential part of the algorithm as they will not be provided in the Evaluation batch.

  • Reference audio
    • Dataset 1: Time aligned close mic audio (monaural)
    • Dataset 2, 3: Simulated direct path of denoised close mic audio (binaural)
    • Dataset 4: Simulated direct path of clean audio (binaural)
  • Binary voice activity detection for all sources
  • All files necessary for TASCAR simulations (Datasets 2, 3 & 4 only)
    • Tascar scene
    • Source audio files
    • Position and orientation files
    • Summary csv file of all scene modifications with respect to dataset 2 for every Minute of every Session (Datasets 3 & 4 only)

The required outputs are one binaural wav file per target talker. Intrusive metrics will be computed for the enhanced binaural audio with respect to the clean target. The same enhanced audio will also be used for perceptual quality evaluation. Details of the metrics and perceptual evaluation are given below.

Practical Rules

Any combination or subset of microphones from the array can be used to tackle this problem. There is no limit on the computation cost of the methods though some indication of complexity/runtime should be reported in the submission. Similarly, algorithmic latency must be reported and must remain under 50~ms. Methods which are above this threshold can still be submitted but will be considered separately from the other entries. Furthermore, to allow accurate computation of metrics, participants should compensate for (i.e. remove) any delay introduced by the enhancement.

Teams who wish to use pre-trained machine learning models as a basis for their algorithms are permitted to do so provided that they meet the following requirements:

  • The pre-trained model must be citable and publicly available within 8 weeks of the Challenge materials being released.
  • The team must inform the organising committee of their intention to use a pre-trained model along with references for any models they are considering.
  • No more than 5 models may be declared by a single team. The organising committee will subsequently distribute a list of all declared and permitted models to all teams so that all teams can compete on an equal basis.

By allowing pre-trained models, the aim is to allow teams to take advantage of the extensive computational resources that have already been expended on similar machine learning tasks. The one-month window at the start of the Challenge is intended to allow existing models to be identified or for new ones to be created by teams who have access to wider data/computing resources than can be provided within the scope of the challenge.

The list of nominated pre-trained models can be found here.

Three weeks before the end of the challenge, the withheld Evaluation set will be released without references. Participants must then submit binaural enhanced audio for each scenario in the evaluation set.

Teams may submit multiple entries to the challenge, assuming that there are significant differences between them. There is no limit to the number of people that can contribute to a submission.


Evaluation

The following evaluation methods’ results will always be compared with the baseline model provided. Participants should always aim at improving the baseline results. This model and other provided tools are detailed in the scripts section.


Metrics

Submissions will be scored on a battery of metrics in the categories of Signal to Noise Ratio (SNR), Speech Intelligibility (SI), and Speech Quality (SQ) as detailed in the table below.

For each scene, the individually enhanced binaural signal for each talker and its clean reference will be used to compute the metrics. Moments when the wearer of the mic array is speaking will be ignored in the evaluation. The average across all scenes will lead to the participant’s raw score.

Entries will be ranked separately according to their performance on each dataset. A full statistical analysis will be conducted to identify the utility of metrics in predicting perceived quality.

Category Metric Metric Abbreviation Reference Python Package
SNR Signal to Noise Ratio SNR [-] PySEPM
SNR Frequency weighted Segmental SNR fwSegSNR Hu et al. 2008 1 PySEPM
SI Short Time Objective Intelligibility STOI Taal et al. 2011 2 PYSTOI
SI Extended STOI ESTOI Jensen et al. 2016 3 PYSTOI
SI Modified Binaural STOI MBSTOI Andersen et al. 2018 4 CLARITY
SI Speech to Artifacts Ratio SAR Vincent et al. 2006 5 Speech Metric
SI Image to Spatial Ratio ISR Vincent et al. 2006 5 Speech Metric
SI Speech to Distortion Ratio SDR Vincent et al. 2006 5 Speech Metric
SI Scale Invariant SDR SI-SDR Roux et al. 2019 6 Speech Metric
SI Hearing Aid Speech Perception Index HASPI Kates et al. 2014 7 8 CLARITY
SQ Perceptual Evaluation of Speech Quality PESQ Rix et al 2001 9 Speech Metric
SQ PESQ Narrow Band PESQ-NB Rix et al 2001 9 Speech Metric


Listening Tests

Procedure

Since metrics do not always necessarily correlate exactly with perceptual results, each algorithm will also be evaluated using crowd-sourced listening tests 10. Crowd sourcing listening tests will allow us to reach a broad range of test subjects and will ensure no ongoing implications of the global COVID-19 pandemic. To mitigate the potential effects of subjective and hardware biases, detailed instructions on how to perform the listening test and the required listening equipment/conditions will be provided as an introduction.

Listeners will be asked to rate their relative preference for two alternative enhancement approaches, given the clean target signal as a reference. In contrast to traditional MOS-based evaluations, there is no ground truth ideal signal. This is what makes SPEAR a particularly interesting application to work on.

Depending on the number of entrants, and at the organizing committee’s discretion, only entries who meet the 50~ms algorithmic latency requirement will be included in the perceptual evaluation.

The listening tests will be performed in two stages:

  1. Direct comparison against the baseline — Only entries which improve upon the baseline will qualify for the second stage.
  2. Relative scoring of all qualifying systems.


A Note on AR Considerations

In a real-world AR context, listeners hear sound from the environment as well as the enhanced audio. Depending on the device, passive or active attenuation of the environmental sound is possible. Internal tests suggest that 50~ms of latency is acceptable however it will depend on the device being used. To keep the focus of the Challenge on the speech enhancement process, only the enhanced signals will be played to listeners. Teams are free to mix in some proportion of the unprocessed binaural signals if they wish.



References
  1. Y. Hu and P. C. Loizou, “Evaluation of objective quality measures for speech enhancement,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 1, pp. 229–238, 2008. 

  2. C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time–frequency weighted noisy speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2125–2136, 2011. 

  3. J. Jensen and C. H. Taal, “An algorithm for predicting the intelligibility of speech masked by modulated noise maskers,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 11, pp. 2009–2022, 2016. 

  4. A. H. Andersen, J. M. de Haan, Z.-H. Tan, and J. Jensen, “Refinement and validation of the binaural short time objective intelligibility measure for spatially diverse conditions,” Speech Communication, vol. 102, pp. 1–13, 2018. 

  5. E. Vincent, R. Gribonval, and C. F ́evotte, “Performance measurement in blind audio source separation,” IEEE Transactions on Audio, Speech and Language Processing, vol. 14, no. 4, pp. 1462–1469, 2006.  2 3

  6. J. L. Roux, S. Wisdom, H. Erdogan, J. R. Hershey, “SDR – Half-baked or Well Done?,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2019. 

  7. J. M. Kates and K. H. Arehart, “The hearing-aid speech perception index (HASPI),” Speech Communication, vol. 65, pp. 75–93, 2014. 

  8. J. M. Kates and K. H. Arehart, “The hearing-aid speech perception index (HASPI) version 2,” Speech Communication, vol. 131, pp. 35–46, 2021. 

  9. A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” in IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (ICASSP), vol. 2, pp. 749–752 vol.2, 2001.  2

  10. M. Cartwright, B. Pardo, G. J. Mysore, and M. Hoffman, “Fast and easy crowdsourced perceptual audio evaluation,” 2016.