Background

Context and Background

Context

Over several decades, microphone array signal processing has led to significant improvements in speech quality and intelligibility for human and machine listening. Traditionally, beamforming algorithms have only had access to the acoustic signals, requiring additional information or inference to steer the beam. Head-worn microphone arrays are particularly challenging because the orientation of the array with respect to the room and sound sources changes rapidly with head motion. Algorithms deployed in hearing aids, which are currently the most common type of head-worn array, tend to be relatively conservative since it takes time to adapt to dynamic aspects of the scene and users need to maintain situational awareness.

Whether listeners are normal hearing or hearing impaired, noisy social situations such as cafes, restaurants and bars present a challenge to speech communication. Assistive listening devices have tended to target the hearing impaired but augmented reality devices offer the potential to bring aided/enhanced hearing for all to the archetypal cocktail party.

The recent emergence of hearable devices with headtracking sensors and/or video recordings, such as smart headphones, smart glasses and virtual/augmented reality headsets, presents an opportunity for a new class of speech and acoustic signal processing algorithms which uses multimodal sensor data to compensate for, or even exploit, changes in head orientation. Therefore, such devices have the potential to help listeners solve the cocktail party problem.

Novelty and Scientific Outcomes

The SPEAR challenge will offer the opportunity for researchers to benchmark existing speech enhancement algorithms and stimulate new research interest in the context of head-worn microphone arrays where positional information is available to the algorithm.

The acoustic scenarios will be realistic, involving multiple people where speakers and listeners have natural head movement throughout the discussion. SPEAR could also be a platform for machine learning algorithms to assess how robust they perform with real recordings when trained with simulated data. Lastly, it is expected that a correlation between standard metrics and listening tests will highlight preferred metrics for human perception.