Summary on the Multimodal Information-Based Speech Processing (MISP) 2023 Challenge
- Authors: Chen H.; Wu S.; Wang C.; Du J.; Lee C.-H.; Siniscalchi S.M.; Watanabe S.; Chen J.; Scharenborg O.; Wang Z.-Q.; Yin B.-C.; Pan J.
- Publication year: 2024
- Type: Prefazione/Postfazione
- OA Link: http://hdl.handle.net/10447/663741
Abstract
Historically, MISP challenges have focused on audio-visual speech recognition (AVSR), where they have been particularly successful in complex acoustic scenarios. However, even the most sophisticated AVSR systems have been found to have performance limitations. Inspired by traditional robust speech recognition systems, where speech enhancement as a front-end can significantly improve accuracy, the MISP2023 challenge focused on audio-visual target speaker extraction (AVTSE). The primary goal of AVTSE is to enhance speech quality by exploiting the lip movements of the target speaker, thereby improving the final recognition performance. This paper provides a comprehensive overview of the challenge framework, describes the results, and summarizes the effective strategies employed by the contributions. In addition, we analyze the prevailing technical hurdles and provide recommendations for future directions to spur further progress in the AVTSE field.