The Multimodal Information Based Speech Processing (MISP) 2023 Challenge: Audio-Visual Target Speaker Extraction
- Authors: Wu, Shilong; Wang, Chenxi; Chen, Hang; Dai, Yusheng; Zhang, Chenyue; Wang, Ruoyu; Lan, Hongbo; Du, Jun; Lee, Chin-Hui; Chen, Jingdong; Siniscalchi, Sabato Marco; Scharenborg, Odette; Wang, Zhong-Qiu; Pan, Jia; Gao, Jianqing
- Publication year: 2024
- Type: Contributo in atti di convegno pubblicato in volume
- OA Link: http://hdl.handle.net/10447/638753
Abstract
Previous Multimodal Information based Speech Processing (MISP) challenges mainly focused on audio-visual speech recognition (AVSR) with commendable success. However, the most advanced back-end recognition systems often hit performance limits due to the complex acoustic environments. This has prompted a shift in focus towards the Audio-Visual Target Speaker Extraction (AVTSE) task for the MISP 2023 challenge in ICASSP 2024 Signal Processing Grand Challenges. Unlike existing audio-visual speech enhancement challenges primarily focused on simulation data, the MISP 2023 challenge uniquely explores how front-end speech processing, combined with visual clues, impacts back-end tasks in real-world scenarios. This pioneering effort aims to set the first benchmark for the AVTSE task, offering fresh insights into enhancing the accuracy of back-end speech recognition systems through AVTSE in challenging and real acoustic environments. This paper delivers a thorough overview of the task setting, dataset, and baseline system of the MISP 2023 challenge. It also includes an in-depth analysis of the challenges participants may encounter. The experimental results highlight the demanding nature of this task, and we look forward to the innovative solutions participants will bring forward.