Salta al contenuto principale
Passa alla visualizzazione normale.

SABATO MARCO SINISCALCHI

Lightweight Audio-Visual Wake Word Spotting with Diverse Acoustic Knowledge Distillation

  • Autori: Li K.W.; Chen H.; Du J.; Zhou H.S.; Siniscalchi S.M.; Niu S.T.; Xiong S.F.
  • Anno di pubblicazione: 2025
  • Tipologia: Articolo in rivista
  • OA Link: http://hdl.handle.net/10447/674965

Abstract

Audio-Visual Wake Word Spotting (AVWWS) aims to accurately detect user-defined keywords by leveraging the complementary nature of different modalities in challenging acoustic environments. However, two primary challenges hinder the application of AVWWS models in real-world scenarios: increased model parameters involving the video modality and the scarcity of paired audio-visual data. To address these issues, we propose a novel diverse acoustic knowledge distillation (DAKD) framework, which utilizes easily accessible single-modality audio data to train two teacher models and employs cross-modal knowledge distillation to transfer the generalization and denoising capabilities of the teachers to the audio-visual student model. This approach mitigates the overfitting risk associated with large parameter counts and limited data. The DAKD framework consists of an audio-visual student model based on the lightweight multi-scale temporal-spatial attention (LMTSA) architecture, a multi-conditional teacher (MCT) model, and a de-noising teacher (DNT) model. The LMTSA model integrates compact 3D and 2D blocks based on the ResNet architecture through a simple attention module and accepts multi-scale supervision from word-level and phone-level labels, achieving joint temporal-spatial modeling with minimal parameter usage. The MCT and DNT models were trained using extensive real or simulated far-field speech and paired near-field and far-field speech, respectively, to generalize unseen acoustic environments and de-noising capabilities to the audio-visual student model. The effectiveness of our proposed DAKD framework is validated through comprehensive experiments on the MISP2021 and the updated MISP2021 Eval Hard datasets, establishing new benchmarks with fewer parameters. Our code will be available at https://github.com/wikkk-tp/AVWWS_DAKD