A Two-Stage Approach to Device-Robust Acoustic Scene Classification
- Autori: Hu, Hu; Yang, Chao-Han Huck; Xia, Xianjun; Bai, Xue; Tang, Xin; Wang, Yajian; Niu, Shutong; Chai, Li; Li, Juanjuan; Zhu, Hongning; Bao, Feng; Zhao, Yuanjun; Siniscalchi, Sabato Marco; Wang, Yannan; Du, Jun; Lee, Chin-Hui
- Anno di pubblicazione: 2021
- Tipologia: Contributo in atti di convegno pubblicato in volume
- OA Link: http://hdl.handle.net/10447/636669
Abstract
To improve device robustness, a highly desirable key feature of a competitive data-driven acoustic scene classification (ASC) system, a novel two-stage system based on fully convolutional neural networks (CNNs) is proposed. Our two-stage system leverages on an ad-hoc score combination based on two CNN classifiers: (i) the first CNN classifies acoustic inputs into one of three broad classes, and (ii) the second CNN classifies the same inputs into one of ten finer-grained classes. Three different CNN architectures are explored to implement the two-stage classifiers, and a frequency sub-sampling scheme is investigated. Moreover, novel data augmentation schemes for ASC are also investigated. Evaluated on DCASE 2020 Task 1a, our results show that the proposed ASC system attains a state-of-the-art accuracy on the development set, where our best system, a two-stage fusion of CNN ensembles, delivers a 81.9% average accuracy among multi-device test data, and it obtains a significant improvement on unseen devices. Finally, neural saliency analysis with class activation mapping (CAM) gives new insights on the patterns learnt by our models.