A two-stage LDA algorithm for ranking induced topic readability
- Authors: Mariangela Sciandra; Alessandro Albano
- Publication year: 2022
- Type: Contributo in atti di convegno pubblicato in volume
- OA Link: http://hdl.handle.net/10447/564283
Abstract
Probabilistic topic models, such as LDA, are standard text analysis algorithms that provide predictive and latent topic representation for a corpus. However, due to the unsupervised training process, it is difficult to verify the assumption that the latent space discovered by these models is generally meaningful and valuable. This paper introduces a two-stage LDA algorithm to estimate latent topics in text documents and use readability scores to link the identified topics to a linguistically motivated latent structure. We define a new interpretative tool called induced topic readability, which is used to rank topics from the one with the most complex linguistic structure to the one with the lowest semantic content readily. The usefulness of our method is shown with an application to real data, using articles from the New York Times.