Statistically validated network for analysing textual data
- Autori: Simonetti, Andrea; Albano, Alessandro; Tumminello, Michele; Di Matteo, T.
- Anno di pubblicazione: 2025
- Tipologia: Articolo in rivista
- OA Link: http://hdl.handle.net/10447/673424
Abstract
This paper presents a novel methodology, called Word Co-occurrence SVN topic model (WCSVNtm), for document clustering and topic modeling in textual datasets. This method represents the corpus as a bipartite network of words and documents to rigorously assess the statistical significance of word co-occurrences within documents and document overlap based on shared vocabulary. By employing the Leiden community detection algorithm to the SVN, distinct communities of words can be identified and interpreted as topics. Similarly, documents can be sorted into groups based on their thematic similarities. We demonstrate the effectiveness of our approach by analyzing three datasets: a set of 120 Wikipedia articles, the arXiv10 dataset, which consists of 100,000 abstracts from scientific papers, and a sampled subset of 10,000 documents from the original arXiv10. To benchmark our results, we compare our approach with several well-established models in the field of topic modeling and document clustering, including the hierarchical Stochastic Block Model (hSBM), BERTopic, and Latent Dirichlet Allocation (LDA). The results show that WCSVNtm achieves competitive performance across all datasets, automatically selecting the number of topics and document clusters, whereas state-of-the-art methods require prior knowledge or additional tuning for optimization. Finally, any advancements in community detection algorithms could further improve our method.