Coherence of Descriptors in Topic Modeling
This page contains supplementary material for the paper:
D. O'Callaghan, D. Greene, J. Carthy, P. Cunningham. (2015), "An
Analysis of the Coherence of Descriptors in Topic Modeling". (Expert Systems With Applications)
Summary
In recent years, topic modeling has become an established method in the analysis of text corpora, with probabilistic techniques such as latent Dirichlet allocation (LDA) commonly employed for this purpose. However, it might be argued that adequate attention is often not paid to the issue of topic coherence, the semantic interpretability of the top terms usually used to describe discovered topics. Nevertheless, a number of studies have proposed measures for analyzing such coherence, where these have been largely focused on topics found by LDA, with matrix decomposition techniques such as Non-negative Matrix Factorization (NMF) being somewhat overlooked in comparison. This motivates the current work, where we compare and analyze topics found by popular variants of both NMF and LDA in multiple corpora in terms of both their coherence and associated generality, using a combination of existing and new measures, including one based on distributional semantics. Two out of three coherence measures find NMF to regularly produce more coherent topics, with higher levels of generality and redundancy observed with the LDA topic descriptors. In all cases, we observe that the associated term weighting strategy plays a major role. The results observed with NMF suggest that this may be a more suitable topic modeling method when analyzing certain corpora, such as those associated with niche or non-mainstream domains.
Data
For evaluation purposes, we created a number of text corpora that have annotated "ground truth" category classes for documents. Details of these corpora are as follows:
Corpus | Documents | Terms | Classes | Description |
BBC | 161,469 | 17,079 | 40 | General news articles from bbc.com. and bbc.co.uk |
Guardian | 194,153 | 22,141 | 24 | General news articles from theguardian.com. and guardian.co.uk |
NYT 2003 | 70,134 | 20,429 | 20 | A subset of the New York Times Annotated Corpus from 2003. |
NYT 2000+ (10%) | 65,335 | 21,461 | 20 | A subset of the New York Times Annotated Corpus, containing a stratified 10% sample of articles from 2000 to 2007. |
Wikipedia (high-level) | 5,682 | 28,699 | 6 | Subset of a Wikipedia dump from January 2014, where articles are assigned labels based on their high level WikiProject. |
Wikipedia (low-level) | 4,970 | 24,265 | 10 | Another Wikipedia subset from January 2014. Articles are labeled with fine-grained WikiProject sub-groups. |
Download
Pre-processed versions of four of the corpora will be made available here shortly for research purposes only.
Unfortunately due to licensing restrictions, we are unable to make both New York Times corpora available. The original complete corpus is available here. To recreate our corpora, the subset of document IDs that we used for NYT 2003 and NYT 2000+ (10%) will be provided shortly.
Contact
In the meantime, for further information please contact Derek O'Callaghan.