Coherence of Descriptors in Topic Modeling

This page contains supplementary material for the paper:
D. O'Callaghan, D. Greene, J. Carthy, P. Cunningham. (2015), "An Analysis of the Coherence of Descriptors in Topic Modeling". (Expert Systems With Applications)


Summary

In recent years, topic modeling has become an established method in the analysis of text corpora, with probabilistic techniques such as latent Dirichlet allocation (LDA) commonly employed for this purpose. However, it might be argued that adequate attention is often not paid to the issue of topic coherence, the semantic interpretability of the top terms usually used to describe discovered topics. Nevertheless, a number of studies have proposed measures for analyzing such coherence, where these have been largely focused on topics found by LDA, with matrix decomposition techniques such as Non-negative Matrix Factorization (NMF) being somewhat overlooked in comparison. This motivates the current work, where we compare and analyze topics found by popular variants of both NMF and LDA in multiple corpora in terms of both their coherence and associated generality, using a combination of existing and new measures, including one based on distributional semantics. Two out of three coherence measures find NMF to regularly produce more coherent topics, with higher levels of generality and redundancy observed with the LDA topic descriptors. In all cases, we observe that the associated term weighting strategy plays a major role. The results observed with NMF suggest that this may be a more suitable topic modeling method when analyzing certain corpora, such as those associated with niche or non-mainstream domains.


Data

For evaluation purposes, we created a number of text corpora that have annotated "ground truth" category classes for documents. Details of these corpora are as follows:

Corpus Documents  Terms   Classes  Description
BBC 161,469 17,079 40 General news articles from bbc.com. and bbc.co.uk
Guardian 194,153 22,141 24 General news articles from theguardian.com. and guardian.co.uk
NYT 2003 70,134 20,429 20 A subset of the New York Times Annotated Corpus from 2003.
NYT 2000+ (10%) 65,335 21,461 20 A subset of the New York Times Annotated Corpus, containing a stratified 10% sample of articles from 2000 to 2007.
Wikipedia (high-level) 5,682 28,699 6 Subset of a Wikipedia dump from January 2014, where articles are assigned labels based on their high level WikiProject.
Wikipedia (low-level) 4,970 24,265 10 Another Wikipedia subset from January 2014. Articles are labeled with fine-grained WikiProject sub-groups.


Download

Pre-processed versions of four of the corpora will be made available here shortly for research purposes only.

Unfortunately due to licensing restrictions, we are unable to make both New York Times corpora available. The original complete corpus is available here. To recreate our corpora, the subset of document IDs that we used for NYT 2003 and NYT 2000+ (10%) will be provided shortly.

Contact

In the meantime, for further information please contact Derek O'Callaghan.