Stability Analysis + Topic Modeling Data
This page contains supplementary material for the paper:
D. Greene, D. O'Callaghan, P. Cunningham. (2014), "How Many Topics? Stability Analysis for Topic Models". Proc. European Conference on Machine Learning (ECML'14). [PDF] [BibTeX]
Summary
Despite the diversity of topic modeling algorithms that have been proposed, a common challenge in successfully applying these techniques is the selection of an appropriate number of topics for a given corpus. Choosing too few topics will produce results that are overly broad, while choosing too many will result in the over-clustering of a corpus into many small, highly-similar topics. In this paper, we propose a term-centric stability analysis strategy to address this issue, the idea being that a model with an appropriate number of topics will be more robust to perturbations in the data.
Data
For evaluation purposes, we created a number of text corpora that have annotated "ground truth" category labels for documents. Details of these corpora are as follows:
Corpus | Documents | Terms | Labels | Description |
bbc | 2,225 | 3,121 | 5 | General news articles from the BBC. See here for more details. |
bbc-sport | 737 | 969 | 5 | Sports news articles from the BBC. See here for more details. |
guardian-2013 | 6,520 | 10,801 | 6 | New corpus of news articles published by The Guardian during 2013. |
irishtimes-2013 | 3,246 | 4,832 | 7 | New corpus of news articles published by The Irish Times during 2013. |
nytimes-1999 | 9,551 | 12,987 | 4 | A subset of the New York Times Annotated Corpus from 1999. |
nytimes-2003 | 11,527 | 15,001 | 7 | As above, with articles from 2003. |
wikipedia-high | 5,738 | 17,311 | 6 | Subset of a Wikipedia dump from January 2014, where articles are assigned labels based on their high level WikiProject. |
wikipedia-low | 4,986 | 15,441 | 10 | Another Wikipedia subsetfrom January 2014. Articles are labeled with fine-grained WikiProject sub-groups. |
Download
Pre-processed versions of six of the corpora are made available here for research purposes only.
>> Download Pre-processed text corpora (35MB)
Unfortunately due to licensing restrictions, we are unable to make the New York Times corpora available. The complete corpus is available from here. To recreate our corpora, the subset of document IDs that we used for nytimes-1999 and nytimes-2003 is provided here, where each ID is prefixed by its category label in the ground truth.
File formats
The datasets have been pre-processed as follows: stop-word removal and low term frequency filtering (count < 20) were applied to the data, then log TF-IDF and L2 document length normalization. The files contained in the archive above have the following formats:
- *.mtx: The document-term matrix, represented as a sparse coordinate matrix in Matrix Market format.
- *.terms: List of terms in the corpus, with each line corresponding to a column of the sparse data matrix.
- *.docs: List of document identifiers, with each line corresponding to a row of the sparse data matrix.
- *.labels: Assignment of documents to the "ground truth" label, where each line corresponds to a different category label.
Contact
For further information please contact Derek Greene.