20 Newsgroup Subset Datasets
Subsets of the original 20 Newsgroups corpus, in term-document format only. We used the 20NG collection as a source for artificially constructed datasets because it contains a range of topics that overlap to varying degrees. From the collection we derived a large number of smaller datasets for which the correct value of k is known.
Contents
We constructed 84 sets in total, 12 for each value of k in the range [2,8]. The datasets are divided as follows:
- s* - * - *: 42 datasets with clusters that are reasonably compact and well-separated.
- o* - * - *: 42 datasets with clusters that overlap considerablely.
These datasets can be alternatively divided into sub-groups of datasets containing clusters of different proportions:
- *b - * - *: 28 datasets with balanced clusters containing 500 documents each.
- *s - * - *: 28 datasets with unbalanced clusters where one cluster contains 10% of the documents in the dataset.
- *l - * - *: 28 datasets with unbalanced clusters where one cluster contains 60% of the documents.
In all cases the documents were randomly drawn from each class.
Downloads
- Download well-separated datasets
- Download overlapping datasets
- Full list of 20NG subset contents
File Formats
The datasets have been pre-processed, with stop-word removal and stemming already applied. In addition, terms occurring in less than three documents have been eliminated. The files contained in the archives given above have the following formats:
- *.mtx: Original term frequencies stored in a sparse data matrix in Matrix Market format.
- *.terms: List of content-bearing terms in the corpus, with each line corresponding to a row of the sparse data matrix.
- *.docs: List of document identifiers, with each line corresponding to a column of the sparse data matrix.
- *.classes: Assignment of documents to natural classes, with each line corresponding to a document.
- *.urls: Links to original articles, where appropriate.