20 Newsgroup Subset Datasets

Subsets of the original 20 Newsgroups corpus, in term-document format only. We used the 20NG collection as a source for artificially constructed datasets because it contains a range of topics that overlap to varying degrees. From the collection we derived a large number of smaller datasets for which the correct value of k is known.

We constructed 84 sets in total, 12 for each value of k in the range [2,8]. The datasets are divided as follows:

s* - * - *: 42 datasets with clusters that are reasonably compact and well-separated.
o* - * - *: 42 datasets with clusters that overlap considerablely.

These datasets can be alternatively divided into sub-groups of datasets containing clusters of different proportions:

*b - * - *: 28 datasets with balanced clusters containing 500 documents each.
*s - * - *: 28 datasets with unbalanced clusters where one cluster contains 10% of the documents in the dataset.
*l - * - *: 28 datasets with unbalanced clusters where one cluster contains 60% of the documents.

In all cases the documents were randomly drawn from each class.

Downloads

- Download well-separated datasets
- Download overlapping datasets
- Full list of 20NG subset contents

File Formats

The datasets have been pre-processed, with stop-word removal and stemming already applied. In addition, terms occurring in less than three documents have been eliminated. The files contained in the archives given above have the following formats:

*.mtx: Original term frequencies stored in a sparse data matrix in Matrix Market format.
*.terms: List of content-bearing terms in the corpus, with each line corresponding to a row of the sparse data matrix.
*.docs: List of document identifiers, with each line corresponding to a column of the sparse data matrix.
*.classes: Assignment of documents to natural classes, with each line corresponding to a document.
*.urls: Links to original articles, where appropriate.

20 Newsgroup Subset Datasets

Contents

Downloads

File Formats