20 Newsgroup Subset Datasets

Subsets of the original 20 Newsgroups corpus, in term-document format only. We used the 20NG collection as a source for artificially constructed datasets because it contains a range of topics that overlap to varying degrees. From the collection we derived a large number of smaller datasets for which the correct value of k is known.


We constructed 84 sets in total, 12 for each value of k in the range [2,8]. The datasets are divided as follows:

These datasets can be alternatively divided into sub-groups of datasets containing clusters of different proportions:

In all cases the documents were randomly drawn from each class.


- Download well-separated datasets
- Download overlapping datasets
- Full list of 20NG subset contents

File Formats

The datasets have been pre-processed, with stop-word removal and stemming already applied. In addition, terms occurring in less than three documents have been eliminated. The files contained in the archives given above have the following formats: