# 20 Newsgroup Subset Datasets

Subsets of the original 20 Newsgroups corpus, in term-document format only. We used the 20NG collection as a source for artificially constructed datasets because it contains a range of topics that overlap to varying degrees. From the collection we derived a large number of smaller datasets for which the correct value of *k* is known.

### Contents

We constructed 84 sets in total, 12 for each value of *k* in the range [2,8]. The datasets are divided as follows:

- s* - * - *: 42 datasets with clusters that are reasonably compact and well-separated.
- o* - * - *: 42 datasets with clusters that overlap considerablely.

These datasets can be alternatively divided into sub-groups of datasets containing clusters of different proportions:

- *b - * - *: 28 datasets with balanced clusters containing 500 documents each.
- *s - * - *: 28 datasets with unbalanced clusters where one cluster contains 10% of the documents in the dataset.
- *l - * - *: 28 datasets with unbalanced clusters where one cluster contains 60% of the documents.

In all cases the documents were randomly drawn from each class.

### Downloads

- Download well-separated datasets

- Download overlapping datasets

- Full list of 20NG subset contents

### File Formats

The datasets have been pre-processed, with stop-word removal and stemming already applied. In addition, terms occurring in less than three documents have been eliminated. The files contained in the archives given above have the following formats:

- *.mtx: Original term frequencies stored in a sparse data matrix in Matrix Market format.
- *.terms: List of content-bearing terms in the corpus, with each line corresponding to a row of the sparse data matrix.
- *.docs: List of document identifiers, with each line corresponding to a column of the sparse data matrix.
- *.classes: Assignment of documents to natural classes, with each line corresponding to a document.
- *.urls: Links to original articles, where appropriate.