BBC Datasets

These datasets are made available for non-commercial and research purposes only, and all data is provided in pre-processed matrix format. 

If you make use of these datasets please reference the publication:

Greene, D. and Cunningham, P. (2006), "Practical solutions to the problem of diagonal dominance in kernel document clustering", Proc. 23rd International Conference on Machine learning (ICML 2006). [PDF] [BibTeX]

Dataset: BBC

All rights, including copyright, in the content of the original articles are owned by the BBC.

  • Consists of documents from the BBC news website corresponding to stories in five topical areas from 2004-2005.
  • Documents: 2225, Terms: 9636
  • Natural Classes: 5 (business, entertainment, politics, sport, tech)

Download dataset

Dataset: BBCSport

All rights, including copyright, in the content of the original articles are owned by the BBC.

  • Consists of documents from the BBC Sport website corresponding to sports news articles in five topical areas from 2004-2005.
  • Documents: 737, Terms: 4613
  • Natural Classes: 5 (athletics, cricket, football, rugby, tennis)

Download dataset

File Formats

The datasets have been pre-processed as follows: stemming (Porter algorithm), stop-word removal (stop word list) and low term frequency filtering (count < 3) have already been applied to the data. The files contained in the archives given above have the following formats:

  • *.mtx: Original term frequencies stored in a sparse data matrix in Matrix Market format.
  • *.terms: List of content-bearing terms in the corpus, with each line corresponding to a row of the sparse data matrix.
  • *.docs: List of document identifiers, with each line corresponding to a column of the sparse data matrix.
  • *.classes: Assignment of documents to natural classes, with each line corresponding to a document.
  • *.urls: Links to original articles, where appropriate.
Last Updated ( Wednesday, 03 June 2009 )