BBC Datasets

Two news article datasets, originating from BBC News, provided for use as benchmarks for machine learning research.
These datasets are made available for non-commercial and research purposes only, and all data is provided in pre-processed matrix format. If you make use of these datasets please consider citing the publication:

D. Greene and P. Cunningham. "Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering", Proc. ICML 2006. [PDF] [BibTeX].


Dataset: BBC

All rights, including copyright, in the content of the original articles are owned by the BBC.

>> Download pre-processed dataset

>> Download raw text files


Dataset: BBCSport

All rights, including copyright, in the content of the original articles are owned by the BBC.

>> Download pre-processed dataset

>> Download raw text files


File formats

The datasets have been pre-processed as follows: stemming (Porter algorithm), stop-word removal (stop word list) and low term frequency filtering (count < 3) have already been applied to the data. The files contained in the archives given above have the following formats:


Contact

For further information please contact Derek Greene.