| BBC Datasets |
|
These datasets are made available for non-commercial and research purposes only, and all data is provided in pre-processed matrix format. If you make use of these datasets please reference the publication: Greene, D. and Cunningham, P. (2006), "Practical solutions to the problem of diagonal dominance in kernel document clustering", Proc. 23rd International Conference on Machine learning (ICML 2006). [PDF] [BibTeX] Dataset: BBCAll rights, including copyright, in the content of the original articles are owned by the BBC.
Dataset: BBCSportAll rights, including copyright, in the content of the original articles are owned by the BBC.
File FormatsThe datasets have been pre-processed as follows: stemming (Porter algorithm), stop-word removal (stop word list) and low term frequency filtering (count < 3) have already been applied to the data. The files contained in the archives given above have the following formats:
|
|
| Last Updated ( Wednesday, 03 June 2009 ) |