Home

Datasets


A collection of novel and benchmark datasets produced by members of the Machine Learning Work and used in their experimental work:

Multi-View Twitter Datasets

A collection of Twitter datasets for evaluating multi-view analysis methods.

News Curation Datasets

A collection of Twitter datasets for evaluating criteria for Twitter user list curation.  

Youtube Dataset

A dataset that was collected in order to permit the investigation of contemporary spam comment activity. 

Detecting Grand Tours of Europe with Geo-Tags 

Supplementary data for an analysis of tourist behaviour based on the analysis of a collection of 95 million Flickr photos for which precise geographic coordinates (geo-tags) are known.

Irish Economic Sentiment Collection 

A new text sentiment analysis collection, produced from three Irish online news sources.

3Sources Collection 

A multi-view text corpus, constructed from news articles from three online news services.

Synthetic Multi-view Datasets 

A set of synthetic text datasets for the evaluation of multi-view learning algorithms.

Yeast Literature Dataset  

A new text corpus, mined from biomedical literature, which refers to the terms used to describe S. cerevisiae ORFs.

CBR Conference Series Dataset 

The network constructed from the publications of the CBR conference series (1993-2008).

BBC Datasets  

Two text corpora consisting of news articles, particularly suited to evaluating cluster analysis techniques.

Multi-label Image Annotation

Image dataset for multi-label image classification using Active Learning with SVMs.

20 Newsgroups Subsets

A large number of artificially constructed text datasets.

Bronchiolitis

A dataset to train recommendation systems on Bronchiolitis treatment.