| Synthetic multi-view text data |
|
We provide here a set of synthetic multi-view text datasets, constructed from the single-view BBC and BBCSport corpora by splitting news articles into related segments of text. Dataset construction From the two news corpora we constructed 6 new datasets containing 2-4 views as follows:
DownloadThese datasets are made available for non-commercial and research purposes only, and all data is provided in pre-processed matrix format. Note that stemming (Porter algorithm), stop-word removal (stop word list) and low term frequency filtering (count < 3) have already been applied to the data. All rights, including copyright, in the content of the original abstracts are owned by the original authors. Download Synthetic Multi-View Datasets (April 2009) (4MB) File formats The above archive contains 6 different datasets, organised by the originating corpus. Each dataset contains 2-4 views, as indicates by the file prefix (e.g. bbc_seg1of2.*, bbc_seg2of2.*). The view data files have the following formats:
In addition annotated labels are provided for the news articles. The
identifiers in these files correspond to the article identifiers above:
|