Yeast Literature Corpus

We provide here a new text corpus, mined from biomedical literature, which refers to the terms used to describe Saccharomyces cerevisiae ORFs.

This corpus is made available for non-commercial and research purposes only. If you make use of this dataset please reference the publication:

Greene, D., Bryan, K. and Cunningham, P. (2008), "Parallel Integration of Heterogeneous Genome-Wide Data Sources", Proc. 8th International Conference on BioInformatics and BioEngineering (BIBE 2008). [PDF] [BibTeX]

Dataset construction

We retrieved a set of 38,661 yeast-related MEDLINE abstracts retrieved from PubMed, corresponding to the references enumerated in the SGD literature curation database (as downloaded in May 2008). Since the database provides links between references and genes, we can form a "meta-document" for each gene consisting of the concatenation of all abstracts annotated as pertaining to that gene. From this we constructed a bag-of-words model, represented in the form of a term-gene matrix. To pre-process the data we removed dubious ORFs, and applied standard stop-word removal and stemming techniques to the abstracts. We subsequently removed terms occurring in less than three documents. Our final dataset consists of 6013 ORFs described by 62,859 unique terms.


The corpus is provided in pre-processed matrix format. All rights, including copyright, in the content of the original abstracts are owned by the original authors. 

Download Yeast Literature Corpus (May 2008) (13MB)

File formats

The files contained in the archive given above have the following formats: