Yeast Literature Corpus
We provide here a new text corpus, mined from biomedical literature, which refers to the terms used to describe Saccharomyces cerevisiae ORFs.
This corpus is made available for non-commercial and research purposes only. If you make use of this dataset please reference the publication:
Greene, D., Bryan, K. and Cunningham, P. (2008), "Parallel Integration of Heterogeneous Genome-Wide Data Sources", Proc. 8th International Conference on BioInformatics and BioEngineering (BIBE 2008). [PDF] [BibTeX]
Dataset construction
We retrieved a set of 38,661 yeast-related MEDLINE abstracts retrieved from PubMed, corresponding to the references enumerated in the SGD literature curation database (as downloaded in May 2008). Since the database provides links between references and genes, we can form a "meta-document" for each gene consisting of the concatenation of all abstracts annotated as pertaining to that gene. From this we constructed a bag-of-words model, represented in the form of a term-gene matrix. To pre-process the data we removed dubious ORFs, and applied standard stop-word removal and stemming techniques to the abstracts. We subsequently removed terms occurring in less than three documents. Our final dataset consists of 6013 ORFs described by 62,859 unique terms.
Download
The corpus is provided in pre-processed matrix format. All rights, including copyright, in the content of the original abstracts are owned by the original authors.
Download Yeast Literature Corpus (May 2008) (13MB)
File formats
The files contained in the archive given above have the following formats:
- yeast.mtx: Term frequencies stored in a sparse term-gene matrix in Matrix Market format.
- yeast.terms: Complete list of content-bearing terms in the corpus, with each line corresponding to a row of the term-gene matrix.
- yeast.docs: List of ORFs, with each line corresponding to a column of the term-gene matrix.