The FetchProt Corpus


Staff member
The first release of the FetchProt corpus is based on 177 full text journal articles from the biological domainanalyzed for experiments on proteins to validate tyrosine kinase activity. The 177 filled template files contain 591 experiments on wild types and
82 different mutants of 77 proteins.

Apart from the template files the corpus includes text versions of the
articles with the analyzed content tagged, as reference to where in the
article the information in the template is to be found.
The proteins and experiments are, among other things, linked to UniProt
identity codes, and Gene Ontology molecular function codes.

The corpus has been compiled within the FetchProt project, a
collaboration between Swedish Institute of Computer Science (SICS),
Center for Genomics and Bioinformatics at Karolinska Institutet (CGB/KI)
and Metamatrix AB, and has received partial funding from VINNOVA, the
Swedish Agency for Innovation Systems.
The aim of the project is to build a system that aids in populating the
EXProt database of proteins with experimentally verified functions, by
means of information extraction from full text scientific journal papers.

More information on the corpus and its analysis can be found in the
documentation at

The corpus is free to download from the project homepage at