R. Overbeek, T. Disz, and R. Stevens, "A Peer-to-Peer Environment for Annotation of Genomes: The SEED," Preprint ANL/MCS-P1169-0604, June 2004. [pdf]
A genome may be thought of as a set of genes that encode protein sequences. The function of each gene is determined by the activity of the protein it encodes. Genome annotation is the process of assigning functions to genes. Functions are assigned by any of several methods. The most direct form of function assignment involves determining the function of a gene by experiment. Since vastly more gene sequences are available in genome databases than the number with directly determined functions, most genes are assigned a function by indirect methods. These methods include assigning a function to a gene based on sequence similarity to genes with known function, assigning a function to a gene based on its position in a conserved gene cluster through comparative analysis of many genomes, and inferring function via other techniques for detection of functional coupling. Genome annotation is an iterative process that can exploit a variety of domain knowledge sources. For genes that code for enzymes involved in core metabolism, much is known about the biochemical reaction networks in which the enzymes participate. The existence of a known reaction pathway (such as those available in biochemistry databases) can provide valuable information that supports inference of function through processes of systematic elimination. We believe this approach will be highly valuable, especially in comparison to simple similarity methods (which are often unreliable, particularly in the case of paralogous genes, genes that have common ancestry but have evolved divergent functions).