G. X. Yu and E. Marland, "Establishment of a Knowledge Base for Function Annotations in High-Throughput Sequence Analysis," Preprint ANL/MCS-P1004-1002, October 2002. [pdf]
Motivation: The rapid accumulation of sequence data and information describing regulatory and metabolic networks has triggered the development of integrated systems for genome sequence analysis. However, a great deal of uncertainty exists in the annotations found in these systems because of the heterogeneities in the public databases and limitations in current computational approaches. Conflicts in assignments based on different computational tools add additional uncertainty to the annotations, and the situation is compounded by a lack of tools for cross-verification. These uncertainties have greatly affected the performance of genome analysis systems, specifically with regard to the accuracy of functional assignments to the genes. In order to minimize the effect of these uncertainties, a biological knowledge base is needed to provide rules for guiding function annotations and a global reference system for cross-verification of the results obtained by analysis using different computational tools.
Results: In this study, we have developed a rule-based knowledge system specifically for automated high-throughput genetic sequence analysis. It includes 22,612 protein function groups and their evolutionary spaces (distributions), which are characterized by protein sequence conservations, the phylogenetic distribution of protein motifs and domains, and their relationships to biological functions. Our knowledge base demonstrates that tremendous variations exist among protein functional groups. Over half of the protein functional groups are highly diversified in sequence similarities (53.6%, and 51.4% in Blast and Blocks measurements, respectively). With regard to protein relationships, we found that Pfam patterns have much higher resolution and broader coverage than Blocks families. Out of 10,604 protein functional groups that Blocks covered, 811 (7.6%) can be uniquely identified. In contrast, Pfam patterns cover 13,803 significant protein functional groups, and 1,899 (almost 14%) of them have unique identifiers. However, most of the relationships between protein functions and protein families or Pfam patterns are complex. Each of the protein families or Pfam patterns can correspond to multiple functions or vice versa. Hence, these families or patterns need to be further defined or additional tools introduced so that each function can be identified through its own unique set of features. One of the important applications of our knowledge base is cross-verification of protein function annotations obtained by different computational tools. Additional applications of this knowledge base are discussed in the paper.