NAME

CDMI_API


DESCRIPTION

The CDMI_API defines the component of the Kbase API that supports interaction with instances of the CDM (Central Data Model). A basic familiarity with these routines will allow the user to extract data from the CS (Central Store). We anticipate supporting numerous sparse CDMIs in the PS (Persistent Store).

Basic Themes: ------------

There are several broad categories of routines supported in the CDMI-API.

The simplest is set of "get entity" routines -- each returning data extracted from instances of a single entity type. These routines all take as input a list of ids referencing instances of a single type of entity. They construct as output a mapping which takes as input an id and associates as output a set of fields from that instance of the entity. Each routine allows the user to specify which fields are desired.

        NEEDS EXAMPLE

To use these routines effectively, a user will need to gradually become familiar with the entities supported in the CDM. We suggest perusing the entity-relationship model that underlies the CDM to get a good introduction.

The next simplest set of routines provide the "get relationship" routines. These take as input a list of ids for a specific entity type, and the give access to the relationship nodes associated with each entity. Thus,

        NEEDS EXAMPLE

Of the remaining CDMI-API routines, most are used to extract data by "crossing one or more relationships". Thus,

        my $references = $kbO->fids_to_literature($fids)

takes as input a list of feature ids referenced by the variable $fids. It creates a hash ($references) which maps each input key to a list of literature references. The construction of the literature references for a given ID involves crossing relationships from the entity 'Feature' to 'ProteinSequence' to 'Publication'. We have attempted to package this specific search in a convenient form. We anticipate that the number of queries of this last class will grow (especially as new entities are added to the model).

Batching queries: ----------------

A majority of the CS-API routines take a list of ids as input. Each id may be thought of as input to a query that produces an output result. We support processing an input list, since the performance (which is usually governed by network interactions) is much better if you process a batch of items, rather than invoking the API repeatedly for each of the ids. Normally, the output would be a mapping (a hash for Perl versions) from the input ids to the output results. Thus, a routine like

             fids_to_literature
 will take a list of feature ids as input.  The returned value will be a mapping from
 feature ids (fids) to publication references.
 It is a little inconvenient to batch your requests by supplying a list of fids,
 but the performance will be much better in most cases.  Please note that you are
 controlling the granularity of each request, and in most cases the size of the input
 list is not critical.  However, you should note that while batching up hundreds or thousands
 of input ids at a time should work just fine, millions may well cause things to break (e.g.,
 you may exhaust local memory in your machine as the output results are returned).  As
 machines get larger, the appropriate size of the input lists may become largely irrelevant.
 For now, we recommend that you experiment a bit and use common sense.


METHODS

fids_to_annotations

  $return = $obj->fids_to_annotations($fids)
Parameter and return types
$fids is a fids
$return is a reference to a hash where the key is a fid and the value is an annotations
fids is a reference to a list where each element is a fid
fid is a string
annotations is a reference to a list where each element is an annotation
annotation is a reference to a list containing 3 items:
	0: a comment
	1: an annotator
	2: an annotation_time
comment is a string
annotator is a string
annotation_time is an int
Description

This routine takes as input a list of fids. It retrieves the existing annotations for each fid, including the text of the annotation, who made the annotation and when (as seconds from the epoch).

fids_to_functions

  $return = $obj->fids_to_functions($fids)
Parameter and return types
$fids is a fids
$return is a reference to a hash where the key is a fid and the value is a function
fids is a reference to a list where each element is a fid
fid is a string
function is a string
Description

This routine takes as input a list of fids and returns a mapping from the fids to their assigned functions.

fids_to_literature

  $return = $obj->fids_to_literature($fids)
Parameter and return types
$fids is a fids
$return is a reference to a hash where the key is a fid and the value is a pubrefs
fids is a reference to a list where each element is a fid
fid is a string
pubrefs is a reference to a list where each element is a pubref
pubref is a reference to a list containing 3 items:
	0: a string
	1: a string
	2: a string
Description

We try to associate features and publications, when the publications constitute supporting evidence of the function. We connect a paper to a feature when we believe that an "expert" has asserted that the function of the feature is basically what we have associated with the feature. Thus, we might attach a paper reporting the crystal structure of a protein, even though the paper is clearly not the paper responsible for the original characterization. Our position in this matter is somewhat controversial, but we are seeking to characterize some assertions as relatively solid, and this strategy seems to support that goal. Please note that we certainly wish we could also capture original publications, and when experts can provide those connections, we hope that they will help record the associations.

fids_to_protein_families

  $return = $obj->fids_to_protein_families($fids)
Parameter and return types
$fids is a fids
$return is a reference to a hash where the key is a fid and the value is a protein_families
fids is a reference to a list where each element is a fid
fid is a string
protein_families is a reference to a list where each element is a protein_family
protein_family is a string
Description

Kbase supports the creation and maintence of protein families. Each family is intended to contain a set of isofunctional homologs. Currently, the families are collections of translations of features, rather than of just protein sequences (represented by md5s, for example). fids_to_protein_families supports access to the features that have been grouped into a family. Ideally, each feature in a family would have the same assigned function. This is not always true, but probably should be.

fids_to_roles

  $return = $obj->fids_to_roles($fids)
Parameter and return types
$fids is a fids
$return is a reference to a hash where the key is a fid and the value is a roles
fids is a reference to a list where each element is a fid
fid is a string
roles is a reference to a list where each element is a role
role is a string
Description

Given a feature, one can get the set of roles it implements using fid_to_roles. Remember, a protein can be multifunctional -- implementing several roles. This can occur due to fusions or to broad specificity of substrate.

fids_to_subsystems

  $return = $obj->fids_to_subsystems($fids)
Parameter and return types
$fids is a fids
$return is a reference to a hash where the key is a fid and the value is a subsystems
fids is a reference to a list where each element is a fid
fid is a string
subsystems is a reference to a list where each element is a subsystem
subsystem is a string
Description

fids in subsystems normally have somewhat more reliable assigned functions than those not in subsystems. Hence, it is common to ask "Is this protein-encoding gene included in any subsystems?" fids_to_subsystems can be used to see which subsystems contain a fid (or, you can submit as input a set of fids and get the subsystems for each).

fids_to_co_occurring_fids

  $return = $obj->fids_to_co_occurring_fids($fids)
Parameter and return types
$fids is a fids
$return is a reference to a hash where the key is a fid and the value is a scored_fids
fids is a reference to a list where each element is a fid
fid is a string
scored_fids is a reference to a list where each element is a scored_fid
scored_fid is a reference to a list containing 2 items:
	0: a fid
	1: a float
Description

One of the most powerful clues to function relates to conserved clusters of genes on the chromosome (in prokaryotic genomes). We have attempted to record pairs of genes that tend to occur close to one another on the chromosome. To meaningfully do this, we need to construct similarity-based mappings between genes in distinct genomes. We have constructed such mappings for many (but not all) genomes maintained in the Kbase CS. The prokaryotic geneomes in the CS are grouped into OTUs by ribosomal RNA (genomes within a single OTU have SSU rRNA that is greater than 97% identical). If two genes occur close to one another (i.e., corresponding genes occur close to one another), then we assign a score, which is the number of distinct OTUs in which such clustering is detected. This allows one to normalize for situations in which hundreds of corresponding genes are detected, but they all come from very closely related genomes.

The significance of the score relates to the number of genomes in the database. We recommend that you take the time to look at a set of scored pairs and determine approximately what percentage appear to be actually related for a few cutoff values.

fids_to_locations

  $return = $obj->fids_to_locations($fids)
Parameter and return types
$fids is a fids
$return is a reference to a hash where the key is a fid and the value is a location
fids is a reference to a list where each element is a fid
fid is a string
location is a reference to a list where each element is a region_of_dna
region_of_dna is a reference to a list containing 4 items:
	0: a contig
	1: a begin
	2: a strand
	3: a length
contig is a string
begin is an int
strand is a string
length is an int
Description

A "location" is a sequence of "regions". A region is a contiguous set of bases in a contig. We work with locations in both the string form and as structures. fids_to_locations takes as input a list of fids. For each fid, a structured location is returned. The location is a list of regions; a region is given as a pointer to a list containing

             the contig,
             the beginning base in the contig (from 1).
             the strand (+ or -), and
             the length

Note that specifying a region using these 4 values allows you to represent a single base-pair region on either strand unambiguously (which giving begin/end pairs does not achieve).

locations_to_fids

  $return = $obj->locations_to_fids($region_of_dna_strings)
Parameter and return types
$region_of_dna_strings is a region_of_dna_strings
$return is a reference to a hash where the key is a region_of_dna_string and the value is a fids
region_of_dna_strings is a reference to a list where each element is a region_of_dna_string
region_of_dna_string is a string
fids is a reference to a list where each element is a fid
fid is a string
Description

It is frequently the case that one wishes to look up the genes that occur in a given region of a contig. Location_to_fids can be used to extract such sets of genes for each region in the input set of regions. We define a gene as "occuring" in a region if the location of the gene overlaps the designated region.

locations_to_dna_sequences

  $dna_seqs = $obj->locations_to_dna_sequences($locations)
Parameter and return types
$locations is a locations
$dna_seqs is a reference to a list where each element is a reference to a list containing 2 items:
	0: a location
	1: a dna
locations is a reference to a list where each element is a location
location is a reference to a list where each element is a region_of_dna
region_of_dna is a reference to a list containing 4 items:
	0: a contig
	1: a begin
	2: a strand
	3: a length
contig is a string
begin is an int
strand is a string
length is an int
dna is a string
Description

locations_to_dna_sequences takes as input a list of locations (each in the form of a list of regions). The routine constructs 2-tuples composed of

     [the input location,the dna string]

The returned DNA string is formed by concatenating the DNA for each of the regions that make up the location.

proteins_to_fids

  $return = $obj->proteins_to_fids($proteins)
Parameter and return types
$proteins is a proteins
$return is a reference to a hash where the key is a protein and the value is a fids
proteins is a reference to a list where each element is a protein
protein is a string
fids is a reference to a list where each element is a fid
fid is a string
Description

proteins_to_fids takes as input a list of proteins (i.e., a list of md5s) and returns for each a set of protein-encoding fids that have the designated sequence as their translation. That is, for each sequence, the returned fids will be the entire set (within Kbase) that have the sequence as a translation.

proteins_to_protein_families

  $return = $obj->proteins_to_protein_families($proteins)
Parameter and return types
$proteins is a proteins
$return is a reference to a hash where the key is a protein and the value is a protein_families
proteins is a reference to a list where each element is a protein
protein is a string
protein_families is a reference to a list where each element is a protein_family
protein_family is a string
Description

Protein families contain a set of isofunctional homologs. proteins_to_protein_families can be used to look up is used to get the set of protein_families containing a specified protein. For performance reasons, you can submit a batch of proteins (i.e., a list of proteins), and for each input protein, you get back a set (possibly empty) of protein_families. Specific collections of families (e.g., FIGfams) usually require that a protein be in at most one family. However, we will be integrating protein families from a number of sources, and so a protein can be in multiple families.

proteins_to_literature

  $return = $obj->proteins_to_literature($proteins)
Parameter and return types
$proteins is a proteins
$return is a reference to a hash where the key is a protein and the value is a pubrefs
proteins is a reference to a list where each element is a protein
protein is a string
pubrefs is a reference to a list where each element is a pubref
pubref is a reference to a list containing 3 items:
	0: a string
	1: a string
	2: a string
Description

The routine proteins_to_literature can be used to extract the list of papers we have associated with specific protein sequences. The user should note that in many cases the association of a paper with a protein sequence is not precise. That is, the paper may actually describe a closely-related protein (that may not yet even be in a sequenced genome). Annotators attempt to use best judgement when associating literature and proteins. Publication references include [pubmed ID,URL for the paper, title of the paper]. In some cases, the URL and title are omitted. In theory, we can extract them from PubMed and we will attempt to do so.

proteins_to_functions

  $return = $obj->proteins_to_functions($proteins)
Parameter and return types
$proteins is a proteins
$return is a reference to a hash where the key is a protein and the value is a fid_function_pairs
proteins is a reference to a list where each element is a protein
protein is a string
fid_function_pairs is a reference to a list where each element is a fid_function_pair
fid_function_pair is a reference to a list containing 2 items:
	0: a fid
	1: a function
fid is a string
function is a string
Description

The routine proteins_to_functions allows users to access functions associated with specific protein sequences. The input proteins are given as a list of MD5 values (these MD5 values each correspond to a specific protein sequence). For each input MD5 value, a list of [feature-id,function] pairs is constructed and returned. Note that there are many cases in which a single protein sequence corresponds to the translation associated with multiple protein-encoding genes, and each may have distinct functions (an undesirable situation, we grant).

This function allows you to access all of the functions assigned (by all annotation groups represented in Kbase) to each of a set of sequences.

proteins_to_roles

  $return = $obj->proteins_to_roles($proteins)
Parameter and return types
$proteins is a proteins
$return is a reference to a hash where the key is a protein and the value is a roles
proteins is a reference to a list where each element is a protein
protein is a string
roles is a reference to a list where each element is a role
role is a string
Description

The routine proteins_to_roles allows a user to gather the set of functional roles that are associated with specifc protein sequences. A single protein sequence (designated by an MD5 value) may have numerous associated functions, since functions are treated as an attribute of the feature, and multiple features may have precisely the same translation. In our experience, it is not uncommon, even for the best annotation teams, to assign distinct functions (and, hence, functional roles) to identical protein sequences.

For each input MD5 value, this routine gathers the set of features (fids) that share the same sequence, collects the associated functions, expands these into functional roles (for multi-functional proteins), and returns the set of roles that results.

Note that, if the user wishes to see the specific features that have the assigned fiunctional roles, they should use proteins_to_functions instead (it returns the fids associated with each assigned function).

roles_to_proteins

  $return = $obj->roles_to_proteins($roles)
Parameter and return types
$roles is a roles
$return is a reference to a hash where the key is a role and the value is a proteins
roles is a reference to a list where each element is a role
role is a string
proteins is a reference to a list where each element is a protein
protein is a string
Description

roles_to_proteins can be used to extract the set of proteins (designated by MD5 values) that currently are believed to implement a given role. Note that the proteins may be multifunctional, meaning that they may be implementing other roles, as well.

roles_to_subsystems

  $return = $obj->roles_to_subsystems($roles)
Parameter and return types
$roles is a roles
$return is a reference to a hash where the key is a role and the value is a subsystems
roles is a reference to a list where each element is a role
role is a string
subsystems is a reference to a list where each element is a subsystem
subsystem is a string
Description

roles_to_subsystems can be used to access the set of subsystems that include specific roles. The input is a list of roles (i.e., role descriptions), and a mapping is returned as a hash with key role description and values composed of sets of susbsystem names.

roles_to_protein_families

  $return = $obj->roles_to_protein_families($roles)
Parameter and return types
$roles is a roles
$return is a reference to a hash where the key is a role and the value is a protein_families
roles is a reference to a list where each element is a role
role is a string
protein_families is a reference to a list where each element is a protein_family
protein_family is a string
Description

roles_to_protein_families can be used to locate the protein families containing features that have assigned functions implying that they implement designated roles. Note that for any input role (given as a role description), you may have a set of distinct protein_families returned.

fids_to_coexpressed_fids

  $return = $obj->fids_to_coexpressed_fids($fids)
Parameter and return types
$fids is a fids
$return is a reference to a hash where the key is a fid and the value is a scored_fids
fids is a reference to a list where each element is a fid
fid is a string
scored_fids is a reference to a list where each element is a scored_fid
scored_fid is a reference to a list containing 2 items:
	0: a fid
	1: a float
Description

The routine fids_to_coexpressed_fids returns (for each input fid) a list of features that appear to be coexpressed. That is, for an input fid, we determine the set of fids from the same genome that have Pearson Correlation Coefficients (based on normalized expression data) greater than 0.5 or less than -0.5.

protein_families_to_fids

  $return = $obj->protein_families_to_fids($protein_families)
Parameter and return types
$protein_families is a protein_families
$return is a reference to a hash where the key is a protein_family and the value is a fids
protein_families is a reference to a list where each element is a protein_family
protein_family is a string
fids is a reference to a list where each element is a fid
fid is a string
Description

protein_families_to_fids can be used to access the set of fids represented by each of a set of protein_families. We define protein_families as sets of fids (rather than sets of MD5s. This may, or may not, be a mistake.

protein_families_to_proteins

  $return = $obj->protein_families_to_proteins($protein_families)
Parameter and return types
$protein_families is a protein_families
$return is a reference to a hash where the key is a protein_family and the value is a proteins
protein_families is a reference to a list where each element is a protein_family
protein_family is a string
proteins is a reference to a list where each element is a protein
protein is a string
Description

protein_families_to_proteins can be used to access the set of proteins (i.e., the set of MD5 values) represented by each of a set of protein_families. We define protein_families as sets of fids (rather than sets of MD5s. This may, or may not, be a mistake.

protein_families_to_functions

  $return = $obj->protein_families_to_functions($protein_families)
Parameter and return types
$protein_families is a protein_families
$return is a reference to a hash where the key is a protein_family and the value is a fid_function_pairs
protein_families is a reference to a list where each element is a protein_family
protein_family is a string
fid_function_pairs is a reference to a list where each element is a fid_function_pair
fid_function_pair is a reference to a list containing 2 items:
	0: a fid
	1: a function
fid is a string
function is a string
Description

protein_families_to_functions can be used to extract the set of functions assigned to the fids that make up the family. Each input protein_family is mapped to a set of 2-tuples composed of a feature id (fid) and the function currently assigned to the fid.

protein_families_to_co_occurring_families

  $return = $obj->protein_families_to_co_occurring_families($protein_families)
Parameter and return types
$protein_families is a protein_families
$return is a reference to a hash where the key is a protein_family and the value is a fc_protein_families
protein_families is a reference to a list where each element is a protein_family
protein_family is a string
fc_protein_families is a reference to a list where each element is a fc_protein_family
fc_protein_family is a reference to a list containing 3 items:
	0: a protein_family
	1: a score
	2: a function
score is a float
function is a string
Description

Since we accumulate data relating to the co-occurrence (i.e., chromosomal clustering) of genes in prokaryotic genomes, we can note which pairs of genes tend to co-occur. From this data, one can compute the protein families that tend to co-occur (i.e., tend to cluster on the chromosome). This allows one to formulate conjectures for unclustered pairs, based on clustered pairs from the same protein_families.

co_occurrence_evidence

  $return = $obj->co_occurrence_evidence($pairs_of_fids)
Parameter and return types
$pairs_of_fids is a pairs_of_fids
$return is a reference to a list where each element is a reference to a list containing 2 items:
	0: a pair_of_fids
	1: an evidence
pairs_of_fids is a reference to a list where each element is a pair_of_fids
pair_of_fids is a reference to a list containing 2 items:
	0: a fid
	1: a fid
fid is a string
evidence is a reference to a list where each element is a pair_of_fids
Description

co-occurence_evidence is used to retrieve the detailed pairs of genes that go into the computation of co-occurence scores. The scores reflect an estimate of the number of distinct OTUs that contain an instance of a co-occuring pair. This routine returns as evidence a list of all the pairs that went into the computation.

The input to the computation is a list of pairs for which evidence is desired.

The returned output is a list of elements. one for each input pair. Each output element is a 2-tuple: the input pair and the evidence for the pair. The evidence is a list of pairs of fids that are believed to correspond to the input pair.

contigs_to_sequences

  $return = $obj->contigs_to_sequences($contigs)
Parameter and return types
$contigs is a contigs
$return is a reference to a hash where the key is a contig and the value is a dna
contigs is a reference to a list where each element is a contig
contig is a string
dna is a string
Description

contigs_to_sequences is used to access the DNA sequence associated with each of a set of input contigs. It takes as input a set of contig IDs (from which the genome can be determined) and produces a mapping from the input IDs to the returned DNA sequence in each case.

contigs_to_lengths

  $return = $obj->contigs_to_lengths($contigs)
Parameter and return types
$contigs is a contigs
$return is a reference to a hash where the key is a contig and the value is a length
contigs is a reference to a list where each element is a contig
contig is a string
length is an int
Description

In some cases, one wishes to know just the lengths of the contigs, rather than their actual DNA sequence (e.g., suppose that you wished to know if a gene boundary occured within 100 bp of the end of the contig). To avoid requiring a user to access the entire DNA sequence, we offer the ability to retrieve just the contig lengths. Input to the routine is a list of contig IDs. The routine returns a mapping from contig IDs to lengths

contigs_to_md5s

  $return = $obj->contigs_to_md5s($contigs)
Parameter and return types
$contigs is a contigs
$return is a reference to a hash where the key is a contig and the value is a md5
contigs is a reference to a list where each element is a contig
contig is a string
md5 is a string
Description

contigs_to_md5s can be used to acquire MD5 values for each of a list of contigs. The quickest way to determine whether two contigs are identical is to compare their associated MD5 values, eliminating the need to retrieve the sequence of each and compare them.

The routine takes as input a list of contig IDs. The output is a mapping from contig ID to MD5 value.

md5s_to_genomes

  $return = $obj->md5s_to_genomes($md5s)
Parameter and return types
$md5s is a md5s
$return is a reference to a hash where the key is a md5 and the value is a genomes
md5s is a reference to a list where each element is a md5
md5 is a string
genomes is a reference to a list where each element is a genome
genome is a string
Description

md5s to genomes is used to get the genomes associated with each of a list of input md5 values.

           The routine takes as input a list of MD5 values.  It constructs a mapping from each input
           MD5 value to a list of genomes that share the same MD5 value.
           The MD5 value for a genome is independent of the names of contigs and the case of the DNA sequence
           data.

genomes_to_md5s

  $return = $obj->genomes_to_md5s($genomes)
Parameter and return types
$genomes is a genomes
$return is a reference to a hash where the key is a genome and the value is a md5
genomes is a reference to a list where each element is a genome
genome is a string
md5 is a string
Description

The routine genomes_to_md5s can be used to look up the MD5 value associated with each of a set of genomes. The MD5 values are computed when the genome is loaded, so this routine just retrieves the precomputed values.

Note that the MD5 value of a genome is independent of the contig names and case of the DNA sequences that make up the genome.

genomes_to_contigs

  $return = $obj->genomes_to_contigs($genomes)
Parameter and return types
$genomes is a genomes
$return is a reference to a hash where the key is a genome and the value is a contigs
genomes is a reference to a list where each element is a genome
genome is a string
contigs is a reference to a list where each element is a contig
contig is a string
Description

The routine genomes_to_contigs can be used to retrieve the IDs of the contigs associated with each of a list of input genomes. The routine constructs a mapping from genome ID to the list of contigs included in the genome.

genomes_to_fids

  $return = $obj->genomes_to_fids($genomes, $types_of_fids)
Parameter and return types
$genomes is a genomes
$types_of_fids is a types_of_fids
$return is a reference to a hash where the key is a genome and the value is a fids
genomes is a reference to a list where each element is a genome
genome is a string
types_of_fids is a reference to a list where each element is a type_of_fid
type_of_fid is a string
fids is a reference to a list where each element is a fid
fid is a string
Description

genomes_to_fids is used to get the fids included in specific genomes. It is often the case that you want just one or two types of fids -- hence, the types_of_fids argument.

genomes_to_taxonomies

  $return = $obj->genomes_to_taxonomies($genomes)
Parameter and return types
$genomes is a genomes
$return is a reference to a hash where the key is a genome and the value is a taxonomic_groups
genomes is a reference to a list where each element is a genome
genome is a string
taxonomic_groups is a reference to a list where each element is a taxonomic_group
taxonomic_group is a string
Description

The routine genomes_to_taxonomies can be used to retrieve taxonomic information for each of a list of input genomes. For each genome in the input list of genomes, a list of taxonomic groups is returned. Kbase will use the groups maintained by NCBI. For an NCBI taxonomic string like

     cellular organisms;
     Bacteria;
     Proteobacteria;
     Gammaproteobacteria;
     Enterobacteriales;
     Enterobacteriaceae;
     Escherichia;
     Escherichia coli

associated with the strain 'Escherichia coli 1412', this routine would return a list of these taxonomic groups:

     ['Bacteria',
      'Proteobacteria',
      'Gammaproteobacteria',
      'Enterobacteriales',
      'Enterobacteriaceae',
      'Escherichia',
      'Escherichia coli',
      'Escherichia coli 1412'
     ]

That is, the initial "cellular organisms" has been deleted, and the strain ID has been added as the last "grouping".

The output is a mapping from genome IDs to lists of the form shown above.

genomes_to_subsystems

  $return = $obj->genomes_to_subsystems($genomes)
Parameter and return types
$genomes is a genomes
$return is a reference to a hash where the key is a genome and the value is a variant_subsystem_pairs
genomes is a reference to a list where each element is a genome
genome is a string
variant_subsystem_pairs is a reference to a list where each element is a variant_of_subsystem
variant_of_subsystem is a reference to a list containing 2 items:
	0: a subsystem
	1: a variant
subsystem is a string
variant is a string
Description

A user can invoke genomes_to_subsystems to rerieve the names of the subsystems relevant to each genome. The input is a list of genomes. The output is a mapping from genome to a list of 2-tuples, where each 2-tuple give a variant code and a subsystem name. Variant codes of -1 (or *-1) amount to assertions that the genome contains no active variant. A variant code of 0 means "work in progress", and presence or absence of the subsystem in the genome should be undetermined.

subsystems_to_genomes

  $return = $obj->subsystems_to_genomes($subsystems)
Parameter and return types
$subsystems is a subsystems
$return is a reference to a hash where the key is a subsystem and the value is a reference to a list where each element is a reference to a list containing 2 items:
	0: a variant
	1: a genome
subsystems is a reference to a list where each element is a subsystem
subsystem is a string
variant is a string
genome is a string
Description

The routine subsystems_to_genomes is used to determine which genomes are in specified subsystems. The input is the list of subsystem names of interest. The output is a map from the subsystem names to lists of 2-tuples, where each 2-tuple is a [variant-code,genome ID] pair.

subsystems_to_fids

  $return = $obj->subsystems_to_fids($subsystems, $genomes)
Parameter and return types
$subsystems is a subsystems
$genomes is a genomes
$return is a reference to a hash where the key is a subsystem and the value is a reference to a hash where the key is a genome and the value is a reference to a list containing 2 items:
	0: a variant
	1: a fids
subsystems is a reference to a list where each element is a subsystem
subsystem is a string
genomes is a reference to a list where each element is a genome
genome is a string
variant is a string
fids is a reference to a list where each element is a fid
fid is a string
Description

The routine subsystems_to_fids allows the user to map subsystem names into the fids that occur in genomes in the subsystems. Specifically, the input is a list of subsystem names. What is returned is a mapping from subsystem names to a "genome-mapping". The genome-mapping takes genome IDs to 2-tuples that capture the variant code of the genome and the fids from the genome that are included in the subsystem.

subsystems_to_roles

  $return = $obj->subsystems_to_roles($subsystems, $aux)
Parameter and return types
$subsystems is a subsystems
$aux is an aux
$return is a reference to a hash where the key is a subsystem and the value is a roles
subsystems is a reference to a list where each element is a subsystem
subsystem is a string
aux is an int
roles is a reference to a list where each element is a role
role is a string
Description

The routine subsystem_to_roles is used to determine the role descriptions that occur in a subsystem. The input is a list of subsystem names. A map is returned connecting subsystem names to lists of roles. 'aux' is a boolean variable. If it is 0, auxiliary roles are not returned. If it is 1, they are returned.

subsystems_to_spreadsheets

  $return = $obj->subsystems_to_spreadsheets($subsystems, $genomes)
Parameter and return types
$subsystems is a subsystems
$genomes is a genomes
$return is a reference to a hash where the key is a subsystem and the value is a reference to a hash where the key is a genome and the value is a row
subsystems is a reference to a list where each element is a subsystem
subsystem is a string
genomes is a reference to a list where each element is a genome
genome is a string
row is a reference to a list containing 2 items:
	0: a variant
	1: a reference to a hash where the key is a role and the value is a fids
variant is a string
role is a string
fids is a reference to a list where each element is a fid
fid is a string
Description

The subsystem_to_spreadsheet routine allows a user to extract the subsystem spreadsheets for a specified set of subsystem names. In the returned output, each subsystem is mapped to a hash that takes as input a genome ID and maps it to the "row" for the genome in the subsystem. The "row" is itself a 2-tuple composed of the variant code, and a mapping from role descriptions to lists of fids. We suggest writing a simple test script to get, say, the subsystem named 'Histidine Degradation', extracting the spreadsheet, and then using something like Dumper to make sure that it all makes sense.

all_roles_used_in_models

  $return = $obj->all_roles_used_in_models()
Parameter and return types
$return is a roles
roles is a reference to a list where each element is a role
role is a string
Description

The all_roles_used_in_models allows a user to access the set of roles that are included in current models. This is important. There are far fewer roles used in models than overall. Hence, the returned set represents the minimal set we need to clean up in order to properly support modeling.

complexes_to_complex_data

  $return = $obj->complexes_to_complex_data($complexes)
Parameter and return types
$complexes is a complexes
$return is a reference to a hash where the key is a complex and the value is a complex_data
complexes is a reference to a list where each element is a complex
complex is a string
complex_data is a reference to a hash where the following keys are defined:
	complex_name has a value which is a name
	complex_roles has a value which is a roles
	complex_reactions has a value which is a reactions
name is a string
roles is a reference to a list where each element is a role
role is a string
reactions is a reference to a list where each element is a reaction
reaction is a string
Description

fids_to_feature_data

  $return = $obj->fids_to_feature_data($fids)
Parameter and return types
$fids is a fids
$return is a reference to a hash where the key is a fid and the value is a feature_data
fids is a reference to a list where each element is a fid
fid is a string
feature_data is a reference to a hash where the following keys are defined:
	feature_id has a value which is a fid
	genome_name has a value which is a string
	feature_function has a value which is a string
	feature_length has a value which is an int
	feature_publications has a value which is a pubrefs
pubrefs is a reference to a list where each element is a pubref
pubref is a reference to a list containing 3 items:
	0: a string
	1: a string
	2: a string
Description

equiv_sequence_assertions

  $return = $obj->equiv_sequence_assertions($proteins)
Parameter and return types
$proteins is a proteins
$return is a reference to a hash where the key is a protein and the value is a function_assertions
proteins is a reference to a list where each element is a protein
protein is a string
function_assertions is a reference to a list where each element is a function_assertion
function_assertion is a reference to a list containing 4 items:
	0: an id
	1: a function
	2: a source
	3: an expert
id is a string
function is a string
source is a string
expert is a string
Description

Different groups have made assertions of function for numerous protein sequences. The equiv_sequence_assertions allows the user to gather function assertions from all of the sources. Each assertion includes a field indicating whether the person making the assertion viewed themself as an "expert". The routine gathers assertions for all proteins having identical protein sequence.

fids_to_regulons

  $return = $obj->fids_to_regulons($fids)
Parameter and return types
$fids is a fids
$return is a reference to a hash where the key is a fid and the value is a regulon_size_pairs
fids is a reference to a list where each element is a fid
fid is a string
regulon_size_pairs is a reference to a list where each element is a regulon_size_pair
regulon_size_pair is a reference to a list containing 2 items:
	0: a regulon
	1: a regulon_size
regulon is a string
regulon_size is an int
Description

The fids_to_regulons allows one to map fids into regulons that contain the fids. Normally a fid will be in at most one regulon, but we support multiple regulons.

regulons_to_fids

  $return = $obj->regulons_to_fids($regulons)
Parameter and return types
$regulons is a regulons
$return is a reference to a hash where the key is a regulon and the value is a fids
regulons is a reference to a list where each element is a regulon
regulon is a string
fids is a reference to a list where each element is a fid
fid is a string
Description

The regulons_to_fids routine allows the user to access the set of fids that make up a regulon. Regulons may arise from several sources; hence, fids can be in multiple regulons.

fids_to_protein_sequences

  $return = $obj->fids_to_protein_sequences($fids)
Parameter and return types
$fids is a fids
$return is a reference to a hash where the key is a fid and the value is a protein_sequence
fids is a reference to a list where each element is a fid
fid is a string
protein_sequence is a string
Description

fids_to_protein_sequences allows the user to look up the amino acid sequences corresponding to each of a set of fids. You can also get the sequence from proteins (i.e., md5 values). This routine saves you having to look up the md5 sequence and then accessing the protein string in a separate call.

fids_to_proteins

  $return = $obj->fids_to_proteins($fids)
Parameter and return types
$fids is a fids
$return is a reference to a hash where the key is a fid and the value is a md5
fids is a reference to a list where each element is a fid
fid is a string
md5 is a string
Description

fids_to_dna_sequences

  $return = $obj->fids_to_dna_sequences($fids)
Parameter and return types
$fids is a fids
$return is a reference to a hash where the key is a fid and the value is a dna_sequence
fids is a reference to a list where each element is a fid
fid is a string
dna_sequence is a string
Description

fids_to_dna_sequences allows the user to look up the DNA sequences corresponding to each of a set of fids.

roles_to_fids

  $return = $obj->roles_to_fids($roles, $genomes)
Parameter and return types
$roles is a roles
$genomes is a genomes
$return is a reference to a hash where the key is a role and the value is a fid
roles is a reference to a list where each element is a role
role is a string
genomes is a reference to a list where each element is a genome
genome is a string
fid is a string
Description

A "function" is a set of "roles" (often called "functional roles");

                F1 / F2  (where F1 and F2 are roles)  is a function that implements
                          two functional roles in different domains of the protein.
                F1 @ F2 implements multiple roles through broad specificity
                F1; F2  is thought to implement F1 or f2 (uncertainty)
            You often wish to find the fids in one or more genomes that
            implement specific functional roles.  To do this, you can use
            roles_to_fids.

reactions_to_complexes

  $return = $obj->reactions_to_complexes($reactions)
Parameter and return types
$reactions is a reactions
$return is a reference to a hash where the key is a reaction and the value is a complexes
reactions is a reference to a list where each element is a reaction
reaction is a string
complexes is a reference to a list where each element is a complex
complex is a string
Description

Reactions are thought of as being either spontaneous or implemented by one or more Complexes. Complexes connect to Roles. Hence, the connection of fids or roles to reactions goes through Complexes.

reaction_strings

  $return = $obj->reaction_strings($reactions, $name_parameter)
Parameter and return types
$reactions is a reactions
$name_parameter is a name_parameter
$return is a reference to a hash where the key is a reaction and the value is a string
reactions is a reference to a list where each element is a reaction
reaction is a string
name_parameter is a string
Description

Reaction_strings are text strings that represent (albeit crudely) the details of Reactions.

roles_to_complexes

  $return = $obj->roles_to_complexes($roles)
Parameter and return types
$roles is a roles
$return is a reference to a hash where the key is a role and the value is a complexes
roles is a reference to a list where each element is a role
role is a string
complexes is a reference to a list where each element is a complex
complex is a string
Description

roles_to_complexes allows a user to connect Roles to Complexes, from there, the connection exists to Reactions (although in the actual ER-model model, the connection from Complex to Reaction goes through ReactionComplex). Since Roles also connect to fids, the connection between fids and Reactions is induced.

The "name_parameter" can be 0, 1 or 'only'. If 1, then the compound name will be included with the ID in the output. If only, the compound name will be included instead of the ID. If 0, only the ID will be included. The default is 0.

fids_to_subsystem_data

  $return = $obj->fids_to_subsystem_data($fids)
Parameter and return types
$fids is a fids
$return is a reference to a hash where the key is a fid and the value is a ss_var_role_tuples
fids is a reference to a list where each element is a fid
fid is a string
ss_var_role_tuples is a reference to a list where each element is a ss_var_role_tuple
ss_var_role_tuple is a reference to a list containing 3 items:
	0: a subsystem
	1: a variant
	2: a role
subsystem is a string
variant is a string
role is a string
Description

representative

  $return = $obj->representative($genomes)
Parameter and return types
$genomes is a genomes
$return is a reference to a hash where the key is a genome and the value is a genome
genomes is a reference to a list where each element is a genome
genome is a string
Description

otu_members

  $return = $obj->otu_members($genomes)
Parameter and return types
$genomes is a genomes
$return is a reference to a hash where the key is a genome and the value is a reference to a hash where the key is a genome and the value is a genome_name
genomes is a reference to a list where each element is a genome
genome is a string
genome_name is a string
Description

fids_to_genomes

  $return = $obj->fids_to_genomes($fids)
Parameter and return types
$fids is a fids
$return is a reference to a hash where the key is a fid and the value is a genome
fids is a reference to a list where each element is a fid
fid is a string
genome is a string
Description

text_search

  $return = $obj->text_search($input, $start, $count, $entities)
Parameter and return types
$input is a string
$start is an int
$count is an int
$entities is a reference to a list where each element is a string
$return is a reference to a hash where the key is an entity_name and the value is a reference to a list where each element is a search_hit
entity_name is a string
search_hit is a reference to a list containing 2 items:
	0: a weight
	1: a reference to a hash where the key is a field_name and the value is a string
weight is an int
field_name is a string
Description

text_search performs a search against a full-text index maintained for the CDMI. The parameter "input" is the text string to be searched for. The parameter "entities" defines the entities to be searched. If the list is empty, all indexed entities will be searched. The "start" and "count" parameters limit the results to "count" hits starting at "start".


TYPES

annotator

Definition
a string

annotation_time

Definition
an int

comment

Definition
a string

fid

Description

a fid is a "feature id". A feature represents an ordered list of regions from the contigs of a genome. Features all have types. This allows you to speak of not only protein-encoding genes (PEGs) and RNAs, but also binding sites, large regions, etc. The location of a fid is defined as a list of "location of a contiguous DNA string" pieces (see the description of the type "location")

Definition
a string

protein_family

Description

A protein_family is thought of as a set of isofunctional, homologous protein sequences. This is not exactly what other groups have meant by "protein families". There is no hierarchy of super-family, family, sub-family. We plan on loading different collections of protein families, but in many cases there will need to be a transformation into the concept used by Kbase.

Definition
a string

role

Description

The concept of "role" or "functional role" is basically an atomic functional unit. The "function of a protein" is made up of one or more roles. That is, a bifunctional protein with an assigned function of

   5-Enolpyruvylshikimate-3-phosphate synthase (EC 2.5.1.19) / Cytidylate kinase (EC 2.7.4.14)

would implement two distinct roles (the "function1 / function2" notation is intended to assert that the initial part of the protein implements function1, and the terminal part of the protein implements function2). It is worth noting that a protein often implements multiple roles due to broad specificity. In this case, we suggest describing the protein function as

     function1 @ function2

That is the ' / ' separator is used to represent multiple roles implemented by distinct domains of the protein, while ' @ ' is used to represent multiple roles implemented by distinct domains.

Definition
a string

subsystem

Description

A substem is composed of two components: a set of roles that are gathered to be annotated simultaneously and a spreadsheet depicting the proteins within each genome that implement the roles. The set of roles may correspond to a pathway, a complex, an inventory (say, "transporters") or whatever other principle an annotator used to formulate the subsystem.

The subsystem spreadsheet is a list of "rows", each representing the subsytem in a specific genome. Each row includes a variant code (indicating what version of the molecular machine exists in the genome) and cells. Each cell is a 2-tuple:

     [role,protein-encoding genes that implement the role in the genome]

Annotators construct subsystems, and in the process impose a controlled vocabulary for roles and functions.

Definition
a string

variant

Definition
a string

variant_of_subsystem

Definition
a reference to a list containing 2 items:
0: a subsystem
1: a variant

variant_subsystem_pairs

Definition
a reference to a list where each element is a variant_of_subsystem

type_of_fid

Definition
a string

types_of_fids

Definition
a reference to a list where each element is a type_of_fid

length

Definition
an int

begin

Definition
an int

strand

Description

In encodings of locations, we often specify strands. We specify the strand as '+' or '-'

Definition
a string

contig

Definition
a string

region_of_dna

Description

A region of DNA is maintained as a tuple of four components:

                the contig
                the beginning position (from 1)
                the strand
                the length
           We often speak of "a region".  By "location", we mean a sequence
           of regions from the same genome (perhaps from distinct contigs).
Definition
a reference to a list containing 4 items:
0: a contig
1: a begin
2: a strand
3: a length

location

Description

a "location" refers to a sequence of regions

Definition
a reference to a list where each element is a region_of_dna

region_of_dna_string

Description

we often need to represent regions or locations as strings. We would use something like

     contigA_200+100,contigA_402+188

to represent a location composed of two regions

Definition
a string

region_of_dna_strings

Definition
a reference to a list where each element is a region_of_dna_string

location_string

Definition
a string

dna

Definition
a string

function

Definition
a string

protein

Definition
a string

md5

Definition
a string

genome

Definition
a string

taxonomic_group

Definition
a string

annotation

Description

The Kbase stores annotations relating to features. Each annotation is a 3-tuple:

     the text of the annotation (often a record of assertion of function)
     the annotator attaching the annotation to the feature
     the time (in seconds from the epoch) at which the annotation was attached
Definition
a reference to a list containing 3 items:
0: a comment
1: an annotator
2: an annotation_time

pubref

Description

The Kbase will include a growing body of literature supporting protein functions, asserted phenotypes, etc. References are encoded as 3-tuples:

     an id (often a PubMed ID)
     a URL to the paper
     a title of the paper

The URL and title are often missing (but, can usually be inferred from the pubmed ID).

Definition
a reference to a list containing 3 items:
0: a string
1: a string
2: a string

scored_fid

Definition
a reference to a list containing 2 items:
0: a fid
1: a float

annotations

Definition
a reference to a list where each element is an annotation

pubrefs

Definition
a reference to a list where each element is a pubref

roles

Definition
a reference to a list where each element is a role

scored_fids

Definition
a reference to a list where each element is a scored_fid

locations

Definition
a reference to a list where each element is a location

proteins

Definition
a reference to a list where each element is a protein

functions

Definition
a reference to a list where each element is a function

taxonomic_groups

Definition
a reference to a list where each element is a taxonomic_group

subsystems

Definition
a reference to a list where each element is a subsystem

contigs

Definition
a reference to a list where each element is a contig

md5s

Definition
a reference to a list where each element is a md5

genomes

Definition
a reference to a list where each element is a genome

pair_of_fids

Definition
a reference to a list containing 2 items:
0: a fid
1: a fid

pairs_of_fids

Definition
a reference to a list where each element is a pair_of_fids

protein_families

Definition
a reference to a list where each element is a protein_family

score

Definition
a float

evidence

Definition
a reference to a list where each element is a pair_of_fids

fids

Definition
a reference to a list where each element is a fid

row

Definition
a reference to a list containing 2 items:
0: a variant
1: a reference to a hash where the key is a role and the value is a fids

fid_function_pair

Definition
a reference to a list containing 2 items:
0: a fid
1: a function

fid_function_pairs

Definition
a reference to a list where each element is a fid_function_pair

fc_protein_family

Description

A functionally coupled protein family identifies a family and two scores that indicate the coupling strength: a co-expression score and a co-occurrence score.

Definition
a reference to a list containing 3 items:
0: a protein_family
1: a score
2: a function

fc_protein_families

Definition
a reference to a list where each element is a fc_protein_family

aux

Definition
an int

fields

Definition
a reference to a list where each element is a string

complex

Definition
a string

complexes

Definition
a reference to a list where each element is a complex

name

Definition
a string

reaction

Definition
a string

reactions

Definition
a reference to a list where each element is a reaction

complex_data

Description

Reactions do not connect directly to roles. Rather, the conceptual model is that one or more roles together form a complex. A complex implements one or more reactions. The actual data relating to a complex is spread over two entities: Complex and ReactionComplex. It is convenient to be able to offer access to the complex name, the reactions it implements, and the roles that make it up in a single invocation.

Definition
a reference to a hash where the following keys are defined:
complex_name has a value which is a name
complex_roles has a value which is a roles
complex_reactions has a value which is a reactions

feature_data

Definition
a reference to a hash where the following keys are defined:
feature_id has a value which is a fid
genome_name has a value which is a string
feature_function has a value which is a string
feature_length has a value which is an int
feature_publications has a value which is a pubrefs

expert

Definition
a string

source

Definition
a string

id

Definition
a string

function_assertion

Definition
a reference to a list containing 4 items:
0: an id
1: a function
2: a source
3: an expert

function_assertions

Definition
a reference to a list where each element is a function_assertion

regulon

Definition
a string

regulon_size

Definition
an int

regulon_size_pair

Definition
a reference to a list containing 2 items:
0: a regulon
1: a regulon_size

regulon_size_pairs

Definition
a reference to a list where each element is a regulon_size_pair

regulons

Definition
a reference to a list where each element is a regulon

protein_sequence

Definition
a string

dna_sequence

Definition
a string

name_parameter

Definition
a string

ss_var_role_tuple

Definition
a reference to a list containing 3 items:
0: a subsystem
1: a variant
2: a role

ss_var_role_tuples

Definition
a reference to a list where each element is a ss_var_role_tuple

genome_name

Definition
a string

entity_name

Definition
a string

weight

Definition
an int

field_name

Definition
a string

search_hit

Definition
a reference to a list containing 2 items:
0: a weight
1: a reference to a hash where the key is a field_name and the value is a string