PGAP Protein Annotation Evidence Documentation

Protein Family Models - assigning names to proteins

Three major types of evidence are used by the Prokaryotic Genome Annotation Pipeline (PGAP) to assign names and other attributes, such as gene symbols, publications, and EC numbers, to predicted proteins. They are Hidden Markov Models (HMMs), BlastRules, and domain architectures.

These types of evidence are created based on the sequence similarity and structure of the protein family they define. They are hierarchically organized according to their specificity and are assigned family types, which depend on the diversity of the proteins in the family. For example, a broad-specificity HMM of family type ‘domain’ typically hits a large number of proteins, usually with the same or similar domain architectures but low overall sequence similarity. By contrast, an HMM of ‘subfamily’ type may hit fewer proteins, with significant sequence similarity throughout their sequence and the same domain architecture.

Family types, and by extension naming evidence are assigned a precedence. If a protein is hit byseveral evidence, it inherits the name and attributes from the evidence with the highest precedence. For example, the RefSeq protein WP_000019731.1 is hit by three evidence,the superfamily HMM TIGR01297.1 (product name: cation diffusion facilitator family transporter), the BlastRuleException NBR007910 (product name: CDF family zinc efflux transporter CzrB), and the domain architecture arch11440813 (product name: cation transporter), however, it was named as ‘CDF family zinc efflux transporter CzrB’ based on BlastRule NBR007910, which has the highest precedence.

Hidden Markov Models (HMMs)

An HMM-based protein family is a probabilistic model used to determine which proteins belong or don’t belong to the family. To construct HMMs, multiple sequence alignments (seed alignments) of proteins of known function are converted into a position-specific scoring system to generate an HMM profile. Amino acids at each position on the seed alignment are given a score according to their frequency. Sequence and domain cutoffs are established based on the seed alignments and used as minimum thresholds for query proteins to be classified as members of the HMM's protein family (See how to build an HMM).

The HMMs used by PGAP come from a variety of sources. Some were built from scratch based on publications documenting protein function (NCBIFAM), others were based on the NCBI protein clusters (PRKs). PGAP also uses TIGRFAMs (now owned by NCBI) as built by TIGR or modified. A subset of Pfam HMMs to which NCBI associated a protein product name is also used. Most Pfam HMMs are built to describe domains found within proteins rather than the proteins themselves, and lack a curated product name, are considered provisional, and are therefore not used for functional annotation by PGAP.

Each HMM used in PGAP is assigned an NCBI accession (“NF” prefix, or "TIGR" prefix for TIGRFAMs) with a version that is incremented if the seed alignments or cutoffs for the HMM are modified. For HMMs that originated from an outside source, the source identifiers ("TIGR" and "PF" identifiers for TIGRFAMs and Pfams, respectively) are also provided in the RefSeq protein and the evidence records. For these HMMs, the product names, cutoffs, or even seed alignments may differ from the values assigned originally by the outside source.

In PGAP, predicted proteins are matched to HMMs using the hmmsearch program in the HMMER software (V3.2.1). A protein is considered a hit and assigned the product name and other attributes from the HMM if its sequence and domain scores are above the cutoffs defined for the HMM (See how HMMs are used in protein annotation by PGAP ). HMMs used in PGAP for protein annotation are available on the NCBI’s ftp site.

BlastRules

BlastRules (identifiers starting with the “NBR” prefix) are a type of evidence for functional classification of proteins based on BLAST (Basic Local Alignment Search Tool). A BlastRule consists of one or more 'model' proteins with known biological function, and BLAST identity and coverage cutoffs. Any protein aligning to a model protein above the cutoffs is considered a BlastRule hit.

BlastRules are typically created for proteins which may play significant roles in virulence, antibiotic resistance, evolution, and pathogenicity, as documented in scientific journals. Curators review the literature to determine whether the biological function of studied proteins is conclusive and informative enough for creating a BlastRule. The protein sequences cited in articles are retrieved from the database and used as queries for BLAST searches in a database of proteins with known function. The identity and coverage cutoffs of the BlastRule are determined based on the BLAST results, as well as phylogenetic analyses of the BLAST hits.

During the PGAP annotation process, predicted proteins are searched against a collection of BlastRules using the NCBI BLAST tool. A protein is considered as a BlastRule hit and assigned the product name and other attributes from the BlastRule if its sequence exceeds the sequence identity and coverage cutoffs of the BlastRule. BlastRules used in PGAP for protein annotation are available on the NCBI’s ftp site.

Domain architectures

Proteins can be classified and grouped into evolutionarily conserved families based on their domain architecture, the nature and order of conserved domain signatures identified along the sequence. Very often such domain architectures are associated with a specific function. Conserved Domain Database (CDD) curation staff maintains a comprehensive collection of common protein domain architectures, derived from pre-computed annotation of proteins with domain footprints. Architectures with significant coverage are reviewed and given names, with an emphasis on architectures prominent in bacteria. The Subfamily Protein Architecture Labeling Engine ( SPARCLE) is used by PGAP for the functional characterization and naming of protein sequences that have been grouped by their characteristic conserved domain architecture. Names derived from domain architectures are sometimes rather generic, as domain architectures may encompass a variety of specific functions and/or functionally uncharacterized proteins. Protein domain architectures and related information retrieval services are maintained by the CDD/SPARCLE team at NCBI. Detailed information is available on the NCBI CDD web pages.

Family types and order of precedence of the naming evidence

If a protein is hit by several evidence, it inherits the name and attributes from the evidence with the highest-precedence family type. The various family types used in the evidence hierarchy and used for naming proteins are defined below, from the highest to the lowest precedence:

BlastRuleIS (Transposase BlastRule)
BlastRuleIS is originally designed for transposases on insertion sequence (IS) elements with nomenclature from ISFinder. However, the cutoffs for a BlastRuleIS are stricter at their default levels (99% of sequence identity) than the protein percent identity cutoff suggested by ISFinder (98%) (precedence score = 96).
BlastRuleException (Exception BlastRule)
A BlastRuleException is used to annotate a special group of proteins, which have a more specific function in a protein family, such as listerolysin O, one of many named cholesterol-dependent cytolysins. The identity and model protein coverage cutoffs of BlastRuleException are set as 94% and 90%, respectively (precedence score = 95).
Exception HMM
An exception HMM recognizes proteins that share a specific chemical function, plus at least one additional distinguishing feature such as having an extended region or belonging to a named subclade. Examples of exception HMMs include specifically named isozymes that are expressed only for certain pathways or biological processes (precedence score = 77).
Equivalog HMM
An equivalog HMM recognizes groups of proteins that are homologous, and similar in domain architecture, and consistent enough in their specific function that all can receive the same functionally descriptive name. Equivalog proteins are presumed to have descended from a shared ancestral protein that had the same function. If the member proteins of an equivalog are enzymes, all should share the same EC number (precedence score = 70).
Hypothetical equivalog HMM
Hypothetical equivalog HMMs are treated the same as Equivalog HMMs. Members of this HMM family are expected have the same specific function, but it may not yet be known what the function is, and member proteins consequently may be assigned rather vague-sounding names (precedence score = 70).
Equivalog domain HMM
Equivalog domain HMMS are treated the same as Equivalog HMMs. The region hit by the HMM is considered sufficient for assigning member proteins a specific functional name, but domain architecture is known to be variable among the proteins within the family (precedence score = 70).
Hypothetical equivalog domain HMM
Hypothetical equivalog domain HMMs are treated the same as Equivalog HMMs. The function is presumed to be consistent for members of the family, but may not yet be known. Domain architecture may be variable across the family, but the region described by the HMM belongs to a conserved core whose presence is considered sufficient for naming member proteins (precedence score = 70).
BlastRuleEquivalog (Equivalog BlastRule)
An equivalog BlastRule resembles an equivalog HMM in design and purpose, but it receives a slightly lower precedence score than that of an equivalog HMM. The percent identity cutoff of BlastRuleEquivalog is set to 80% upon creation by default, and then may be adjusted (precedence score = 69).
BlastRuleSubPlus (Subfamily-plus BlastRule)
This type of BlastRule (subfamily-plus) enforces nearly full-length alignment between a model protein from the rule’s definition and the candidate protein that it matches. Rules of this type provide names to rather narrowly defined protein subfamilies; “plus” means rules of this sort out-rank both CDD domain architectures and subfamily HMMs (precedence score = 65).
Domain architecture
There are two types of conserved domain architectures, superfamily and subfamily architectures. Superfamily architectures consist solely of conserved domain superfamilies. This infers a general functional category for proteins which have that architecture. Subfamily architectures either contain a mix of conserved domain superfamilies and subfamilies or consist solely of conserved domain subfamilies. Currently, only subfamily domain architectures are used for PGAP annotation of proteins (precedence score = 60).
PfamEq (Pfam equivalog HMM)
Some Pfam HMMs hit exclusively proteins with a single named function, as HMMs built to find equivalogs do. However, such models in Pfam tend to have permissive enough gathering thresholds that additional proteins with only distant homology to the main cohort of proteins may score well enough to be include, despite differing in function. Users should be wary of functional assignments made by such HMMs when match scores, though above cutoff, are unusually low for the family (precedence score = 57).
Subfamily HMM
A subfamily HMM hits collections of proteins that typically show nearly full-length homology, and may share a general function (e.g. NAD-dependent oxidoreductase), but often vary in specific function (precedence score = 55).
BlastRuleSubMinus (Subfamily-Minus BlastRule)
This infrequently used type of BlastRule enforces nearly full-length alignment between a model protein from the rule’s definition and the candidate protein that it matches, but is assigned a low precedence in annotation as if the name it applies is not very specific (precedence score = 50).
BlastRuleCOLLAB (Collaboration BlastRule)
BlastRuleCOLLABs are designed for rapid import of large numbers of BlastRules supplied by trusted external contributors. Once entered into our evidence system, BlastRuleCOLLABs can be subjected to additional testing and then promotion to different BlastRule types that have higher precedence (precedence score = 41).
PfamAutoEq HMM
Computational analysis has suggested that this HMM, from the Pfam collection, behaves in certain ways like an equivalog HMM, but the standard warning applies that Pfam HMMs typical have permissive cutoffs set to help identify all homologs, rather than stringent cutoffs designed to exclude homologs differing in function from the family that the model describes (precedence score = 37).
Paralog HMM
A few older TIGRFAMs models were built to describe protein families that were abundant in at least narrow lineage, while rare or previously never seen outside that lineage. Proteins in these families tend to be similar in length and align almost from end-to-end, and may contain recognizable homology domains shared with proteins outside the family. The paralog HMM is thus a special case of the subfamily HMM (precedence score = 35).
Superfamily HMM
Superfamily HMMs hit collections of proteins that typically show nearly full-length homology, and that in addition may be able to detect essentially all homologs, rather than just one clade from such a collection of proteins. A superfamily HMM can encompass several different subfamilies (precedence score = 33).
Domain HMM
A domain is a localized region of sequence homology that is shared across proteins from different families, whose other regions may be completely unrelated. Because HMMs that detect homology domains find proteins that have a variety of different functions, and may describe only a small fraction of proteins, domain HMMs name proteins in a fairly general way (i.e. NF023550) (precedence score = 30).
Repeat HMM
Compared to a domain, repeat HMMs tend to describe even smaller regions, usually as multiple regions arranged in tandem. A single repeat unit may be too small to fold independently. The small size of repeat region, the correspondingly low cutoff scores necessitated by the small size, and the risk of false-positive sequence matches give repeat HMMs a very low precedence not yet used in PGAP/RefSeq annotation.