Protein Family Models - assigning names to proteins
Hidden Markov Models (HMMs)
BlastRules
Domain architectures
Family types and order of precedence of the naming evidence
-
BlastRuleIS (Transposase BlastRule)
BlastRuleIS is originally designed for transposases on insertion sequence (IS) elements with nomenclature from ISFinder. However, the cutoffs for a BlastRuleIS are stricter at their default levels (99% of sequence identity) than the protein percent identity cutoff suggested by ISFinder (98%) (precedence score = 96). -
BlastRuleException (Exception BlastRule)
A BlastRuleException is used to annotate a special group of proteins, which have a more specific function in a protein family, such as listerolysin O, one of many named cholesterol-dependent cytolysins. The identity and model protein coverage cutoffs of BlastRuleException are set as 94% and 90%, respectively (precedence score = 95). -
Exception HMM
An exception HMM recognizes proteins that share a specific chemical function, plus at least one additional distinguishing feature such as having an extended region or belonging to a named subclade. Examples of exception HMMs include specifically named isozymes that are expressed only for certain pathways or biological processes (precedence score = 77). -
Equivalog HMM
An equivalog HMM recognizes groups of proteins that are homologous, and similar in domain architecture, and consistent enough in their specific function that all can receive the same functionally descriptive name. Equivalog proteins are presumed to have descended from a shared ancestral protein that had the same function. If the member proteins of an equivalog are enzymes, all should share the same EC number (precedence score = 70). -
Hypothetical equivalog HMM
Hypothetical equivalog HMMs are treated the same as Equivalog HMMs. Members of this HMM family are expected have the same specific function, but it may not yet be known what the function is, and member proteins consequently may be assigned rather vague-sounding names (precedence score = 70). -
Equivalog domain HMM
Equivalog domain HMMS are treated the same as Equivalog HMMs. The region hit by the HMM is considered sufficient for assigning member proteins a specific functional name, but domain architecture is known to be variable among the proteins within the family (precedence score = 70). -
Hypothetical equivalog domain HMM
Hypothetical equivalog domain HMMs are treated the same as Equivalog HMMs. The function is presumed to be consistent for members of the family, but may not yet be known. Domain architecture may be variable across the family, but the region described by the HMM belongs to a conserved core whose presence is considered sufficient for naming member proteins (precedence score = 70). -
BlastRuleEquivalog (Equivalog BlastRule)
An equivalog BlastRule resembles an equivalog HMM in design and purpose, but it receives a slightly lower precedence score than that of an equivalog HMM. The percent identity cutoff of BlastRuleEquivalog is set to 80% upon creation by default, and then may be adjusted (precedence score = 69). -
BlastRuleSubPlus (Subfamily-plus BlastRule)
This type of BlastRule (subfamily-plus) enforces nearly full-length alignment between a model protein from the rule’s definition and the candidate protein that it matches. Rules of this type provide names to rather narrowly defined protein subfamilies; “plus” means rules of this sort out-rank both CDD domain architectures and subfamily HMMs (precedence score = 65). -
Domain architecture
There are two types of conserved domain architectures, superfamily and subfamily architectures. Superfamily architectures consist solely of conserved domain superfamilies. This infers a general functional category for proteins which have that architecture. Subfamily architectures either contain a mix of conserved domain superfamilies and subfamilies or consist solely of conserved domain subfamilies. Currently, only subfamily domain architectures are used for PGAP annotation of proteins (precedence score = 60). -
PfamEq (Pfam equivalog HMM)
Some Pfam HMMs hit exclusively proteins with a single named function, as HMMs built to find equivalogs do. However, such models in Pfam tend to have permissive enough gathering thresholds that additional proteins with only distant homology to the main cohort of proteins may score well enough to be include, despite differing in function. Users should be wary of functional assignments made by such HMMs when match scores, though above cutoff, are unusually low for the family (precedence score = 57). -
Subfamily HMM
A subfamily HMM hits collections of proteins that typically show nearly full-length homology, and may share a general function (e.g. NAD-dependent oxidoreductase), but often vary in specific function (precedence score = 55). -
BlastRuleSubMinus (Subfamily-Minus BlastRule)
This infrequently used type of BlastRule enforces nearly full-length alignment between a model protein from the rule’s definition and the candidate protein that it matches, but is assigned a low precedence in annotation as if the name it applies is not very specific (precedence score = 50). -
BlastRuleCOLLAB (Collaboration BlastRule)
BlastRuleCOLLABs are designed for rapid import of large numbers of BlastRules supplied by trusted external contributors. Once entered into our evidence system, BlastRuleCOLLABs can be subjected to additional testing and then promotion to different BlastRule types that have higher precedence (precedence score = 41). -
PfamAutoEq HMM
Computational analysis has suggested that this HMM, from the Pfam collection, behaves in certain ways like an equivalog HMM, but the standard warning applies that Pfam HMMs typical have permissive cutoffs set to help identify all homologs, rather than stringent cutoffs designed to exclude homologs differing in function from the family that the model describes (precedence score = 37). -
Paralog HMM
A few older TIGRFAMs models were built to describe protein families that were abundant in at least narrow lineage, while rare or previously never seen outside that lineage. Proteins in these families tend to be similar in length and align almost from end-to-end, and may contain recognizable homology domains shared with proteins outside the family. The paralog HMM is thus a special case of the subfamily HMM (precedence score = 35). -
Superfamily HMM
Superfamily HMMs hit collections of proteins that typically show nearly full-length homology, and that in addition may be able to detect essentially all homologs, rather than just one clade from such a collection of proteins. A superfamily HMM can encompass several different subfamilies (precedence score = 33). -
Domain HMM
A domain is a localized region of sequence homology that is shared across proteins from different families, whose other regions may be completely unrelated. Because HMMs that detect homology domains find proteins that have a variety of different functions, and may describe only a small fraction of proteins, domain HMMs name proteins in a fairly general way (i.e. NF023550) (precedence score = 30). -
Repeat HMM
Compared to a domain, repeat HMMs tend to describe even smaller regions, usually as multiple regions arranged in tandem. A single repeat unit may be too small to fold independently. The small size of repeat region, the correspondingly low cutoff scores necessitated by the small size, and the risk of false-positive sequence matches give repeat HMMs a very low precedence not yet used in PGAP/RefSeq annotation.