What are Protein Family Models?

What are Protein Family Models

The Protein Family Models database contains Hidden Markov models (HMM), conserved domain (CDD) architectures and BlastRules that represent protein families found in prokaryotes and eukaryotes. Notably, this database includes the models used for structural and functional annotation by the NCBI (Prokaryotic Genome Annotation Pipeline (PGAP)).

HMMs

An HMM-based protein family is a probabilistic model used to determine which proteins belong or don't belong to the family. HMMs are built by converting multiple sequence alignments (seed alignments) of proteins of known function into a position-specific scoring system to generate an HMM profile. Amino acids at each position on the seed alignment are given a score according to their frequency. Sequence and domain cutoffs are established based on the seed alignments and used as minimum thresholds for query proteins to be classified as members of the HMM's protein family (See how to build an HMM). The HMMs included in the database encompass multiple collections.

NCBIFAMs: developed at NCBI, either de novo based on publications documenting protein function or from NCBI protein clusters (PRKs)
TIGRFAMs: created at The Institute for Genomic Research (now J. Craig Venter Institute) but now owned and maintained by NCBI
Pfams: maintained and distributed by EBI-EMBL. NCBI expert curators may associate extra attributes that are not distributed in the EBI-EMBL Pfam releases such as product names, gene symbols or publications to the Entrez records.

Conserved domain (CDD) architectures

Conserved domain architectures are protein classifiers built based on the nature and order of conserved domain signatures identified along protein sequences. Architectures with significant coverage are reviewed and given names, with an emphasis on architectures prominent in bacteria.

BlastRules

BlastRules are a type of evidence for functional classification of proteins based on BLAST (Basic Local Alignment Search Tool). A BlastRule consists of one or more 'model' proteins with known biological function, and BLAST identity and coverage cutoffs. Any protein aligning to a model protein above the cutoffs is considered a BlastRule hit. Curators review the literature to determine whether the biological function of studied proteins is conclusive and informative enough for creating a BlastRule. The protein sequences cited in articles are used as queries for BLAST searches in a database of proteins with known function. The identity and coverage cutoffs of the BlastRule are determined based on the BLAST results themselves, as well as phylogenetic analyses of the BLAST hits.

Revised 31 August 2023