How are orthologs calculated?

NCBI's Eukaryotic Genome Annotation Pipeline identifies ortholog gene groups for the NCBI Gene dataset using a combination of protein sequence similarity and local synteny information.

Orthology is determined between a genome being annotated and a reference genome, for example, human or zebrafish, and pairs of orthologs are tracked as groups. Transitive relationships are inferred in the group, for example medaka <-> zebrafish <-> human <-> mouse. Only genes in the NCBI Gene database are eligible for ortholog calculation. With a few exceptions, ortholog calculation is currently limited to vertebrates and arthropods.

For each protein from the genome being annotated, the reference genome is searched for best and near-best matches based on protein sequence similarity. Candidates are further analyzed for nucleotide sequence similarity across all exons (including UTRs), and an additional 2kb sequence on either side of the gene, and microsynteny within the local genomic neighborhood (+/- 10 genes). Orthology relationships are assigned only when there is a clear 1:1 relationship, using the microsynteny information to help resolve closely related paralogs, and may be reviewed by a RefSeq curator to further refine the set.

How are orthologs calculated?

Links