Genome Notes
Genome Notes displayed on individual Genome web pages are applied to genome assemblies based on analyses performed by NCBI to alert users that an assembly may not be suitable in particular cases. Most assemblies with warnings and other comments are excluded from the RefSeq collection. Atypical assemblies are shown as a warning at the top of genome pages while all others are provided in the Genome Notes section. Genomes in the atypical category in the list below can be excluded from the Genome Table by checking the “Exclude atypical genomes” checkbox.
Filters to remove such assemblies from Entrez Assembly search results are on the left-hand sidebar under the heading "Exclude", filters for more reasons can be exposed using the "Customize …" menu). Filters can be applied for each individual reason. A filter to remove all atypical assemblies from the Entrez Assembly search results is on by default. This filter must be cleared for assemblies flagged as atypical to be returned. The "excluded from RefSeq" property available under the "Advanced" search menu can be used to filter out all the genome assemblies that were excluded from RefSeq.
Atypical Assemblies
- Chimeric — The genome assembly contains sequences from two different organisms that are joined together.
- Contaminated — The genome assembly contains sequences from other organisms, cloning vectors, linkers, or adapters, or primers.
- Fragmented assembly — A prokaryotic assembly with contig L50 above 500, contig N50 below 5,000, or with more than 2,000 contigs is considered fragmented.
- Genome length too large — The total ungapped sequence length of the assembly is more than 4 standard deviations above the average for genomes of the same species if 10 or more genomes for the species, or more than 40% above the average if less than 10 genomes for the species, or more than 15 Mbp, or is otherwise suspiciously long.
- Genome length too small — The total ungapped sequence length of the assembly is more than 4 standard deviations below the average for genomes of the same species if 10 or more genomes for the species, or more than 40% below the average if less than 10 genomes for the species, or less than 300 Kbp, or is otherwise suspiciously short.
- Hybrid — Genome assembly sequences are from a hybrid between different species, strains, or isolates.
- Low quality sequence — Long stretches of the sequence have a high proportion of ambiguous bases, are low complexity, or provide some other indication that the sequence quality is low.
- Misassembled — Alignment to related genome assemblies or other evidence indicates the genome assembly is likely to contain errors.
- Partial — The genome assembly contains a sequence for only part of the DNA found in a typical cell, e.g., one chromosome out of twenty.
- Sequence duplications — The genome assembly contains one or more large duplications.
- Unverified source organism — Quality analysis demonstrates the taxonomic assignment of the genome assembly is incorrect.
Organism Source Material
- Derived from metagenome — The genomic sequence was assembled from metagenomic sequencing rather than a pure culture. A small number of these genomes, estimated to be free of contaminants, for species with fewer than 50 non-MAGs, and with Taxonomy check status OK or Inconclusive with best match status 'below-threshold match' are included in RefSeq.
- Derived from single cell — The source material for the assembly was amplified from a single cell resulting in concerns about genome sequence accuracy.
- From large multi-isolate project — The assembly is one of over 100 assemblies for multiple isolates of the same species generated by the same project. Typically, these are pathogen surveillance projects.
- Genus undefined — The lineage does not include a genus and therefore, the precise taxonomic placement is uncertain. An exception is made for symbionts.
- Metagenome — The assembl is derived from a sample consisting of a mixture of unidentified organisms.
- Missing strain identifier — The prokaryote assembly lacks both strain and isolate identifiers in the appropriate field. Exceptions are made for symbionts and phytoplasmas.
- Mixed culture — The genome assembly is derived from a co-culture of multiple organisms.
- Not used as type — The assembly is derived from a type specimen, but it does not meet the criteria for a type-strain assembly that can be used in ANI analysis.
RefSeq Annotation
- Abnormal gene to sequence ratio — The NCBI PGAP predicts too many or too few genes. The gene-to-sequence ratio is calculated as the number of genes of any type per kb of sequence. The typical range of the ratio is 0.8 to 1.2, and anything outside the 0.5 to 1.5 range is considered abnormal.
- Annotation fails completeness check — The percent completeness, as estimated on the NCBI PGAP protein predictions by CheckM, is below three standard deviations from the average completeness for the species, if more than 1000 genomes for the species, or is below 90% or three standard deviations below the average for the species, whichever is smaller, if 100-1000 genomes for the species.
- Annotation fails MAG completeness check — The percent completeness, as estimated on the NCBI PGAP protein predictions by CheckM, is below 90%. Applied to MAGs only.
- Low gene count — The number of predicted genes by NCBI PGAP is much lower than expected when compared with other high-quality genome assemblies for the same species.
- Many frameshifted proteins — The percentage of protein-coding genes with frameshifts, as determined by NCBI PGAP, is more than three standard deviations from the species average or 5% of annotated genes, whichever is larger, if the genome is for a species with more than 10 genomes, or is above 30%.
- Missing rRNA genes — The NCBI PGAP failed to find at least one copy each of the 5S, 16S, and 23S rRNA gene. Applied to complete genomes only.
- Missing tRNA genes — The NCBI PGAP failed to find tRNA genes with anticodons for two or more of the expected 20 amino acids. Applied to complete genomes only.
- RefSeq annotation failed — The annotated genome assembly does not meet RefSeq standards for reasons other than those listed in this article, typically related to metadata errors or inconsistencies.