Genome Size Check

GenBank compares the size of a submitted genome assembly to the expected genome size range for the species to identify outliers that can result from errors such as:

incorrect organism assignment
metagenome submitted as an organism genome
targeted sub-genome assembly not flagged as partial genome representation
gross contamination with other sequences

The NCBI Genome Size Check API can be used to check the size of a genome assembly against the expected genome size range in advance of submission.

Expected Genome Size Range

NCBI calculates an expected genome size range for all species that have at least four genome assemblies in GenBank. The "genome size" is the ungapped-length of the genome assembly, i.e. gaps and runs of 10 or more Ns are ignored. The expected genome size for eukaryotes is the value for a haploid genome assembly. The rules used to calculate the expected genome size range for a species can be summarized as:

skip assemblies that are flagged as atypical for reasons related to size or identity of the organism, e.g. too large or partial or contaminated, as described in Datasets, and any others whose genome size is radically different from others of the species, as calculated by an interquartile range (IQR) method
if the species has at least four genome assemblies remaining
- calculate the median and standard-deviation (std-dev) of the genome assembly sizes
- if 4 standard-deviations is between 20% and 50% of the median,
  - then the expected genome size range is: median - 4x std-dev to median + 4x std-dev
- if 4 standard-deviations is less than 20% of the median,
  - then the expected genome size range is: 80% to 120% of median
- if 4 standard-deviations is more than 50% of the median,
  - then the expected genome size range is: 50% to 150% of median
if the species has one of the few specially selected RefSeq reference genome assemblies, then regardless of the number of assemblies
- the expected genome size range is: 80% to 120% the size of the reference genome

Those rules use the public genomes because we do not know the expected genome size for every species submitted to GenBank. However, the rules are overridden by hard-coded limits for more than one hundred well-characterized species.

The expected genome sizes are calculated daily and reported in a file on the NCBI genomes FTP site:

https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/species_genome_size.txt.gz

in the https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS directory.

Accepted sizes for other genomes

We fall back to using broader size ranges when an expected genome size is not available because the submitted assembly is from:

a species for which there are fewer than four non-atypical public genomes in GenBank
an unidentified species, e.g. Vibrio sp. X123
an unspecified organism, e.g. uncultured proteobacterium or most Candidatus
a metagenome or unclassified organism

In such cases, the accepted genome size ranges are as follows:

organism is in the archaea superkingdom: 100,000 bp to 15,000,000 bp
organism is in the bacteria superkingdom: 100,000 bp to 15,000,000 bp
organism is in the eukaryota superkingdom: 100,000 bp to unlimited (100 Gbases in practice)
organism is in the viruses superkingdom: 100 bp to 15,000,000 bp
metagenome or unclassified organism: at least 200 bp (the minimum length for a Whole Genome Shotgun sequence in the International Nucleotide Sequence Database Collaboration)