U.S. flag

An official website of the United States government

Genome Size Check

GenBank compares the size of a submitted genome assembly to the expected genome size range for the species to identify outliers that can result from errors such as:

  • incorrect organism assignment
  • metagenome submitted as an organism genome
  • targeted sub-genome assembly not flagged as partial genome representation
  • gross contamination with other sequences

The NCBI Genome Size Check API can be used to check the size of a genome assembly against the expected genome size range in advance of submission.

Expected Genome Size Range

NCBI calculates an expected genome size range for all species that have at least four genome assemblies in GenBank. The "genome size" is the ungapped-length of the genome assembly, i.e. gaps and runs of 10 or more Ns are ignored. The expected genome size for eukaryotes is the value for a haploid genome assembly. The rules used to calculate the expected genome size range for a species can be summarized as:

  • skip assemblies that are flagged as atypical for reasons related to size or identity of the organism, e.g. too large or partial or contaminated, as described in Datasets, and any others whose genome size is radically different from others of the species, as calculated by an interquartile range (IQR) method
  • if the species has at least four genome assemblies remaining
    • calculate the median and standard-deviation (std-dev) of the genome assembly sizes
    • if 4 standard-deviations is between 20% and 50% of the median,
      • then the expected genome size range is: median - 4x std-dev to median + 4x std-dev
    • if 4 standard-deviations is less than 20% of the median,
      • then the expected genome size range is: 80% to 120% of median
    • if 4 standard-deviations is more than 50% of the median,
      • then the expected genome size range is: 50% to 150% of median
  • if the species has one of the few specially selected RefSeq reference genome assemblies, then regardless of the number of assemblies
    • the expected genome size range is: 80% to 120% the size of the reference genome

Those rules use the public genomes because we do not know the expected genome size for every species submitted to GenBank. However, the rules are overridden by hard-coded limits for more than one hundred well-characterized species.

The expected genome sizes are calculated daily and reported in a file on the NCBI genomes FTP site:

https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/species_genome_size.txt.gz

in the https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS directory.

Accepted sizes for other genomes

We fall back to using broader size ranges when an expected genome size is not available because the submitted assembly is from:

  • a species for which there are fewer than four non-atypical public genomes in GenBank
  • an unidentified species, e.g. Vibrio sp. X123
  • an unspecified organism, e.g. uncultured proteobacterium or most Candidatus
  • a metagenome or unclassified organism

In such cases, the accepted genome size ranges are as follows:

  • organism is in the archaea superkingdom: 100,000 bp to 15,000,000 bp
  • organism is in the bacteria superkingdom: 100,000 bp to 15,000,000 bp
  • organism is in the eukaryota superkingdom: 100,000 bp to unlimited (100 Gbases in practice)
  • organism is in the viruses superkingdom: 100 bp to 15,000,000 bp
  • metagenome or unclassified organism: at least 200 bp (the minimum length for a Whole Genome Shotgun sequence in the International Nucleotide Sequence Database Collaboration)

Genome Size Check API

URL

https://api.ncbi.nlm.nih.gov/genome/v0/expected_genome_size

Input parameters

  • species_taxid=<INTEGER>
    • a species level NCBI taxonomy ID, or a sub-species taxonomy ID that will be automatically mapped up to the species level
    • integer
  • length=<VALUE>
    • the size of the genome assembly in base-pairs, ignoring gaps and N bases
    • either an integer number of basepairs or a length expressed with standard suffixes: K, M, G, KB, MB, GB, Kbp, Mbp, Gbp

Examples

  1. expected_genome_size?species_taxid=287&length=6264404
  2. expected_genome_size?species_taxid=1773&length=4.41M
  3. expected_genome_size?species_taxid=9606&length=3.1Gbp

Results

Successful requests output XML as in the following example.

<?xml version="1.0" encoding="ISO-8859-1"?>
<genome_size_response>
<input>
 <species_taxid>9606</species_taxid>
 <length>3100000000</length>
</input>
 <organism_name>Homo sapiens</organism_name>
 <species_taxid>9606</species_taxid>
 <size_source>species</size_source>
 <genome_count>642</genome_count>
 <expected_ungapped_length>3000000000</expected_ungapped_length>
 <minimum_ungapped_length>2700000000</minimum_ungapped_length>
 <maximum_ungapped_length>3400000000</maximum_ungapped_length>
 <length_status>within_range</length_status>
</genome_size_response>

The two fields under input are the data that was entered.

  • species_taxid is unchanged.
  • length is the parsed value. This can be helpful if the input length had a suffix that is not supported.

The other fields are:

  • species_taxid - the taxonomy ID that was used after mapping the input taxonomy ID to species level
  • size_source - indicates the source used to obtain the genome size range; one of:
  • genome_count - the number of genome assemblies used to calculate the expected size range
  • expected_ungapped_length - the median genome assembly size, only present when size_source is species
  • minimum_ungapped_length - the minimum genome assembly size
  • maximum_ungapped_length - the maximum genome assembly size, omitted when size_source is insdc-seq-min
  • length_status - the evaluation of the input length; one of:
    • within_range
    • too_small
    • too_large

Errors

There are two errors with HTTP status 400 "Bad Request" that return bare strings rather than XML.

  • "Given taxid XXXXXX is not a known taxid."
  • "Given taxid XXXXX is above species."
Support Center

Last updated: 2024-01-29T18:44:29Z