Glossary

Glossary of terms used in Datasets

Glossary

Glossary of terms used in Datasets

Alternate locus—A specific sequence that represents a different version or variation of a particular locus present in the haploid assembly, also known as the primary assembly. It serves as an alternative representation of the genetic information found at that specific location in the genome.

Alternate locus group—A set of alternate loci grouped together for annotation purposes. This may be because they are from the same haplotype or strain or for annotation convenience.

Assembly—The set of chromosomes, unlocalized and unplaced (sometimes called “random”) and alternate sequences used to represent an organism’s genome. The NCBI data model defines assemblies as comprising one or more assembly units.

Assembly anomaly—See atypical genomes (below).

Assembly level—The highest assembly level for any object in the assembly. The values are as follows:

  • Chromosome—There is a sequence for one or more chromosomes. This may be a completely sequenced chromosome without gaps or a chromosome containing scaffolds or contigs with gaps between them. There may also be unplaced or unlocalized scaffolds.

  • Complete genome—All chromosomes are gapless and contain runs of nine or less ambiguous bases (Ns), there are no unplaced or unlocalized scaffolds, and all the expected chromosomes are present (i.e., the assembly is not noted as having partial genome representation). Plasmids and organelles may or may not be included in the assembly, but if they are present, the sequences are gapless.

  • Contig—Nothing is assembled beyond the level of sequence contigs.

  • Scaffold—Some sequence contigs have been connected across gaps to create scaffolds, but the scaffolds are all unplaced or unlocalized.

Assembly name—The submitter’s name for the assembly when one is provided, otherwise, a default name is provided by NCBI.

Assembly type:

  • Diploid assembly—A genome assembly for which a chromosome assembly is available for both sets of an individual’s chromosomes. A diploid genome assembly is expected to represent the genome of an individual. Therefore, alternate loci are not expected to be defined for this assembly, though it is possible that unlocalized or unplaced sequences may be part of the assembly.

  • Haploid assembly (default assembly type)—The collection of chromosome assemblies, unlocalized and unplaced sequences representing an organism’s genome. Any locus may be represented zero or one time, and entire chromosomes are only represented zero or one time.

  • Haploid-with-alt-loci—The collection of chromosome assemblies, unlocalized and unplaced sequences, and alternate loci representing an organism’s genome. Any locus may be represented zero, one, or greater than one time, but entire chromosomes are only represented zero or one time.

  • Linked pseudohaplotype assemblies—A genome assembly from a diploid in which many of the haplotypic sequences have been resolved and phased, and the two haplotypes have been separated. With the current state of technology, most assemblers produce blocks phased and separated by blocks where the haplotype cannot be distinguished. The typical result is that the “principal pseudohaplotype” assembly is a mosaic of haplotype blocks linked by unresolved segments, and the “alternate pseudohaplotype” assembly is the other haplotype wherever the haplotypes can be distinguished. A pair of pseudohaplotype assemblies derived from the same diploid individual can be linked with a cross-reference.

  • Unresolved diploid—A genome assembly from a diploid in which many haplotypic sequences have been resolved, but the two haplotypes have not been separated. Consequently, the assembly will be much larger than the expected haploid genome size, and two copies of many genes will be present.

Atypical genomes—Atypical genomes are genomes with one or more problems identified by NCBI relating to quality, unusual size, or other flaws in the genome assembly.

Average Nucleotide Identity (ANI)—The Average Nucleotide Identity (ANI) is a measure of genomic similarity at the nucleotide level between two different genomes. The NCBI utilizes ANI to evaluate the taxonomic identity of genome assemblies that are submitted to GenBank (Konstantinidis and Tiedje 2005; Ciufo et al. 2018).

Full Genome—The data used to generate the assembly was obtained from the whole genome, e.g., Whole Genome Shotgun (WGS) assemblies. The assembly may still contain gaps.

GenBank assembly accession—The accession and version for the GenBank assembly (“accession.version”).

Genome patches—Sequence updates released outside the major assembly cycle. These are instantiated as independent scaffolds aligned to the primary assembly to provide chromosome context. There are two types of patches:

  • Fix patches—These patches are made in a region where the Tiling Path File (TPF) changes and these scaffolds are withdrawn in the next major assembly update. The accessions will be made secondary to the chromosome, and the sequence will be incorporated into the primary assembly TPF.

  • Novel patches—These represent new alternate loci. These sequences will be moved to the appropriate assembly unit at the next major assembly update, and the accession will remain stable.

Linked assembly—The “accession.version” and designation (principal or alternate pseudohaplotype) of a paired genome assembly derived from the same diploid individual (see “Assembly type” definitions above).

Modifier—Infraspecific or subspecies name or description.

Partial Genome—The data used to generate the assembly came from only part of the genome. The reasons genome representation is set to partial include:

  • The assembly description indicates that the assembly was targeted to a single chromosome or a subset of the genome.

  • The chromosome set in the assembly is less than the expected chromosome complement for the organism, ignoring any plasmids, organelle chromosomes, and the small sex chromosome (Y for mammals, W for birds).

  • The genome coverage in a WGS assembly is less than one.

  • The ungapped sequence length of the assembly is less than half the average for other assemblies from the same species.

RefSeq assembly accession—The accession and version for the RefSeq version of the assembly (“accession.version”).

Note: this is not always present because only certain assemblies are selected for RefSeq.

RefSeq category—NCBI classification of a reference or representative genome:

  • Reference genome—A manually selected high-quality genome assembly that NCBI and the community have identified as important as a standard against which other data are compared.

  • Representative genome—A genome computationally or manually selected as a representative from among the best genomes available for a species or clade that does not have a designated reference genome.

  • Taxonomic considerations:

    • Eukaryotes have no more than one reference or representative genome per species. If there are no assemblies in RefSeq for a particular eukaryotic species, the GenBank assembly RefSeq will be selected as best available for that species and designated the representative genome.

    • Prokaryotes may have more than one reference or representative genome per species. For more information, see the Prokaryotic RefSeq Genomes page.

    • Viruses may have one or more reference genomes per species. The representative genome designation is not applied to viruses and viroids.

Regions—Locations on the primary assembly (typically on the chromosome sequences) for which alternate representations or genome patches exist.

Relation to type material—Shown if the sequences in the genome assembly were derived from type material, synonym type material, or other type material (for more information, see What is type material? and Federhen 2015):

  • Assembly designated as neotype—The sequences in the genome assembly were derived from neotype material.

  • Assembly designated as reftype—The sequences in the genome assembly were derived from reftype material.

  • Assembly from pathotype material—The sequences in the genome assembly were derived from pathovar type material.

  • Assembly from synonym type material—The sequences in the genome assembly were derived from synonym type material.

  • Assembly from type material—The sequences in the genome assembly were derived from type material.

  • ICTV additional isolate—The International Committee on Taxonomy of Viruses (ICTV) designated the genome assembly as an additional isolate for the virus species.

  • ICTV species exemplar—The International Committee on Taxonomy of Viruses (ICTV) designated the genome assembly as the exemplar for the virus species.

Release type—Indicates whether this version of the genome assembly is a major, minor, or patch release:

  • Major—Changes from the previous assembly version result in a significant change to the coordinate system. The first version of an assembly is always a major release. Most subsequent genome assembly updates are also major releases.

  • Minor—Changes from the previous assembly version are limited to the following changes, none of which result in a significant change to the coordinate system of the primary assembly unit:

    • Adding, removing, or changing a non-nuclear assembly unit.
    • Dropping unplaced or unlocalized scaffolds.
    • Adding up to 50 unplaced or unlocalized scaffolds shorter than the current scaffold-N50 value.
    • Replacing a component with a gap of the same length.
  • Patch—The only change from the previous assembly version is the addition or modification of a patch assembly unit (relevant for assemblies maintained by the Genome Reference Consortium).

Status—The current status for the GenBank and/or RefSeq assembly accession.version is shown. The possible values are “latest,” “replaced,” or “suppressed.”

Unlocalized sequence—A sequence found in an assembly associated with a specific chromosome that cannot be ordered or oriented on that chromosome. The location of these sequences cannot be expressed in chromosome coordinates.

Unplaced sequence—A sequence found in an assembly not associated with any chromosome. These sequences cannot be expressed in chromosome coordinates.

Generated April 19, 2024