This page was last updated in July, 2024
NCBI Virus Help Documentation
Welcome to the NCBI Virus Help Documentation. Here you will find information on how to use the NCBI Virus resource to search, view, analyze and download viral sequence data, as well as background on our data model.
Table of Contents
- Introduction to NCBI Virus
- Recent Changes to NCBI Virus Interface
- Data Access
- Exploring Results
- Refining Results via Filters
- Downloading Data
- Using Visual Dashboards
- Accessing SARS-CoV-2 Data
- Citing NCBI Virus
- Submitting Data
- FAQs
- Glossary
Introduction to NCBI Virus
What is NCBI Virus?
NCBI Virus is an integrative, value-added resource designed to support retrieval, display and analysis of a curated collection of virus sequences and large sequence datasets. Our goal is to increase the usability of viral sequence data archived in GenBank and other NCBI repositories.
Key features of NCBI Virus include:
- Finding sequences of interest using metadata-based filtering
- Creating custom data reports
- Exporting data in various formats for use outside NCBI Virus
- Accessing high-quality reference sequences with standardized metadata
NCBI Virus is a community resource, and we welcome your feedback! Please use the Feedback button on the site to send comments and suggestions. Alternatively, you can contact us using this contact form.
Click on "Feedback Button" and tell us what you think!
Data Model, Type of Data and Dataflow
NCBI Virus uses manual and machine curation to validate viral sequence data from the International Nucleotide Sequence Database Collaboration (INSDC) and normalize sequence and sample attributes (metadata). This data is then made available through a custom search interface that supports selection of data based on a variety of properties.
Currently NCBI Virus database includes the data from following sequence groups:
GenBank records
GenBank records - all records submitted to GenBank and other INSDC databases, including sequences submitted through the Sequence Read Archive (SRA). Nucleotide and protein sequence accessions can be found on Nucleotide and Protein tabs of NCBI Virus. If sequences submitted through SRA, corresponding SRA accessions also provided.
Note: In September 2023, we removed Protein Data Base (PDB) nucleotide records from NCBI Virus search results. PDB records are typically very short and accompany three-dimensional protein structures that are available in the NCBI Structure database. PDB records are still searchable through other resources such as RCSB PDB and the NCBI Nucleotide database.
RefSeqs records
RefSeq records – reference sequence records from one or more complete genome sequences for each viral species. Where available, RefSeqs are created based on "exemplar" isolates for each recognized species identified by the International Committee on Taxonomy of Viruses (ICTV). The majority of RefSeqs are complete genomes, but because some ICTV "exemplars" are not complete sequences, there are also incomplete RefSeqs in the database. A separate RefSeq record is created for each segment in segmented viral genomes (read more about virus segmented groups below).
Accession numbers unique to RefSeq records are assigned to the nucleotide (NC_XXXXXX) and protein (NP_XXXXXX, YP_XXXXXX, or YP_XXXXXXXXX) sequences. Nucleotide RefSeqs can be found on the Nucleotide tab of the NCBI Virus Results table, and protein RefSeqs can be viewed on the Protein tab of the NCBI Virus Results table.
RefSeq sequences are specially labeled and located in the top rows of the Results Table by default settings.
Assembly records
NCBI Virus provides access to two types of genome assembly records: RefSeq genome assemblies and genome assemblies for segmented viruses. Both types of assembly records can be found in the NCBI Virus Assembly tab of the NCBI Virus Results Table.
RefSeq genome assemblies: These assemblies represent complete or nearly complete genomes for viruses with RefSeq records. They include the genomic sequence, annotations, and metadata. Users can explore the connections between nucleotide and protein records within a RefSeq genome assembly. For RefSeq genome assemblies users can access additional information, such as the GC content, assembly method, and sequencing technology, by navigating to the corresponding Genome Assembly Datasets page from the NCBI Virus Assembly Details panel. On the Datasets page, users can download the assembly in various formats, including annotation features in GTF, GFF, and GBFF formats, as well as the entire package in a Zip file. The Data Sets page also provides options for programmatic access to the RefSeq genome assemblies through the command line. Furthermore, users can access RefSeq genome assemblies through the NCBI FTP site.
Segmented virus genome assemblies: For viruses with segmented genomes, NCBI Virus groups together nucleotide sequence genome segments, derived from the same biological sample, based on identical user-submitted metadata fields. These grouped segments form a genome assembly for the segmented virus genome. These segmented virus genome groups (GCA accessions) are represented by assembly records with a Grouping method text "NCBI Virus Segmented Genome Grouping Pipeline" and can be accessed via the NCBI Virus Assembly tab or via Datasets. Read more about virus segmented groups below.
Virus Segment Grouping
Virus Segment Grouping refers to a custom process developed by NCBI Virus for segmented genomes, where nucleotide sequence genome segments that have been derived from the same biological sample are grouped together based on identical user-submitted metadata fields. Groups are then represented by assembly records (GCA accessions) with a Grouping method text "NCBI Virus Segmented Genome Grouping Pipeline". These assemblies can be accessed via the NCBI Virus Assembly tab of NCBI Virus or via Datasets.
Virus Segment Grouping process runs periodically on all submitted nucleotide sequences. Results become available as soon as they are properly recorded and indexed. Submitters can support the creation of accurate groups by ensuring that the appropriate metadata fields match exactly among all the segments submitted from the same sample: Species, Collection Date, Collection Location, Isolate name, and Host name.
Currently, the grouping is restricted only to Alphainfluenzavirus influenzae, Betainfluenzavirus influenzae, and Gammainfluenzavirus influenzae. Only complete genome groupings are reported, with 8 segments required for Alpha or Beta, and 7 segments for Gamma. Completeness status of the individual nucleotide segments and segment names are provided by the submitter. These groups can be found by Grouping method of "NCBI Virus Segmented Genome Grouping Pipeline VG-AUTO-v1.0".
NCBI plans to expand the scope to gradually include other segmented viruses and also allow for partial or extended groupings (for mixed infections).
Metadata Parsing
NCBI Virus relies on the information provided by sequence submitters to NCBI GenBank (including SRA) and other INSDC databases. The metadata associated with each sequence record is parsed and standardized to facilitate efficient searching and filtering of the data. The following metadata fields are parsed:
- Nucleotide / Genome Completeness
- Sample Collection date
- Sequence Release date to GenBank and other INSDC databases
- Isolation Host
- Tissue, specimen or other isolation source, a part of the host organism, where the sample was obtained
- Provirus, integrated into the genome of another host organism, as indicated by the sequence submitter
- Environmental Source
- Lab passaged and cultured viruses info
- Vaccine Strain
- SARS-CoV-2 Pango lineage
- SARS-CoV-2 Surveillance Sampling info
- Isolate
- Protein names
- Geographic Region
- Submitters' info
- Genome Molecule Type
- Vaccine Strain info
- Segment names
For more information on how these metadata fields are parsed and standardized, please refer to the Refining results via Filters and Glossary.
Taxonomy Validation
NCBI Virus relies on the NCBI Taxonomy group to validate and standardize the taxonomic information associated with virus sequences. The Taxonomy group follows the International Committee on Taxonomy of Viruses (ICTV) classification system, with some nuances:
-
Sequences are placed in the corresponding unclassified bin for every Virus Metadata Resource (VMR) taxon name above species, based on submitter-provided classification.
-
Phages are classified based on BLAST results, while RNA viruses without classification are placed in the unclassified Riboviria.
- If submitters clime the new taxon names during summation process, they placed in the unclassified bin for the next ICTV taxon one level up from the claimed new taxon.
- The Taxonomy group focuses on placing ICTV accessions at the correct species level or under a virus name within the right species. However, not all accessions can be updated due to various issues, such as overly broad or non-unique virus names.
- The timing of taxonomy updates in NCBI Virus may be delayed or compromised due to the manual nature of the process and the workload of the Taxonomy group. While efforts are made to provide accurate and up-to-date taxonomic information, there may be discrepancies between the current VMR and the taxonomy displayed in NCBI Virus.
Recent Changes to NCBI Virus Interface
In July 2024, we made several updates to improve your experience with the NCBI Virus database interface. Here we outline the key changes and how they may affect your workflow.
Summary of Changes
- Introduction of Genome Groups for Influenza Viruses
- New NCBI Virus Assembly Tab
- Filter Updates and Renaming
- Column Additions and Renaming
- Improved Filter Descriptions
Detailed Changes
Genome Groups for Influenza Viruses
- NCBI Virus has developed a process to group segments into genomes from the same sample based on matching metadata fields for species, isolate name, host organism, collection date & location.
- Newly released GenBank records are processed daily using an automated process.
- Currently, genome groups are only being built for records under the species Influenza A, B, and C, and only if the expected number of segments is identified.
- Future plans include expanding to other segmented viruses and partial genomes. Please reach out if you would use genome groups for a particular virus species.
- Access genomes at: https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Genome
NCBI Virus Assembly Tab
- The "RefSeq Genome" tab above the Results Table has been replaced with the "NCBI Virus Assembly" tab.
- This new tab provides access to genome groups stored in the Assembly database.
- More information about the functionality of this tab can be found in the "NCBI Virus Assembly Tab" section of this help document.
Filter Updates
Old Filter Name | New Filter Name |
---|---|
Virus | Virus/Taxonomy |
Sequence Type | GenBank/RefSeq |
RefSeq Genome Completeness | Assembly Completeness |
Random Sampling | Surveillance Sampling |
Isolate | Isolate/Strain Name |
Proteins | Has Protein |
Provirus | Provirus/Integrated |
Isolation Source | Tissue/Specimen/Source |
Lab Host | Lab Passaged |
(New Filter) | Segment |
(New Filter) | Genotype |
Note: While some filters have been renamed, their functionality remains the same. Read more about filter functionality in the “Refining Results via Filters” section of this help document.
Improved Filter Descriptions
When you open a filter by clicking on it, you'll now find improved descriptions for definitions of the filters and how to use them.
Column Updates
Old Name | New Name |
---|---|
Sequence Type | GenBank/RefSeq |
Isolation Source | Tissue/Specimen/Source |
(New Column) | Assembly |
(New Column) | Segment |
Find more about colums in the “Results Table Columns” section of this help document.
Additional Interface Updates
- A new Segment filter has been added (works best after first selecting a virus group using the Virus/Taxonomy filter).
- A new Genotype filter has been added.
- A new Accesson column was added to the table displayed on Nucleotide and Protein tabs to ensure connection between different types of records.
- Records can be downloaded both via NCBI Virus and in bulk using the Datasets command-line API. More information on how download virus sequence information can be found in "Downloading Data" section of this document.
We value your feedback and are continually working to improve NCBI Virus. If you have any questions or suggestions, please don't hesitate to contact us.
Data Access
NCBI Virus provides several ways to access viral sequence data
- Search by sequence
- Search by virus name or taxonomy
- Access preconfigured sets of data
Search by Sequence
Use the BLAST™ tool to find virus nucleotide or protein sequences similar to your query sequence.
- Click “Search by Sequence” on the NCBI Virus page or select it from the “Find Data” menu.
- Paste your sequence in the FASTA format into the text box or upload a FASTA file.
- Specify whether it is a nucleotide or protein sequence.
- Click “Start Search” to run BLAST™ against NCBI Virus sequences
- View the results in the results table.
- Alternatively, click “Search up-to-date Betacoronavirus DB” to search against sequences from the Betacoronavirus BLAST database. This database was created to accommodate the SARS-CoV-2 outbreak and includes all sequences from the genus Betacoronavirus.
- If desired, adjust the view and refine search results in the Results Table by using filters and adding or removing columns.
Click on the "Search by sequence" button, and enter a sequence in one of these formats: plain text, FASTA, or NCBI sequence accession.
Tips:
- For nucleotide searches, NCBI Virus uses the BLAST™ nucleotide tool and databases (BLASTn).
- For protein searches, NCBI Virus uses the BLAST™ protein tool and databases (BLASTp).
- Currently, only one sequence can be searched at a time. There is no option to perform multiple sequence searches simultaneously.
- Read more about BLAST™ searches in the NCBI BLAST Guide.
Search by Virus Name or Taxonomy Group
Find nucleotide or protein sequences by virus name or NCBI Taxonomy database identifier.
- Click “Search by Virus” on the NCBI Virus page or select it from the “Find Data” menu
- Begin typing the virus name or Taxonomy identifier. An autocomplete list would appear to aid your selection.
- Select the desired virus from the list and click “Search”.
- The search results would appear in a table view that can be refined further by using filters.
- To perform an additional search by virus name or taxonomy on the results table, use the “Virus/Taxonomy” filter (see details about Virus/Taxonomy filter below).
Click on the "Search by Virus" button and start typing a virus/viroid name or NCBI taxid, then select from the dropdown menu.
Tips:
- Find a list of all virus taxonomy terms on the NCBI Taxonomy pages.
- Use the sunburst chart on NCBI Virus to explore various taxonomy groups and their connections.
View Pre-configured Data Sets
Quick Links
Quickly access sequence data for commonly accessed taxonomic and functional groupings via the “Find Data” menu or “Search by Virus” quick links on the home page.
- All viruses
- Human viruses
- H5N
*
Influenza A virus - New sequences released in the past month
- SARS-CoV-2 sequences
Select any of the popular searches to view the Results Table for the selected search.
Select any of popular searches from the "Find Data" dropdown list.
Results are presented in a table view that can be further refined.
Popular Searches Panel
The NCBI Virus Results Table page has a Popular Searches panel above the Results Table. This panel provides links to the Results Table for the following virus groups:
- Influenza virus
- Rotavirus
- Dengue virus
- West Nile virus
- Zika virus
- MERS coronavirus
- Ebolavirus
- SARS-CoV-2 coronavirus
Click on any link in the 'Popular searches' panel to view the updated Results Table.
Exploring Search Results
Results Table Features
After searching, your results will appear in a table.
From here one can:
- Sort any column in ascending or descending order by clicking on the column header sort icons.
- Click on an accession number to display more detailed information about the sequence record.
Click on an Accession link in the Results Table to view that record's details in the fly-out Details panel.
Details panel: Clicking on the Accession number link will open the record's NCBI Entrez page (if Nucleotide or Protein tabs are selected), or the NCBI Genome Assembly page (if NCBI Virus Assembly tab is selected).
- Sort on any column by clicking the column header.
- Customize which columns are displayed using "Select columns".
Click the "Select Columns" button to add or remove columns.
- Refine results using filters (more details in Refining Results).
- Navigate between results pages using the page number selector.
- View results in different tabs (Nucleotide, Protein, NCBI Virus Assembly).
Results Table Columns
The columns available in the results table may vary depending on the search type you used. Not all columns are displayed by default, but one can customize the visible columns using the "Select columns" menu.
The following columns are available in the results table for both "Search by Sequence" and "Search by Virus" options:
- Accession: A unique identifier (Nucleotide, Protein or Assembly Accession, depending on the selected tab) assigned to the sequence record in GenBank.
- Organism Name: Gen Bank organism name, a taxonomic name at species level or below the species level.
- Assembly: NCBI Virus Assembly accession number. For more information about NCBI Virus assembly, see Data Model, Type of Data and Dataflow section.
- Submitters: Authors who submitted the sequences.
- Organization: Organization or institution the sequence submitters affiliated with, as well as country or location of the organization.
- Release Date: The date when the sequence was released in the GenBank.
- Isolate: Isolate or strain name from the "/isolate" field of GenBank record. For more information, see Glossary: Isolate.
- Species: Species name, as defined by NCBI Taxonomy.
- Length: Sequence length, the number of nucleotides or amino acids in the sequence.
- Nuc Completeness: Indicates if the nucleotide sequence is complete or partial. Nucleotide sequences are considered complete, if they were submitted as such to GenBank or other INSDC databases. For more information, see Glossary: Nucleotide Completeness.
- Asm completeness: Assembly Completeness Indicates if NCBI Virus assembly is complete or partial. For more information about NCBI Virus assembly, see Data Model, Type of Data and Dataflow section.
- Geo Location: The geographic location where the virus was isolated. NCBI Virus relies solely on the information provided by submitters, which may sometimes mistakenly identify regions where sequences were processed instead of the actual collection locations. The granularity of the information displayed in the column depends on the submitter. For example, it can be "Australia: Heron Island, Great Barrier Reef", or just "China". The country and other geographic location names are based on the INSDC Geographic Location list.
- USA: The name of US state, if sample was collected in USA.
- Host: The host organism from which the virus was isolated. This is the submitter-provided host organism, not the full host range known for the virus. For more information, see Glossary: Host.
- Tissue/Specimen Source: Isolation source, part of the host organism, where the sample was obtained. For more information, see Glossary: Tissue/Specimen/Source.
- Collection Date: The date when the virus sample was collected.
- SRA accession: NCBI Sequence Read Archive (SRA) accession number.
- Genus: Genus name, as identified by NCBI Taxonomy.
- Family: Family name, as identified by NCBI Taxonomy.
- Molecule Type: Genome molecule type, type of viral nucleic acid, as provided by ICTV. For more information, see Glossary: Genome Molecule Type.
- GenBank/RefSeq: Specifies if the sequence is from GenBank or RefSeq database.
- Genotype: genotype/subtype of a viral sequence. For more information, see Clossary: Genotype.
- Segment: Segment name or number representing a genome segment. For more information, see Clossary: Segment.
- Publications: Number of publications linking to the associated with the sequence publications in PubMed.
- Country: The country of specimen collection.
- BioSample: The BioSample accession number associated with the sequence.
- BioProject: The BioProject accession number associated with the sequence.
- GenBank Title: The title of the GenBank record associated with the sequence.
Additional columns specific only to the "Search by Sequence" option:
- Coverage: Query coverage, the percentage of the query sequence that aligns with the target sequence in the GenBank.
- Identity: The percentage of identical matches between the query and target sequences in the GenBank.
- Score: Blast score, the total alignment score (total score) from all alignment segments.
Each column can be sorted in ascending or descending order by clicking on the column header. Clicking on an accession number will display more detailed information about the sequence record."
Results Table Tabs
The NCBI Virus results table provides separate tabs for accessing virus sequence records in different contexts:
- Nucleotide,
- Protein,
- NCBI Virus Assembly.
Nucleotide Tab
The Nucleotide tab displays both GenBank and RefSeq nucleotide sequence records for viruses.
These records represent the genetic material of viruses and may include complete genomes, partial sequences, or specific genomic regions. Users can filter and sort the nucleotide records based on various criteria, such as virus species, host, geographic location, and sequence length.
Users can also see the connection of nucleotide accessions to the NCBI Virus Genome assembly accessions through the Assembly column.
Protein Tab
The Protein tab displays both GenBank and RefSeq protein sequence records for viruses.
These records represent the amino acid sequences of viral proteins, which are encoded by the viral genome. Users can filter and sort the protein records based on criteria similar to those available for nucleotide records, as well as additional protein-specific fields, such as protein name and function.
It is important to note that the Nucleotide and Protein tabs are not directly connected. When switching between these tabs, the records displayed may not have a one-to-one correspondence with the records from the previous tab. The order and organization of records in the Protein tab will be different from those in the Nucleotide tab, as they represent distinct types of sequences with their own sets of metadata and annotations. However, users can see the connection with nucleotide records through the Nucleotide column, which contains nucleotide accessions corresponding to the protein accessions in the table.
Users can also see the connection of protein accessions to the NCBI Virus Genome assembly accessions through the Assembly column.
NCBI Virus Assembly Tab
The NCBI Virus Assembly tab provides access to genome assembly records for viruses, including RefSeq genome assemblies and genome assemblies for segmented viruses.
These assembly records represent complete or nearly complete viral genomes and provide additional information and annotations beyond individual nucleotide or protein sequences.
For more information on the types of sequence records and assembly records available in NCBI Virus, please refer to the Data Model, Type of Data and Dataflow section.
Multiple Sequence Alignment (Accessed via "Align" button)
Generate a multiple sequence alignment from selected search results:
- Select the desired sequences using the checkboxes next to each row.
- Click the "Align" button above the results table.
- View the alignment in new window.
- Refine the alignment using the toolbox options (e.g. use different coloring schemes, select sequences, use zooming options).
- Download the alignment in FASTA, PDV/SVG formats.
Multiple sequence alignments are calculated using the MUSCLE (Multiple Sequence Comparison by Log-Expectation) algorithm.
Important:
This alignment tool is designed for quick visualization and preliminary analysis. For publication-quality alignments, we recommend using dedicated alignment software and manually reviewing the results.
Read more about how to use the alignment viewer in the NCBI Multiple Sequence Alignment Viewer documentation.
Select the desired sequences and click the "Align" button to build a Multiple Sequence Alignment.
BLAST Similarity-Score Based Distance Tree (Accessed via "Build Phylogenetic Tree" button)
Generate a quick distance tree based on BLAST similarity scores from selected search results:
- Select the desired sequences using the checkboxes next to each row.
- Click the “Build Phylogenetic Tree” button above the results table.
- The tree will appear in new window.
- Interact with the tree by zooming, collapsing/expanding branches, etc.
- Search for particular nodes.
- Download the tree in ASN, Newick, Nexus and PDF format.
Select the desired sequences and click the "Build Phylogenetic Tree" button to build a BLAST Similarity-Score Based Distance Tree.
Limitations:
- This tree is suitable for quick visualization and preliminary analysis of sequence relationships.
- It should not be used for publication or as definitive evidence of evolutionary relationships.
- For publication-quality phylogenetic analyses, we strongly recommend using dedicated phylogenetic software, selecting appropriate evolutionary models, and critically evaluating the results.
Important
This tree is not a true phylogenetic tree but a BLAST similarity-score based distance tree. It is generated using BLAST comparisons, with BLAST scores used as tree distance parameters. The NCBI Tree Viewer displays this data without applying additional phylogenetic algorithms.
For more information about the Tree Viewer and how to use it, please refer to the NCBI Tree Viewer documentation.
Refining Results via Filters
The left sidebar of the NCBI Virus results table page provides various filters to apply to further refine your results.
Use any combination of these search filters to apply to the records. When multiple filters are used, they will be connected with the AND logical operator to include sequences that match all the provided criteria. Filters are applied and highlighted dynamically as you select them.
Active filters appear above the results table. Remove a filter by clicking the “X” next to it.
Clear all filters using the “Reset All” button.
Refine results by applying various filters from the left-side panel.
Virus/Taxonomy
Virus or viroid name, taxonomy group, synonyms or taxids.
- Start typing a virus or viroid name, NCBI Taxonomy acronym, synonym (NCBI Taxonomy equivalent) or taxid (NCBI Taxonomy database ID) in the text box, e.g. "Influenza A virus", "FLUAV", "Influenza virus type A", "11320".
- Select the desired taxid from the top five suggestions.
- One can filter/search for any level of hierarchy in NCBI Taxonomy (organism name, family name, genus name, etc.).
- Select “Exclude SARS-CoV-2” to exclude SARS-CoV-2 sequences from your search results.
- The filtered results would be displayed in the Results Table.
- For more information, see Taxonomy Validation section.
Accession
NCBI Accession number(s) from GenBank, BioProject, BioSample, SRA and/or Assembly accession.
- Enter the accession number(s) in the search box.
- Results Table will display only the sequences matching the provided accession number(s).
Sequence Length
Min and/or Max sequence length. The range is applied to nucleotide sequences independently from protein sequences.
- Enter the desired length range in the input fields.
- The range is applied to nucleotide sequences independently from protein sequences.
- The sequence length filter is disabled for the NCBI Virus Assembly tab.
Ambiguous Characters
Maximum number of ambiguous characters (N’s in nucleotide or X’s in protein) allowed in each sequence.
- Enter the desired maximum number of ambiguous characters in the input field.
- The number of characters is applied to nucleotide sequences independently from protein sequences and is disabled for the NCBI Virus Assembly tab.
GenBank/RefSeq
GenBank and/or RefSeq. GenBank and RefSeq sequence records are mutually exclusive.
- Select the desired sequence type(s) from the options. GenBank and RefSeq sequence records are mutually exclusive.
- GenBank includes assembled sequence data, while RefSeq includes curated reference sequences.
- RefSeq records may be partial genomes if they are exemplars from the International Committee on Taxonomy of Viruses (ICTV).
- For more information about GenBank and RefSeq records, see Data Model, Type of Data and Dataflow section.
Assembly Completeness
Complete and/or partial NCBI Virus assembly records.
- Select the desired completeness option(s).
- Partial NCBI Virus assemblies are created for partial RefSeqs. RefSeq records that are partial genomes and are exemplars from the International Committee on Taxonomy of Viruses (ICTV) compose partial NCBI Virus assemblies.
- For segmented virus genome assemblies, currently only complete assemblies are available. We do not create partial assemblies from incomplete segment sets at this time.
- To learn more about NCBI Virus assemblies, see Data Model, Type of Data and Dataflow section.
Nucleotide Completeness
Complete and/or partial nucleotide records. Nucleotide sequences are considered complete, if they were submitted as such to GenBank or other INSDC databases.
- Select “complete” to include only complete nucleotide sequences, “partial” to include only partial nucleotide sequences, or both options to include all sequences.
- For more information, see Glossary: Nucleotide Completeness.
Pango Lineage
Name of SARS-CoV-2 Pango lineage assigned to sequence record using Pangolin with UShER.
-
Enter the desired Pangolin lineage in the search box. All SARS-CoV-2 GenBank records are reprocessed nightly by the Pangolin pipeline using UShER.
-
The field will be empty if the sequence was released after the pipeline run for that day.
- Pangolin version information may be downloaded by requesting to include the Pangolin column in a download file.
- To see relationships between Pango lineages and WHO labels, please visit WHO: Tracking SARS-CoV-2 variants.
- In the downloaded file, the PangoVersions column includes versions of tools used in the format: pangolin/pangolin-data/constellations/scorpio, for example, 4.0.6/1.8/v0.1.8/0.3.17.
- This filter available only on SARS-CoV-2 Data Hub.
Surveillance Sampling
SARS-CoV-2 sequences collected randomly in the population, for the purpose of baseline surveillance - not including samples collected for vaccine breakthrough or localized outbreak investigations.
- This filter can help determine which lineages are increasing in frequency or provide a rough estimate of the infection rate in geographical regions where infection rate data are not yet available. Indicators like “baseline surveillance” or “random sampling” come from the “KEYWORDS” or “/notes” fields in the GenBank record, or the “purpose of sequencing” field in the BioSample record.
- Select “Include” to display randomly sampled sequences along with other sequences in the results.
- Select “Exclude” to remove randomly sampled sequences from the results.
- Select “Only” to display only randomly sampled sequences in the results.
- This filter available only on SARS-CoV-2 Data Hub.
Isolate/Strain Name
Isolate or strain name from the "/isolate" field of GenBank record.
- Enter the desired isolate name in the search box.
- The isolate or strain name is parsed from the “/isolate” field of the GenBank record.
- SARS-CoV-2 sequence isolate names are formatted according to the definitions provided by the Coronaviridae Study Group of the International Committee on Taxonomy of Viruses (ICTV).
Has Proteins
Sequence records which contain given protein name(s).
- Start typing the protein name in the search box and select from the offered suggestions.
- Protein names are parsed from the “/product” field of GenBank nucleotide and protein records.
- In the Nucleotide tab, the results may contain entire genomes.
- Protein names in records are not normalized and are presented as submitted.
- Multiple proteins can be searched and added to the search results.
- The Results Table will display sequences containing at least one of the specified proteins.
- Nucleotide records may contain other proteins in addition to those specified in the search.
Provirus/Integrated
Provirus is a sequence obtained from a virus, or a phage, that is integrated into the genome of another host organism, as indicated by the sequence submitter.
- Proviral sequences are identified by the presence of the “/proviral” source qualifier in the GenBank record.
- Select “Include” to include proviral sequences in the results along with other sequences,
- Select “Exclude” to remove proviral sequences from the results
- Select “Only” to display only proviral sequences.
Geographic Region
Global geographic areas from which the sequences were collected. When a region is chosen, all countries within that region will be automatically selected as well.
- NCBI Virus relies solely on the information provided by submitters, which may sometimes mistakenly identify regions where sequences were processed instead of the actual collection locations. The country and other geographic location names are based on the INSDC Geographic Location list.
- Type the desired country or geographic region in the search box or select continents and countries from the expandable menu.
- Selecting a continent automatically selects all countries within that continent.
- Multiple selections are allowed.
Host
Submitter-provided host organism, not the known host range of a virus. Start typing a name or taxid to select from a list of top 10 suggestions. Optionally, select Human or non-Human.
- Enter the host name or taxid in the search box.
- Select the desired host from the suggestions.
- Optionally, select the “Human” or “non-Human” checkboxes to filter sequences accordingly.
- Multiple host selections are allowed.
- For more information, see Glossary: Host.
Submitters
Submitter/Author names, affiliated institution or affiliation country/location of persons who submitted the sequences. Must use quotes to search with a phrase. Case-sensitive.
- Submitters Names: Enter the last names of the sequence submission authors with or without initials.
- Submitters Affiliation Institution: Enter the submitters’ affiliation institution, which can be an official abbreviation.
- Submitters Affiliation Country/Location: Enter the submitters’ affiliation country or location name according to the INSDC Geographic Location list.
- The filter would apply the AND logical operator, if multiple entries are given
You can use any combination of these search windows to filter the sequences. When multiple search windows are used, the filter applies an AND logic, meaning that the results will include sequences that match all the provided criteria.
Tissue/Specimen/Source
Part of the host organism, where the sample was obtained.
- Select one or more isolation source terms from the available options.
- GenBank isolation source is mapped to standardized terms and can be mapped to more than one such term. Filter allows multiple selections.
- For more information, see Glossary: Tissue / Specimen / Source.
Collection Date
Filter sequences by the sample collection date range.
- Enter the desired date range in the "From" and "To" fields using the format mm/dd/yyyy or yyyy.
- Alternatively, use the calendar picker to select the dates.
- Both the "From" and "To" fields must be filled to apply the filter.
- Searching by Collection Date may return records with incomplete dates (where day or month is missing). Results will be included if the year matches your search criteria, even if the month or day fields are empty.
Release Date
Date range when samples were released through GenBank.
- Enter the desired date range in the "From" and "To" fields using the format mm/dd/yyyy or yyyy.
- Alternatively, use the calendar picker to select the dates.
- Both the "From" and "To" fields must be filled to apply the filter.
- Searching by Release Date may return records with incomplete dates (where day or month is missing). Results will be included if the year matches your search criteria, even if the month or day fields are empty.
Genome Molecule Type
Type of viral nucleic acid, as provided by ICTV.
- Molecule type (e.g., DNA, RNA), as provided by the International Committee on Taxonomy of Viruses (ICTV) in the Master Species List. RefSeqs that have "Unknown" molecule type belong to tax groups which were not yet recognized by ICTV.
- Select the desired genome type(s) from the available options.
- RefSeq records with "Unknown" molecule type belong to taxonomic groups not yet recognized by ICTV.
Environmental Source
Environmental source of the samples when sample is not associated with a host organism.
- Typically, environmental isolates are determined by conducting searches based on specific keywords, such as "sewage" or "ocean water," within the "/isolation_source" and "note" fields of GenBank records in cases where the "host" field is empty.
- The list of environmental terms selected for environmental source mapping is manually maintained by the NCBI Virus group.
- Select "Include" to display sequences with an environmental source along with other sequences in the results.
- Select "Exclude" to remove sequences with an environmental source from the results.
- Select "Only" to display only sequences with an environmental source in the results.
Lab Passaged
Indicator that sequences came from samples from a laboratory setting, and not from a wild host.
- This filter identifies sequences that come from samples grown in a laboratory setting, rather than from a wild host.
- Lab passaged is identified by searching within the "lab_host" field of the GenBank record. In the case of bacteriophages, if both the "host" and "lab_host" fields are empty in GenBank, the lab host can be extracted from the bacteriophage organism name in the GenBank record.
- Select "Include" to display sequences with a laboratory host along with other sequences in the results.
- Select "Exclude" to remove sequences with a laboratory host from the results.
- Select "Only" to display only sequences with a laboratory host in the results.
Vaccine Strain
Strain used for generating vaccines.
- Vaccine strains are identified by searching the "/isolation_source", "/note", "/host", and definition line of the GenBank record.
- Select "Include" to include vaccine strains in the results along with other sequences.
- Select "Exclude" to remove vaccine strains from the results.
- Select "Only" to display only vaccine strains.
Segment
Name or number representing a genome segment. Note: segment name is provided by submitter.
- Select the desired segment name(s) or number(s) from the provided options.
- Segment name is provided by the submitter; no additional check or curation was done, so it may contain errors.
- For more information, see Glossary: Segment.
Genotype
Genotype (subtype) of a viral sequence.
- Enter a genotype (subtype) or serotype name, then hit "Submit"
- For more information, see Glossary: Genotype.
Downloading Data
Downloading Sequences from NCBI Virus
Step 1: Select Data Type
- Click the Download button located on the upper left side of the NCBI Virus Results Table page.
Click on the "Download" button to open the Download Results menu.
- Choose the type of data you want to download:
- Nucleotide, Protein, or Coding Region Sequence (CDS) in FASTA format (Note: Randomized subsets are not available for CDS FASTA files).
- Accession List for nucleotide, protein, or assembly records (Note: Randomized subsets are not available for CDS accession lists).
- Results Table contents in CSV (Comma Separated Values) or XML format, which includes metadata.
Selected data type: FASTA Format (Nucleotide, Protein, or CDS)
Download results in FASTA format, Step 1: Select Data Type.
Step 2: Select Records
- Decide which records to download:
- Only selected records using checkboxes in the results table.
- All records in the results table.
- Randomized subset of up to 2000 records in the Results Table (Note: Not available for CDS FASTA files).
Download results in FASTA format, Step 2: Select which records to download.
- For a randomized subset:
- Choose either a fully randomized subset or a stratified subset.
- Enter the total number of records you want to download and, if stratified, select the category for stratification: Country, Collection Year, Release Year or Host.
- When downloading a stratified randomized subset, the file name will include the date of download and the randomization seed used. Download file format example:
sequences_[MMDDYYYY]_[seed].fasta
Download results in FASTA format, Step 2: Download randomized subset.
Step 3: Customize Sequence Titles (Optional)
Customize the FASTA definition line:
- Nucleotide/Protein Sequence Data default format:
(accession) | (GenBank title)
>AAO17794 | VP4 spike protein [Human rotavirus A]
- Coding Region Data default format:
(nucleotide accession)_(cds coordinates) | (GenBank title)
>NC_045425.1:319..1659 | replication endonuclease [Thermus phage phiOH3]
- Use the Build custom sequence title option to select from various columns like Assembly, SRA accession, Submitters, Release date, and more.
- Click Next and follow the prompts to start your download.
Download Results in FASTA format, Step 3: Build a custom FASTA defline.
Selected data type: Accession List
Download Accession List, Step 1: Select an accession type.
Step 2: Select Records
-
Decide which records to download: - Only selected records using checkboxes in the results table. - All records in the results table. - Randomized subset of up to 2000 records in the Results Table (Note: Not available for CDS accession lists).
- For a randomized subset:
- Choose either a fully randomized subset or a stratified subset.
- Enter the total number of records you want to download and, if stratified, select the category for stratification (e.g., Country, Host).
- Download File Naming Examples:- Enter the total number of records you want to download and, if stratified, select the category for stratification: Country, Collection Year, Release Year or Host.
- When downloading a stratified randomized subset, the file name will include the date of download and the randomization seed used. Download file format example:
sequences_[MMDDYYYY]_[seed].acc
- For a randomized subset:
-
Click Next and follow the prompts to start your download.
Selected data type: Results Table (CSV or XML)
Download Results Table, Step 1: Select a format for the Results Table file.
Step 2: Select Records
- Decide which records to download:
- Only selected records using checkboxes in the results table.
- All records in the results table.
- Randomized subset of up to 2000 records in the Results Table.
- For a randomized subset:
- Choose either a fully randomized subset or a stratified subset.
- Enter the total number of records you want to download and, if stratified, select the category for stratification (e.g., Country, Host).
- Download File Naming Examples:- Enter the total number of records you want to download and, if stratified, select the category for stratification: Country, Collection Year, Release Year or Host.
- When downloading a stratified randomized subset, the file name will include the date of download and the randomization seed used. Download file format examples:
- Results Table (CSV):
sequences_[MMDDYYYY]_[seed].csv
- Results Table (XML):
sequences_[MMDDYYYY]_[seed].xml
- Results Table (CSV):
- Click Next and follow the prompts to start your download.
- For a randomized subset:
Additional Information
Filters and Randomization
- Apply filters before randomization.
- Select the appropriate data type tab (Nucleotide, Protein, NCBI Virus Assembly) before opening the download menu.
- If you picked the “Nucleotide” tab, you will only be able to download randomized sequence data in FASTA Nucleotide, Nucleotide Accession list, XML, and CSV formats.
- If you chose the “Protein” tab, you will only be able to download randomized sequence data in FASTA Protein, Protein Accession List, XML, and CSV formats. If you picked the “NCBI Virus Assembly” tab, you will only be able to download randomized sequence data in Accession Assembly list, XML, and CSV formats.
For large datasets:
Consider downloading data in smaller batches, especially when dealing with long sequences or including associated metadata.
If you are experiencing difficulties with very large downloads, you may want to explore alternative methods such as FTP access or NCBI Datasets, which are optimized for bulk retrieval.
Disclaimers
- Our current platform does not support repeatable randomized searches. We understand the importance of repeatability in the scientific community and are working to include this feature in future updates.
- Downloading randomized subsets is currently available for nucleotide, protein, and assembly records. We are working to make them available for coding region records in the future.
Alternative methods for downloading virus sequences
While the NCBI Virus user interface is the primary access point for searching and downloading virus sequences, there are alternative ways of programmatic access and bulk downloads of the data.
NCBI FTP Site
Virus data are a part of the NCBI FTP site, which provides access to a wide range of sequence data.
To access virus nucleotide sequences and/or associated metadata, navigate to the Viruses directory on the NCBI FTP site https://ftp.ncbi.nlm.nih.gov/genomes/Viruses/ to find:
- Full metadata table based on the Nucleotide tab in the NCBI Virus, in CSV format:
- Complete list of nucleotide records, in the FASTA format:
To access only RefSeq viral genomes go to https://ftp.ncbi.nlm.nih.gov/genomes/refseq/viral/ to find:
- individual virus folders that contain files with various levels of information about RefSeq assemblies
- Complete release of virus RefSeqs in https://ftp.ncbi.nlm.nih.gov/refseq/release/viral/
NCBI Datasets
NCBI Datasets provides separate tools for downloading genome sequence data and metadata.
For tutorials on programmatic access and command-line tools available in the NCBI Datasets resource, visit https://www.ncbi.nlm.nih.gov/datasets/docs/v2/how-tos/virus/
Using Visual Dashboards
NCBI Virus includes two interactive visual dashboards for exploring viral sequence data: Home Page Dashboard and Visual Filters Dashboard.
NCBI Virus Home Page Dashboard
The home page dashboard provides several ways to explore the available data:
- Statistics buttons with counts of nucleotide, protein, genome, and CDS sequences.
- RefSeq Nucleotides: all viral nucleotide reference sequences available at NCBI (find more about reference sequences here).
- All Proteins: all NCBI viral protein sequences, including RefSeq proteins.
- All Nucleotides: all viral nucleotide records available at NCBI, including RefSeqs.
- RefSeq Proteins: all viral protein reference sequences available at NCBI.
- Complete Nucleotides: all NCBI viral nucleotide sequences, where GenBank ASN.1 format contains the following descriptors:
descr/molinfo/completeness=complete
or there "complete genomes" present in the record’s definition line (defline). It also includes complete reference records (RefSeqs). For more information about nucleotide completeness refer to the Nucleotide Completeness.
- Sunburst chart to explore the virus taxonomy hierarchy.
- Host distribution bar chart to view sequences by host species.
Home Page Dashboard: To view the results selected on this dashboard in a tabular format, click any of the statistics buttons.
Interact with the dashboard:
- Click a statistic button to view matching records in the Results Table
- Clicking on each button will show a results table with the corresponding sequences.
- Results can be further refined by using filters for various sequence attributes (metadata) located on the left side of the Results Table page (learn more here).
- Explore virus taxonomy using the sunburst chart.
- The default view represents the classification for all available NCBI virus and viroid taxa.
- The inner layer (ring) represents four non-taxonomic groups of viruses: RNA viruses, DNA viruses, and Unclassified viruses.
- Only 4 levels of the whole hierarchy are visible on the plot at a given time.
- Click on any slice (section) of any layer to zoom into the selected taxa and display additional subtaxa.
- Hover over a slice to view the taxon name and breadcrumbs.
- Breadcrumbs above the chart show the location of the taxa in the hierarchy; clicking a breadcrumb will refocus the plot on the selected taxa.
- Clicking the center of the chart will return to the parent taxon.
- Select a host species from the bar chart
- Each bar is proportional to the number of virus sequences isolated from that host.
- Click a bar or host name to highlight the selected host and associated taxa in the sunburst chart.
- Only one host can be selected at a time.
- Click the selected host again or use the “Reset” button to deselect.
- Use the scroll bar or “CTRL+F” to search for a specific host.
- After selecting a host or taxon, the statistics buttons in the top row will update.
- Click a highlighted taxon in the sunburst chart to focus on taxa containing sequences from the selected host.
- The lower layers will highlight taxa with sequences from the selected host.
- Not all taxa will be highlighted if they do not include sequences from the selected host.
Visual Filters Dashboard
The "Visual Filters for GenBank Sequences" dashboard allows you to interactively filter search results by collection location and date.
Important:
The Visual Filters Dashboard is specifically designed to provide detailed insights for individual virus taxa or the entire virus database. It is not available if multiple virus taxids are selected in the Virus/Taxonomy filter on the Results Table page. If you need to analyze data from multiple specific taxa, consider examining each taxon individually using the dashboard.
To access the Visual Filters Dashboard:
- Search for a virus from the home page or Results Table (or leave all viruses selected by default).
- Optional: modify the Results Table using other filters.
- Click the “Visual Filters for GenBank Sequences” tab above the Results Table
- Alternatively, append the virus taxonomy ID to the URL:
https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/dashboard?taxid=<taxid>
Click on the "Visual Filters for GenBank Sequences" tab to access the Visual Filters Dashboard.
Important note on switching between Results Table and Visual Filters Dashboard:
When transitioning from the Results Table to the Visual Filters Dashboard, please be aware of the following:
-
Preserved Filters:
- Virus/Taxonomy (only for single taxon selections)
- Collection Date
- Release Date
- Geographic Region (partially - see details below)
-
Geographic Region Filter Behavior:
- Country and US state selections are preserved
- Continent selections are reset (not available in Visual Filters Dashboard)
- All other filters are reset when switching to the Visual Filters Dashboard
-
When switching back to the Results Table from the Visual Filters Dashboard:
- Filters applied in the Visual Filters Dashboard (including country and US state selections) will be reflected in the Results Table.
- Previously applied filters that were reset will not be automatically restored. You will need to reapply these filters if needed.
Filter behavior when switching between Results Table and Visual Filters Dashboard:
Filter Type | Behavior When Switching to Visual Filters Dashboard |
---|---|
Virus/Taxonomy | Preserved (single taxon only) |
Collection Date | Preserved |
Release Date | Preserved |
Geographic Region - Countries | Preserved |
Geographic Region - US States | Preserved |
Geographic Region - Continents | Reset (not available in Visual Filters Dashboard) |
All Other Filters | Reset |
To filter results using Visual Filters Dashboard:
-
Explore the Geographic Distribution:
- View the world map displaying the distribution of sequences based on their collection locations.
- Note: color shades represent nucleotide record numbers; darker shades indicate higher numbers.
- Click on countries or states to select them.
- Use the International/USA toggle to switch between world and US views.
- Select multiple locations by clicking on them.
- To choose a single region, type its name and select from the dropdown.
- Remember: changing between International and USA views will reset your selections.
-
Use Timeline Sliders for Date Filtering:
- Adjust the Collection Time slider (earliest to current collection year).
- Adjust the Release Time slider (first release to current year).
- Drag slider handles or date bars on the chart to set specific ranges.
- Select weekly, monthly, or yearly intervals.
- Use bi-yearly intervals from the dropdown selector to zoom into specific time periods.
- Note: For incomplete collection dates, records are shown as follows:
- Year only: displayed as January 1 of that year.
- Year and month only: displayed as the first day of that month.
-
Apply Filters:
- Click a bar on the timeline or select a time interval with the sliders.
- The dashboard will automatically apply your selected filters.
-
Review Results:
- Check the panel above the dashboard to see updated sequence counts matching your filters.
- Each filtering feature on the dashboard is interactive and connected, so when a filter is applied in one feature, it is also reflected in the other features.
- The top summary section automatically updates to reflect the number of records in the NCBI RefSeq, Nucleotide, and Protein sets that match the combined search conditions.
-
Click on the Geographic and Time Distribution visuals to apply the filters.
-
Return to Results Table:
- Click "Advanced Filters for GenBank Sequences" tab or "View the Results Table and download" button to view your filtered results in a tabular format.
- From here, you can apply additional filters or download sequences and metadata.
- Filters applied in the Visual Filters dashboard will persist in the Results Table view.
- You can navigate back to the Visual Filters dashboard from the Results Table through "Visual Filters for GenBank Sequences" tab.
-
View the Results Table with applied Visual Filters by clicking either the "View the Results Table and download" button, the "Advanced Filters for GenBank Sequences" tab, or any of the record statistics links.
- Click "Advanced Filters for GenBank Sequences" tab or "View the Results Table and download" button to view your filtered results in a tabular format.
Accessing SARS-CoV-2 Data
SARS-CoV-2 Data Hub
NCBI Virus provides a dedicated SARS-CoV-2 data hub to easily access sequences and metadata for this virus.
- Access it through:
- the home page “Search by Virus” shortcuts (link to the description in helpdoc),
- popular searches panel (link to the description in helpdoc) located above Results Table.
- navigation menu: "Find Data" dropdown list.
- announcement banner located on NCBI Virus home page.
- Includes all sequences in GenBank (including SRA) and RefSeq.
- View nucleotide, protein or genome records.
- Filter by additional attributes like [Surveillance sampling] (#filter-surveillance-sampling) and Pangolin lineage.
- Visualize prevalence and trends using the SARS-CoV-2 dashboard.
- Access the SARS-CoV-2 Variants Overview resource, an interactive dashboard for displaying aggregated and analyzed SARS-CoV-2 lineage and mutation data from both GenBank and SRA sources. Learn more about the SARS-CoV-2 Variants Overview in the resource's help documentation.
- View the SARS-CoV-2 Lineage Frequency and Location dashboard by clicking on "Lineage Frequency and Location of GenBank + SRA Data" tab.
- View the SARS-CoV-2 Mutation Data by clicking on "Search GenBank + SRA Data by Mutation" tab.
SARS-CoV-2 Data Hub.
Sequence records follow the standard NCBI Virus table format and can be refined, analyzed and downloaded as described in previous sections.
NCBI Virus also provides the following SARS-CoV-2 specific resources:
- Betacoronavirus BLAST
- CDC outbreak information
- SARS-CoV-2 articles in PubMed
- Data Sets command line
- SRA data
SARS-CoV-2 Variants Overview
Explore lineage geo-temporal and mutation data using the interactive SARS-CoV-2 Variants Overview dashboard.
- Access it through the announcement banner located on NCBI Virus home page.
Access SARS-CoV-2 Variants Overview Resource through the announcement banner.
- View the SARS-CoV-2 Lineage Frequency and Location dashboard by clicking on "Lineage Frequency and Location of GenBank + SRA Data" tab.
Access the SARS-CoV-2 Variants Overview Lineage Frequency and Location page.
- View the SARS-CoV-2 Mutation Data by clicking on "Search GenBank + SRA Data by Mutation" tab on the SARS-CoV-2 Data Hub page.
Access the SARS-CoV-2 Variants Overview Mutation Data page.
Learn more using SARS-CoV-2 Variants Overview help center.
Citing NCBI Virus
To cite the NCBI Virus resource in your own work, please include this URL https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/ in the website citation formatted according to your publisher's recommendations.
Example:
NCBI Virus [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; [cited YYYY MM DD]. Available from: https://www.ncbi.nlm.nih.gov/labs/virus/.
For more information on citing other NCBI resources, please see How do I cite NCBI services and databases?
If you want to find more information on citing other NCBI services and databases, please follow this link: How do I cite NCBI services and databases? .
Our other related publications by NCBI Virus Team can be found on Our Publications Page.
FAQs
- Why can't I find my sequence in NCBI Virus resource?
- The reason why you cannot find some viral sequences in NCBI Virus and why the number of viral sequences in GenBank is different from the number of sequences in NCBI Virus is because these sequences were recently released (publicly appeared) in GenBank or another INSDC database, and were not yet processed by NCBI Virus.
Submitting Data
Please, refer to Submit Sequences page to find an overview on how to submit virus sequences.
Glossary
Isolate
The isolate or strain name provided by the submitter. Parsed from the "/isolate" field of GenBank record. Shows the individual isolate from which the sequence was obtained, typically an alphanumeric sample ID. SARS-CoV-2 sequence isolate name is formatted according to the Coronaviridae Study Group of the International Committee on Taxonomy of Viruses (ICTV) definitions.
Host
In the NCBI Virus database, the term “Host” refers to the host organism from which a virus was isolated, as provided by the sequence submitter. This information is displayed in the Host column of the Results Table and can also be used as a filter to refine search results.
The host terms are parsed from the “/host” field in a sequence’s GenBank record. These parsed terms are then mapped to a standardized vocabulary created by curators who aggregated the various terms found in GenBank files. This mapping strategy also accounts for common misspellings. For example, “Accipter cooperii” is mapped to “Accipiter cooperii”.
If the isolation host is unknown (/host field of the GenBank record is empty), but the laboratory host is present (as indicated in the /lab_host field of the GenBank record), the laboratory host will be displayed in the Host column of the Results Table. If both the isolation host and laboratory host can be mapped, only the isolation host will be presented.
When using the Host filter, users can enter a host name or NCBI taxonomy ID (taxid) into the designated text box, and a list of suggested host terms will be displayed (limited to the top 10 names / taxids). Users can then choose the desired host term to filter the search results. The Host filter also allows users to filter for multiple hosts simultaneously by adding additional host terms. Additionally, users can select either human or non-human hosts using the corresponding checkboxes.
It is important to note that the Host information in NCBI Virus represents the specific host from which the viral sequence was isolated, based on the submitter-provided data. It does not necessarily encompass the entire range of hosts that the virus is known to infect. For a more comprehensive understanding of a virus’s host range, users should consult additional sources and scientific literature.
Tissue / Specimen / Source
In the NCBI Virus database, the term “Tissue / Specimen / Source” refers to the specific part of the host organism from which a virus was isolated, as provided by the sequence submitter. This information is displayed in the Tissue / Specimen / Source column of the Results Table and can also be used as a filter to refine search results.
The Tissue / Specimen / Source terms are parsed from the “/isolation_source” field in a sequence’s GenBank record. These parsed terms are then mapped to a standardized vocabulary created by curators who aggregated the various terms found in GenBank files. This mapping strategy also accounts for common misspellings and regional spelling differences. For example, “serum” and “plasma” are both mapped to the standardized term “blood”.
When using the Tissue / Specimen / Source filter, users can select one or more isolation source terms from the provided list. The filter allows for multiple selections, enabling users to refine their search results based on specific isolation sources of interest.
It is important to note that the Tissue / Specimen / Source information in NCBI Virus represents the specific part of the host organism from which the viral sequence was isolated, based on the submitter-provided data. The level of detail and specificity of the Tissue / Specimen / Source may vary depending on the information provided by the submitter.
For a more comprehensive understanding of the Tissue / Specimen / Source for a particular viral sequence, users should refer to the full GenBank record and any associated publications or metadata. In some cases, additional information about the Tissue / Specimen / Source may be available in the “/note” or other fields of the GenBank record.
Nucleotide Completeness
Nucleotide Completeness indicates whether a nucleotide sequence is complete or partial. Nucleotide sequences are considered complete if they were submitted as such to GenBank or other INSDC databases. In majority of cases, the submitter is responsible for determining the completeness of the sequence.
For manual submissions through Bankit or tbl2asn, templates or questions are provided to the submitter to confirm the completeness of the sequence(s). In BankIt, for example, there is a prompt asking the submitter if their sequence(s) is a complete viral genome/segment. If the submitter confirms completeness, the sequence moves forward. However, GenBank curators may mark sequences as partial on a case-by-case basis, even if the submitter asserted completeness. Examples:
- The record has a partial coding region or other annotation (either at the ends or due to a series of Ns in the internal sequence).
- The sequence contains a string of more than 100 Ns.
- BLAST results show other sequences from the same virus that are longer.
For automatic submissions, such as for Influenza virus A, B, or C, Norovirus, and Dengue virus, completeness is automatically checked using the processes outlined in the VADR (Viral Annotation DefineR) system documentation and related publication.
NCBI Virus determines nucleotide completeness from the ASN.1 format of GenBank records. A record in NCBI Virus is marked as complete if it contains "complete genome" in the definition or if molinfo/completeness is set to "complete" in the ASN.1. To access ASN.1 data for a submission, select the ASN.1 format from the "Format" dropdown menu on the left upper corner of the Entrez page of the record.
Click on the GenBank dropdown menu, and select "ASN.1" format from the list.
Tips:
- A "complete" nucleotide sequence does not necessarily mean that the 3' and 5' ends are complete or that there are no gaps in the sequence. It simply reflects what is currently considered complete based on the submission process, including any gaps.
- Methods for assessing the quality of virus sequences in terms of gaps and ambiguous characters vary depending on the case. For further questions regarding virus sequence GenBank submission and validations, please contact gb-admin@ncbi.nlm.nih.gov.
Genome Molecule Type
Molecule type provided by the International Committee on Taxonomy of Viruses (ICTV) in the Master Species List. RefSeqs that have "Unknown" molecule type belong to tax groups which were not yet recognized by ICTV.
Genotype
In the NCBI Virus database, the term "Genotype" refers to the genotype or subtype of a viral sequence, as provided by the sequence submitter. This information is displayed in the Genotype column of the Results Table and can also be used as a filter to refine search results.
The genotype information is parsed from the "/serotype" field of the GenBank record and is shown exactly as it was submitted by the sequence submitter. The presence, accuracy, and consistency of genotype data may vary between records, as it is not a mandatory field in the GenBank submission process.
Segment
In the NCBI Virus database, the term "Segment" refers to the segment name or number representing a genome segment of a segmented virus, as provided by the sequence submitter. This information is displayed in the Segment column of the Results Table and can also be used as a filter to refine search results.
The segment information is parsed from the "/segment" field of the GenBank record and is displayed exactly as it was submitted by the user. No additional processing or normalization is performed on the segment data.
Randomized Sequence Subset
A randomized subset of sequences (also referred to as 'downsampling') can allow a user to work with a smaller subset of sequences selected at random from a larger dataset, as an approximation of the full dataset.
A smaller, representative sequence set could make downstream analysis faster and less computationally intensive, and still allow for interpretation of the larger collection. When downloading a randomized subset, the file name will include the date of download and the randomization seed used.
Stratified Randomized Sequence Set
Randomized subsets of sequences can be stratified, meaning equally distributed over a field of categories (also referred to as 'stratified downsampling'). This enables a user to work with a subset of sequences selected from a dataset, as an approximation of the full dataset, with equal numbers of sequences from a selected category, to approximate a larger sequence collection. The fields currently available for stratification are Country, Collection Year, Release Year and Host.