Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation

This page was last updated in July, 2024

NCBI Virus Help Documentation

Welcome to the NCBI Virus Help Documentation. Here you will find information on how to use the NCBI Virus resource to search, view, analyze and download viral sequence data, as well as background on our data model.

Table of Contents

  1. Introduction to NCBI Virus
  2. Recent Changes to NCBI Virus Interface
  3. Data Access
  4. Exploring Results
  5. Refining Results via Filters
  6. Downloading Data
  7. Using Visual Dashboards
  8. Accessing SARS-CoV-2 Data
  9. Citing NCBI Virus
  10. Submitting Data
  11. FAQs
  12. Glossary

Introduction to NCBI Virus

What is NCBI Virus?

NCBI Virus is an integrative, value-added resource designed to support retrieval, display and analysis of a curated collection of virus sequences and large sequence datasets. Our goal is to increase the usability of viral sequence data archived in GenBank and other NCBI repositories.

Key features of NCBI Virus include:

NCBI Virus is a community resource, and we welcome your feedback! Please use the Feedback button on the site to send comments and suggestions. Alternatively, you can contact us using this contact form.

Click on "Feedback Button" and tell us what you think!

Feedback button

Data Model, Type of Data and Dataflow

NCBI Virus uses manual and machine curation to validate viral sequence data from the International Nucleotide Sequence Database Collaboration (INSDC) and normalize sequence and sample attributes (metadata). This data is then made available through a custom search interface that supports selection of data based on a variety of properties.

Currently NCBI Virus database includes the data from following sequence groups:

GenBank records

GenBank records - all records submitted to GenBank and other INSDC databases, including sequences submitted through the Sequence Read Archive (SRA). Nucleotide and protein sequence accessions can be found on Nucleotide and Protein tabs of NCBI Virus. If sequences submitted through SRA, corresponding SRA accessions also provided.

Note: In September 2023, we removed Protein Data Base (PDB) nucleotide records from NCBI Virus search results. PDB records are typically very short and accompany three-dimensional protein structures that are available in the NCBI Structure database. PDB records are still searchable through other resources such as RCSB PDB and the NCBI Nucleotide database.

RefSeqs records

RefSeq records – reference sequence records from one or more complete genome sequences for each viral species. Where available, RefSeqs are created based on "exemplar" isolates for each recognized species identified by the International Committee on Taxonomy of Viruses (ICTV). The majority of RefSeqs are complete genomes, but because some ICTV "exemplars" are not complete sequences, there are also incomplete RefSeqs in the database. A separate RefSeq record is created for each segment in segmented viral genomes (read more about virus segmented groups below).

Accession numbers unique to RefSeq records are assigned to the nucleotide (NC_XXXXXX) and protein (NP_XXXXXX, YP_XXXXXX, or YP_XXXXXXXXX) sequences. Nucleotide RefSeqs can be found on the Nucleotide tab of the NCBI Virus Results table, and protein RefSeqs can be viewed on the Protein tab of the NCBI Virus Results table.

RefSeq sequences are specially labeled and located in the top rows of the Results Table by default settings.

Assembly records

NCBI Virus provides access to two types of genome assembly records: RefSeq genome assemblies and genome assemblies for segmented viruses. Both types of assembly records can be found in the NCBI Virus Assembly tab of the NCBI Virus Results Table.

RefSeq genome assemblies: These assemblies represent complete or nearly complete genomes for viruses with RefSeq records. They include the genomic sequence, annotations, and metadata. Users can explore the connections between nucleotide and protein records within a RefSeq genome assembly. For RefSeq genome assemblies users can access additional information, such as the GC content, assembly method, and sequencing technology, by navigating to the corresponding Genome Assembly Datasets page from the NCBI Virus Assembly Details panel. On the Datasets page, users can download the assembly in various formats, including annotation features in GTF, GFF, and GBFF formats, as well as the entire package in a Zip file. The Data Sets page also provides options for programmatic access to the RefSeq genome assemblies through the command line. Furthermore, users can access RefSeq genome assemblies through the NCBI FTP site.

Segmented virus genome assemblies: For viruses with segmented genomes, NCBI Virus groups together nucleotide sequence genome segments, derived from the same biological sample, based on identical user-submitted metadata fields. These grouped segments form a genome assembly for the segmented virus genome. These segmented virus genome groups (GCA accessions) are represented by assembly records with a Grouping method text "NCBI Virus Segmented Genome Grouping Pipeline" and can be accessed via the NCBI Virus Assembly tab or via Datasets. Read more about virus segmented groups below.

Virus Segment Grouping

Virus Segment Grouping refers to a custom process developed by NCBI Virus for segmented genomes, where nucleotide sequence genome segments that have been derived from the same biological sample are grouped together based on identical user-submitted metadata fields. Groups are then represented by assembly records (GCA accessions) with a Grouping method text "NCBI Virus Segmented Genome Grouping Pipeline". These assemblies can be accessed via the NCBI Virus Assembly tab of NCBI Virus or via Datasets.

Virus Segment Grouping process runs periodically on all submitted nucleotide sequences. Results become available as soon as they are properly recorded and indexed. Submitters can support the creation of accurate groups by ensuring that the appropriate metadata fields match exactly among all the segments submitted from the same sample: Species, Collection Date, Collection Location, Isolate name, and Host name.

Currently, the grouping is restricted only to Alphainfluenzavirus influenzae, Betainfluenzavirus influenzae, and Gammainfluenzavirus influenzae. Only complete genome groupings are reported, with 8 segments required for Alpha or Beta, and 7 segments for Gamma. Completeness status of the individual nucleotide segments and segment names are provided by the submitter. These groups can be found by Grouping method of "NCBI Virus Segmented Genome Grouping Pipeline VG-AUTO-v1.0".

NCBI plans to expand the scope to gradually include other segmented viruses and also allow for partial or extended groupings (for mixed infections).

Metadata Parsing

NCBI Virus relies on the information provided by sequence submitters to NCBI GenBank (including SRA) and other INSDC databases. The metadata associated with each sequence record is parsed and standardized to facilitate efficient searching and filtering of the data. The following metadata fields are parsed:

For more information on how these metadata fields are parsed and standardized, please refer to the Refining results via Filters and Glossary.

Taxonomy Validation

NCBI Virus relies on the NCBI Taxonomy group to validate and standardize the taxonomic information associated with virus sequences. The Taxonomy group follows the International Committee on Taxonomy of Viruses (ICTV) classification system, with some nuances:

Back to Top


Recent Changes to NCBI Virus Interface

In July 2024, we made several updates to improve your experience with the NCBI Virus database interface. Here we outline the key changes and how they may affect your workflow.

Summary of Changes

  1. Introduction of Genome Groups for Influenza Viruses
  2. New NCBI Virus Assembly Tab
  3. Filter Updates and Renaming
  4. Column Additions and Renaming
  5. Improved Filter Descriptions

Detailed Changes

Genome Groups for Influenza Viruses

NCBI Virus Assembly Tab

Filter Updates

Old Filter Name New Filter Name
Virus Virus/Taxonomy
Sequence Type GenBank/RefSeq
RefSeq Genome Completeness Assembly Completeness
Random Sampling Surveillance Sampling
Isolate Isolate/Strain Name
Proteins Has Protein
Provirus Provirus/Integrated
Isolation Source Tissue/Specimen/Source
Lab Host Lab Passaged
(New Filter) Segment
(New Filter) Genotype

Note: While some filters have been renamed, their functionality remains the same. Read more about filter functionality in the “Refining Results via Filters” section of this help document.

Improved Filter Descriptions

When you open a filter by clicking on it, you'll now find improved descriptions for definitions of the filters and how to use them.

Column Updates

Old Name New Name
Sequence Type GenBank/RefSeq
Isolation Source Tissue/Specimen/Source
(New Column) Assembly
(New Column) Segment

Find more about colums in the “Results Table Columns” section of this help document.

Additional Interface Updates

We value your feedback and are continually working to improve NCBI Virus. If you have any questions or suggestions, please don't hesitate to contact us.

Back to Top


Data Access

NCBI Virus provides several ways to access viral sequence data

Search by Sequence

Use the BLAST™ tool to find virus nucleotide or protein sequences similar to your query sequence.

Click on the "Search by sequence" button, and enter a sequence in one of these formats: plain text, FASTA, or NCBI sequence accession.

Search by Sequence

Tips:

Search by Virus Name or Taxonomy Group

Find nucleotide or protein sequences by virus name or NCBI Taxonomy database identifier.

Click on the "Search by Virus" button and start typing a virus/viroid name or NCBI taxid, then select from the dropdown menu.

Search by Virus

Tips:

View Pre-configured Data Sets

Quickly access sequence data for commonly accessed taxonomic and functional groupings via the “Find Data” menu or “Search by Virus” quick links on the home page.

Select any of the popular searches to view the Results Table for the selected search.

Search by Virus Quick Links

Select any of popular searches from the "Find Data" dropdown list.

Select any of popular searches from dropdown list.

Results are presented in a table view that can be further refined.

The NCBI Virus Results Table page has a Popular Searches panel above the Results Table. This panel provides links to the Results Table for the following virus groups:

Click on any link in the 'Popular searches' panel to view the updated Results Table.

Popular Searchers

Back to Top


Exploring Search Results

Results Table Features

After searching, your results will appear in a table.

From here one can:

Results Table Columns

The columns available in the results table may vary depending on the search type you used. Not all columns are displayed by default, but one can customize the visible columns using the "Select columns" menu.

The following columns are available in the results table for both "Search by Sequence" and "Search by Virus" options:

Additional columns specific only to the "Search by Sequence" option:

Each column can be sorted in ascending or descending order by clicking on the column header. Clicking on an accession number will display more detailed information about the sequence record."

Results Table Tabs

The NCBI Virus results table provides separate tabs for accessing virus sequence records in different contexts:

Nucleotide Tab

The Nucleotide tab displays both GenBank and RefSeq nucleotide sequence records for viruses.

These records represent the genetic material of viruses and may include complete genomes, partial sequences, or specific genomic regions. Users can filter and sort the nucleotide records based on various criteria, such as virus species, host, geographic location, and sequence length.

Users can also see the connection of nucleotide accessions to the NCBI Virus Genome assembly accessions through the Assembly column.

Protein Tab

The Protein tab displays both GenBank and RefSeq protein sequence records for viruses.

These records represent the amino acid sequences of viral proteins, which are encoded by the viral genome. Users can filter and sort the protein records based on criteria similar to those available for nucleotide records, as well as additional protein-specific fields, such as protein name and function.

It is important to note that the Nucleotide and Protein tabs are not directly connected. When switching between these tabs, the records displayed may not have a one-to-one correspondence with the records from the previous tab. The order and organization of records in the Protein tab will be different from those in the Nucleotide tab, as they represent distinct types of sequences with their own sets of metadata and annotations. However, users can see the connection with nucleotide records through the Nucleotide column, which contains nucleotide accessions corresponding to the protein accessions in the table.

Users can also see the connection of protein accessions to the NCBI Virus Genome assembly accessions through the Assembly column.

NCBI Virus Assembly Tab

The NCBI Virus Assembly tab provides access to genome assembly records for viruses, including RefSeq genome assemblies and genome assemblies for segmented viruses.

These assembly records represent complete or nearly complete viral genomes and provide additional information and annotations beyond individual nucleotide or protein sequences.

For more information on the types of sequence records and assembly records available in NCBI Virus, please refer to the Data Model, Type of Data and Dataflow section.

Multiple Sequence Alignment (Accessed via "Align" button)

Generate a multiple sequence alignment from selected search results:

Multiple sequence alignments are calculated using the MUSCLE (Multiple Sequence Comparison by Log-Expectation) algorithm.

Important:

This alignment tool is designed for quick visualization and preliminary analysis. For publication-quality alignments, we recommend using dedicated alignment software and manually reviewing the results.

Read more about how to use the alignment viewer in the NCBI Multiple Sequence Alignment Viewer documentation.

Select the desired sequences and click the "Align" button to build a Multiple Sequence Alignment.

Multiple Sequence Alignment

BLAST Similarity-Score Based Distance Tree (Accessed via "Build Phylogenetic Tree" button)

Generate a quick distance tree based on BLAST similarity scores from selected search results:

Select the desired sequences and click the "Build Phylogenetic Tree" button to build a BLAST Similarity-Score Based Distance Tree.

BLAST Similarity-Score Based Distance Tree

Limitations:

Important

This tree is not a true phylogenetic tree but a BLAST similarity-score based distance tree. It is generated using BLAST comparisons, with BLAST scores used as tree distance parameters. The NCBI Tree Viewer displays this data without applying additional phylogenetic algorithms.

For more information about the Tree Viewer and how to use it, please refer to the NCBI Tree Viewer documentation.

Back to Top


Refining Results via Filters

The left sidebar of the NCBI Virus results table page provides various filters to apply to further refine your results.

Use any combination of these search filters to apply to the records. When multiple filters are used, they will be connected with the AND logical operator to include sequences that match all the provided criteria. Filters are applied and highlighted dynamically as you select them.

Active filters appear above the results table. Remove a filter by clicking the “X” next to it.

Clear all filters using the “Reset All” button.

Refine results by applying various filters from the left-side panel.

Apply filters

Virus/Taxonomy

Virus or viroid name, taxonomy group, synonyms or taxids.

Accession

NCBI Accession number(s) from GenBank, BioProject, BioSample, SRA and/or Assembly accession.

Sequence Length

Min and/or Max sequence length. The range is applied to nucleotide sequences independently from protein sequences.

Ambiguous Characters

Maximum number of ambiguous characters (N’s in nucleotide or X’s in protein) allowed in each sequence.

GenBank/RefSeq

GenBank and/or RefSeq. GenBank and RefSeq sequence records are mutually exclusive.

Assembly Completeness

Complete and/or partial NCBI Virus assembly records.

Nucleotide Completeness

Complete and/or partial nucleotide records. Nucleotide sequences are considered complete, if they were submitted as such to GenBank or other INSDC databases.

Pango Lineage

Name of SARS-CoV-2 Pango lineage assigned to sequence record using Pangolin with UShER.

Surveillance Sampling

SARS-CoV-2 sequences collected randomly in the population, for the purpose of baseline surveillance - not including samples collected for vaccine breakthrough or localized outbreak investigations.

Isolate/Strain Name

Isolate or strain name from the "/isolate" field of GenBank record.

Has Proteins

Sequence records which contain given protein name(s).

Provirus/Integrated

Provirus is a sequence obtained from a virus, or a phage, that is integrated into the genome of another host organism, as indicated by the sequence submitter.

Geographic Region

Global geographic areas from which the sequences were collected. When a region is chosen, all countries within that region will be automatically selected as well.

Host

Submitter-provided host organism, not the known host range of a virus. Start typing a name or taxid to select from a list of top 10 suggestions. Optionally, select Human or non-Human.

Submitters

Submitter/Author names, affiliated institution or affiliation country/location of persons who submitted the sequences. Must use quotes to search with a phrase. Case-sensitive.

You can use any combination of these search windows to filter the sequences. When multiple search windows are used, the filter applies an AND logic, meaning that the results will include sequences that match all the provided criteria.

Tissue/Specimen/Source

Part of the host organism, where the sample was obtained.

Collection Date

Filter sequences by the sample collection date range.

Release Date

Date range when samples were released through GenBank.

Genome Molecule Type

Type of viral nucleic acid, as provided by ICTV.

Environmental Source

Environmental source of the samples when sample is not associated with a host organism.

Lab Passaged

Indicator that sequences came from samples from a laboratory setting, and not from a wild host.

Vaccine Strain

Strain used for generating vaccines.

Segment

Name or number representing a genome segment. Note: segment name is provided by submitter.

Genotype

Genotype (subtype) of a viral sequence.

Back to Top


Downloading Data

Downloading Sequences from NCBI Virus

Step 1: Select Data Type

  1. Click the Download button located on the upper left side of the NCBI Virus Results Table page.

    Click on the "Download" button to open the Download Results menu.

    Download step 1
  2. Choose the type of data you want to download:
    • Nucleotide, Protein, or Coding Region Sequence (CDS) in FASTA format (Note: Randomized subsets are not available for CDS FASTA files).
    • Accession List for nucleotide, protein, or assembly records (Note: Randomized subsets are not available for CDS accession lists).
    • Results Table contents in CSV (Comma Separated Values) or XML format, which includes metadata.

Selected data type: FASTA Format (Nucleotide, Protein, or CDS)

Download results in FASTA format, Step 1: Select Data Type.

FASTA Download step 1

Step 2: Select Records

Step 3: Customize Sequence Titles (Optional)

Customize the FASTA definition line:

>AAO17794 | VP4 spike protein [Human rotavirus A]

>NC_045425.1:319..1659 | replication endonuclease [Thermus phage phiOH3]

Download Results in FASTA format, Step 3: Build a custom FASTA defline.

Download Results in FASTA format, Step 3: Build a custom FASTA defline.

Selected data type: Accession List

Download Accession List, Step 1: Select an accession type.

Download Accession List, Step 1: Select an accession type.

Step 2: Select Records

Selected data type: Results Table (CSV or XML)

Download Results Table, Step 1: Select a format for the Results Table file.

Download Results Table, Step 1: Select a format for the Results Table file.

Step 2: Select Records

Additional Information

Filters and Randomization
For large datasets:

Consider downloading data in smaller batches, especially when dealing with long sequences or including associated metadata.

If you are experiencing difficulties with very large downloads, you may want to explore alternative methods such as FTP access or NCBI Datasets, which are optimized for bulk retrieval.

Disclaimers

  • Our current platform does not support repeatable randomized searches. We understand the importance of repeatability in the scientific community and are working to include this feature in future updates.
  • Downloading randomized subsets is currently available for nucleotide, protein, and assembly records. We are working to make them available for coding region records in the future.

Alternative methods for downloading virus sequences

While the NCBI Virus user interface is the primary access point for searching and downloading virus sequences, there are alternative ways of programmatic access and bulk downloads of the data.

NCBI FTP Site

Virus data are a part of the NCBI FTP site, which provides access to a wide range of sequence data.

To access virus nucleotide sequences and/or associated metadata, navigate to the Viruses directory on the NCBI FTP site https://ftp.ncbi.nlm.nih.gov/genomes/Viruses/ to find:

To access only RefSeq viral genomes go to https://ftp.ncbi.nlm.nih.gov/genomes/refseq/viral/ to find:

NCBI Datasets

NCBI Datasets provides separate tools for downloading genome sequence data and metadata.

For tutorials on programmatic access and command-line tools available in the NCBI Datasets resource, visit https://www.ncbi.nlm.nih.gov/datasets/docs/v2/how-tos/virus/

Back to Top


Using Visual Dashboards

NCBI Virus includes two interactive visual dashboards for exploring viral sequence data: Home Page Dashboard and Visual Filters Dashboard.

NCBI Virus Home Page Dashboard

The home page dashboard provides several ways to explore the available data:

Home Page Dashboard: To view the results selected on this dashboard in a tabular format, click any of the statistics buttons.

Home Page Dashboard: To view the results selected on this dashboard in a tabular format, click any of the statistics buttons.

Interact with the dashboard:

  1. Click a statistic button to view matching records in the Results Table
    • Clicking on each button will show a results table with the corresponding sequences.
    • Results can be further refined by using filters for various sequence attributes (metadata) located on the left side of the Results Table page (learn more here).
  2. Explore virus taxonomy using the sunburst chart.
    • The default view represents the classification for all available NCBI virus and viroid taxa.
    • The inner layer (ring) represents four non-taxonomic groups of viruses: RNA viruses, DNA viruses, and Unclassified viruses.
    • Only 4 levels of the whole hierarchy are visible on the plot at a given time.
    • Click on any slice (section) of any layer to zoom into the selected taxa and display additional subtaxa.
    • Hover over a slice to view the taxon name and breadcrumbs.
    • Breadcrumbs above the chart show the location of the taxa in the hierarchy; clicking a breadcrumb will refocus the plot on the selected taxa.
    • Clicking the center of the chart will return to the parent taxon.
  3. Select a host species from the bar chart
    • Each bar is proportional to the number of virus sequences isolated from that host.
    • Click a bar or host name to highlight the selected host and associated taxa in the sunburst chart.
    • Only one host can be selected at a time.
    • Click the selected host again or use the “Reset” button to deselect.
    • Use the scroll bar or “CTRL+F” to search for a specific host.
  4. After selecting a host or taxon, the statistics buttons in the top row will update.
  5. Click a highlighted taxon in the sunburst chart to focus on taxa containing sequences from the selected host.
    • The lower layers will highlight taxa with sequences from the selected host.
    • Not all taxa will be highlighted if they do not include sequences from the selected host.

Visual Filters Dashboard

The "Visual Filters for GenBank Sequences" dashboard allows you to interactively filter search results by collection location and date.

Important:

The Visual Filters Dashboard is specifically designed to provide detailed insights for individual virus taxa or the entire virus database. It is not available if multiple virus taxids are selected in the Virus/Taxonomy filter on the Results Table page. If you need to analyze data from multiple specific taxa, consider examining each taxon individually using the dashboard.

To access the Visual Filters Dashboard:

  1. Search for a virus from the home page or Results Table (or leave all viruses selected by default).
  2. Optional: modify the Results Table using other filters.
  3. Click the “Visual Filters for GenBank Sequences” tab above the Results Table
  4. Alternatively, append the virus taxonomy ID to the URL: https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/dashboard?taxid=<taxid>

Click on the "Visual Filters for GenBank Sequences" tab to access the Visual Filters Dashboard.

How to access the Visual Filters Dashboard.

Important note on switching between Results Table and Visual Filters Dashboard:

When transitioning from the Results Table to the Visual Filters Dashboard, please be aware of the following:

  1. Preserved Filters:
    • Virus/Taxonomy (only for single taxon selections)
    • Collection Date
    • Release Date
    • Geographic Region (partially - see details below)
  2. Geographic Region Filter Behavior:
    • Country and US state selections are preserved
    • Continent selections are reset (not available in Visual Filters Dashboard)
  3. All other filters are reset when switching to the Visual Filters Dashboard
  4. When switching back to the Results Table from the Visual Filters Dashboard:
    • Filters applied in the Visual Filters Dashboard (including country and US state selections) will be reflected in the Results Table.
    • Previously applied filters that were reset will not be automatically restored. You will need to reapply these filters if needed.

Filter behavior when switching between Results Table and Visual Filters Dashboard:

Filter Type Behavior When Switching to Visual Filters Dashboard
Virus/Taxonomy Preserved (single taxon only)
Collection Date Preserved
Release Date Preserved
Geographic Region - Countries Preserved
Geographic Region - US States Preserved
Geographic Region - Continents Reset (not available in Visual Filters Dashboard)
All Other Filters Reset

To filter results using Visual Filters Dashboard:

  1. Explore the Geographic Distribution:

    • View the world map displaying the distribution of sequences based on their collection locations.
    • Note: color shades represent nucleotide record numbers; darker shades indicate higher numbers.
    • Click on countries or states to select them.
    • Use the International/USA toggle to switch between world and US views.
    • Select multiple locations by clicking on them.
    • To choose a single region, type its name and select from the dropdown.
    • Remember: changing between International and USA views will reset your selections.
  2. Use Timeline Sliders for Date Filtering:

    • Adjust the Collection Time slider (earliest to current collection year).
    • Adjust the Release Time slider (first release to current year).
    • Drag slider handles or date bars on the chart to set specific ranges.
    • Select weekly, monthly, or yearly intervals.
    • Use bi-yearly intervals from the dropdown selector to zoom into specific time periods.
    • Note: For incomplete collection dates, records are shown as follows:
    • Year only: displayed as January 1 of that year.
    • Year and month only: displayed as the first day of that month.
  3. Apply Filters:

    • Click a bar on the timeline or select a time interval with the sliders.
    • The dashboard will automatically apply your selected filters.
  4. Review Results:

  1. Return to Results Table:


    • Click "Advanced Filters for GenBank Sequences" tab or "View the Results Table and download" button to view your filtered results in a tabular format.
      • From here, you can apply additional filters or download sequences and metadata.
      • Filters applied in the Visual Filters dashboard will persist in the Results Table view.
      • You can navigate back to the Visual Filters dashboard from the Results Table through "Visual Filters for GenBank Sequences" tab.
      • View the Results Table with applied Visual Filters by clicking either the "View the Results Table and download" button, the "Advanced Filters for GenBank Sequences" tab, or any of the record statistics links.

        View the Results Table with applied Visual Filters


Back to Top

Accessing SARS-CoV-2 Data

SARS-CoV-2 Data Hub

NCBI Virus provides a dedicated SARS-CoV-2 data hub to easily access sequences and metadata for this virus.

SARS-CoV-2 Data Hub.

SARS-CoV-2 Data Hub

Sequence records follow the standard NCBI Virus table format and can be refined, analyzed and downloaded as described in previous sections.

NCBI Virus also provides the following SARS-CoV-2 specific resources:

SARS-CoV-2 Variants Overview

Explore lineage geo-temporal and mutation data using the interactive SARS-CoV-2 Variants Overview dashboard.

Learn more using SARS-CoV-2 Variants Overview help center.

Back to Top


Citing NCBI Virus

To cite the NCBI Virus resource in your own work, please include this URL https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/ in the website citation formatted according to your publisher's recommendations.

Example:

NCBI Virus [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; [cited YYYY MM DD]. Available from: https://www.ncbi.nlm.nih.gov/labs/virus/.

For more information on citing other NCBI resources, please see How do I cite NCBI services and databases?

If you want to find more information on citing other NCBI services and databases, please follow this link: How do I cite NCBI services and databases? .

Our other related publications by NCBI Virus Team can be found on Our Publications Page.

Back to Top


FAQs

Back to Top


Submitting Data

Please, refer to Submit Sequences page to find an overview on how to submit virus sequences.

Back to Top


Glossary

Isolate

The isolate or strain name provided by the submitter. Parsed from the "/isolate" field of GenBank record. Shows the individual isolate from which the sequence was obtained, typically an alphanumeric sample ID. SARS-CoV-2 sequence isolate name is formatted according to the Coronaviridae Study Group of the International Committee on Taxonomy of Viruses (ICTV) definitions.

Host

In the NCBI Virus database, the term “Host” refers to the host organism from which a virus was isolated, as provided by the sequence submitter. This information is displayed in the Host column of the Results Table and can also be used as a filter to refine search results.

The host terms are parsed from the “/host” field in a sequence’s GenBank record. These parsed terms are then mapped to a standardized vocabulary created by curators who aggregated the various terms found in GenBank files. This mapping strategy also accounts for common misspellings. For example, “Accipter cooperii” is mapped to “Accipiter cooperii”.

If the isolation host is unknown (/host field of the GenBank record is empty), but the laboratory host is present (as indicated in the /lab_host field of the GenBank record), the laboratory host will be displayed in the Host column of the Results Table. If both the isolation host and laboratory host can be mapped, only the isolation host will be presented.

When using the Host filter, users can enter a host name or NCBI taxonomy ID (taxid) into the designated text box, and a list of suggested host terms will be displayed (limited to the top 10 names / taxids). Users can then choose the desired host term to filter the search results. The Host filter also allows users to filter for multiple hosts simultaneously by adding additional host terms. Additionally, users can select either human or non-human hosts using the corresponding checkboxes.

It is important to note that the Host information in NCBI Virus represents the specific host from which the viral sequence was isolated, based on the submitter-provided data. It does not necessarily encompass the entire range of hosts that the virus is known to infect. For a more comprehensive understanding of a virus’s host range, users should consult additional sources and scientific literature.

Tissue / Specimen / Source

In the NCBI Virus database, the term “Tissue / Specimen / Source” refers to the specific part of the host organism from which a virus was isolated, as provided by the sequence submitter. This information is displayed in the Tissue / Specimen / Source column of the Results Table and can also be used as a filter to refine search results.

The Tissue / Specimen / Source terms are parsed from the “/isolation_source” field in a sequence’s GenBank record. These parsed terms are then mapped to a standardized vocabulary created by curators who aggregated the various terms found in GenBank files. This mapping strategy also accounts for common misspellings and regional spelling differences. For example, “serum” and “plasma” are both mapped to the standardized term “blood”.

When using the Tissue / Specimen / Source filter, users can select one or more isolation source terms from the provided list. The filter allows for multiple selections, enabling users to refine their search results based on specific isolation sources of interest.

It is important to note that the Tissue / Specimen / Source information in NCBI Virus represents the specific part of the host organism from which the viral sequence was isolated, based on the submitter-provided data. The level of detail and specificity of the Tissue / Specimen / Source may vary depending on the information provided by the submitter.

For a more comprehensive understanding of the Tissue / Specimen / Source for a particular viral sequence, users should refer to the full GenBank record and any associated publications or metadata. In some cases, additional information about the Tissue / Specimen / Source may be available in the “/note” or other fields of the GenBank record.

Nucleotide Completeness

Nucleotide Completeness indicates whether a nucleotide sequence is complete or partial. Nucleotide sequences are considered complete if they were submitted as such to GenBank or other INSDC databases. In majority of cases, the submitter is responsible for determining the completeness of the sequence.

For manual submissions through Bankit or tbl2asn, templates or questions are provided to the submitter to confirm the completeness of the sequence(s). In BankIt, for example, there is a prompt asking the submitter if their sequence(s) is a complete viral genome/segment. If the submitter confirms completeness, the sequence moves forward. However, GenBank curators may mark sequences as partial on a case-by-case basis, even if the submitter asserted completeness. Examples:

For automatic submissions, such as for Influenza virus A, B, or C, Norovirus, and Dengue virus, completeness is automatically checked using the processes outlined in the VADR (Viral Annotation DefineR) system documentation and related publication.

NCBI Virus determines nucleotide completeness from the ASN.1 format of GenBank records. A record in NCBI Virus is marked as complete if it contains "complete genome" in the definition or if molinfo/completeness is set to "complete" in the ASN.1. To access ASN.1 data for a submission, select the ASN.1 format from the "Format" dropdown menu on the left upper corner of the Entrez page of the record.

Click on the GenBank dropdown menu, and select "ASN.1" format from the list.

How to view a record in ASN.1 format

Tips:

Genome Molecule Type

Molecule type provided by the International Committee on Taxonomy of Viruses (ICTV) in the Master Species List. RefSeqs that have "Unknown" molecule type belong to tax groups which were not yet recognized by ICTV.

Genotype

In the NCBI Virus database, the term "Genotype" refers to the genotype or subtype of a viral sequence, as provided by the sequence submitter. This information is displayed in the Genotype column of the Results Table and can also be used as a filter to refine search results.

The genotype information is parsed from the "/serotype" field of the GenBank record and is shown exactly as it was submitted by the sequence submitter. The presence, accuracy, and consistency of genotype data may vary between records, as it is not a mandatory field in the GenBank submission process.

Segment

In the NCBI Virus database, the term "Segment" refers to the segment name or number representing a genome segment of a segmented virus, as provided by the sequence submitter. This information is displayed in the Segment column of the Results Table and can also be used as a filter to refine search results.

The segment information is parsed from the "/segment" field of the GenBank record and is displayed exactly as it was submitted by the user. No additional processing or normalization is performed on the segment data.

Randomized Sequence Subset

A randomized subset of sequences (also referred to as 'downsampling') can allow a user to work with a smaller subset of sequences selected at random from a larger dataset, as an approximation of the full dataset.

A smaller, representative sequence set could make downstream analysis faster and less computationally intensive, and still allow for interpretation of the larger collection. When downloading a randomized subset, the file name will include the date of download and the randomization seed used.

Stratified Randomized Sequence Set

Randomized subsets of sequences can be stratified, meaning equally distributed over a field of categories (also referred to as 'stratified downsampling'). This enables a user to work with a subset of sequences selected from a dataset, as an approximation of the full dataset, with equal numbers of sequences from a selected category, to approximate a larger sequence collection. The fields currently available for stratification are Country, Collection Year, Release Year and Host.

Back to Top