Virus report
Virus record metadata
Virus report
The downloaded virus package contains a virus data report in
JSON Lines
format in the file:
ncbi_dataset/data/data_report.jsonl
Each line of the virus data report file is a hierarchical JSON
object that represents a single virus record. The schema of the virus record is defined in the tables below where
each row describes a single field in the report or a sub-structure, which is a collection of fields.
The outermost structure of the report is VirusAssembly.
Table fields that include a Table Field Mnemonic can be used with the
dataformat command-line tool's
--fields
Sample report
{
"accession": "NC_045512.2",
"bioprojects": [
"PRJNA485481"
],
"completeness": "COMPLETE",
"geneCount": 11,
"host": {
"lineage": [
{
"name": "cellular organisms",
"taxId": 131567
},
{
"name": "Eukaryota",
"taxId": 2759
},
{
"name": "Opisthokonta",
"taxId": 33154
},
{
"name": "Metazoa",
"taxId": 33208
},
{
"name": "Eumetazoa",
"taxId": 6072
},
{
"name": "Bilateria",
"taxId": 33213
},
{
"name": "Deuterostomia",
"taxId": 33511
},
{
"name": "Chordata",
"taxId": 7711
},
{
"name": "Craniata",
"taxId": 89593
},
{
"name": "Vertebrata",
"taxId": 7742
},
{
"name": "Gnathostomata",
"taxId": 7776
},
{
"name": "Teleostomi",
"taxId": 117570
},
{
"name": "Euteleostomi",
"taxId": 117571
},
{
"name": "Sarcopterygii",
"taxId": 8287
},
{
"name": "Dipnotetrapodomorpha",
"taxId": 1338369
},
{
"name": "Tetrapoda",
"taxId": 32523
},
{
"name": "Amniota",
"taxId": 32524
},
{
"name": "Mammalia",
"taxId": 40674
},
{
"name": "Theria",
"taxId": 32525
},
{
"name": "Eutheria",
"taxId": 9347
},
{
"name": "Boreoeutheria",
"taxId": 1437010
},
{
"name": "Euarchontoglires",
"taxId": 314146
},
{
"name": "Primates",
"taxId": 9443
},
{
"name": "Haplorrhini",
"taxId": 376913
},
{
"name": "Simiiformes",
"taxId": 314293
},
{
"name": "Catarrhini",
"taxId": 9526
},
{
"name": "Hominoidea",
"taxId": 314295
},
{
"name": "Hominidae",
"taxId": 9604
},
{
"name": "Homininae",
"taxId": 207598
},
{
"name": "Homo",
"taxId": 9605
},
{
"name": "Homo sapiens",
"taxId": 9606
}
],
"organismName": "Homo sapiens",
"taxId": 9606
},
"isAnnotated": true,
"isolate": {
"collectionDate": "2019-12",
"name": "Wuhan-Hu-1"
},
"length": 29903,
"location": {
"geographicLocation": "China",
"geographicRegion": "Asia"
},
"maturePeptideCount": 26,
"nucleotide": {
"sequenceHash": "A926D55E"
},
"proteinCount": 12,
"releaseDate": "2020-01-13T00:00:00Z",
"sourceDatabase": "RefSeq",
"submitter": {
"affiliation": "National Center for Biotechnology Information, NIH",
"country": "USA",
"names": [
"Wu,F.",
"Zhao,S.",
"Yu,B.",
"Chen,Y.M.",
"Wang,W.",
"Song,Z.G.",
"Hu,Y.",
"Tao,Z.W.",
"Tian,J.H.",
"Pei,Y.Y.",
"Yuan,M.L.",
"Zhang,Y.L.",
"Dai,F.H.",
"Liu,Y.",
"Wang,Q.M.",
"Zheng,J.J.",
"Xu,L.",
"Holmes,E.C.",
"Zhang,Y.Z.",
"Baranov,P.V.",
"Henderson,C.M.",
"Anderson,C.B.",
"Gesteland,R.F.",
"Atkins,J.F.",
"Howard,M.T.",
"Robertson,M.P.",
"Igel,H.",
"Baertsch,R.",
"Haussler,D.",
"Ares,M. Jr.",
"Scott,W.G.",
"Williams,G.D.",
"Chang,R.Y.",
"Brian,D.A.",
"Chen,Y.-M.",
"Song,Z.-G.",
"Tao,Z.-W.",
"Tian,J.-H.",
"Pei,Y.-Y.",
"Zhang,Y.-L.",
"Dai,F.-H.",
"Wang,Q.-M.",
"Zheng,J.-J.",
"Zhang,Y.-Z."
]
},
"updateDate": "2020-07-18T00:00:00Z",
"virus": {
"lineage": [
{
"name": "Viruses",
"taxId": 10239
},
{
"name": "Riboviria",
"taxId": 2559587
},
{
"name": "Orthornavirae",
"taxId": 2732396
},
{
"name": "Pisuviricota",
"taxId": 2732408
},
{
"name": "Pisoniviricetes",
"taxId": 2732506
},
{
"name": "Nidovirales",
"taxId": 76804
},
{
"name": "Cornidovirineae",
"taxId": 2499399
},
{
"name": "Coronaviridae",
"taxId": 11118
},
{
"name": "Orthocoronavirinae",
"taxId": 2501931
},
{
"name": "Betacoronavirus",
"taxId": 694002
},
{
"name": "Sarbecovirus",
"taxId": 2509511
},
{
"name": "Severe acute respiratory syndrome-related coronavirus",
"taxId": 694009
},
{
"name": "Severe acute respiratory syndrome coronavirus 2",
"taxId": 2697049
}
],
"organismName": "Severe acute respiratory syndrome coronavirus 2",
"pangolinClassification": "B",
"taxId": 2697049
}
}
VirusAssembly Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
accession | accession | Accession | string | The accession.version of the viral nucleotide sequence. Includes both GenBank and RefSeq accessions | NC_045512.2 |
isAnnotated | is-annotated | Is Annotated | bool | The viral genome has been annotated by either the submitter (GenBank) or by NCBI (RefSeq) | |
isolate | isolate- | Isolate | Isolate | ||
sourceDatabase | sourcedb | Source database | string | Indicates if the source of the viral nucleotide record is from a GenBank submitter or from NCBI-derived curation (RefSeq) | RefSeq GenBank |
proteinCount | protein-count | Protein count | uint32 | The total count of annotated proteins including both proteins and polyproteins but not processed mature peptides | |
host | host- | Host | Organism | Taxon from which the virus sample was isolated | |
virus | virus- | Virus | Organism | Viral taxon | |
bioprojects repeated | bioprojects | BioProjects | string | Associated BioProject accessions, when available | PRJNA485481 |
location | geo- | Geographic | VirusAssembly.CollectionLocation | ||
updateDate | update-date | Update date | string | Date the viral nucleotide accession was last updated in NCBI Virus | |
releaseDate | release-date | Release date | string | Date the viral nucleotide accession was first released in NCBI Virus | |
completeness | completeness | Completeness | VirusAssembly.Completeness | ||
length | length | Length | uint32 | Length of the viral nucleotide sequence | |
geneCount | gene-count | Gene count | uint32 | Total count of genes annotated on the viral nucleotide sequence | |
maturePeptideCount | matpeptide-count | Mature peptide count | uint32 | Total count of processed mature peptides annotated on the viral nucleotide sequence | |
biosample | biosample-acc | BioSample accession | string | Associated Biosample accessions | SAMN15394129 |
molType | mol-type | Molecule type | string | ICTV (International Committee on Taxonomy of Viruses) viral classification based on nucleic acid composition, strandedness and method of replication | |
nucleotide | SeqRangeSetFasta | The whole genomic nucleotide record of the CDS feature. | |||
purposeOfSampling | purpose-of-sampling | Purpose of Sampling | PurposeOfSampling | ||
sraAccessions repeated | sra-accs | SRA Accessions | string | SRA accessions linked to the genbank genome | |
submitter | submitter- | Submitter | VirusAssembly.SubmitterInfo | Name, affiliation, and country of the submitter(s) | |
labHost | lab-host | Lab Host | string | This sequence is from viruses passaged in this host | |
isLabHost | is-lab-host | Is Lab Host | bool | If true, this sequence is from viruses passaged in a laboratory | |
isVaccineStrain | is-vaccine-strain | Is Vaccine Strain | bool | If true, this sequence is derived from a virus used as a vaccine or potential vaccine |
InfraspecificNames Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
breed | breed | Breed | string | A homogenous group of animals within a domesticated species | Hereford boxer |
cultivar | cultivar | Cultivar | string | A variety of plant within a species produced and maintained by cultivation | B73 |
ecotype | ecotype | Ecotype | string | A population or subspecies occupying a distinct habitat | Alpine |
isolate | isolate | Isolate | string | The individual isolate from which the sequences in the genome assembly were derived | L1 Dominette 01449 registration number 42190680 Pmale09 |
sex | sex | Sex | string | Male or female | female |
strain | strain | Strain | string | A genetic variant, subtype or culture within a species | SE11 |
Isolate Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
name | lineage | Lineage | string | BioSample harmonized attribute names https://www.ncbi.nlm.nih.gov/biosample/docs/attributes/ | |
source | lineage-source | Lineage source | string | Source material from which the viral specimen was isolated | blood feces lung |
collectionDate | collection-date | Collection date | string | The collection date for the sample from which the viral nucleotide sequence was derived |
LineageOrganism Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
taxId | coming soon | coming soon | uint32 | NCBI Taxonomy identifier | 11118 |
name | coming soon | coming soon | string | Scientific name | Coronaviridae |
Organism Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
taxId | tax-id | Taxonomic ID | uint32 | NCBI Taxonomy identifier | 9606 2697049 |
organismName | name | Name | string | Scientific name | Homo sapiens Severe acute respiratory syndrome coronavirus 2 |
commonName | common-name | Common Name | string | Common name | human pangolin MERS SARS2 |
lineage repeated | LineageOrganism | Lineage ordered from superkingdom level to increasingly more specific taxonomic entries | |||
pangolinClassification | pangolin | Pangolin Classification | string | B.1.1.7 | |
infraspecificNames | infraspecific- | Infraspecific Names | InfraspecificNames |
Range Structure
A 1-based range on a sequence record.
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
begin | start | Start | uint64 | ||
end | stop | Stop | uint64 | ||
orientation | orientation | Orientation | Orientation | ||
order | order | Order | uint32 | ||
ribosomalSlippage | coming soon | coming soon | int32 | When ribosomal slippage is desired, fill out slippage amount between this and previous range. |
SeqRangeSetFasta Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
seqId | seq-id | Sequence ID | string | Seq_id may include location info in addition to a sequence accession | |
accessionVersion | accession | Accession | string | Accession and version of the viral nucleotide sequence | |
title | title | Title | string | ||
sequenceHash | hash | Hash | string | Unique identifier for identical sequences | |
range repeated | range- | Range | Range | Series of intervals on above accession_version |
VirusAssembly.CollectionLocation Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
geographicLocation | location | Location | string | Country of virus specimen collection | USA France |
geographicRegion | region | Region | string | Region of virus specimen collection | Asia North America |
usaState | state | State | string | Two letter abbreviation of the state of the virus specifime collection (if United States) | NY VA |
VirusAssembly.SubmitterInfo Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
names repeated | names | Names | string | List of submitters or authors of the virus assembly | Jane D John S |
affiliation | affiliation | Affiliation | string | The submitter’s organization and/or institution | Centers for Disease Control and Prevention, Respiratory Viruses Branch, Division of Viral Diseases Public Health Directorate, Communicable Disease Laboratory |
country | country | Country | string | The country representing the submitter’s affilation | USA China |
Orientation Enumeration
Name | Number | Description |
---|---|---|
none | 0 | |
plus | 1 | |
minus | 2 |
PurposeOfSampling Enumeration
Name | Number | Description |
---|---|---|
PURPOSE_OF_SAMPLING_UNKNOWN | 0 | |
PURPOSE_OF_SAMPLING_BASELINE_SURVEILLANCE | 1 |
VirusAssembly.Completeness Enumeration
Name | Number | Description |
---|---|---|
UNKNOWN | 0 | |
COMPLETE | 1 | |
PARTIAL | 2 |
Scalar Value Types
Protocol buffers type | Notes | C++ | Python | Java | Go |
---|---|---|---|---|---|
double | double | float | double | float64 | |
float | float | float | float | float32 | |
int32 | Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead. | int32 | int | int | int32 |
int64 | Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead. | int64 | int/long | long | int64 |
uint32 | Uses variable-length encoding. | uint32 | int/long | int | uint32 |
uint64 | Uses variable-length encoding. | uint64 | int/long | long | uint64 |
sint32 | Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s. | int32 | int | int | int32 |
sint64 | Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s. | int64 | int/long | long | int64 |
fixed32 | Always four bytes. More efficient than uint32 if values are often greater than 2^28. | uint32 | int | int | uint32 |
fixed64 | Always eight bytes. More efficient than uint64 if values are often greater than 2^56. | uint64 | int/long | long | uint64 |
sfixed32 | Always four bytes. | int32 | int | int | int32 |
sfixed64 | Always eight bytes. | int64 | int/long | long | int64 |
bool | bool | boolean | boolean | bool | |
string | A string must always contain UTF-8 encoded or 7-bit ASCII text. | string | str/unicode | String | string |
bytes | May contain any arbitrary sequence of bytes. | string | str | ByteString | []byte |