Adding a Structured Comment to GenBank Submissions

Introduction

GenBank records consist primarily of nucleotide sequence data, source organism information, and sequence features. The organism and feature description are based on a controlled list of organism modifiers (such as isolate, strain, clone, and specimen voucher) and features (such as CDS, rRNA, and gene).

However, many sequence submitters also have additional organism metadata that cannot easily fit into the controlled list but that is significant for the complete description of a sequences source and allows for comparisons of sequences isolated from similar locations.

To collect and display such additional metadata in sequence records, GenBank has developed a Structured Comment. The comment consists of tag-value pairs that are contained within START and END tags that function as delimiters for easy parsing. These comments can be incorporated from a tab-delimited table into submission files using table2asn (the replacement of the older tbl2asn). An example of a GenBank record that includes a structured comment is GU949562.

This guide explains how to include structured comments with your sequence submission. However, note that several GenBank submission tools prompt submitters to provide the metadata required to create certain structured comments for particular types of data, as explained below.

If you do not understand any of the instructions presented here or you have questions, please contact GenBank User Services at info@ncbi.nlm.nih.gov prior to creating your submission.

Including Structured Comments Within GenBank Submissions
- Adding the same structured comment to all sequences in your submission
- Adding a unique structured comment to each sequence in your submission
Specialized Structured Comments
Retrieval in Entrez

Including Structured Comments Within GenBank Submissions

In order to include unique metadata within the structured comment, you need to create a tab-delimited table in one of two ways depending on how the data should be applied to the sequences in your submission. Any scientific unit of measurement (e.g., deg C or km) should be included with the value.

[1] Adding the same structured comment to all sequences in your submission

This requires a single tab-delimited table that includes the tag-value pairs that are to be applied to all of the sequences in your submission, for example:

oxygen_content	32 ppm
habitat	Black Lake
temperature	27 deg C
sample size	150 mL
depth	10 m

Once the metadata table is created and saved as plain text, the structured comment can be included using table2asn.

table2asn: The tab-delimited table needs to be saved as a .cmt file and included in the same directory as your fasta file. If the .cmt file name has the same basename as your fasta file (for example, fasta1.fsa and fasta1.cmt), it will be automatically recognized and the structured comment will be included for all the sequences in your fasta file. Alternatively, you can use any file name for the structured comment file and call it with the argument -w within the table2asn commandline.

[2] Adding a unique structured comment to each sequence in your submission

The format for this type of table is a tab-delimited, multi-column table, where the first column must be the Sequence Identifier used in the .fsa files. The first row in each column is the metadata tag that appears in the left side of the structured comment, for example:

SeqID	investigation_type	project_name	collection_date	depth
A	metagenome	aquatic study	2007-03-04	10 m
B	metagenome	aquatic study		5 m
C	eukaryote	Analysis of fish	2008-08-09	25 m

Each sequence in this submission will include a structured comment with unique tag-value pairs. Once the metadata table is created and saved as plain text, the tag-value pairs can be included using table2asn.

See the HIV example below for instructions on the .cmt file format to include a specific prefix for the structured comment.

table2asn: The tab-delimited table needs to be saved as a .cmt file and included in the same directory as your fasta (and optional .tbl) file. If the .cmt file name has the same basename as your fasta file (for example, fasta1.fsa and fasta1.cmt), the .cmt file will be automatically included so that each sequence in column 1 has the tag-value pairs of that row of the file.

Specialized Structured Comments

[1] MIGS/MIMS/MIMARKS

Minimum information checklists have been developed by the Genomic Standards Consortium (GSC) as a means of reporting core descriptive information about the environment from which an organism(s) was collected. Core descriptors include information about the origins of the nucleic acid sequence (genome), its environment (e.g., latitude and longitude, date and time of sampling, habitat) and sequence processing (sequencing and assembly methods).

Different lists have been developed to describe genomic, metagenomic, and marker sequence metadata:

MIGS - Minimum Information About a Genome Sequence
MIMS - Minimum Information About a Metagenome Sequence
MIMARKS - Minimum Information About a Marker Sequence
MIMAG - Minimum Information About a Metagenome-Assembled Genome
MISAG - Minimum Information About a Single Amplified Genome
MIUVIG - Minimum Information About an Uncultivated Virus Genome

The tag-value pairs that are included for each submission type can be validated for compliance with the GSC recommended list. The recommended lists of core descriptors that should be included for each of these sequence types can be found here.

Validation tools within will report if structured comments include all of the GSC recommended compliant core descriptors. Submissions that include of all the compliant tags will have a Keyword included within the GenBank flatfile, for example:

KEYWORD GSC:MIMARKS:5.0

Structured comments that are not compliant based on the GSC guidelines can still be included within GenBank submissions - they just will not include the keyword.

In order for this validation to occur, you will need to include within the first column in your table a tag that defines the prefix and suffix for the start and end tags within the structured comment, for example:

StructuredCommentPrefix	[one of the following - MIGS:3.0-Data / MIMS:3.0-Data / MIMARKS:3.0-Data]
investigation_type	[value determined by organism type as defined within GSC spreadsheet]
project_name	Analysis of soil bacteria
collection_date	2008-08-09
lat_lon	35.64N 56E
geo_loc_name	France
biome	grassland
feature	field
material	soil
env_package	[env_package types are listed within the GSC spreadsheet] - can include the term "missing"
num_replicons	14
ref_biomaterial	PMID
biotic_relationship	free living
trophic_level	autotroph
rel_to_oxygen	aerobe
isol_growth_condt	PMID
seq_meth	pyrosequencing
assembly	Velvet; error rate 1/45
finishing_strategy	complete; 4X coverage; 2500 contigs

An example of a sequence that includes a structured comment that meets GSC compliance is CP051461.

[2] Genome Submissions

Prokaryotic and eukaryotic genome submissions require assembly information in a Genome Assembly-Data structured comment. This structured comment includes the following required fields:

Assembly Method (with version or date the program was run): e.g., Newbler v. 2.3 OR Celera Assembly v. May 2010
Genome Coverage : e.g., 121x
Sequencing Technology : e.g., ABI 3730; Illumina GAIIx; Nanopore

Assembly Name may be added for eukaryotic assemblies, but is optional.

Assembly Name : a short name suitable for display e.g., LoxAfr_3.0 for a Loxodonta africana assembly, version 3.0

Note that Assembly Method requires 'v. ' between the algorithm name and its version (or the month and year it was run). If more than one sequencing technology was used, they are separated with a semi-colon, e.g. "PacBio; Illumina GAIIx".

You will be prompted for this information when you submit your prokarotic or eukaryotic genome via the Genome Submission Portal, which is the easiest way to provide the information.

If you are creating a .sqn file with table2asn, you can create a Genome-Assembly-Data file and include it as described above, if you wish. However, this is not necessary because you will be prompted for the information when you submit the genome in the Submission Portal.

The prefix and suffix for the start and end tags are:

StructuredCommentPrefix Genome-Assembly-Data
StructuredCommentSuffix Genome-Assembly-Data

An example of a genome with the required structured comment is AMVS01000000.

[3] Transcriptome Shotgun Assembly Submissions

An Assembly-Data structured comment is required for Transcriptome Shotgun Assembly (TSA) sequences. Users will be prompted for this information when using the TSA Submission Wizard. If submitting using table2asn, this file can be made using the Structured Comment template (non-genomes) page or as described above. However, this is not necessary because you will be prompted for the information when you submit the genome in the Submission Portal.

The TSA structured comment includes the following required values:

Assembly Method (with version or date the program was run): e.g., Velvet v.1.1.05, Oases v.0.1.22, Trinity r2012-01-25
Sequencing Technology : e.g., ABI 3730; 454 GS-FLX Titanium; Illumina GAIIx

Coverage and Assembly Name may be added but these are optional.

Assembly Name : a short name suitable for display e.g., LoxAfr_3.0 for a Loxodonta africana assembly, version 3.0
Coverage : e.g., 12x

The prefix and suffix for the start and end tags to include within this structured comment are:

StructuredCommentPrefix Assembly-Data
StructuredCommentSuffix Assembly-Data

An example of a TSA submission with the required structured comment is JU497302.

[4] GenBank Assembly-Data

Submission to GenBank can include an Assembly-Data structured comment that is displayed within the GenBank flatfile and provides users with information regarding the sequencing and assembly details.

This structured comment includes the following values:

Assembly Method (with version or date the program was run): e.g., Newbler v. 2.3 OR Celera Assembly v. May 2010
Coverage : e.g., 12x
Sequencing Technology : e.g., ABI 3730; 454 GS-FLX Titanium; Illumina GAIIx (required)

The prefix and suffix for the start and end tags to include within this structured comment are:

StructuredCommentPrefix Assembly-Data
StructuredCommentSuffix Assembly-Data

An example of a GenBank record with an Assembly-Data structured comment is JQ307843.

[5] HIV

A specialized structured comment can be included with HIV sequence submissions to describe additional metadata that cannot be easily included within the source descriptor. This includes specific tags that provide more information regarding the source of the virus.

For HIV-specific structured comments, you need to include two additional columns in your table that define the prefix and suffix for the start and end tags on either side of the structured comment:

StructuredCommentPrefix HIVDataBaseData
StructuredCommentSuffix HIVDataBaseData

Example Table

SeqID	sequence name	Patient cohort	Sample tissue	viral load	StructuredCommentPrefix	StructuredCommentSuffix
SeqA	mysample_1	CHAVI001	plasma	3565728	HIV-DataBaseData	HIV-DataBaseData
SeqB	mysample_2	CHAVI002	plasma	3565730	HIV-DataBaseData	HIV-DataBaseData
SeqC	mysample_3	CHAVI003	plasma	3565755	HIV-DataBaseData	HIV-DataBaseData

An example record that includes a properly formatted HIV structured comment is EU579019.

Retrieval in Entrez

Sequences with structured comments can be retrieved in Entrez by specifying the tag-value pair in double quotes, e.g. "investigation_type bacteria_archaea". This search in Entrez retrieves GenBank records with this tag-value pair in the structured comment. You can also search for each tag as a property in Entrez (e.g., depth[prop]) in order to retrieve all records that have this indexed within the structured comment.

GenBank

Public nucleic acid sequence repository