Assembly Submission Guidelines

An assembly is a collection of genomic sequences that are used to represent the genome of an organism. Below are some instructions for submitting a genome assembly.

Assembly Types
Data File Definitions
Examples

Assembly Types

Genome assemblies can be described in several ways. We've defined two basic categories, within each category specific assembly types can be defined.

Simple: Assemblies with no instructions for building higher order molecules.
- Complete replicon: All molecules are represented by a gapless sequence. Typically bacterial, but can be seen in other taxa.
- WGS contigs only: An assembly where only read overlap contigs have been produced, no scaffolds have been made. See this page for an overview of WGS Projects
Complex: assemblies consisting of some component sequence (like a WGS contig or an HTG sequence) and higher order structures such as scaffolds and/or chromosomes.
- Haploid only: a collection of unplaced scaffolds, or a collection of chromosome + unlocalized and unplaced scaffolds that represent an organism's consensus haploid sequence. Any locus may not intentionally be represented more than once, with the exception of the pseudo-autosomal region (PAR) in mammalian genomes. A genome assembly may be haploid either because the genetic material sequenced was haploid, or because the genetic material sequenced was diploid or polyploid but the assembly process collapsed the sequence into a composite haploid representation of the genome (unless sophisticated assembly algorithms are used, any alternate sequences typically get placed in the unlocalized or unplaced bins). This is the most common situation with current sequencing and assembly technologies.
- Haploid + Alts: a collection of chromosome assemblies, unlocalized and unplaced sequences, plus alternate loci that represents an organism's genome. Any locus may be represented 0, 1 or >1 times, but entire chromosomes are only represented 0 or 1 times. An example of this type of assembly is the GRCh37 assembly for Homo sapiens.
- Diploid/Polyploid: a genome assembly for which a chromosome assembly is available for both/all sets of an individual's chromosomes. It is anticipated that a diploid/polyploid genome assembly is representing the full genome of an individual, therefore alternate loci are not expected to be defined for such an assembly, although it is likely that unlocalized or unplaced sequences could be part of the assembly.

Data File Definitions

Below are the file types and their formats that may be required for your submission

Meta data file: Information about the project, identifiers, DNA source, sequencing technology and assembly information.

see format...

A new web form is under development. Information about the submitter and publication are provided with a template file made here /WebSub/template.cgi. The assembly metadata that we want is:

Project ID
An assembly name: should be short and suitable for display.
An assembly description: brief text to display about the assembly
Sequencing technologies used to generate reads.
Sequence coverage
Assembly program and version used
Base level quality description: brief description about how the assembly program determines base level quality. Also, minimum value for a 'good' base would be useful.
Linkage quality: If submitting AGP files, how was linkage quality determined. For example, if linkage is based on mate pairs, what was the minimum number of mates required to determine linkage.
Information on the material that was sequenced:
- organism name
- strain/breed/isolate/cultivar
- sex
- any other sample specific information that could be useful.

Contig Fasta Files: A sequence file for the WGS contigs in an assembly. There should be no gaps represented, although Ns can be used to represent sequence ambiguities. There should be no more than 10,000 sequences per file. It is often convenient to group sequences by molecule type (e.g. chromosome) or sequence status (e.g. unplaced or unlocalized).

see format...

A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by begining with a greater-than (">") symbol followed by the contig name (SeqId) in the first column. It is recommended that all lines of text be shorter than 80 characters in length. An example sequence in FASTA: >contig0001 AAAACCTTCCCGTTGGCCTTCACCGTCTACTTAACGAGCCACGCCCCTCCTAGGACACCGCAAGAGAAAT GCTGGGGTCACCCCTGGCCGAGGCCTCCCTCCTGCTGGCCACACGTAAGAAAGGACTTCACAAGGGAGAC CTCCGTGGCTGCCACACACATTCACCCCAAATGCTTCCTGGAGAAAGCACCTGCCCTCACACTGTGAGCT CGTGAGTTTGCCAAAAAGGAGATGCAGGAGCCTGAGATCACCTCCTGTCTTGCTGCTAAAATATCCCAGC CGTGGAAAAGCAAGGCTGGCCTCAAATTGGGGAATCTGGTCTTGCCAGCCCAGCTGTGCTCCAGGGACTC CGGTTTGCATTGGGAATGAGAGAGTGTTGGCCGGGTAAGATGGCAAGACAGACACAGTCCTCCTACAGAC TTGTAGAAGGGCTTTCTGCCCGCCCCCACCCAGGGCAGAAAGAGGAGGCACAGGGGAAAACAACAAGAGC CCTGGCCAAGAATGAGCCCTCTGCCGTCGTCCTGTGTGTGGCCTTGTGGCCCAGCACCAGCGCTGGGGGG NCACTTTGCCCTGCCTGACAGGAGGAAGGGATGCCCTAGTGAGGTGGGAAACAGAGGGAGAGGTTGAGAC CACCTTGGACAAGAAGGGCCAGGGAAGGCCCTNCCNTCACCTGTCACTACAGCCCGACACTTAGAAGGTA Typically, files will end with an .fsa extension (e.g. chr1.fsa, chr2.fsa, unknown.fsa).

Base level quality files: A file similar to FASTA but with numbers to represent the quality for each base in an equivalent FASTA file.

see format...

A base level quality score file looks very similar to a FASTA file: >contig0001 51 63 70 82 82 82 90 90 90 90 86 86 86 86 86 86 90 90 90 90 90 86 86 78... It is important that the defline in the quality file match the defline in the fasta file. Use the same basename as the fasta file, but use the extension .qvl for the base quality score file. For example, if your fasta file is called 'mycontigs.fsa' your quality file should be mycontigs.qvl'.

AGP File: A file describing how higher level objects (scaffolds and/or chromosomes) are assembled using component sequences. Components are single sequences that are submitted to GenBank/EMBL/DDBJ. These are typically BAC/fosmid clones or WGS contigs, but may include PCR products, or other genomic sequences.

see format...

An AGP can be thought of as the instructions for building a pseudo-molecule. In the context of a genome assembly, this is typically a scaffold sequence or a chromosome sequence. A full description of the AGP format can be found here: AGP Specification. More extensive examples of AGP files can be found at the previous link, but here is an example for a eukaryote: # ORGANISM: Pan troglodytes # TAX_ID: 9598 # ASSEMBLY NAME: EG2 # ASSEMBLY DATE: 06-September-2006 # GENOME CENTER: NCBI # DESCRIPTION: Example AGP specifying the assembly of chromosome 37 from BAC clone sequence contigs # COMMENTS: # This is an AGP for a fictitious chromosome. chr37 1 8000000 1 N 8000000 short_arm no chr37 8000001 8050000 2 N 50000 heterochromatin no chr37 8050001 9050000 3 N 1000000 centromere no chr37 9050001 9119019 4 F AC147148.2 82352 151370 - chr37 9119020 9306028 5 F AC147343.2 1 187009 + chr37 9306029 9468766 6 F AC146245.2 1 162738 + chr37 9468767 9674992 7 F AC146175.1 21960 228185 - chr37 9674993 9773308 8 F AC145782.1 1 98316 + chr37 9773309 9840873 9 F AC147670.4 1 67565 + chr37 9840874 9973728 10 F AC151848.4 1 132855 + .... and an example for a prokaryote: # ORGANISM: Mycobacterium tuberculosis C # TAX_ID: 348776 # ASSEMBLY NAME: MycParC_1.0 # ASSEMBLY DATE: 06-September-2006 # GENOME CENTER: NCBI # DESCRIPTION: Example AGP specifying the assembly of bacterial scaffolds from wgs-contigs being submitted # COMMENTS: # This is an AGP for a fictitious assembly. SCAFFOLD1 1 1145353 1 W Contig1 1 1145353 + SCAFFOLD2 1 490249 1 W Contig2 1 490249 + SCAFFOLD2 490250 490749 2 N 500 fragment yes SCAFFOLD2 490750 586060 3 W Contig3 1 95311 + SCAFFOLD3 1 41479 1 W Contig4 1 41479 + SCAFFOLD4 1 2266 1 W Contig5 1 2266 + SCAFFOLD4 2267 2326 2 N 60 fragment yes SCAFFOLD4 2327 5525 3 W Contig6 1 3199 + SCAFFOLD5 1 1788 1 W Contig7 1 1788 + SCAFFOLD6 1 202169 1 W Contig8 1 202169 + SCAFFOLD7 1 8835 1 W Contig9 1 8835 + SCAFFOLD8 1 20646 1 W Contig10 1 20646 +

PAR Definition File: File describing the location of the Pseudo-Autosomal Region in mammals.

see format...

If the Psuedo-autosomal region (PAR) is known, it is useful to define the region to facilitate annotation. The file is called PAR-regions. This is a tab delimited file with the following definition: chr-name: name of the chromosome (usually X or Y) par-name: name of the PAR region (e.g. PAR#1, PARq) start: where the PAR starts on the chromosome stop: where the PAR ends on the chromosome and here is an example file: #Chr PAR-name start end X PAR#1 60001 2699520 X PAR#2 154931044 155260560 Y PAR#1 10001 2649520 Y PAR#2 59034050 59363566

Alternate locus placement file: File describing the chromosome context of alternate locus scaffolds.

see format...

For some assemblies, highly divergent regions can be captured as separate paths in the assembly. Typically one of these is incorporated into the chromosome and the other can be put into chromosome context. This file defines the chromosome context for a alternate locus scaffolds. This is a tab delimited file. alt_asm_name: name of the assembly-unit that includes the alternate scaffold. prim_asm_name: name of the assembly-unit on which the alternate scaffold is being placed. Expected to be 'Primary Assembly' in most cases. alt_scaf_name: name of the alternate scaffold being placed parent_type: type of object on which the alternate scaffold is being placed, either CHROMOSOME or SCAFFOLD parent_name: name of the object on which the alternate scaffold is being placed (can be either a chromosome or a scaffold) ori: orientation of the alignment, '+', '-', 'b'(mixed) alt_scaf_start: start of the placement on the alternate scaffold (in 1 base coordinates) alt_scaf_stop: end of the placement on the alternate scaffold (in 1 base coordinates) parent_start: start of the placement on the parent sequence (in 1 base coordinates) parent_stop: end of the placement on the parent sequence (in 1 base coordinates) alt_start_tail: number of bases at the start of the alternate scaffold not involved in the placement alt_stop_tail: number of bases at the end of the alternate scaffold not involved in the placement It is expected that every alternate locus scaffold associated with the assembly will be listed in this file as a data integrity check. Any alternate scaffold that has no placement would have "na" in columns 4 to 12. Any alternate scaffold that has a chromosome assignment, but no placement, would have the chromosome name in column 5 and "na" in columns 6 to 12.

Alternate locus alignment file: File with the alignment of alternate locus scaffolds to a chromosome sequence.

see format

Currently accepted formats:

Genomic Region Definition file: If an assembly has alternate locus scaffolds that have been put into a chromosome context, it is often convenient to define genomic regions, as many people will gain access to the alternate loci via their alignment to the chromosome. This file and "Alternate locus assignments to Genomic Region" are optional. If you do not provide a genomic region definition file, we will create Genomic Regions based on the placements provided. This is a tab delimited file.

see format

region name: The name of the region. This should be no more than 64 characters and unique within the assembly.

chromosome: The chromosome on which the region is defined. This may be the name (e,g. chr1) or the sequence identifier (e.g. accession.version), but all records must be of the same type.

start: First coordinate of the region, in 1-base coordinates.

stop: Last coordinate of the region, in 1-base coordinates.

Alternate locus assignment to Genomic Region: A defined genomic region can contain to more than 1 alternate locus scaffold. This file associates each alternate locus scaffold with a specific genomic region. This file is optional- if you do not provide an region assignment file, we will assign alternate locus scaffolds to regions based on the placements provided. This is a tab delimited file.

see format

region name: This should have a corresponding entry in the Genomic Region definition file.

alt-locus scaffold name: This should correspond to the scaffold name used in the fasta, agp, etc.

Annotation File: File with annotation. (Prokaryotic annotation guidelines) (Eukaryotic annotation guidelines)

Examples

Submitting a haploid assembly: submitting WGS contigs only

If you don't have a project ID, get one here: /genomes/mpfsubmission.cgi
Fill out the submission template form. This will be used later.
Generate your contig fasta files for the WGS contigs. If there is no chromosome mapping information, you can just split this randomly with no more than 10,000 sequences per file. Remember, the only Ns in these sequences are ambiguities, not gaps. In addition, remove any terminal Ns from the WGS contigs.
1. Note: if a contig is known to be from a specific source (e.g. a plasmid or organelle) then include that information in the defline of the .fsa file. If the plasmid name is unknown, then use "unnamed". >contig_seqid1 [organelle=mitochondrian] >contig_seqid100 [plasmid=unnamed] >contig_seqid200 [plasmid=pBB1]
Generate your base level quality score files. While this information is not strictly required, it is strongly recommended.
Add relevant information (called 'source qualifiers') about the DNA that was sequenced. The relevant information is:
1. Strain/breed/isolate/cultivar: this is the intra-species qualifier. There should be a type and ID.
2. tissue-type
3. developmental stage
4. sex
Add this information using a program called tbl2asn. (download) Here is an example: tbl2asn -p 'path_to_files' -t template_file (from step 2) -M n -j "[organism=Genus species] [breed=111] [sex=male]" -Z discrep. Don't forget to check the .val files for validation errors and the discrep file for inconsistencies. 6. If you have annotation, prepare the relevant annotation files. 7. Submit! Upload the .sqn files (generated from the tbl2asn process) and any annotation files to GenomesMacroSend. 8. After you submit, send an email to genomes@ncbi.nlm.nih.gov. The email should contain the projectID, the GDSub number from GenomesMacroSend and the information in the meta data file for the assembly.

Submitting a haploid assembly: submitting WGS contigs and pseudo-molecules (chromosomes and/or scaffolds)

If you don't have a project ID, get one here: /genomes/mpfsubmission.cgi
Fill out the submission template form. This will be used later.
Generate your contig fasta files for the WGS contigs. If there is no chromosome mapping information, you can just split this randomly with no more than 10,000 sequences per file. Remember, the only Ns in these sequences are ambiguities, not gaps. In addition, remove any terminal Ns from the WGS contigs.
1. Note: if a contig is known to be from a specific source (e.g. a plasmid or organelle) then include that information in the defline of the .fsa file. If the plasmid name is unknown, then use "unnamed". >contig_seqid1 [organelle=mitochondrian] >contig_seqid100 [plasmid=unnamed] >contig_seqid200 [plasmid=pBB1]
Generate your base level quality score files. While this information is not strictly required, it is strongly recommended.
Generate your AGP file. Remember to include all of the wgs-contigs that are considered to be part of the genome assembly and are >200bp in the AGP file as components. Therefore, some scaffolds may be singletons (have only a single component). If some of the wgs-contigs are not included in the AGP file, include a comment line in the AGP file to indicate why they are not included, eg: # x number of wgs-contigs are not in the AGP file because they are duplicates and are not considered part of this assembly.
Add relevant information (called 'source qualifiers') about the DNA that was sequenced. The relevant information is:
1. Strain/breed/isolate/cultivar: this is the intra-species qualifier. There should be a type and ID.
2. tissue-type
3. developmental stage
4. sex
Add this information using a program called tbl2asn. (download) Here is an example: tbl2asn -p 'path_to_files' -t template_file (from step 2) -M n -j "[organism=Genus species] [breed=111] [sex=male]" -Z discrep. Don't forget to check the .val files for validation errors and the discrep file for inconsistencies. 7. Generate the PAR definition file if the PAR region is known and applicable, if not skip this. 8. If you have annotation, prepare the relevant annotation files. 9. Submit! Upload the .sqn files (generated from the tbl2asn process) and any annotation files to GenomesMacroSend. 10. After you submit, send an email to genomes@ncbi.nlm.nih.gov. The email should contain the projectID, the GDSub number from GenomesMacroSend and the information in the meta data file for the assembly.

Submitting a haploid assembly: pseudo-molecules (chromosomes and/or scaffolds) based on GenBank accessions

If you don't have a project ID, get one here: /genomes/mpfsubmission.cgi
Fill out the submission template form. This will be used later.
Generate your AGP file using the GenBank accessions as the component identifiers.
Generate the PAR Definition file if the PAR region is known, if not skip this.
If you have annotation, prepare the relevant annotation files.
Submit! Upload the .sqn files (generated from the tbl2asn process), AGP file and annotation files (if available) to GenomesMacroSend.
After you submit, send an email to genomes@ncbi.nlm.nih.gov. The email should contain the projectID, the GDSub number from GenomesMacroSend and the information in the meta data file for the assembly.

Submit a haploid+alts assembly: WGS contigs + pseudo-molecules (chromosomes and/or scaffolds)

If you don't have a project ID, get one here: /genomes/mpfsubmission.cgi
Fill out the submission template form. This will be used later.
Generate your contig fasta files for the WGS contigs. If there is no chromosome mapping information, you can just split this randomly with no more than 10,000 sequences per file. Remember, the only Ns in these sequences are ambiguities, not gaps. In addition, remove any terminal Ns from the WGS contigs.
1. Note: if a contig is known to be from a specific source (e.g. a plasmid or organelle) then include that information in the defline of the .fsa file. If the plasmid name is unknown, then use "unnamed". >contig_seqid1 [organelle=mitochondrian] >contig_seqid100 [plasmid=unnamed] >contig_seqid200 [plasmid=pBB1]
Generate your base level quality score files. While this information is not strictly required, it is strongly recommended.
Add relevant information (called 'source qualifiers') about the DNA that was sequenced. The relevant information is:
1. Strain/breed/isolate/cultivar: this is the intra-species qualifier. There should be a type and ID.
2. tissue-type
3. developmental stage
4. sex
Add this information using a program called tbl2asn. (download) Here is an example: tbl2asn -p 'path_to_files' -t template_file (from step 2) -M n -j "[organism=Genus species] [breed=111] [sex=male]" -Z discrep. Don't forget to check the .val files for validation errors and the discrep file for inconsistencies. 6. Generate the PAR Definition file if the PAR region is known and this is applicable, if not skip this. 7. Generate thealternate locus placement files. 8. Generate the alternate locus alignment files (optional). 9. Generate the genomic region definition file (optional). 10. Generate the alternate locus assignment to genomic region file (optional) 11. If you have annotation, prepare the relevant annotation files. 12. Submit! Upload the .sqn files (generated from the tbl2asn process), AGP file and alt_assembly_placements.txt files (and PAR definition, region definition, alternate locus to genomic region file and annotation files if present) to GenomesMacroSend. 13. After you submit, send an email to genomes@ncbi.nlm.nih.gov. The email should contain the projectID, the GDSub number from GenomesMacroSend and the information in the meta data file for the assembly.

Submit a haploid+alts assembly: pseudo-molecules (chromosomes and/or scaffolds) based on GenBank accessions

If you don't have a project ID, get one here: /genomes/mpfsubmission.cgi
Fill out the submission template form. This will be used later.
Generate your AGP file.
Generate the PAR Definition file if the PAR region is known and this is applicable, if not skip this.
Generate thealternate locus placement files.
Generate the alternate locus alignment files (optional).
Generate the genomic region definition file (optional).
Generate the alternate locus assignment to genomic region file (optional)
If you have annotation, prepare the relevant annotation files.
Submit! Upload the .sqn files (generated from the tbl2asn process), AGP file and alt_assembly_placements.txt files (and PAR definition, region definition, alternate locus to genomic region file and annotation files if present) to GenomesMacroSend.
After you submit, send an email to genomes@ncbi.nlm.nih.gov. The email should contain the projectID, the GDSub number from GenomesMacroSend and the information in the meta data file for the assembly.

Submit a complete replicon

Instructions for submitting a complete replicon can be found here: genomesubmit.html.

Assembly

Organizing Genome Coordinates

Assembly Submission Guidelines

Table of Contents

Assembly Types

Data File Definitions

see format...

see format...

see format...

see format...

see format...

see format...

see format

see format

see format

Examples

Submitting a haploid assembly: submitting WGS contigs only

Submitting a haploid assembly: submitting WGS contigs and pseudo-molecules (chromosomes and/or scaffolds)

Submitting a haploid assembly: pseudo-molecules (chromosomes and/or scaffolds) based on GenBank accessions

Submit a haploid+alts assembly: WGS contigs + pseudo-molecules (chromosomes and/or scaffolds)

Submit a haploid+alts assembly: pseudo-molecules (chromosomes and/or scaffolds) based on GenBank accessions

Submit a complete replicon