Sequin Help Documentation
Sequin Entrez BLAST OMIM Taxonomy Structure

 

Table of Contents


Introduction

Sequin is a program designed to aid in the submission of sequences to the GenBank, EMBL, and DDBJ sequence databases. It was written at the National Center for Biotechnology Information, part of the National Library of Medicine at the National Institutes of Health. This section of the help document provides a basic overview of how to submit sequences using the Sequin forms. Subsequent sections provide detailed instructions for entering information on each form.

The Help Documentation

The Sequin help documentation is available in both on-line and World Wide Web (http://www.ncbi.nlm.nih.gov/Sequin/sequin.hlp.html) formats. The text of the on-line version scrolls as you progress through the Sequin forms. Specific words or phrases can be identified with the "find" command at the top of the window. The on-line document can also be saved as a text file, or printed directly to a printer. Click on the window that contains the help documentation. Under the Sequin File menu, choose Export Help... to save the documentation as a text file. To print the documentation without saving it first, click on the help window, and choose Print from the Sequin File menu.

Organization of Forms

Information is entered into Sequin on a number of different forms. Each form is made up of pages, which are indicated by folder tabs at the top of the form. You can move to the desired page by clicking on the appropriate folder tab. You can also move between pages of a form by clicking on the "Next page" or "Prev page" buttons at the bottom of the screen. You can move to the previous form or the next form by clicking on the "Prev form" or "Next form" buttons on the first or last pages of a form, respectively.

There are numerous ways to enter information onto a page of a form, including text fields, radio buttons, check boxes, scrolling boxes, pop-up menus and spreadsheets.

You may also use tables to import annotation of source information. The formatting of these tables will be discussed below.

Table of Contents

Overview of Sequin

If you are using Sequin for the first time, you will be prompted to fill out four forms: the Welcome to Sequin form, the Submitting Authors Form, the Sequence Format form, and the Organism and Sequences Form. After you have filled out these forms, a window will appear that contains the Sequin record viewer. This viewer allows you to access many other forms in which you can edit fields filled out in the three initial forms, as well as add additional information. Detailed instructions on how to fill out the forms and use the record viewer are presented below.

Welcome to Sequin Form

First, indicate with one of the three radio buttons whether you are submitting the sequence to the GenBank, EMBL, or DDBJ database. If you are working on a sequence submission for the first time, click on "Start New Submission". If you are modifying an existing submission record, click on "Read Existing Record". If you would like to quit from Sequin, click on "Quit Program".

You can also "Read Existing Record" to read in a FASTA-formatted sequence file for analysis purposes. The sequence will be displayed in Sequin and can be analyzed with tools such as CDD Search, but it should not be submitted because it does not have the appropriate annotations.

If you are running Sequin in its network-aware mode, you will see another button labeled "Download from Entrez". This option allows you to update an existing database record using Sequin. The record will be downloaded from GenBank into Sequin using NCBI's Entrez retrieval system. The contents of the record will appear in Sequin, and you can edit them by updating the sequence or the annotations, as necessary. If you do not see the button labeled "Download from Entrez" on the Welcome to Sequin form, you are not running Sequin in its network-aware mode. To make Sequin network-aware, see the instructions later in the help documentation.

You can update only those records that you have submitted, not those submitted by others. To update an existing record, first select which of the databases you will be sending the update to. This should be the database to which the original record was submitted. If you do not know which database to use, send the record to GenBank and the NCBI staff will forward it to the appropriate database. Next, click on the button "Download from Entrez". Enter the nucleotide Accession number or GI of the sequence on the first form. Then enter "yes" if you are planning to submit the record as an update to one of the databases. Fill out the Submitting Authors form. Instructions for this form are found in the Sequin help documentation under "Edit Submitter Info" under the Sequin File menu. The record will then open in the record viewer. Explanations of how to add annotations or update sequences are presented in the documentation entitled "Editing the record" and Sequence Editor respectively. You will not see the Submitting Authors Form, the Sequence Format Form, or the Organism and Sequences Form. Note that updates, as well as new records, must be emailed to the appropriate database. Sequin does not support direct submission of records over the Internet.

Additional configuration options are available under the Misc menu. You can toggle between the stand-alone and network-aware modes of Sequin. The default mode of Sequin, which is sufficient for most sequence submissions, is stand-alone. In its network-aware mode, Sequin can exchange data with NCBI and, for example, retrieve sequences from Entrez and perform Taxonomy searches. The network-aware mode of Sequin is described in detail in the Net Configure section below. You can also start the NCBI DeskTop, which is for advanced Sequin users only.

Table of Contents

Submitting Authors Form

Information from this form will be used as a citation for the sequence entry itself. It can contain the same information found in citations associated with the formal publication of the sequence.

On the bottom of each form are two buttons. Click "Prev form" (first page in a form) or "Prev page" (subsequent pages in a form) to go to the previous form or page. Click "Next Form" (last page on a form) or "Next Page" (earlier pages on a form) to move to the next form or page.

Form pages can also be saved individually by using the "Export" function under the File menu. If you are processing multiple submissions, you can use the "Import" function under the File menu to paste previously entered information directly on the page.

The Contact, Authors, and Affiliation pages can be saved as a block so that you can use this information for your next submission. For your first Sequin submission, fill in the requested information on the Submitting Authors form and proceed with the preparation of the submission. Choose Export Submitter Info under the File menu to export this to a file. You can then import this information in subsequent submissions using the Import Submitter Info in the File menu. You will need to fill in the manuscript title for each submission however.

Submission Page

When May We Release Your Sequence Record?

Please select one of the two radio buttons. If you select

"Immediately After Processing", the entry will be released to the public after the database staff has added it to the database. If you select "Release Date", fields will appear in which you can indicate the date on which the sequences should be released to the public. The submission will then be held back until formal publication of the sequence or GenBank Accession number, or until the release date, whichever comes first. The maximum hold time is five years.

Tentative Title for Manuscript

Please enter a title that appropriately describes the sequence entry. Later in the submission process, you will have the opportunity to change this information and add details for published or in press references.

Contact Page

Please enter the name, telephone and fax numbers, and email address of the person who is submitting the sequence. This is the person who will be contacted regarding the sequence submission. The phone, fax, and email address will not be visible in the database record, but are essential for contact by the database staff.

Authors Page

Please enter the names of the people who should receive scientific credit for the generation of sequences in this entry. The person on the Contact page is automatically listed as the first author. This information can be changed if necessary. The author names should be entered in the order first name, middle initial, surname. You can add as many authors to this page as you wish. After you type in the name of the third author, the box becomes a spreadsheet, and you can scroll down to the next line by using the space bar. The consortium box should only be used for consortium names, not institute or department names.

Affiliation Page

Please enter information about the principal institution where the sequencing was performed. This is not necessarily the same as the workplace of the person described on the Contact page. This information will show up in the reference section of the record, with the title Direct Submission.

Sequence Format Form

Use this form to indicate the type, format and category of sequence you are submitting.

Sequin can process single nucleotide sequences, gapped sequences and sets of related sequences. If the sequences are related in terms of coming from the same publication, or the same organism, they may be candidates for a Batch submission. Biologically related sequences may be classified as environmental samples, population, phylogenetic, mutation, or segmented sets as appropriate. Segmented sets consist of a collection of non-overlapping sequences covering a specific genetic region. In all cases, although the sequences are handled as a single submission, each sequence in a set will receive its own database Accession number and can be annotated independently.

Sequin can display the alignments of sequences that are submitted as part of an aligned phylogenetic, population, mutation set, or environmental samples. Such sequences can be submitted in FASTA, Contiguous (FASTA+GAP, NEXUS, MACAW), or Interleaved (PHYLIP, NEXUS) formats. If the sequences are in FASTA format, Sequin can generate an alignment. If the sequences have already been aligned in FASTA+GAP, PHYLIP, MACAW, or NEXUS, Sequin will not change the alignment. If one of the sequences in your alignment is already present in the GenBank/EMBL/DDBJ database, you must mark that sequence so that it does not receive a new Accession number. Instead of supplying that sequence with a new Sequence Identifier, give it the identifier accU12345, where U12345 is the Accession number of the sequence.

Single sequences, gapped sequences, segmented sequences, and batch submissions must be submitted in FASTA format.

Table of Contents

Submission Type

Use the radio buttons to indicate which of the following types of submissions you are creating:

Sequence Data Format

If you are submitting a single, gapped, or segmented sequence, or a batch submission, your sequence must be in FASTA format, described below. If you are submitting a set of sequences as part of a population, phylogenetic, or mutation study, you have a choice of sequence formats. You may submit the set as individual sequences in FASTA format. Alternatively, you can submit the sequences as part of an alignment. Sequin currently accepts the alignment formats FASTA+GAP, PHYLIP, MACAW, NEXUS Interleaved, and NEXUS Contiguous.

Submission Category

Use the radio buttons to indicate whether your sequence corresponds to an original submission or a third-party annotation submission. If you have directly sequenced the nucleotide sequence in your laboratory, your submission would be considered an original submission.

If you have downloaded the sequence from GenBank and added to it your own annotations, your entry may be eligible for submission to the Third-Party Annotation Database (TPA) .

In order to be released into the TPA database, the sequence must appear in a peer-reviewed publication in a biological journal. If you select this option, a pop-up box will appear upon the completion of the Sequence Format form. You must provide some description of the biological experiments used as evidence for the annotation of your TPA submission in this box.

You will be asked later in the submission process to provide the GenBank Accession number(s) of the primary sequence(s) from which your TPA submission was derived.

Table of Contents

Organism and Sequences Form

This form is made up of four pages. If your sequences are imported as properly formatted FASTA files, there will be minimum input necessary in these pages.

FASTA Format for Nucleotide Sequences

In FASTA format the line before the nucleotide sequence, called the FASTA definition line, must begin with a carat (">"), followed by a unique SeqID (sequence identifier). The SeqID must be unique for each nucleotide sequence and should not contain any spaces. Use of brackets ("[]") in the SeqID is also prohibited. The identifier will be replaced with an Accession number by the database staff when your submission is processed.

Information about the source organism from which the sequence was obtained follows the SeqID and must be in the format [modifier=text]. Do not put spaces around the "=". At minimum, the scientific name of the organism should be included. Optional modifiers can be added to provide additional information. A complete list of available source modifiers and their format is available.

The final optional component of the FASTA definition line is the sequence title, which will be used as the DEFINITION field in the final flatfile. The title should contain a brief description of the sequence. There is a preferred format for nucleotide and protein titles and Sequin can generate them automatically using the Generate Definition Line function under the Annotate menu in the record viewer.

Note in all cases, the FASTA definition line must not contain any hard returns. All information must be on a single line of text. If you have trouble importing your FASTA sequences, please double check that no returns were added to the FASTA definition line by your editing software.

Examples of properly formatted FASTA definition lines for nucleotide sequences are:

>Seq1 [organism=Mus musculus] [strain=C57BL/6] Mus musculus neuropilin 1 (Nrp1) mRNA, complete cds.
>ABCD [organism=Plasmodium falciparum] [isolate=ABCD] Plasmodium falciparum isolate ABCD merozoite surface protein 2 (msp2) gene, partial cds.
>DNA.new [organism=Homo sapiens] [chromosome=17] [map=17q21] [moltype=mRNA] Homo sapiens breast and ovarian cancer susceptibility protein (BRCA1) mRNA, complete cds.

The line after the FASTA definition line begins the nucleotide sequence. Unlike the FASTA definition line, the nucleotide sequence itself can contain returns. It is recommended that each line of sequence be no longer than 80 characters. Please only use IUPAC symbols within the nucleotide sequence. For sequences that are not contained within an alignment, do not use "?" or "-" characters. These will be stripped from the sequence. Use the IUPAC approved symbol "N" for ambiguous characters instead.

A single file containing multiple FASTA sequences can be imported into Sequin in order to create a Batch Submission . Make sure that the FASTA definition line for each sequence is formatted as above.

If the FASTA definition line is not properly formatted a pop-up box will appear upon importing the nucleotide FASTA. The top box in this pop-up will list any errors in the FASTA definition lines, including missing SeqIDs, duplicate SeqIDs for different sequences, or improperly formatted modifiers. You can add or edit this information in the spreadsheet provided. The toggle at the bottom of the pop-up allows you to select whether all sequences or only those with errors are listed in the spreadsheet above. After making changes, click on Refresh Error List to ensure that all errors have been corrected. You must correct any errors involving the SeqID in order to proceed with your submission.

Table of Contents

FASTA Format for Segmented Sequence

Each segment of a segmented sequence must have its own SeqID, but the organism name and other modifiers are only indicated in the FASTA definition line of the first segment. Square brackets are used to delimit the members of the segmented set. For example,

[
>A-0V-1-Apart1 [organism=Gallus gallus] [clone=C]
TCACTCTTTGGCAAC
>A-0V-1-Apart2
GACCCGTCGTCATAATAAAGATAGAGGGGCAACTAAAGGAAGCTCTA
TTAGATACAGGAGCAGATGATACAGTATTAGAAGAAAT 
]

FASTA Format for Gapped Sequence

The FASTA definition line for a gapped sequence follows the same format as above. To indicate a gap within the sequence, enter a hard return within the sequence at the point of the gap, then insert an extra line starting with a carat (">") and a question mark ("?"). If the gap size is unknown, enter "unk100" after the question mark. If the gap size is known, enter the length of the gap after the question mark. For example,

>Dobi [organism=Canis familiaris] [breed=Doberman pinscher]
AAATGCATGGGTAAAAGTAGTAGAAGAGAAGGCTTTTAGCCCAGAAGTAATACCCATGTTTTCAGCATTA
GGAAAAAGGGCTGTTG
>?unk100
TGGATGACAGAAACCTTGTTGGTCCAAAATGCAAACCCAGATKGTAAGACCATTTTAAAAGCATTGGGTC
TTAGAAATAGGGCAACACAGAACAAAAAT
>?234
AAAAATAAAAGCATTAGTAGAAATTTGTACAGAACTGGAAAAGGAAGGAAAAATTTCAAAAATTGGGCCT
GAAAACCCATACAATACTCCGGG

will generate a sequence containing two gaps. The first gap is of unknown length, the second is 234 nucleotides long.

FASTA+GAP Format for Aligned Nucleotide Sequences

A number of programs output sets of aligned sequences in FASTA format. Frequently, to align these sequences, gaps must be inserted. The default alignment settings should correctl interpret gap and ambiguous characters in most cases. If Sequin can not read your alignment, you may need to change these settings using the Optional Alignment Settings button on the Nucleotide Page form. Each sequence, including gaps, must be the same length. The gaps will only show up in the alignment, not in the individual sequence in the database.

Sequences in FASTA+GAP format resemble FASTA sequences. The previous section on FASTA Format for Nucleotide Sequences has instructions for formatting FASTA sequences. If one of the sequences in your alignment is already present in the GenBank/EMBL/DDBJ database, you must mark that sequence so that it does not receive a new Accession number. To do this, use a SeqID in the format accU12345, where U12345 is the Accession number of the pre-existing sequence. All sequences in FASTA+GAP format should be in the same file.

The following is an example of FASTA+GAP format:

>A-0V-1-A [organism=Gallus gallus] [clone=C]
TCACTCTTTGGCAACGACCCGTCGTCATAATAAAGATAGAGGGGCAACTAAAGGAAGCTCTA
TTAGATACAGGAGCAGATGATACAGTATTAGAAGAAAT

>A-0V-2-A [organism=Drosophila melanogaster] [strain=D]
TCACTCTTTGGCAAC---GCGTCGTCACAATAAAGATAGAGGGGCAACTAAAGGAAGCTCTA
TTAGATACAGGAGCAGATGATACAGTATTAGAAGAAAT

>A-0V-3-A [organism=Caenorhabditis elegans] [strain=E]
TCACTCTTTGGCAAC---GCGTCGTCACAATAAAGATAGAGGGGCAACTAAAGGAAGCTCTA
TTAGATACAGGAGCAGATGATACAGTATTAGAAGAAAT

>A-0V-4-A [organism=Rattus norvegicus] [strain=F]
TCACTCTTTGGCAACGACCCGTCGTCACAATAAAGATAGAGGGGCAACTAAAGGAAGCTCTA
TTAGATACAGGAGCAGATGATACAGTATTAGAAGAAAT

>A-0V-7-A [organism=Aspergillus nidulans] [strain=G]
TCACTCTTTGGCAACGACCAGTCGTCACAATAAAGATAGAGGGGCAACTAAAGGAAGCTCTA
TTAGATACAGGAGCAGATGATACAGTATTAGAAGAAAT

Table of Contents

PHYLIP Format for Aligned Nucleotide Sequences

A number of programs output sets of aligned sequences in PHYLIP format.

The following is an example of PHYLIP format.

     5    100
A-0V-1-A   TCACTCTTTG GCAACGACCC GTCGTCATAA TAAAGATAGA GGGGCAACTA
A-0V-2-A   TCACTCTTTG GCAAC---GC GTCGTCACAA TAAAGATAGA GGGGCAACTA
A-0V-3-A   TCACTCTTTG GCAAC---GC GTCGTCACAA TAAAGATAGA GGGGCAACTA
A-0V-4-A   TCACTCTTTG GCAACGACCC GTCGTCACAA TAAAGATAGA GGGGCAACTA
A-0V-7-A   TCACTCTTTG GCAACGACCA GTCGTCACAA TAAAGATAGA GGGGCAACTA


           AAGGAAGCTC TATTAGATAC AGGAGCAGAT GATACAGTAT TAGAAGAAAT
           AAGGAAGCTC TATTAGATAC AGGAGCAGAT GATACAGTAT TAGAAGAAAT
           AAGGAAGCTC TATTAGATAC AGGAGCAGAT GATACAGTAT TAGAAGAAAT
           AAGGAAGCTC TATTAGATAC AGGAGCAGAT GATACAGTAT TAGAAGAAAT
           AAGGAAGCTC TATTAGATAC AGGAGCAGAT GATACAGTAT TAGAAGAAAT

In this example, the first line indicates that there are 5 sequences, each with 100 nt of sequence. The following five lines contain the Sequence IDs, followed by the sequences. Specifically, the sequence identifier for the first sequence is A-0V-1-A. Note that subsequent blocks of sequence do not contain the Sequence ID. If one of the sequences in your alignment is already present in the GenBank/EMBL/DDBJ database, you must mark that sequence so that it does not receive a new Accession number. To do this, use a SeqID in the format accU12345, where U12345 is the Accession number of the pre-existing sequence.

The default alignment settings should correctly interpret gap and ambiguous characters in most cases. If Sequin can not read your alignment, you may need to change these settings using the Optional Alignment Settings button on the Nucleotide Page form.

You can modify the PHYLIP format so that Sequin can determine the correct organism and any other modifiers for each sequence. An example of such modifications are below in the section on Source Modifiers for PHYLIP and NEXUS .

Alternatively, you can leave your sequence alignment in standard PHYLIP format and enter the organism, strain, chromosome, etc. information on the following Source Modifers form .

NEXUS Format for Aligned Nucleotide Sequences

A number of programs output sets of aligned sequences in one of two NEXUS formats, NEXUS Interleaved and NEXUS Contiguous.

NEXUS files can contain ? for "missing" at the 5' and 3' ends of sequences, as long as this parameter is properly defined within the header of the NEXUS file.

The following is an example of NEXUS Interleaved format.

#NEXUS

begin data;
   dimensions ntax=5 nchar=100;
   format datatype=dna missing=? gap=- interleave;
   matrix

A-0V-1-A   TCACTCTTTG GCAACGACCC GTCGTCATAA TAAAGATAGA GGGGCAACTA
A-0V-2-A   TCACTCTTTG GCAAC---GC GTCGTCACAA TAAAGATAGA GGGGCAACTA
A-0V-3-A   TCACTCTTTG GCAAC---GC GTCGTCACAA TAAAGATAGA GGGGCAACTA
A-0V-4-A   TCACTCTTTG GCAACGACCC GTCGTCACAA T????ATAGA GGGGCAACTA
A-0V-7-A   TCACTCTTTG GCAACGACCA GTCGTCACAA TAAAGATAGA GGGGCAACTA


A-0V-1-A   AAGGAAGCTC TATTAGATAC AGGAGCAGAT GATACAGTAT TAGAAGAAAT
A-0V-2-A   AAGGAAGCTC TATTAGATAC AGGAGCAGAT GATACAGTAT TAGAAGAAAT
A-0V-3-A   AAGGAAGCTC TATTAGATAC AGGAGCAGAT GATACAGTAT TAGAAGAAAT
A-0V-4-A   AAGGAAGCTC TATTAGATAC AGGAGCAGAT GATACAGTAT TAGAAGAAAT
A-0V-7-A   AAGGAAGCTC TATTAGATAC AGGAGCAGAT GATACAGTAT TAGAAGAAAT

In this example, the first few lines provide information about the data in the sequence alignment. The following five lines contain the Sequence IDs, followed by the sequences. Specifically, the sequence identifier for the first sequence is A-0V-1-A. Note that subsequent blocks of sequence also contain the Sequence ID. If one of the sequences in your alignment is already present in the GenBank/EMBL/DDBJ database, you must mark that sequence so that it does not receive a new Accession number. To do this, use a SeqID in the format accU12345, where U12345 is the Accession number of the pre-existing sequence. Also, Sequin will replace the "?" characters in the sequences with "N"s since they are defined as "missing" data in the header. The default alignment settings should correctly interpret gap and ambiguous characters in most cases. If Sequin can not read your alignment, you may need to change these settings using the Optional Alignment Settings button on the Nucleotide Page form.

You can modify either NEXUS format so that Sequin can determine the correct organism and any other modifiers for each sequence. An example of such modifications are below in the section on Source Modifiers for PHYLIP and NEXUS .

Alternatively, you can leave your sequence alignment in standard NEXUS format and enter the organism, strain, chromosome, etc. information on the following Source Modifers form .

The following is an example of NEXUS Contiguous format.

#NEXUS
BEGIN DATA;
DIMENSIONS NTAX=5 NCHAR=100;
FORMAT MISSING=? GAP=- DATATYPE=DNA ;
MATRIX

A-0V-1-A
TCACTCTTTGGCAACGACCCGTCGTCATAATAAAGATAGAGGGGCAACTAAAGGAAGCTCTA
TTAGATACAGGAGCAGATGATACAGTATTAGAAGAAAT

A-0V-2-A
TCACTCTTTGGCAAC---GCGTCGTCACAATAAAGATAGAGGGGCAACTAAAGGAAGCTCTA
TTAGATACAGGAGCAGATGATACAGTATTAGAAGAAAT

A-0V-3-A
TCACTCTTTGGCAAC---GCGTCGTCACAATAAAGATAGAGGGGCAACTAAAGGAAGCTCTA
TTAGATACAGGAGCAGATGATACAGTATTAGAAGAAAT

A-0V-4-A
TCACTCTTTGGCAACGACCCGTCGTCACAATAAAGATAGAGGGGCAACTAAAGGAAGCTCTA
TTAGATACAGGAGCAGATGATACAGTATTAGAAGAAAT

A-0V-7-A
TCACTCTTTGGCAACGACCAGTCGTCACAATAAAGATAGAGGGGCAACTAAAGGAAGCTCTA
TTAGATACAGGAGCAGATGATACAGTATTAGAAGAAAT

In this example, the first few lines provide information about the data in the sequence alignment. The following five lines contain the Sequence IDs, followed by the sequences. Specifically, the sequence identifier for the first sequence is A-0V-1-A. Note that subsequent blocks of sequence also contain the Sequence ID. If one of the sequences in your alignment is already present in the GenBank/EMBL/D DBJ database, you must mark that sequence so that it does not receive a new Accession number. To do this, use a SeqID in the format accU12345, where U12345 is the Accession number of the pre-existing sequence.

You can modify either NEXUS format so that Sequin can determine the correct organism and any other modifiers for each sequence. An example of such modifications are below in the section on Source Modifiers for PHYLIP and NEXUS .

Alternatively, you can leave your sequence alignment in standard NEXUS format and enter the organism, strain, chromosome, etc. information on the following Source Modifers form .

Table of Contents

Source Modifiers for PHYLIP and NEXUS

You can modify the PHYLIP or NEXUS formats so that Sequin can determine the correct organism and any other modifiers for each sequence by adding lines at the end of the file. The first line applies to the first sequence, the second line to the second sequence, and so on. You must have one line for each sequence. These inserted lines contain modifiers formatted like in the FASTA definition line, but do not begin with a SeqID. Instead, the SeqID is present at the beginning of the sequence lines as shown above.

Each of the initial lines starts with the character ">". The scientific organism name follows in brackets. Optional modifiers also follow in brackets. For further information on the data that can go in the lines preceding the sequences, see the instructions entitled "FASTA Format for Nucleotide Sequences", above.

The following lines indicating the organisms and strain of each sequence would follow immediately after the sequence in the PHYLIP and NEXUS examples, above.

;
END;

begin ncbi;
sequin
>[organism=Gallus gallus] [clone=C]
>[organism=Drosophila melanogaster] [strain=D]
>[organism=Caenorhabditis elegans] [strain=E]
>[organism=Rattus norvegicus] [strain=F]
>[organism=Aspergillus nidulans] [strain=G]
;
end;

The number of lines of source information must exactly match the number of sequences provided. Complete examples can be found in the Alignment Formats section of the Sequin Quick Guide.

Alternatively, you can leave your sequence alignment in standard NEXUS or PHYLIP format and enter the organism, strain, chromosome, etc. information on the following Organism Page .

Importing Aligned Sets of Segmented Sequences

Sequin can also read segmented sets that are part of an alignment if the sequences are in FASTA or FASTA+GAP format. Each segment should have its own Sequence ID, but organism name and source modifiers should only be indicated for the first segment from each sequence. Square brackets are used to delimit the members of a set. For example,

[
>A-0V-1-Apart1 [organism=Gallus gallus] [strain=C]
TCACTCTTTGGCAAC
>A-0V-1-Apart2
GACCCGTCGTCATAATAAAGATAGAGGGGCAACTAAAGGAAGCTCTA
TTAGATACAGGAGCAGATGATACAGTATTAGAAGAAAT
]
[
>A-0V-2-Apart1 [organism=Drosophila melanogaster] [strain=D]
TCACTCTTTGGCAAC
>A-0V-2-Apart2
GAAGCGTCGTCACAATAAAGATAGAGGGGCAACTAAAGGAAGCTCTA
TTAGATACAGGAGCAGATGATACAGTATTAGAAGAAAT
]

Table of Contents

Nucleotide Page

The options on this page will vary depending on the Submission Type and Sequence Data Format selected earlier. Segmented sets and gapped sequences mut be imported as properly formatted FASTA files. Details about importing alignment files are below .

Nucleotide Page for FASTA Data Format

Create Alignment

If you have selected a Population study, Phylogenetic study, Mutation study, or Environmental samples set as a Submission Type a check box will appear at the top of the Nucleotide Page. If you check 'Create Alignment', Sequin will attempt to generate an alignment of the seqeunces within your submission.

Import Nucleotide FASTA

Use this button to import your properly formatted FASTA file . You will see a window containing information about the imported sequence(s). Please check the number of sequences, Sequence IDs (SeqIDs) and length of each sequence to make sure they are correct. If you have included source information within the FASTA definition line, this will also be listed.

Add/Modify Sequences

This option allows you to add or modify sequences without using a previously formatted FASTA file, but is not available if you have selected a Segmented sequence or Gapped sequence as a Submission Type . On the Specify Sequences box you can either import a nucleotide FASTA or add a new sequence. If you choose Add New Sequence, a new box will pop-up where you can either import an existing sequence file or directly paste or type the nucleotide sequence.

If you add a sequence where the FASTA definition line is not properly formatted a pop-up box will appear. The top box in this pop-up will list any errors in the FASTA definition lines, including missing SeqIDs, duplicate SeqIDs for different sequences, or improperly formatted modifiers. You can add or edit this information in the spreadsheet provided. The toggle at the bottom of the pop-up allows you to select whether all sequences or only those with errors are listed in the spreadsheet above. After making changes, click on Refresh Error List to ensure that all errors have been corrected. You must correct any errors involving the SeqID in order to proceed with your submission. Click on Accept to save your sequences and return to the Specify Sequences box.

In the Specify Sequences box, you can choose to add another sequence or select a sequence from the list and choose to edit or delete it. You can also delete all sequences at this point. You will need to click on Done to save your sequences and return to the Nucleotide Page.

Clear Sequences

This option will remove all imported nucleotide sequences.

Table of Contents

Specify Molecule

A database sequence can represent one of several different molecule types. The default molecule is genomic DNA. If the sequence was not derived from genomic DNA, you can edit that information here. If you are submitting multiple sequences you can apply one molecule type to all sequences or apply the molecule type to each sequence individually. Enter in the Molecule pop-up menu the type of molecule that was sequenced.

Specify Topology

Most sequences have a Linear topology and this is the default. You should change this setting to Circular only if the sequence is complete and it has a circular topology. For example, a complete plasmid or a complete mitochondrial genome would have a Circular topology, but a single gene from a plasmid or mitochondrion would have a Linear topology. If you are submitting multiple sequences you can apply one topology to all sequences or set the topology for each sequence individually.

Nucleotide Page for Aligned Data Formats

Sequence Characters

If you are submitting a set of aligned sequences, you can specify sequence characters used in your alignment here. Sequin requires that you define any non-IUPAC nucleotide characters in your alignment file. The five types of variable characters are listed under Sequence Characters.

Every sequence within an alignment file must contain the same number of characters (nucleotides + gaps). Gap characters are used to represent the spaces between contiguous nucleotides in an alignment. Gaps that appear at the beginning or end of a sequence are treated differently than gaps that appear between nucleotides and each must be defined. GenBank prefers to use a hyphen (-) to represent gaps. If you use a different character to represent a gap, you will need to add this character to the list in the Beginning Gap, Middle Gap, or End Gap boxes.

Ambiguous characters represent nucleotides that are known to exist, but whose identity has not been experimentally validated. GenBank prefers to use 'n' to represent any ambiguous nucleotides. If you are using a different character to represent an ambiguous base, you will need to add this character to the list in the Ambiguous/Unknown box. Sequin will convert these characters to 'n's when your file is imported.

Match characters denote nucleotides that are identical in every member of an alignment. GenBank prefers the use of a colon (:) to represent match characters. If you are using a different character to represent a match character, you will need to add this character to the list in the Match box.

Table of Contents

Import Nucleotide Alignment

Once you have imported the alignment using the Import Nucleotide Alignment button, you can edit the molecule information using the Specify Molecule and Specify Topology buttons explained above. Note that you can not access the Add/Modify Sequences dialog for submissions of aligned sequences.

Organism Page

Information about the organism from which the sequence was derived should be entered or edited on this page. If there are any potential problems with the organism information previously provided in either the FASTA definition line or entered in the Add/Modify Sequences dialog, a window listing these problems will appear at the top of the form. Please review these problems and edit using the Add Source Modifiers button as necessary. At minimum, you must supply the scientific name of the organism from which the sequence was obtained in order to proceed with your submission.

The second window is a summary of the organism information provided so far. Double clicking on a line of text within this window will launch a modifier-specific editing window. In each of these windows, you can edit the available information for the specific modifier. In most cases, you have the choice to edit the modifier for each sequence separately, or to enter text and select Apply above value to all sequences. These changes will be reflected in the windows of the Organism page immediately upon closing the modifier-specific editor.

Add Organisms, Locations, and Genetic Codes

If you have not added organism information using either the FASTA definition line or the Add/Modify Sequences dialog, you can use the Add Organisms, Locations, and Genetic Codes to do so at this point. This button will launch the Multiple Organism Editor pop-up where you may add or edit existing information concerning the Organism name, Location and Genetic Code . The SeqID of each sequence is listed at the left of the spreadsheet format. You can change the information in the spreadsheet individually or globally for all sequences.

Table of Contents

Organism

The scrollable list at the top of the pop-up contains the scientific names of many organisms. To reach a name on the list, type the first few letters of the scientific name into the box above the list or the appropriate box in the spreadsheet. The list will scroll to the names beginning with those letters, and you can select the organism within the list itself. You can then use the arrow button to copy this name into the appropriate box in the spreadsheet.

To apply the same scientific name to all sequences in the submission, click on the Organism button in the spreadsheet column header. A separate pop-up box will appear with the same organism list. You can select a name from this list and choose Accept to apply this name to all sequences.

If you have any questions about the scientific name of an organism, see the NCBI Taxonomy Browser

If the name of the organism is not on the list, type it in directly. If you do not know the scientific name, please be as specific as you can and include a unique identifier, such as a clone, isolate, strain or voucher number, or cultivar name, e.g.; Nostoc ATCC29106, uncultured spirochete Im403, Lauraceae sp. Vásquez 25230 (MO), Rosa hybrid cultivar 'Kazanlik'. Also, if applicable, indicate if the name is unpublished as of the time of submission. Additional information such as strain, isolate, or serotype can be entered later in the submission process.

Location

The default Location for all seqeunces is "Genomic". If the sequence is not genomic, select the alternative location (ie, organelle) from the pull-down list. You can change the location of all sequences globally by clicking on the Location button in the spreadsheet header. The following is a brief description of the choices in this list:

Table of Contents

Genetic Code

If you selected a scientific organism name from the scrollable list described above, this field will be filled out automatically. However, if the organism is not on the list, this field will default to the "Standard" genetic code. If this is incorrect, you can select the correct genetic code from the pull-down list. To globally change the genetic code for all sequences which are not automatically filled out, click on the Genetic Code button in the spreadsheet header.

For more information regarding the genetic codes available, see the NCBI Taxonomy page .

Import Source Modifiers

Using this button allows you to import a tab-delimited table of source modifiers. The first column in the table must contain the Sequence Identifiers (SeqIDs) used earlier in the submission and each subsequent column must contain a different source modifier. The first row in the table must contain the labels for each column. The label for the Sequence Identifiers column should be in the format "Seq_ID". A list of modifiers in the format to be used in the column headers is available.

Add Source Modifiers

Using this button will launch the Specify Source Modifiers pop-up box where you can add or edit any source modifier. You can also import a source modifier table or export the existing source modifiers in table format from this page.

The Select Modifier dialog allows you to select a modifier from the pull-down list and edit the value of this modifier for each sequence or globally add a value to all sequences.

The two windows in this pop-up provide information about the current source modifiers for the sequences in your submission. The top window provides a summary of these modifiers and the lower window lists the values of each modifier for each sequence. If any sequences have missing organism names or have source information that is identical to another sequence, the SeqIDs will be shown in red in this window. Double-clicking on a modifier value in this window will launch a pop-up where you can edit this value. Double-clicking on the modifier name used in the header will launch a modifier-specific pop-up where you can globally edit the modifier value for all sequences or change the value for individual sequences.

Clear All Source Modifiers

This button will clear all modifiers previously entered in either the FASTA definition lines or the submission dialogs. This includes the organism name which is required for submission.

Protein Page

This page allows you to provide the protein sequence translated from the nucleotide sequence that you just entered. If the nucleotide sequence is alternatively spliced or contains multiple open reading frames, enter all of the protein sequences on this page. Each protein sequence will appear in the database record as a coding sequence (CDS) feature. Sequin will automatically determine which nucleotide sequences code for the protein and indicate the nucleotide sequence interval on the database record. Sequin also provides tools that allow you to view a graphical representation of all the open reading frames in your nucleotide sequence and to convert these reading frames into CDS features. These tools are described later in the help documentation under the ORF Finder.

Table of Contents

Conceptual Translation Confirmed by Peptide Sequencing

Most protein entries are computer-generated conceptual translations of a nucleic acid sequence. If you have confirmed this translation by direct sequencing either of the entire protein or of peptides derived from the protein, please check this box.

Incomplete at NH3 end/Incomplete at COOH end

If the sequence is lacking amino acids at the amino- or carboxy-terminal end of the protein, please check the appropriate box.

Create Initial mRNA with CDS Intervals

If you check this box, Sequin will make an mRNA feature with the same initial intervals (i.e., range of sequence) as the CDS feature. After the record has been assembled, you should edit the mRNA feature location to add the 5' UTR and 3' UTR intervals. This may be done either in the mRNA editor or in the sequence editor.

Import Protein FASTA

You can import a single or multiple protein sequences contained within a previously generated protein FASTA file.

FASTA Format for Protein Sequences

The basic FASTA format is the same as that used for nucleotide sequences , with a FASTA definition line followed by the sequence itself.

In order to match the protein sequence to the correct nucleotide sequence, you must use the same Sequence Identifier (SeqID) that you used to identify the nucleotide sequence. Thus in cases of alternatively spliced genes, a single protein FASTA file can contain two unique sequences that have the same SeqID. Both coding regions will be added to the same nucleotide sequence.

The available modifiers for use in a protein FASTA definition line are different than those for a nucleotide FASTA definition line and are limited to information about the protein or gene itself and are contained within the examples below. The format remains [modifer=text].

Note in all cases, the FASTA definition line must not contain any hard returns. All information must be on a single line of text.

Examples of properly formatted protein FASTA definition lines are:

>Seq1 [protein=neuropilin 1] [gene=Nrp1]
>ABCD [protein=merozoite surface protein 2] [gene=msp2] [protein_desc=MSP2]
>DNA.new [protein=breast and ovarian cancer susceptibility protein] [gene=BRCA1] [note=breast cancer 1, early onset]

The protein name should be included in the entry; all other fields are optional.

The line after the FASTA definition line begins the amino acid sequence. It is recommended that each line of sequence be no longer than 80 characters. Please only use IUPAC symbols within the amino acid sequence. Non-IUPAC amino acid symbols will be stripped from the sequence.

After you import your sequence, a window will appear with information about the sequence. The first line will describe the number of protein sequences imported and the total length in amino acids of all sequences. Each sequence is numbered, and its length, unique identifier (SeqID), Gene symbol, Protein name, and title (Definition line) as supplied in the FASTA definition line are listed.

Annotation Page

Note: This page will not be available if you have selected a segmented or gapped sequence as the Submission Type .

On this page, you can add a gene , ribosomal RNA or CDS feature across the entire span of each sequence you are submitting. You can not specify locations within each sequence using this page. More options are available under the Annotate Menu in the record viewer.

If the feature should be partial at one or both ends, check the appropriate box and then fill in the text boxes for the relevant feature.

You may add a title to all sequences if this was not included in the FASTA definition line. This will be used as the DEFINITION field in the final flatfile. The title should contain a brief description of the sequence. There is a preferred format for nucleotide and protein titles and Sequin can generate them automatically using the Generate Definition Line function under the Annotate menu in the record viewer.

Table of Contents

Assembly Tracking

You will only see this form if you had previously indicated that the entry is a Third-Party Annotation submission. You must provide the GenBank Accession number(s) of the primary sequence used to assemble your TPA sequence. We can not accept primary sequences corresponding to Reference Sequences or those from proprietary databases. More information about this can be found on the TPA home page.

If a proper GenBank Accession is entered in the first column of the Assembly Tracking form, the GenBank staff can map the coordinates for you. You do not need to fill out the 'from' and 'to' columns. Note that multiple accessions may be entered to provide full coverage of the assembled sequence.

If the accession entered is not recognized as a GenBank Accession number, a pop-up box is generated requesting that you edit the numbers listed. Sequences from the trace archive can be used primary sequence data for TPA records but must be entered in the format "TI123456789".

You may also generate an Assembly Tracking form in the record viewer under the Annotate menu. Select Descriptors and TPA Assembly from the pull-down menu in order to generate the Assembly Tracking form.

Editing the Record

Overview

After you finish the Organism and Sequences Form, Sequin will process your entry based on the information you have entered. The window you see now is called the record viewer. This is also the window you will see if you are submitting an update to an existing record. The instructions after this point are the same whether you are submitting a new record or an update.

In the default window of the record viewer, you will see your entry approximately as it would appear in the database. Most of the information that you entered earlier in the submission process is present in the viewer; other information, such as the contact, is still present in the record but will not be visible in the database entry. If you have provided a conceptual translation of the nucleotide sequence, the translation will be listed as a CDS Feature. Sequin automatically determines which nucleotides encode for the protein, and lists them, even if the nucleotide sequence contains introns and exons.

You can save the entry to a file by selecting Save or Save As under the File menu. This is not the same as saving the entry for submission to the database. It is a good idea to save the file at this point so that if you make any unwanted changes during the editing process you can revert to the original copy. If you wish to edit the entry later, click on "Read Existing Record" on the Welcome to Sequin form and choose the file.

It is likely that the entry could be processed now for submission to the database. However, you may wish to add information to the entry. This information may be in the form of Descriptors or Features. Descriptors are annotations that apply to an entire sequence, or an entire set of sequences, and Features are annotations that apply to a specific sequence interval. For example, you may want to change the Reference Descriptor to add a published manuscript, or to annotate the sequence by adding features such as a signal peptide or polyA signal.

Information in the record viewer can be edited in different ways. One way to modify information is to double click within the block of information you wish to edit. Many blocks, such as "Definition", "Source", "Reference", or "Features" can be edited.

To add information, create a new descriptor or feature by selecting the appropriate form from the Misc or Features menus. These options are described later in this help document.

Finally, you may need to edit the sequence itself. Instructions for working with the sequence are presented in the documentation for the Sequence Editor.

Table of Contents

Submitting the Finished Record to the Database

Once you are satisfied that you have added all the appropriate information, you must process your entry for submission to the database. Select "Validate" under the Search menu. This function detects discrepancies between the format of your submission and that required by the database selected for entry.

If Sequin detects problems with the format of your record, you will see a screen listing the validation errors as well as suggestions for how to fix the discrepancies. Single clicking on an error message scrolls the record viewer to the feature that is causing the error. Double clicking on the error message launches the relevant feature editor on which you can correct the problem. If you are annotating a set of multiple sequences, shift-click to scroll to the target sequence and feature. When you think you have corrected all the problems, click on "Revalidate". You can submit files with errors, but it is strongly recommended that you correct as many errors as possible prior to submission.

Message: Select Verbose, Normal, Terse, or Table. Verbose gives a more detailed explanation of the problem.

Filter: Select the error messages you wish to see. You can select ALL, SEQ_INST (errors regarding the sequence itself, its type, or length), SEQ_DESCR (descriptor errors), SEQ_FEAT (feature errors), or errors specific to your record.

Severity: Select the types of error messages you wish to see. You will see the type of message selected, as well as any messages warning of more serious problems.

There are four types of error messages, Info, Warning, Error, and Reject. Info is the least severe, and Reject is the most severe. You may submit the record even if it does contain errors. However, we encourage you to fix as many problems as possible. Note that some messages may be merely suggestions, not discrepancies. A possible Warning message is that a splice site does not match the consensus. This may be a legitimate result, but you may wish to recheck the sequence. A possible Error message is that the conceptual translation of the sequence that you supplied does not encode an open reading frame. In this case, you should check that you translated the sequence in the correct reading frame. A possible Reject message is that you neglected to include the name of the organism from which the sequence was derived. The name of the organism is absolutely required for a database entry.

If Sequin does not detect any problems with the format of your record, you will see a message that "Validation test succeeded".

To prepare the submission, click the "Done" button on the record viewer, or select "Prepare Submission" under the File menu. You will be prompted to save the file. Email this file to the database at the address shown. You MUST email the file; Sequin does not submit the file automatically over the network. The email addresses for the databases are:

  • GenBank: gb-sub@ncbi.nlm.nih.gov -EMBL: datasubs@ebi.ac.uk -DDBJ: ddbjsub@ddbj.nig.ac.jp

After your entry is complete, close the record viewer. You will be returned to the Welcome to Sequin form and can begin another entry.

The Record Viewer

Target Sequence

This pop-up menu shows a list of SeqIDs of all nucleotide and protein sequences associated with the Sequin entry. Use the menu to select the sequences displayed in the record viewer, as well as the sequences you want to "target", that is, the sequences to which you want to apply a descriptor (see Descriptors in the Sequin help documentation). You may select either an individual sequence by name or a set of sequences, such as All Sequences, or SEG_dna if you have a segmented nucleotide set. You may change the selection at any time.

Table of Contents

Display Format

You may change the display format of the record viewer to any of the formats described below. Editing a field in one display format will change that field in all formats. Subsequent pop-up menus will appear depending on which format is selected.

GenBank

This display format allows you to see the submission as it would appear as a GenBank or DDBJ entry. It is the default format.

The Mode pop-up default setting is Sequin. Release mode shows certain qualifiers and db_xrefs in RefSeq entries which are non-collaborative. Entrez mode is used fro web display and can show new elements that have not yet finished their four month quarentine period. Dump mode requires that the accession slot be populated. In most cases, there is no need to change from the default Sequin mode.

The Style pop-up allows different views of segmented records. The default is Normal. Segment style is the traditional representation of segmented sequences, while Contig style displays a CONTIG line with a join of accessions instead of raw sequence. Master style shows features mapped to the segmented sequence coordinates instead of the coordinates of the individual parts.

Graphic

This display format shows the entry in a graphical view. The top bar represents the nucleotide sequence. Lower arrows or bars represent different features on the sequence. Double click on an arrow or bar to launch the appropriate editing window. Any sequence highlighted in the Sequence Editor will be boxed on the graphical view of the sequence. To see a graphical representation of a segmented set (see Submission type , above), the Target Sequence must be set to SEG_dna.

The Style pop-up menu allows you to see the display in different styles and colors.

The Scale pop-up menu allows you to see the display in different sizes. The smaller the number, the larger the display.

Sequence

This display format shows the nucleotide sequence in the record along with any annotated features (such as CDS or mRNA). You can only view a single sequence at a time with this option. You can use the Features pop-up menu to change the display of the features. With the numbering pop-up menu, select where you want the sequence numbers to be indicated, at the side of the sindow, at the top of each sequence line, or not at all.

Alignment

This display format shows sets of aligned sequences, such as those imported as part of a population, phylogenetic, mutation, or environmental samples set. When toggled to All Sequences in the Target Sequence pop-up, the alignment of all entries will be displayed. To more closely analyze similarities, you can select a single entry in the Target Sequence pop-up. The complete sequence of the entry selected will be displayed. Any nucleotides in the other sequences that differ from that selected will be displayed, while identical nucleotides will be displayed as a period. You can also display features annotated on the selected target sequence or all sequences using the Feature display toggle. To launch the alignment editor, select Alignment Assistant from the record viewer Edit menu.

EMBL

This display format allows you to see the submission as it would appear as an EMBL entry.

Table of Contents

Table

This display format shows the annotation in a five-column, tab-delimited table format. This format can be imported to add annotation to a record that has none.

FASTA

This display shows the sequence and Definition line only, without any annotations, in a format called the FASTA format. This is a format used by many molecular biology analysis programs. You cannot edit in this display mode.

Quality

This display format shows quality score data ifit has been included in the submission.

ASN.1

This display shows the entry in Abstract Syntax Notation 1, a data description language used by the NCBI. You cannot edit in this display mode.

XML

This display format shows the entry in XML language, sometimes used by various databases. You cannot edit in this display mode.

INSDSeq

This display format shows the entry in the XML format used by the INSD. You cannot edit in this display mode.

Desktop

The NCBI DeskTop displays the internal structure of the record being viewed in Sequin. The DeskTop is explained under the Misc menu.

Done

This button allows you to validate the entry when you are finished with the submission. See Submitting the Finished Record to the Database in the Sequin help documentation.

Controls for Downloaded Entries

If you have downloaded a sequence from Entrez, you will see an additional button labeled PubMed. This button will launch a web browser containing the target sequence as it appears in Entrez. From here, you can access any Entrez-supported Links, including related sequences and associated references in PubMed.

Descriptors

Overview

Descriptors are annotations that apply to an entire sequence, or an entire set of sequences, in a given entry. They do not have a specific location on a sequence, as they apply to the entire sequence. They can be contrasted to Features, which apply to a specific interval of the sequence.

You may edit descriptors in one of two ways.

(1) In the record viewer, double click within the text of the descriptor to bring up a form on which information can be added.

(2) Choose the option Descriptors from the Annotate menu.

Table of Contents

Annotate Menu

This menu allows you either to create new descriptors or to modify existing ones. Select the descriptor that you wish to modify.

When you first select a descriptor, you will see a window called "Descriptor Target Control". Using the target control pop-up menu, select the sequences you wish this descriptor to cover. The name(s) listed correspond to the SeqID(s) given to the nucleotide or amino acid sequences when when they were imported into Sequin. The default selection for this menu is set in the Target Sequence pop-up menu on the record viewer. You may choose to have the descriptor cover just one sequence, or a set of sequences in your entry. If you are creating a new descriptor, select "Create New". If you wish to modify a previous descriptor, select "Edit Old".

The following is a list of some of the descriptors that can be added. Two additional descriptors, those for Publications and Biological Source, are described in other sections.

TPA Assembly

If you indicated that your sequence is a TPA submission, a TPA Assembly was created from the information regarding primary accession numbers. This Assembly information can be edited here. Note that it is not necessary to enter nucleotide location in the "from" and "to" columns.

Update Date

This is for database staff use only. Please do not modify the date.

Create Date

This is for database staff use only. Please do not modify the date.

Region

This descriptor provides general information about the genetic context of the sequence. For example, if your nucleotide sequence is cloned from the region surrounding the Huntington's Disease gene, you could enter that information here. Providing information for this descriptor is optional.

Name

Alternative place for a descriptive name for the sequence. This information will not appear in the flatfile view, but will be maintained in the ASN1.

Comment

This descriptor is used to list any additional information that you wish to provide about the sequence. Use of this descriptor is optional. Most information can be better annotated using the appropriate features and qualifiers rather than a generic comment descriptor.

Title

This descriptor contains the information that will go on the Definition line of the database entry. If you supplied a title for your nucleotide sequence when you imported it into Sequin, that information is here. If you wish to change the Definition line, or if you did not supply a title when you submitted the sequence, edit this Descriptor. For more information on creating proper Definition lines, please see the Sequin help documentation for the Nucleotide Definition Line (Title) .

Table of Contents

Molecule Description

This descriptor indicates the characteristics of the molecule from which the sequence was derived. The information that you have already entered can be edited here. In most cases, the molecule and class are the only choices which should be edited from the default values.

Molecule

A GenBank sequence can represent one of several different molecule types. Enter in the Molecule pop-up menu the type of molecule that was sequenced. A brief description of the choices in this pop-up menu were listed previously.

Completedness
Choose the appropriate option from the pop-up menu.

  • Complete: Use this designation when a complete molecule, such as a complete mitochondrial genome, is being submitted.

  • Partial: Use this designation when an incomplete unit, such as the partial coding sequence of a gene, is being submitted.

  • No left: Use this designation when an incomplete unit, such as the partial coding sequence of a gene, or a partial protein sequence, is being submitted. The sequence has no left if it is incomplete on the 5', or amino-terminal, end.

  • No right: Use this designation when an incomplete unit, such as the partial coding sequence of a gene, or a partial protein sequence, is being submitted. The sequence has no right if it is incomplete on the 3', or carboxy-terminal, end.

  • No ends: Use this designation when an incomplete unit, such as the partial coding sequence of a gene, or a partial protein sequence, is being submitted, The sequence has no ends if it is incomplete at both the 5' and 3', or amino- and carboxy- terminal, ends.

  • Other: Use this designation when none of the above descriptions apply.
Technique

From the pop-up menu, select the technique that was used to generate the sequence.

  • Standard: standard sequencing technique.

  • EST: Expressed Sequence Tag : single-pass, low-quality mRNA sequences derived from cDNAs. These sequences will appear in the EST division.

  • STS: Sequence Tagged Site : short sequences that are operationally unique in a genome and that define a specific position on the physical map. These sequences will appear in the STS division.

  • Survey: single-pass genomic sequence . These sequences will appear in the Genome Survey Sequence (GSS) division.

  • Genetic Map: Genetic map information, for example, in the Genomes division.

  • Physical Map: Physical map information, for example in the Genomes division.

  • Derived: A sequence assembled into a contig from shorter sequences.

  • Concept-trans: A protein translation generated with the appropriate genetic code.

  • Seq-pept: Protein sequence was generated by direct sequencing of a peptide.

  • Both: Protein sequence was generated by conceptual translation and confirmed by peptide sequencing.

  • Seq-pept-Overlap: Protein sequence was generated by sequencing multiple peptides, and the order of peptides was determined by overlap in their sequences.

  • Seq-pept-Homol: Protein sequence was generated by sequencing multiple peptides, and the order of peptides was determined by homology with another protein.

  • Concept-Trans-A: Conceptual translation of the nucleotide sequence provided by the author of the entry.

  • HTGS 0: High Throughput Genome Sequence , Phase 0. These sequences are produced by high-throughput sequencing projects and will be in the HTG division.

  • HTGS 1: High Throughput Genome Sequence , Phase 1. These sequences are produced by high-throughput sequencing projects and will be in the HTG division.

  • HTGS 2: High Throughput Genome Sequence , Phase 2. These sequences are produced by high-throughput sequencing projects and will be in the HTG division.

  • HTGS 3: High Throughput Genome Sequence , Phase 3. These sequences are produced by high-throughput sequencing projects and will be in the HTG division.

  • FLI_cDNA: Full Length Insert cDNA. Sequence corresponds to entire cDNA but not necessarily entire transcript. These sequences are produced by large sequencing projects.

  • HTC: High Throughput cDNA. These sequences are produced by large sequencing projects.

  • WGS: Whole Genome Shotgun . These sequences are produced by large sequencing projets and follow a separate submission process.

  • Barcode: Nucleotide sequence is part of Barcodes of Life project. This selection should only be used by members of the Consortium for the Barcodes of Life.

  • Other: Do not use this designation.

Table of Contents

Class

From the pop-up menu, select the type of molecule that was sequenced.

  • DNA: DNA

  • RNA: RNA

  • Protein: Protein

  • Nucleotide: Do not select this item

  • Other: Do not select this item
Topology

From the pop-up menu, select the topology of the sequenced molecule.

  • Linear: Linear molecule (most sequences).

  • Circular: Circular molecule (such as a complete plasmid or mitochondrion).

  • Tandem: Do not select this item.

  • Other: Do not select this item.
Strand

From the pop-up menu, select whether the sequence was derived from an organism with a single- or double-stranded genome. This is used primarily for viral submissions.

  • Single: The organism contains only a single-stranded genome, for example, ssRNA viruses.

  • Double: The organism contains only a double-stranded genome, for example, dsDNA viruses.

  • Mixed: Do not select this item.

  • Mixed Rev: Do not select this item.

  • Other: Do not select this item.

Biological Source

The Biological Source descriptor is described in more detail below.

Features

Overview

Features are annotations which apply to one or more intervals on a sequence. They can be contrasted to Descriptors, that apply to an entire sequence or an entire set of sequences. Features will be added to the Target Sequence selected in the record viewer pop-up menu.

You may add or modify features in one of three ways.

(1) In the record viewer, double click on the text of an existing feature to bring up a form on which information can be added or edited.

(2) Choose the feature from the Annotate menu to add a new feature.

(3) Choose the feature from the Sequence Editor Features menu to add a new feature.

The features listed in the Annotate menu and the Sequence Editor Features menu are identical, and the instructions for adding them are the same, with one exception. If you annotate them in the Annotate menu, you must provide the nucleotide sequence location of the feature. However, if you add features from the Sequence Editor, you can highlight the sequence that the feature covers, and the location of the sequence will be automatically entered in the feature location box.

Table of Contents

Annotate Menu

This menu allows you to add or modify features on the sequence selected in the Target Sequence pop-up menu of the record viewer. Features are grouped into six categories. Select the feature that you would like to mark on your sequence. A new form will appear.

Feature forms share a common design. The first page is specific to the particular feature, e.g., Coding Region or Gene. The second page lists Properties of the Feature. The third page describes the Location of the feature. Details about the common second and third pages are provided below.

Properties Page

General Subpage

Enter general comments about the feature here.

Select any of the flags if necessary. If this sequence contains only a partial representation of the feature you are describing, check the "Partial" box. Check the "Exception" box if the feature annotates a post-transcriptional modification of the nucleotide sequence, such as ribosomal slippage or RNA editing. This is generally used only on CDS features. The evidence dialogs will only be editable if information has been entered in the Evidence subpage.

If a gene feature overlaps the feature you are editing, the gene symbol will appear in the pull-down menu. If you want to add the name of a new gene, select new, and enter its name and optional description. By default, mapping between the feature and the gene is done by overlap, that is, the gene associated with the feature is the gene whose location overlaps with the location of the feature. Under some circumstances, for example, if the sequences of two genes overlap, you may wish the feature to apply to a different gene. In this case, select cross-reference, and select the name of the new gene in the pop-up menu. If you do not want the feature to map to any existing gene, select suppress. You may also edit information on the Gene feature form by clicking on Edit Gene Feature.

Comment Subpage

Add any comments about the feature here, especially if you checked the "Exception" box on the General Subpage.

Citations Subpage

This page is used to list any citations that specifically apply to the feature you are annotating. The citation must have already been entered into the record (see Publications ) in the Sequin help documentation. Click on Edit Citations, and place a check mark in box next to the publication you want to cite. However, we discourage the use of citations on features.

Cross-Refs Subpage

This is a read-only page used to cross-reference this entry to entries in external databases (databases other than GenBank, EMBL/EBI, and DDBJ), such as dbEST or FLYBASE. For more information on this topic, see the International Nucleotide Sequence Database Collaboration page .

Evidence Subpage

This page is primarily used by large sequencing centers to explain annotation prediction methods and its use is optional. More details about these qualifiers can be found in the genome submission guidelines . The two choices of evidence are Experiment or Inference.

Wet-bench, experimental evidence can be entered as free text in the Experiment section. Please be as brief as possible.

The Inference section allows for information to be added in cases where the feature is annotated based solely on sequence similarity or prediction software. In order to fill in text, you must select one of the options from the Category pull-down menu. Different pull-down and text boxes will appear depending on the selection you choose from the Category menu. If you select one of the 'similar to' categories, you must include the name of the database and the corresponding accession number of the sequence used as the basis for the annotation. If you choose one of the prediction categories, you must include the name and version of the prediction program used as the basis for the annotation.

For example, if your annotation of a coding region was based on similarity to the sequence and annotation in GenBank Accession number AY411252, you would select "similar to DNA sequence" from the pull-down menu and then select "INSD" in the Database pull-down. You would then type "AY411252.1" in the Accession text box. If the annotation is based on the Genscan prediction algorithm, you would select "ab initio prediction" from the pull-down menu, select "Genscan" in the Program pull-down and enter 2.0 in the Program Version text box. If the database or program used is not listed in the appropriate pull-down list, select Other from the list. A new text box will appear where you can enter the name of the database or program used. You still must include the appropriate accession number or version in the subsequent text box.

Table of Contents

Identifiers Subpage

This is a read-only page used by the database staff for tracking features within the record.

Location Page

This page allows you to select the location of the feature you are citing. Each feature must have a sequence interval associated with it. In most cases, Sequin will limit the option to the nucleic acid or protein sequence as appropriate.

Check the 5' Partial or 3' Partial box if the feature in your nucleic acid sequence is missing residues at the 5' or 3' ends, respectively. Check the NH2 Partial or COOH Partial if the feature in your amino acid sequence is missing residues at the amino- or carboxy-terminal ends, respectively. If you checked "Partial" on the Properties page, you must check either the 5' and/or 3' partial boxes.

Enter the sequence range of the feature. The numbers should correspond to the nucleotide sequence interval if the SeqID is set to a nucleotide sequence, and to an amino acid sequence interval if the SeqID is set to a protein sequence. If the feature spans multiple, non-continuous intervals on the sequence, indicate the beginning and end points of each interval. If each interval is separate, and should not be joined with the others to describe the feature, check the Intersperse intervals with gaps box (for example, when annotating multiple primer binding sites). If the feature is composed of several intervals that should all be joined together, do not check the box (for example, when annotating mRNA on a genomic DNA sequence).

For nucleic acid Features only: From the pop-up menu, select the strand on which the feature is found.

  • Plus: Plus strand, or coding strand.

  • Minus: Minus strand, or non-coding strand.

  • Both: Both strands.

  • Reverse: Do not select this item.

  • Other: Do not select this item.

Use the pop-up menu to select the SeqID of the sequence you are describing by the location. Clicking on the X button to the left will clear location spans, strand, and SeqID from that row.

If you are working on a set of sequences which contain an alignment, you will see a toggle at the bottom of the Location Page where you can select to add or view the location of the feature using the Sequence Coordinates of the target sequence or the Alignment Coordinates. In either case, the feature will only be added to the target sequence. If you want to add features to all members of the set using the alignment coordinates, you must use the Alignment Assistant .

A brief description of the available features follows. A detailed explanation of how to use the coding region (CDS) feature is included. The DDBJ/EMBL/GenBank feature table definition page provides detailed information about other features.

attenuator

1) region of DNA at which regulation of termination of transcription occurs, which controls the expression of some bacterial operons; 2) sequence segment located between the promoter and the first structural gene that causes partial termination of transcription.

Table of Contents

C_region

Constant region of immunoglobulin light and heavy chains, and T-cell receptor alpha, beta, and gamma chains. Includes one or more exons, depending on the particular chain.

CAAT_signal

CAAT box; part of a conserved sequence located about 75 bp upstream of the start point of eukaryotic transcription units that may be involved in RNA polymerase binding; consensus=GG(C or T)CAATCT.

CDS

coding sequence; sequence of nucleotides that corresponds with the sequence of amino acids in a protein (location includes stop codon). Feature includes amino acid conceptual translation.

Coding Region Page

Most users add a coding region to their sequence when they fill out the Organism and Sequences form. However, you may need to edit the coding region, or add additional ones. Choose CDS under the Coding Regions and Transcripts submenu of the Features menu, or to edit an existing CDS, double click on the record viewer. If you appended the partial sequence of a coding region to the Organism and Sequences form, you will probably need to edit the Coding Region feature to avoid validation error messages about the location of the coding region.

General (Product) Subpage

Choose the genetic code that should be used to translate the nucleotide sequence. For more information, and for the translation tables themselves, see the NCBI Taxonomy page . If the genetic code is already populated from the taxonomy database, do not change this selection.

Choose the reading frame in which to translate the sequence. Do not fill in the Protein Product or SeqID selections.

Supply additional information about the protein by clicking on Edit Protein Information to launch the Protein feature forms. The protein name must have already been filled out on the Protein subpage.

Checking retranslate on accept will translate the nucleotide sequence according to the interval(s) indicated on the Locations page when you click on Accept to exit the editor. This new translation will replace any earlier translations you have supplied. This should not be a problem if the interval was indicated appropriately.

If the coding sequence that you supply is a partial sequence and you have checked a Partial box on the Location subpage, it is a good idea to check the Synchronize Partials box. In this case, Sequin will ensure that all other appropriate features (such as protein) are also marked as partial.

When editing existing CDS features, choose the sequence you want to view by selecting its name uder the Product pop-up menu. You may also import a new protein sequence by selecting Import Protein FASTA under the file menu. The sequence should be formatted as described above on the Organism and Sequences form.

After you have imported a protein sequence, click on Predict Interval. This function will predict the interval on the nucleotide sequence to which the coding region applies. If you do not select this function, the interval will likely be wrong, and you will get an error message when you attempt to validate the record. If your sequence is a 5' or 3' partial, you must first indicate this manually on the Location Page.

You may also have Sequin generate the protein sequence from the nucleotide sequence by clicking on Translate Product. However, you must first indicate the location and partialness of the coding region on the Location page in order to obtain the correct translation.

The Edit Protein Sequence button will launch an amino acid Sequence Editor as discussed below.

The Adjust for Stop Codon button will truncate a displayed translation at the first stop codon. If no stop codon is present in the current translation, this function will extend the translation to the first stop codon or to the end of the sequence. In both cases, the spans of the coding region will be automatically updated on the Location Page to reflect the new translation.

Table of Contents

Protein Subpage

Use this page to enter or edit a name or descriptionof the protein product. For a new sequence, enter information directly into the boxes. You can edit descriptions of an existing sequence by clicking on Edit Protein Feature which will bring up the Protein feature form. The Launch Product Viewer displays the flatfile view of ht eprotein record generated from the information in the CDS feature.

Exceptions Subpage

Exceptions describe places where there is a posttranslational modification. Enter the amino acid position at which the modification occurs, and select the amino acid that is actually represented in the protein from the pop-up list. Sequin will change the amino acid number to a nucleotide interval. Please provide some explanation for the exception in a comment.

conflict

Independent determinations of the "same" sequence differ at this site or region.

D-loop

Displacement loop; a region within mitochondrial DNA in which a short stretch of RNA is paired with one strand of DNA, displacing the original partner DNA strand in this region; also used to describe the displacement of a region of one strand of duplex DNA by a single stranded invader in the reaction catalyzed by RecA protein.

D_segment

Diversity segment of immunoglobulin heavy chain, and T-cell receptor beta chain.

enhancer

A cis-acting sequence that increases the utilization of (some) eukaryotic promoters and can function in either orientation and in any location (upstream or downstream) relative to the promoter.

exon

Region of genome that codes for portion of spliced mRNA; may contain 5' UTR, all CDSs, and 3' UTR.

gap

Gap in the sequence, only applied to gaps of unknown length. The location span of the gap feature is 100 base pairs, indicated by 100 "n"s in the sequence. The qualifier /estimated_length=unknown is mandatory.

GC_signal

GC box; a conserved GC-rich region located upstream of the start point of eukaryotic transcription units that may occur in multiple copies or in either orientation; consensus=GGGCGG.

gene

Region of biological interest identified as a gene and for which a name has been assigned.

iDNA

Intervening DNA; DNA which is eliminated through any of several kinds of recombination.

intron

A segment of DNA that is transcribed, but removed from within the transcript, by splicing together the sequences (exons) on either side of it.

J_segment

Joining segment of immunoglobulin light and heavy chains, and T-cell receptor alpha, beta, and gamma chains.

Table of Contents

LTR

Long terminal repeat, a sequence directly repeated at both ends of a defined sequence, of the sort typically found in retroviruses.

mat_peptide

Mature peptide or protein coding sequence; coding sequence for the mature or final peptide or protein product following post-translational modification. The location does not include the stop codon (unlike the corresponding CDS).

misc_binding

Site in nucleic acid that covalently or non-covalently binds another moiety that cannot be described by any other Binding key (primer_bind or protein_bind).

misc_difference

Feature sequence is different from that presented in the entry and cannot be described by any other Difference key (conflict, unsure, mutation, variation, allele, or modified_base).

misc_feature

Region of biological interest which cannot be described by any other feature key.

misc_recomb

Site of any generalized, site-specific, or replicative recombination event where there is a breakage and reunion of duplex DNA that cannot be described by other recombination keys (iDNA and virion) or qualifiers of source key (/proviral).

misc_RNA

Any transcript or RNA product that cannot be defined by other RNA keys (prim_transcript, precursor_RNA, mRNA, 5'UTR, 3'UTR, exon, transit_peptide, polyA_site, rRNA, tRNA, and ncRNA).

misc_signal

Any region containing a signal controlling or altering gene function or expression that cannot be described by other Signal keys (promoter, CAAT_signal, TATA_signal, -35_signal, -10_signal, GC_signal, RBS, polyA_signal, enhancer, attenuator, terminator, and rep_origin).

misc_structure

Any secondary or tertiary structure or conformation that cannot be described by other Structure keys (stem_loop and D-loop).

modified_base

The indicated nucleotide is a modified nucleotide and should be substituted for by the indicated molecule (given in the mod_base qualifier value).

mRNA

messenger RNA; includes 5' untranslated region (5' UTR), coding sequences (CDS, exon) and 3' untranslated region (3' UTR).

ncRNA

non-coding RNA; a non-protein-coding transcript other than ribosomal RNA and transfer RNA, including antisense RNA, guide RNA, scRNA, siRNA, miRNA, piRNA, snoRNA, and snRNA. The specific type of ncRNA must be specified in the /ncRNA_class qualifier.

N_region

Extra nucleotides inserted between rearranged immunoglobulin segments.

Table of Contents

operon

Region containing polycistronic transcript under the control of the same regulatory sequences.

oriT

Origin of transfer; region of DNA where transfer is initiated during the process of conjugation or mobilization.

polyA_signal

Recognition region necessary for endonuclease cleavage of an RNA transcript that is followed by polyadenylation; consensus=AATAAA.

polyA_site

Site on an RNA transcript to which will be added adenine residues by post-transcriptional polyadenylation.

precursor_RNA

Any RNA species that is not yet the mature RNA product; may include 5' clipped region (5' clip), 5' untranslated region (5' UTR), coding sequences (CDS, exon), intervening sequences (intron), 3' untranslated region (3' UTR), and 3' clipped region (3' clip).

prim_transcript

Primary (initial, unprocessed) transcript; includes 5' clipped region (5' clip), 5' untranslated region (5' UTR), coding sequences (CDS, exon), intervening sequences (intron), 3' untranslated region (3' UTR), and 3' clipped region (3' clip).

primer_bind

Non-covalent primer binding site for initiation of replication, transcription, or reverse transcription. Includes site(s) for synthetic e.g., PCR primer elements.

promoter

Region on a DNA molecule involved in RNA polymerase binding to initiate transcription.

protein_bind

Non-covalent protein binding site on nucleic acid.

RBS

Ribosome binding site.

repeat_region

Region of genome containing repeating units. Some qualifiers such as rpt_type and mobile_element have controlled vocabularies. These qualifiers have check boxes or pull-down menus to ensure that the correct format is used.

repeat_unit

Single repeat element.

rep_origin

Origin of replication; starting site for duplication of nucleic acid to give two identical copies.

rRNA

Mature ribosomal RNA ; the RNA component of the ribonucleoprotein particle (ribosome) that assembles amino acids into proteins.

S_region

Switch region of immunoglobulin heavy chains. Involved in the rearrangement of heavy chain DNA leading to the expression of a different immunoglobulin class from the same B-cell.

Table of Contents

satellite

Many tandem repeats (identical or related) of a short basic repeating unit; many have a base composition or other property different from the genome average that allows them to be separated from the bulk (main band) genomic DNA.

sig_peptide

Signal peptide coding sequence; coding sequence for an N-terminal domain of a secreted protein; this domain is involved in attaching nascent polypeptide to the membrane; leader sequence.

source

Identifies the biological source of the specified span of the sequence. This key is mandatory. Every entry will have, as a minimum, a single source key spanning the entire sequence. More than one source key per sequence is permittable.

stem_loop

Hairpin; a double-helical region formed by base-pairing between adjacent (inverted) complementary sequences in a single strand of RNA or DNA.

STS

Sequence Tagged Site. Short, single-copy DNA sequence that characterizes a mapping landmark on the genome and can be detected by PCR. A region of the genome can be mapped by determining the order of a series of STSs.

TATA_signal

TATA box; Goldberg-Hogness box; a conserved AT-rich septamer found about 25 bp before the start point of each eukaryotic RNA polymerase II transcript unit that may be involved in positioning the enzyme for correct initiation; consensus=TATA(A or T)A(A or T).

terminator

Sequence of DNA located either at the end of the transcript or adjacent to a promoter region that causes RNA polymerase to terminate transcription; may also be site of binding of repressor protein.

tmRNA

Transfer messenger RNA; acts as a tRNA first, then an mRNA that encodes a peptide tag.

transit_peptide

Transit peptide coding sequence; coding sequence for an N-terminal domain of a nuclear-encoded organellar protein; this domain is involved in post- translational import of the protein into the organelle.

tRNA

Mature transfer RNA, a small RNA molecule (75-85 bases long) that mediates the translation of a nucleic acid sequence into an amino acid sequence.

unsure

Author is unsure of exact sequence in this region.

V_region

Variable region of immunoglobulin light and heavy chains, and T-cell receptor alpha, beta, and gamma chains. Codes for the variable amino terminal portion. Can be made up from V_segments, D_segments, N_regions, and J_segments.

V_segment

Variable segment of immunoglobulin light and heavy chains, and T-cell receptor alpha, beta, and gamma chains. Codes for most of the variable region (V_region) and the last few amino acids of the leader peptide.

Table of Contents

variation

A related strain contains stable mutations from the same gene (e.g., RFLPs, polymorphisms, etc.) that differ from the presented sequence at this location (and possibly others).

3'UTR

Region near or at the 3' end of a mature transcript (usually following the stop codon) that is not translated into a protein; trailer.

5'UTR

Region near or at the 5' end of a mature transcript (usually preceding the initiation codon) that is not translated into a protein; leader.

-10_signal

Pribnow box; a conserved region about 10 bp upstream of the start point of bacterial transcription units that may be involved in binding RNA polymerase; consensus=TAtAaT.

-35_signal

A conserved hexamer about 35 bp upstream of the start point of bacterial transcription units; consensus = TTGACa or TGTTGACA.

Biological Source Descriptor or Feature

This annotation is very important, as an entry cannot be processed by the databases unless it includes some basic information about the organism from which the sequence was derived. This basic information was entered previously in the submission, in the Organism and Sequences Form. The more detailed Organism Information form allows you to alter or add to the data you entered earlier.

Overview: Descriptor or Feature?

Sequin allows two types of biological source information to be entered, Biological Source Descriptors and Biological Source Features. Biological Source Descriptors, like other descriptors, provide organism information about an entire sequence, or an entire set of sequences, in an entry. Biological Source Features, like other features, provide organism information about a specific interval on a given sequence.

In most cases, you will want to use a Biological Source Descriptor, because all the sequences in the entry will derive from the same source. However, if you have sequenced a transgenic molecule, for example, one that is part plant and part bacterial, you would use Biological Source Features to annotate which sequence was derived from plant and which from bacteria.

To add a Biological Source Descriptor, select Biological Source under the Descriptor section of the Annotate menu. To add a Biological Source Feature, select Biological Source under the Bibliographic and Comments section of the Annotate menu.

Annotating a Biological Source Descriptor or Feature is similar to annotating any descriptor or feature. For help in creating descriptors and features, see the appropriate section of the help documentation. The following are instructions for filling out Biological Source-specific forms.

Organism Page

Names Subpage

The scrollable list contains the scientific names of many organisms. To reach a name on the list, either type the first few letters of the scientific name, or use the thumb bar. Click on a name from the list to fill out the scientific name field. If there is a common name for the organism, that field will be filled out automatically. You may also directly type in the scientific name. If you have any questions about the scientific or common name of an organism, see the NCBI taxonomy browser

Table of Contents

Location Subpage

Location of Sequence

From the selection list, please enter the location of the genome that contains your sequence. Most entries will have a "Genomic" location. A brief description of the choices in this pop-up menu were listed previously.

Origin of Sequence

This menu is for the use of database personnel. Please leave this field empty. The Biological focus box should be checked in rare cases where multiple source features are annotated.

Genetic Codes Subpage

Please use these fields to select the nuclear and mitochondrial genetic code that should be used to translate the nucleic acid sequence. The genetic code for a eukaryotic orga