16S Ribosomal RNA Reference Sequence Similarity Search Beta Release

The 16S Ribosomal RNA Reference Sequence Similarity Search tool allows visualization of BLAST hits from a single query sequence against a precalculated phylogenetic tree constructed from an alignment of 16S rRNA sequences for prokaryotic type strains. The set of sequences is updated semi-regularly and currently contains type strains derived from comparison of a number of contributing databases. More information on this project can be found on the here.

NOTE: This tool is a beta release and is expected to change.

Clicking on the home link at the top of the results page at any time will reload the submission page in order to do another search.

The interface is described in the following sections.
  1. Query
  2. Results
  3. Tree

blue marker gif 1. Query back to top

Currently the search tool only allows a single sequence as a query. The sequence should be provided either in FASTA format with a header or as an INSD Accession number or GI. Start and stop positions can be delimited in the boxes to the right. Searches are currently limited to between 500 and 2000 bases only.

The sample sequence provided is one of the 16S ribosomal RNAs from the Ureaplasma parvum genome.


blue marker gif 2. Results back to top

BLAST Results are initially displayed in the following manner:

  1. A summary of the top ten hits at the top of the page.
  2. The same top ten hits are displayed on the tree.
  3. A full table of results are available from the Table button
Results: Summary

Hits are sorted by score in the full set of results. However, summary results (top ten) to either bacterial or archaeal sequences, and in the tree are sorted and ranked by percent identity (hit rank). This may mean that results in the full table and results in the summary table and tree do not show the same rank. The count of hits in each of three percent identity intervals is displayed. Color-coding in the summary table and in the tree are based on:

Hits with greater than 99% identity.
Hits with identity between 95 and 99%.
Hits less than 95% identity.
Summary results show rank, percent identity, coverage, and the organism name. The summary can be expanded or collapsed using the arrow to the left side of the "Top 10..." line. The identity intervals were chosen as biologists often assume that sequences that show identity greater than 99% are the same species and those sequences with greater than 95% identity are the same genus (although this is NOT always the case). Coverage is the percentage of the query that is aligned over the total length of the query sequence. Sequences that have low coverage (even with a high scoring hit) are reported with a red flag.

Results: Tree

The color-coding based on percent identity is also displayed in the tree, and the rank of the hit is shown by the position of the color-coded box (leftmost = rank 1; rightmost = rank 10). Resolution of the tree (number of leaves displayed - see below) may compress or expand the box

Results: BLAST table

A full table of results is available from the "Table" button. The full table shows all BLAST results for a given query. The summary is displayed at the top. The percent identity for the given interval, and links to further results (pagination) are available. A link back to the tree ("Show tree" button) or a link to run a normal BLAST query ("BLAST" button) are above the table. The full table shows score (link to blast 2 sequences of query vs. subject), percent identity, GI and RefSeq accession, GenBank Accession from which the RefSeq was derived, length of hit, e-value, and defline.

Results: Red Alert

Coverage is calculated for the query. Aligned length (start and stop positions on the query) over the total length of the query. Those query sequences that have a low ratio (currently < 0.7 or 70%) are flagged in red. Note, that this coverage is calculated only for the best hit and does not include all possible hits to a subject sequence.

No Results

If no sequences are found, or if a significant error is found with the query (a sequence that has hits to both archaea and bacteria) then an error is reported with no results:

alert("ERROR: thrown when getting list of alignments from the result: most likely no results");


blue marker gif 3. Tree back to top

Tree Sets of Reference Sequences representing bacterial and archaeal sequences were separately aligned using a secondary structure covariance model and INFERNAL version 1.0. A maximum likelihood tree was constructed using raxmlHPC v 7.2.2 and a secondary structure file for Thermus thermophilus 16S rRNA. Sequences used in tree construction are the same as those in the BLAST database.

The tree is arbitrarily rooted for visualization and the graphical view is similar to that developed for the NCBI Influenza project. Search results are visualized as color-coded boxes representing the top ten hits (see above) and the entire branch where the top ten hits appear is drawn with a shaded blue background.

The number of sequences in each branch is represented both by the size of the gray circle, and by the first number in each row. The total number of sequences with the same taxonomic assignment is shown next, and if there are any sequences that do not share a common parent with the majority at up to two nodes in the taxonomic hierarchy, then the total number of sequences is noted as 'other'. Yellow highlights are used to aid in subtree visualization. The buttons to the left and the resolution controller change how the tree is visualized or take one to different tools. The buttons (marked with an asterisk) and additional controls and functionalities are listed below:

Additionally there is a resolution controller for controlling the number of leaves resolved in the tree.

Select subtree

This button selects the subtree containing the top ten hits.

Table

The table button displays full BLAST results.

Search

The search button displays a pop-up window where metadata for the set of Reference Sequences can be searched in a field-specific manner: organism (including taxonomic lineage), strain, culture collection, minimum length, maximum length, and accession. Successful search results are tagged by gray boxes drawn on the tree with the number in the box corresponding to the total search results in the row/branch. Existing tags can be removed, and the form cleared for a new search term.

Reset

The reset button rests the tree to the way that it appeared after the initial search. This will reset any rearrangements, resolution changes, and tagging. It does not reset the form for another search. To start over with a new search the title is linked back to the original submission form.

Unselect

The unselect button removes any branch selections in the tree.

Align

The align button aligns the query sequence against the top ten hits and displays the resulting alignment in a multiple alignment viewer that is similar to that for the SARS alignment.

Resolution controller

The resolution of the number of leaves displayed is done via the resolution controller. A maximum of 500 leaves can be shown for each branch. Tree selection can either be done with the select subtree button (which selects the branch where hits occurred) or done directly on the tree itself (click on the root of a branch). Once a selection occurs, the total number of leaves in the branch and the number of leaves that are resolved is displayed aboved the resolution controller shown. Moving the red marker line from minimum (one) to maximum (up to 500) for a given branch changes the resolution of the number of leaves in the selected subtree.

Accession

Clicking on a RefSeq Accession Number in the tree will open the flatfile view of that record in Entrez Nucleotide.


Last updated: Dec 29, 2009.