To learn more about AceView, you
may enjoy these documents:
Danielle Thierry-Mieg and Jean
Thierry-Mieg, AceView:
a comprehensive cDNA-supported gene and transcripts annotation, Genome Biology
2006, 7(Suppl 1):S12
About AceView: Overview , News (most recent improvements; release statistics; special notes on Worm), About history of Acembly/AceView, development
corner, acknowledgements
To get help, please see these:
Frequently asked
questions (updated Sept 13, 2009)
Help on
the three ways to query AceView
Help with texts, tables, displays
the “Gene on Genome” page (updated July, 2007)
the“Annotated mRNA” page
(updated May 2006)
How to link to AceView, cite us , Download, Register on AceView list
Please send us wishes, comments, and bug reports.
Summary:
Our molecular
annotations of the transcripts (not of the proteins, which still today are
largely predicted) are fully supported by experimental cDNA sequence data, but
the biological and functional annotations should be considered as hints: please
refer to the PubMed articles cited as evidence for each statement about
associated diseases or pathways, processes, molecular function, localization
and interactions, and if you can, please help us by sending your feedback.
We align
cooperatively the millions of cDNA sequences available from the public
databases on the genome sequence, then quality-filter the cDNA clones for
rearrangements or defects, then cluster the 90% good cDNA sequences into mRNA
models and genes. By construction, the redundant GenBank, dbEST and Trace cDNA
sequences are merged, and the resulting sequence is fixed to match the
excellent quality of the genome sequence, while the eventual SNPs (shared
across multiple mRNAs) remain apparent on the graphical views of the
reconstructed mRNA.
Our program
performs semi-automatic hand-annotation: at each step, we use appropriate heuristics,
search genome-wide for counterexamples, and tune the algorithms until the
results meet manual curation criteria. We construct the minimal number of
alternative variants necessary and sufficient to represent all cDNA clones, and
each clone is associated to a single variant. In this way, our annotation is
both comprehensive and non redundant.
On the
molecular side, AceView genes and their alternative variants are analyzed in
terms of expression, intron-exon structure, alternative features and regulation,
neighbor relationships, such as antisense and operon-like arrangements;
participating cDNA clones/accessions are annotated; the putative protein
products are analyzed for completeness, their best covering cDNA clones are
identified, the proteins are searched for motifs, membership to a protein
family, conservation in evolution, closest homologues in other species and
signals for subcellular localization; non coding
genes are identified. We believe that our annotation of mRNAs and putative
protein products, expression or regulation is quite reliable and comprehensive,
as we carefully annotate the caveats, one of which is spurious concatenation of
two rare transcripts into a single mRNA (in about ¼ cases in human), the other
more general is that the proteins annotated are predicted from the mRNA
sequence rather than experimental (as is also the case for SwissProt or RefSeq). But overall, it appears that AceView transcripts actually capture
well the bulk of the complexity of the transcription apparent in GenBank and
dbEST, and can be used as a framework to design experiments.
Rationale:
Millions
of mRNAs and ESTs have been deposited in the public databases, even long before the genome sequence was known. They represent a snapshot of
the mRNA content of the cells and are the most primary data reflecting gene
activity. Now that the genome sequence is almost complete, it is essential to
integrate this wealth of data, however noisy and complex, to trace all mRNAs
back to the region of the genome from which they were transcribed and to try to
reconstruct, by assembling the pieces of the puzzle, the most likely image of
the original transcripts. To this effect, we have developed the AceView program
(initially called Acembly), which allows us to see, without preconceived ideas,
what the pattern of transcription looks like in vivo, and which mRNAs and
proteins are likely made in the cell. Our basic hypothesis is that the cDNAs
submitted to the public databases give a faithful image of the transcriptome.
While
analyzing the data, we identify unexpected results, some of which may be real
discoveries, some experimental artifacts or incomplete data, but in all cases,
our hope is to stimulate further investigations toward a better understanding
of the transcriptome.
AceView
regularly downloads from the public databases the whole set of cDNA sequences,
mRNAs and ESTs, aligns them on the current genome available at NCBI, and
clusters them into reference transcripts, including alternative variants. At
least three quarters of AceView reconstructed transcripts are unambiguously
supported in GenBank, but the others could represent an inappropriate
concatenation of two or more partial variants found separately in the cell. To
identify these cases clearly, we rate explicitly the support of each
transcript, hoping to encourage further sequencing. The consensus sequence of
the reconstructed transcript is calculated from the cDNA sequences and from the
underlying genome, and the cDNA clones and proteins are annotated. Inferences
about the gene and its proteins are described. Biological annotations, usually
selected from other sources (linked), are added, and summaries are presented in
AceView, as a structured text document per gene. Text queries in AceView bring
lists of genes related to the question asked. Useful data are also available
for download in bulk, on our ftp site.
We summarize
here the types of results from AceView analysis that are currently displayed
for a gene or a transcript/protein, and we outline our annotation procedures.
AceView mRNA-to-genome co-alignment and clustering results:
analysis and classification of properties of clones and genes
Annotation of the proteins deduced from
the reconstructed mRNA sequences
Biological annotation is collected to
allow meaningful queries by content
Presentation of the data on the Web
Additional
information may be found in the Frequently Asked Questions, in the AceView News, in the “Gene on genome” page help, in the mRNA page help, or in the development
corner.
AceView
co-aligns all publicly available mRNAs and ESTs onto the
genome sequence, by using an original cDNA-to-genome coalignment program, which we initially called Acembly. The program was developed for the
worm transcriptome data, and the fact that we had the sequencing traces allowed
us to analyze in detail the sources of noise, in particular the base-call
errors, and to tune the co-alignment strategy to find the best alignments of
each cDNA on the genome even in the presence of noise in the sequences. The
code is big and its behavior in special circumstances depends on various
heuristics; it is not easy to summarize in a few lines (so it is not yet
published). It is however very fast and efficient, and quite unique in its
performances.
reconstructs transcripts by clustering the aligned cDNAs into
minimal sets, as the maximal pacific components of the friends and foes cDNA
clones connection graph
groups overlapping transcripts into genes, provided they share
at least one intron boundary (this last statement was added from Aug 2005 on)
names the genes by following the official
nomenclature when available or by creating new names that are then tracked from
release to release.
The program
then
describes for each gene the intron-exon structure and its support in a
dedicated table
classifies the genes by level of expression relative to all main
genes genome wide.
generates two sequences for each reconstructed transcript, one
derived from the underlying genome sequence and one derived as the consensus of
the cDNAs that best matches the genome sequence (the AceView mRNA reference sequence .AM)
for each
protein, a “golden path” of clones is selected; the minimal number of clones required to structurally support each
protein is given: if =1, the variant is of RefSeq quality; if
=2 or more, the reconstructed transcript should be tested by resequencing clones or by RT-PCR to straighten
combinatorial artifacts in the reconstruction.
identifies best representative cDNA clones for each transcript,
as well as anomalies such as partial deletions or rearrangements of the clone
insert, internal priming or mosaicism
displays
various tables of supporting cDNA clones/accessions: one for all clones supporting the gene (accessible
from the “Gene” page caption), one for clones supporting each variant (in each
“annotated mRNA” page) and one for the “main clones”, defined as the set of clones
necessary and sufficient to reconstruct the consensus sequences that best match
the genome for all transcripts. In each table of clones are indicated the
tissue and stage information, the sequence accessions (linked to the GenBank
record), and the quality of the match to the genome (length aligned, number of
single base differences and %).
evaluates completeness of the sequence-derived
protein, and compares the alternative proteins among them
compares alternative transcripts and describes the
presence of alternative promoters, alternative last exons, cassette exons,
overlapping exons, unspliced introns, and how that would impact the alternate
protein isoforms
characterizes and reports the existence of what we consider
reliable antisense genes and other regulatory features that can be evidenced
from the mRNA to genome alignment such as close neighbors, complex loci, RNA
edition, suspected translational frameshift or leaky stops, or internal
ribosome entry sites. Recognition of common sequences in promotors may also be proposed in a distant future.
The above
analyses are displayed on the “Gene on Genome” page, together with a
supporting color-coded zoomable diagram that shows
the gene aligned on the genome, the AceView reconstructed transcripts with
indication of standard and non-standard introns and delineation of the main
open reading frame, and the gene’s neighbors.
AceView then
submits the proteins predicted from the reconstructed transcripts sequences to
an analysis pipeline using a set of programs selected after comparative
evaluation among those publicly available. Annotations include NCBI BlastP against the nr database, TaxBlast for ancestry relationships and identification of
the closest homolog in the most studied species, Psort 2 by Kenta Nakai for protein motifs and predicted localization in
cellular compartments, Pfam, searched using HMMER and keeping only the highly
significant matches, for classification into protein families. As a bonus of
Pfam, we gain the InterPro descriptions and GO classifications. Primers and temperature conditions to
amplify the predicted CDS are calculated by Osp. (Hillier L, Green P. PCR Methods Appl. 1991 Nov;1(2):124-8).
These data
are displayed in the “Annotated mRNA” page, where AceView
transcripts can be viewed as spliced mRNA variants, decorated with BlastP
homologies, Pfam and Psort motifs, Stops and Met(AUG)
in the three frames, and, by clicking on the “clones” button in the graph, all
supporting mRNAs and ESTs, with color-coded indications of differences from the
genome sequence as well as labeled anomalies. For the worm, we have indicated
the stage and/or tissue at the bottom of each cDNA (3’ end) and the presence
and type of trans-spliced leader at the top (5’ end).
Finally, all
these data are used by AceView to generate a descriptive title for each
transcript and protein: if BlastP gave good hits to meaningful names, its
results are used in an intricate (and not yet fully bug free) way, allowing the
protein family Pfam name to influence the name as well; in the absence of good
hits with meaningful descriptions, Psort and TaxBlast are used to annotate the proteins with basic properties such as presence of
motifs, high-probability compartment localization, and sequence conservation in
speciation.
The schema we use in Acedb is object oriented,
so it offers all the desired features of a model tailored to our needs, and it
is modifiable and extensible at will. This is why we are able to construct the
descriptive texts for each gene and mRNA. Yet users can also browse lists of
all genes in the genome with a given Pfam or Psort motif, or a GO or phenotype
term.
We evolved
toward an NCBI AceDB version, which now differs appreciably from the Sanger
version. It compiles under UNIX and Windows, has a different client server
implementation, a very powerful “tablemaker” query
system, and a nice C programmer’s interface AceC. We d be willing to distribute it if there was enough interest.
A very rich
functional and phenotypic annotation is highly desirable. Queries in AceView interrogate over all
data in the database, including the set of abstracts of publications available
from PubMed (and worm meetings) that we index to this effect. Our main source
of information is PubMed. One gets a good feel for the
biology from it, yet it is limited and somewhat ambiguous, because not all
papers are attached to the specific gene(s) they describe, and some are
associated to too many genes. We filter the latter category out by manually
reviewing the papers associated to more than 100 genes and not indexing them if
they do not seem biologically relevant.
For
additional annotation, our current procedure depends on the species:
In the nematode C. elegans, we perform annotation
ourselves in a well structured database, WormGenes. We use the Worm Transcriptome
Project of Kohara as a main data source and exploit
many resources, published or available on the Web, including WormBase, the CGC, RNAiDB, the Worm ORFeome site, and of course PubMed.
WormGenes was the direct source of RefSeq for worm at NCBI from 2002 to early
2005, when the task was pre-empted by WormBase.
For human, we use extensively the
NCBI resources, and copy fields from LocusLink /Entrez Gene: names and aliases,
literature and its annotation by NLM, RefSeq annotation, GO. We provide links to the wonderful OMIM texts of McKuzick and collaborators but did not yet index them. We also provide links to the
excellent GeneCards resource, which is available
free of charge to all academic, but may be fee/access-limited for
profit-oriented users.
We are open to any suggestions to make biological annotations
better.
AceView
offers through a versatile query system, lists and individual genes
annotated with text descriptions, tables, graphical representations and
external links. Three main pages are available:
Gene on Genome
Annotated mRNA(s) page (where the spliced
variants reconstructed by AceView are described and depicted)
External links, also provided in the first paragraph of the “gene on
genome” page.
As
biologists, we are used to reading text descriptions accompanied by tables of
results and graphs or diagrams, and a database dump does not have the same
appeal. That idea guides the development of our viewer: in AceView, we try to
automatically build summaries and sentences in English from the data in the
various fields in the database. We believe we gain a dimension in the depth and
accuracy of the description, because we can modulate each piece of text as a
function of multiple fields in the database. A drawback of automatic text generation
is a “metallic” accent, that we hope people can
accommodate. If you feel otherwise, please tell us!
But users in quest of a solid and thorough
integration of the cDNA and genome data should find in AceView the answers to
their questions, at all levels of detail. AceView may look a bit scary, but
that is because the data are intrinsically complex, and we do not have to
oversimplify since AceView at NCBI can be viewed as a
complement to the RefSeq project at NCBI, which provides a
simple and very secure view of the best known genes.
updated May 16, 2009
· Are AceView transcripts reliable?
· Why don’t I hit the AceView transcripts when I do a Blast
search on the human genome?
· Could you add references for the statement you make about
alternative promoter, antisense, etc.?
· Where do AceView gene names come from? Can I use them in
publications? or: I found no reference to the gene “jawker” in PubMed, is that expected?
· Why do you have composite genes, and names such as GeneAandGeneB, GeneA+GeneB, GeneNameC or GeneNameCo?
· What are “main genes”, “putative genes” and “cloud genes” in AceView?
· How is the 'best' protein in each mRNA selected and the protein score
computed in AceView
· Why is the AceView protein longer than the protein from a very similar transcript in GenBank?
· Can you help me design primers to amplify specific transcripts?
Or other similar questions...
· I am hunting for a gene responsible for a
phenotype in a region of a chromosome. Can you help?
· I found a bug, or would like to submit a wish…
·
We carefully
align each mRNA or EST onto the genome. Then we compare all possible alignments
and keep only the best position on the genome and the most compact alignment.
We reject the low-quality hits. In the August 05
release, we use to reconstruct AceView transcripts 91% of all mRNAs currently
in GenBank, they match the genome at 99.74% identity
over 98.8% of their length on average. We also use 76% of all ESTs in dbest, and they align at 98.16% identity over 93.3% of
their length. Therefore our individual alignments are highly reliable. Of
course, we align 99.7% of the RefSeqs at 99.98%
identity over 99.9% of their length (but they are made to fit, so the numbers
are meaningless!).
Few cDNA
sequences have an ambiguous map: only 1.3% of the mRNAs and 2.2% of the ESTs
have two or more equivalent niches on the genome: they participate in the
reconstruction of multiple, sometimes exactly repeated, AceView genes.
The introns,
and exons bounded by introns, are numerous but also highly reliable. By our
criteria, a well-defined intron has at least eight bases on both sides of the
intron boundary that are identical in at least one cDNA sequence from
GenBank/dbEST and the current prototype of the human genome. 96.6% of the
introns in AceView follow this stringent criteria.
Graphically, the well defined introns are drawn as a broken line (> or >)
to distinguish them from the less well defined introns, the “fuzzy”, drawn with
a straight colored line (| or |). The 4.5 million ESTs and mRNAs used in the
Aug05 release identify 317,417 different well defined introns, a very large
number indeed: the 35,991 genes with standard introns in this release (some
partial) have on average 9.1 good introns per gene. 65% of the introns are
alternative (i.e. not found in all transcripts of the gene). The table below
shows the frequencies of intron boundaries: 98.5% are standard, and we list
only 265 cases of putative U12-type spliceosome involvement. In this new release, we have aggressively annotated and ignored
anomalous splice sites (as explained here), because
most do not recheck by direct RT-PCR in the Havana Gencode project. There are
now only 1.4% left; these most probably correspond to structural defects in the
cDNA, but we could not ignore this cDNA altogether because it was also bringing
us a novel standard intron, which we trust, because defects in cDNA, such as
partial insert deletions, are local.
Introns are
especially reliable in AceView, because there is a lot of redundancy in the
cDNA data, and our co-alignment technique allows us to ascertain the boundaries
without the shadow of doubt. Each intron is described with its support in the
intron-exon table. Similarly, almost all exons are fully reliable: less than 1%
exons required concatenation of 2 cDNA sequences. Only the terminal exons,
especially at the 3’ end, suffer from our choice to mask the diversity in
polyadenylation sites. But overall, merging the different UTR lengths reduces
the number of variants and still seems preferable to most of our users. Notice
that the mRNA graphical view allows to “see” the alternative polyadenylation
sites when they occur, but beware that a substantial number actually correspond
to internal priming of the cDNAs in A-rich regions, which are naturally
enriched in the A/T rich UTRs, and we have not labeled these explicitly yet.
A less
straightforward question is the reliability of the whole transcript, i.e. the
reliability of the chaining of exons and introns. In the August 2005 release,
69% of the AceView proteins above 100 and above 300 aminoacids are fully
encoded within a single identified clone in GenBank/dbEST (91,738 different proteins
> 100 aa; 24,367 distinct proteins > 300 aa all have their coding part
fully covered by a known cDNA). The remaining 31% AceView proteins could still
result from an inappropriate concatenation of two clones (26%) or more (5%),
because, if no conflicts arise, we merge partial transcripts.
The picture
on the gene page displays explicitly the reliability of the CDS encoded by each
transcript: a reconstructed transcript variant (pink) holds its name (a, b,
c...) underneath it; the name is underlined if the CDS can be covered by a
single clone (a, c) hence the protein is secure. Names not underlined signal
provisional concatenations, that will require
experimental evaluation. Indeed, there are usually multiple ways to concatenate
cDNAs, but in AceView we chose to present the minimal number of variants and to
prevent the combinatorial, hoping for people at the bench to find evidence for
more variants, not less. Once compatible cDNAs are merged into long transcripts
containing GenBank mRNAs, the left-over fragments, each incompatible with the
main transcripts, may contact one another and get merged.
but we provide a
direct clue in the “proteins” table. For each variant, we count the minimal
number of clones needed to support the whole CDS or the whole mRNA,
structurally. We report the identity of the clones/sequence in the “Protein”
table, under “minimal set of supporting clones”.
-
If a variant needs only one clone to be covered, the structure of the putative
protein is reliable. Provided it extends to the Stop (second column of the same
table) and is not included in another, or even better is complete (first column
of the same table), which by us means there is also a Stop upstream of the Met,
and if in addition the transcript has a reasonable coding potential, this
variant is of RefSeq quality. You may look up the first table in “Main table of
supporting clones” to see if an NM supports this variant, but even if the
variant is not yet a RefSeq at NCBI, it should become
one in the future.
-
If a variant needs two or more clones to cover the CDS, it is more
questionable, because we might be cutting and pasting two existing alternative
forms that may not be compatible, and the fusion molecule may not exist as such
in the cell, but the two parts would be found separately in different
molecules. If each of the two clones were sequenced in full, they would
have a rather high probability of belonging to non-compatible transcripts,
because we already see on average five alternative variants per human gene with
confirmed introns.
To fix ideas,
in human build 33 release, 74,114 (64%) of the spliced transcripts are
supported by a single identified clone covering the entire CDS; the remaining
could represent an inappropriate concatenation of two clones (28%) or more
(8%).
Finally we
may discuss the reliability of the protein. Apart from the possibility of
artificial mosaics described above, when more than one clone is needed to cover
the structure of the CDS, there are five kinds of possible problems.
1) Truncated
protein: The transcript may be incomplete (indicated in the “protein” table).
2) An error
in the genome sequence may lead to a truncated or mutated protein, because we
annotate the protein derived from the genome sequence. This is rare, but we still
see a few. Note that we also provide the sequence of the protein derived from
the consensus of the cDNAs that best matches the genome (called .AM for AceView
reference mRNA; the sequence is accessible in the transcript table or the fasta
sequence file).
3) We may
occasionally have chosen the wrong coding frame.
4) We
annotate one protein per transcript, so we lose the second protein in the 26%
cases where more than one protein could be made from different regions of the
molecule (polycistronic-like transcripts, or uORFs, or smaller ORFs in 3’UTR).
We also have bugs (or bad features) in the program and may have artificially
fused two sequences that overlap, but encode different proteins, when it would
have been preferable not to fuse them.
5) The decision
on which putative protein is worth annotating, i.e. which RNA sequence has been
seen by the live ribosome as worth scanning and translating, is often extremely
hard: it supposes we know the answer before we have the data. Using
informatics, it is easy to define conservative criteria, usually based on
comparison to other species, and to choose one good CDS from about 20000 human
genes, but in general, given a standard transcript sequence, selecting an
initiator Met and a proper CDS remains a (wild) guess, in the absence of large
datasets, for example mass spectrometry data, on real eukaryotic proteins. We
expect surprises when the proteins will be known rather than inferred (see for
example the good work by the Sugano group): for now, we
conservatively annotate multiple CDS per mRNA and pick the best for display to
the public; if a CDS is complete (bounded by an in frame stop or there is an
accumulation of 5’ ends there) we pick as initiator Met either the first ATG,
or the first in frame NTG if the protein gains at least 30 residues upstream.
Most of the annotation, except for the signal peptide, can be truncated by eye,
so we let it to the researcher to look at the shorter protein, starting at the
ATG of their choice (i.e. the green line - by the side of the protein drawing).
Yes, this is
an important word of caution. Although AceView is being developed at NCBI, it
is not visible from NCBI Entrez. AceView is still considered a research project
at NCBI, and the AceView human reference transcripts and genes are not yet
visible from Entrez or from any of the general query tools at NCBI, including
Blast. This is why we offer a specific Blast page to search over AceView
transcripts, and we are trying to improve Web accessibility. We also make the
sequences available for download on our ftp site, and they are displayed on the
UCSC Genome Browser, as one of the gene tracks.
All
annotations in the fields “Expression” and in the entire section on variants
and mRNAs are the results of our AceView analysis, not a result taken from the
literature (see details). Sometimes, there may be a paper
in the literature that describes the same kind of observations, and it will
hopefully be referred to in the “litterature” field
at the bottom of the gene page. But our annotations are independent; they are
only based on aligning all cDNA sequences from the public databases onto the
genome sequence. Please cite us for the AceView analysis.
The
Bibliography, in the “Gene on Genome” page, is taken from Entrez Gene, itself
largely helped by OMIM. Each paper points to its summary in PubMed, and you can
use the “Related papers” function of PubMed to get a more complete
bibliography. Authors who think their paper provides a significant new result
and is missing from the list may go to their gene in Entrez Gene (using, for
example, the link we provide as the gene title); they can add a pointer to
their paper with a note of its main contribution using the RIF (Reference Into Function) option.
Yes. We are
developing a new query by feature that will allow you to look at lists of genes
ordered by position in a given area, but in the meantime, it is easy for us to
send you an html table of the genes lying in between two markers, or in a given
region that you would define by coordinates. The list will have interesting
data and each gene will be clickable and connect to our main AceView server.
Just send us a mail.
Concerning
bugs, we really hate them, track them, and want them all reported! Thanks!
Concerning
wishes or suggestions, please do not hesitate to send them: many do not take
long to implement or to fix, and then it makes a big difference to all users!
You can look at an outline of our plans for future development , but it is always
essential for us to get feedback from you. We do this project because it is
very challenging and therefore a lot of fun, but above all we do this work to
try to help you with your experiments at the bench, in the hope that we all
will learn more about the wondrous world of genes.
|
|
For human, we
do not recommend that you use AceView gene names in publications: as is customary,
you should try to obtain an official gene name by consulting with the
nomenclature committee (HUGO). That being said, as we explain below, we think
we now track correctly most of the gene names from one release to the next. But
we do not track the variant identifiers, we just call
variants a-b-c… in decreasing order of size of the protein, so that variant
aDec03 will possibly differ from variant aMar04. Please never quote a variant
without saying which release you refer to.
Naming genes
is a difficult issue: genes are hard to identify and hard to recognize, yet we
need to be able to refer to them by unique, stable, callable names. In AceView
initially, we used unique but unfriendly positional identifiers that were
build-dependent (such as G_t1_Hs1_4478_30_0_2551). In August 2002, for the
human release 30, we switched to a new three-step naming system:
1
If an AceView gene matches an NCBI gene model on the genome, or if it is
supported by a GenBank mRNA or a RefSeq sequence carrying a HUGO official
symbol or a LocusLink provisional symbol, we adopt this symbol as the name of
our gene. In the few instances where more than one gene fights for the same
symbol, we call them by the symbol and add an ending .1, .2,… for the second, third, and so on. This happens in 3 to 5 % of the genes in
build 33, usually in unfinished regions of the chromosome where artificial
duplications are frequent, or if the gene was too long or too complex to be
handled correctly by AceView, which split it into pieces. In build 34 from
Aug05, it happens only in 1% of the genes, the truly repeated ones, because all
the above problems were solved, but we made progress in identifying repeated
genes!
2
If strategy 1 fails, we look for the most significant protein family Pfam hit,
and we name the gene accordingly, for example kinesin_1, _2,
... _27. Of course, we forbid reuse of official names in this class.
3
If strategy 2 fails, we number the genes sequentially by position as before and
transform the number into a pronounceable pseudo-word, each digit in the
appropriate number base corresponding to a phoneme, with the slowest moving
digit to the right. We compose either a pseudo-Japanese-sounding name, such as sayuri, kimu, or nowara, to label genes where one of the principal clones is
Japanese; else we use a pseudo-English-sounding name, such as jawker or sneery.
The resulting
names have two, three or four phonemes and are composed exclusively of
lowercase letters, they are easy to pronounce and to remember, and genes of the
same area of the genome often share the same suffix. We do not expect that you
would find these names or the Pfam-derived names in PubMed.
Since
we started this strategy in August 2002, we have tried to maintain the same
names, but it is not a trivial undertaking. What we do is attach the name to
the best clone of the gene, see where this clone remaps, and let it bring the
old name. At each release, we may lose a few names and create new ones.
However, if a gene name is reused, it is most likely the same gene or its best
approximation in the new genome, but of course as time goes, more and more
genes have a LocusLink/geneID name, which supersedes the AceView gene names.
|
|
Many of our
users are wondering why some AceView “genes” appear to correspond to two genes,
known to encode two completely different proteins. This difficulty springs from the definition
of what a gene is. Operationally, until August 2005, AceView defined genes
molecularly by the clustering and contiguity of the footprint of the mRNA
sequences on the genome. So, whenever two known genes actually overlapped in
sequence, because some cDNA clone bridged the two, both genes were considered
part of the same AceView “gene”. This
created ambiguity for naming, especially when the two genes have an official
name and a known function. To solve the problem in these rather frequent
instances, we have introduced the notation GeneAandGeneB.
However, it is usually possible to understand, from the description of the
transcripts or from the graph, which type of protein a particular transcript
produces.
In some
instances, the overlap is only through the untranslated regions: gene A may
produce various isoforms of protein A, but one of its 3’ UTR contacts the 5’ of
the gene immediately downstream on the same strand, which itself produces
various isoforms of an unrelated protein B. These are annoying cases. We solved
this problem in August 2005 by redefining the gene (see the news). We now demand that transcripts
from a single gene share not only some transcribed sequence, but at least one
intron boundary. This way, we shrug off the loose attachments of that kind of
complex, and secondarily of unspliced transcript variants: we allow genes to
overlap through their 3’ and 5’ intronless region (usually UTRs). Note that the
mere sharing of transcribed areas may be interesting biologically, since it
indicates that transcription of one gene may leak through the next gene in cis,
suggesting that the two genes may belong to some kind of co-regulated
operon-like unit.
In other more
involved cases, the overlap in sequence is through specific transcripts that do
share exons and introns, hence usually coding regions, with both the A and B
genes: the gene is a complex locus, it does include both known genes A and B,
but may in addition produce proteins of type AB. Finally the current 1079 instances
of GeneAandGeneB remaining in the Aug05 build are
largely due to some AceView models contacting two predicted NCBI models with
two GeneID, that may become merged in the NCBI gene
database in a next release.
In nematode,
we do manual edition and annotation of all the genes, to complement and enhance
the WormBase view. By eye, it is easy to distinguish the case “complex locus”
and the case “concatenated genes contacting through their 3’UTR/5’UTR”. Complex
loci were initially defined on the basis of complementation tests in phage or
Drosophila. In such a gene, two complete alternative variants may not share a
base, although a third variant “bridges” them and overlaps both. By convention,
we denote this property by adding the letter C for complex behind the name
(e.g., 1C777C) or by a + sign if the gene had acquired two independent official
gene names (e.g., mai-1+gpd-2+gpd-3). This convention follows
that chosen by Ed Lewis for the Drosophila complex genes (e.g. BXC for bithorax complex locus, of which bx, Ubx, Cbx and pbx for example are alleles). In
the case of nearby genes expressed from the same strand but with overlap
between the 3’ end of the first gene and the 5’ end of the second, we use the
suffix Co (for sequence in COmmon) appended after the
gene name (e.g., cul-1Co or 5K225Co). We may also use the sign AND if both
genes have a name (e.g., mev-1ANDced-9), but as of August 2005, the
latter type should just vanish since the two genes can be shed and separated
from one another under our new gene definition.
|
|
As of
November 2004, we have classified the genes in three classes of decreasing
interest (or increasing questionability):
The main genes include the protein coding
genes (defined here by CDS > 100 amino acids) and all genes with at least
one well-defined standard intron, i.e. an intron with a gt-ag or gc-ag
boundary, supported by at least one clone matching exactly, and with no
uncalled base, the 16 bp bordering the intron (identical to the genome over the
last and first 8 bp of the successive exons). We added in this category some
genes for which the CDS is smaller than 100 aminoacid,
provided they have a NCBI RefSeq sequence (NM_#) or an OMIM, or they encode a
protein with BlastP homology (expect <10-3) to a “real” nematode AceView
protein. When we introduced this distinction, in November 2004 (build 35c),
there were about 51,000 main human genes, 40,245 encoding CDS of more than 100
amino acids, and 11,046 encoding shorter peptide if any, but with at least one
standard intron, which validates their RNA nature.
The main
genes are displayed on the “neighborhood icon” on top of each gene page.
Clicking on any gene in this icon brings you to the gene.
The putative
genes have no standard intron and do not encode CDS of more than 100
aminoacids, yet they belong to a category that may be useful not to disregard
completely. They may be of two types: either they are supported by more than 6
cDNA clones, or they encode a putative protein with an “interesting
annotation”, for example a PFAM motif, a BlastP hit to another species than
itself with e< 10-3, a transmembrane domain or
other rare and meaningful domains identified by Psort2, or a highly probable
localization in a cell compartment (excluding cytoplasm and nucleus). We
disregard genes in this category if their extent on the genome is greater than
10 kb.
In November
2004 (build 35c), there were 60,089 human genes in this gray zone. Note that
some may represent pseudogenes that have not yet diverged fully.
The cloud genes include all other genes. Although
they are supported by sequence data submitted to GenBank as cDNA, they are less
confirmed than the other AceView genes: they are expressed at low level (they
include less than 6 overlapping cDNA clones), they do not have standard
introns, are not clearly coding for proteins, and of course they do not have an
associated RefSeq. The transcripts are usually shorter than 1 kb, and we
discard those extending beyond 2 kb on the genome, because they usually contain
dubious structural features.
The
biological relevance of the cloud genes remains to be established. It might be
tempting to assume that they are DNA contaminations in the RNA preparations,
but instead our analysis of their biased locations indicates that most
represent intermediates in the transcription process: they localize
significantly more frequently “under” the main genes, on the same strand and in
introns, but they constitute separate genes because they do not physically
contact any other gene (which is our criterium for
gene definition).
We used to
filter this category before build 34, to keep only the genes supported by at
least six clones, or encoding proteins or with standard introns. But people were
sometimes searching AceView to locate the best alignment of a clone, and
failing to find it because of the filter. We call this class of “genes” the
cloud, and indicate this in the gene title.
We witness important progress in peptide identification, and large
proteome datasets have started to become available. But we should keep in mind
that most annotated protein sequences are predicted in view of the mRNA
sequence (or the genome): even the current Swissprot/Uniprot is in majority composed of predicted rather
than experimentally validated proteins.
As of 2007, we have devised an empirical new score system to choose the
most likely product from each mRNA, and to help decide if a transcript is
potentially protein-coding or not.
1) We score the length of the predicted CDS:
a) If the CDS is above 100 aminoacids, every 100
aminoacids stretch scores 1 point,
b) CDS between 60 and 100 aminoacids score 0 point, CDS below 60 aminoacids score -1 point. In addition,
if the initiator Met codon is atypical (non-ATG codon fitting Kozak’s rule) and the CDS has less than 80 aminoacids,
a malus of -0.85 is applied.
2) We count introns within the CDS and introns outside
the CDS. “Good looking CDSs” have all or almost all intron scars
within the CDS.
a) If all introns are within the CDS, each intron
scores 1.
b) If 1 intron is outside the CDS, each intron
inside the CDS scores 1 point, except if there are only two introns total, 1 in
1 out, then the score is 0.
c) If a transcript has a unique intron outside the CDS,
the score is -1.
d) If 2 or more introns are outside the CDS, we score +1
per intron inside the CDS and -1 per intron outside the CDS. However, some rare
transcripts have an operon-like structure (e.g. BAGE4andTPTE.aNov06):
unlike pre-messengers, they may have lots of introns, but potentially encode multiple CDSs,
located in succession on the transcript and appear molecularly distinct (i.e.
not clearly belonging to a unique protein and encoded by a partly unspliced
mRNA). To avoid losing these significant CDSs that could
all end up having negative scores and be all dismissed, we score a maximum of 4
introns outside the CDS (maximum penalty of 4 for introns outside the coding
region). Note that in some instances, NMD may lead to rapid degradation of
mRNAs with introns located downstream of ~55 bp up from the last intron scar,
but the large number of cDNA sequences from scores of cDNA libraries that share
this property indicates that NMD does not act on all genes, all transcripts, in
all tissues or at all times, or else that it is quite inefficient.
3) We score the interest or credibility or ‘conservative
quality’ of the protein in three ways, for a total of 1 point maximum.
Properties that will score are conserved sequence, or presence of conserved
motifs (PFAM or some PSORT), of interesting and rare predicted localization.
a) We examine BlastP homologies with expect less than 10-3 and
run TaxBlast. Existence of at least one BlastP
hit to a species other than self scores a ‘conservation’ point.
b) We consider the Pfam significant hits, with
thresholds as recommended by Sean Eddy. We exclude hits to frequent retrotransposons and rétroposons, so as not to
rescue these products too actively: DDE, gag_*, GP36, rve, rvp, rvt_1,transposase_* and
ribosomal_*. Existence of any other Pfam hit scores 1
conservation point, unless the protein already had 1 conservation point
from BlastP/TaxBlast.
c) We examine the motifs defined by Kenta Nakai Psort2
collection of programs and exploit the predicted cellular localization. We
score a maximum of 1 point, again only if there was no point from BlastP and
Pfam, if any of the following domains is
found:
i) transmembrane domain
ii) coiled coiled region
iii) ER retention domain
iv) Golgi transport domain
v) N-myristoylation domain
vi) Prenylation domain
vii) With high probability (>=50%, except for Golgi,
>40%), the NH2 and COOH complete CDS of more than 70 aminoacids is predicted
to be secreted or extracellular, or localized either in the plasma
membrane, the mitochondria, the endoplasmic reticulum, the Golgi, the
cytoskeleton, peroxisomes, lysosomes, secretory
vesicles.
4) It is not rare that an RNA transcript sequence has
the potential to encode multiple proteins. To decide between very closely rated
CDS which is ‘the best’, we score the position of the predicted protein
relative to the transcript. The protein most 5’ will be the ribosome’s first
encounter, so if a CDS is NH2 and COOH complete and already has at least one
point (i.e. is already encoding a ‘good’ protein), it scores an extra 0.8
points. Finally, we compare the two CDS length and add 0.001 point per aminoacid, so that the longest CDS wins over a shorter one
with equal annotation score.
5) Final classification:
a) A CDS with 1 to 5 points is ‘good’, a CDS with 5
points or more is ‘very good’. For each mRNA, we select the CDS with the
highest score and do not display annotation for all other CDSs unless
they are of ‘very good’ grade. The integral part of the score is given in the
text.
b) AceView genes have at least one of the following
properties: either
i) a ‘good’ protein, with score above 1 as defined above
ii) or at least one intron with standard boundaries
(gt-ag or gc-ag),
iii) or more than 5 cDNA supporting clones,
iv) or an Entrez
Gene ID or an OMIM annotation, or finally they include a RefSeq NM or NR.
c) Other genes with none of the above properties become
‘cloud genes’; they are supported by 1 to 4 cDNAs, they align with no intron
and do not visibly encode a protein. Note that the average length of cloud
genes is 500 bp (+-200), leaving open the possibility that a fraction
of those genes represent 5’ or 3’UTR of partial transcripts of previously known
or new genes. Others may correspond to artefacts, such as genomic DNA
contaminations in RNA libraries. Deep RNA sequencing will teach us more about
the properties of these genes and bring some evidence as to their real or artefactual nature.
A CDS with negative score, or with score below 1corresponds to either a partial
product, or a non-coding RNA.
d) Promoter sequence: To facilitate the search for
promoter elements, we currently provide the 2 kb sequence upstream of each
mRNA. Please tell us if you would prefer a longer
sequence upstream or downstream.
There are two
kinds of possible reasons, but please remember that all of the proteins we annotate
are predicted from the mRNA sequence: experimental validation would be useful,
and we would happily feed in our database all the peptide sequences that were
actually observed (please contact us if you have such data).
1.
Choice of the initiator Met: When working with mRNA and genome sequences, we do
not have access to real protein sequences: we just predict them. Translation
usually starts at an ATG (Met), but three other codons: GTG, TTG, or CTG are
candidates to be used as initiators in most species (see the codon usage table maintained by the Taxonomy
group at NCBI). Confronted to the choice of annotating a protein possibly too
long or possibly too short, we decided to use as Start any of the possible Met
codons: we pick whichever codon gives us the longest predicted protein,
provided a non-standard initiator codon adds at least 30 aminoacids N-terminal
to the ATG. Using this rule, we end up annotating close to one third of the
complete proteins (bounded by a Stop on both sides) from one of the “rare”
initiator sites instead of an ATG. The same proportion is seen in very long and
very short open reading frames. Note that we have not yet fixed the code to
reflect the fact that the protein sequence would start with a Met rather than
Leu or Val, if these codons were really used as initiator.
But please
remember we annotate longer rather than shorter proteins because it is easier
to mentally truncate the annotation than to extend it. The position of the
start specifically influences annotation for a signal peptide, and of course
molecular weight and pI. And one can just look at the mRNA diagram where PFAM
(yellow), blastP (blue) Psort motifs (red) and ATG Met (-) are
displayed.
On the other
hand, if users would prefer proteins annotated from ATGs rather than NTG(with 30 residues gained), they should simply manifest themselves.
6) AceView allows itself to extend open reading frames,
including those of RefSeqs, counting on users to test
validate the proposed proteins. We take into account all the data available in
GenBank and align each EST and mRNA carefully in the place of the genome from
where it originated. In a good number of cases, we find an EST aligning just
upstream of an mRNA, itself open from its first base,
and the two clones share some sequence: we merge them into a longer transcript
candidate, potentially encoding a longer protein than available in a single
GenBank entry. Because our transcript is then a composite minimally supported by more than one clone, in some cases the extension of the
protein will not be legitimate. But in any event, considering the average 5’UTR
size in human, we expect that a vast majority of complete transcripts should
contain a stop in the 5’UTR; therefore a protein not bounded by a stop on each
side would also be suspicious.
Right now, we
only provide some useful hints to do that: since January 2004, we provide the
sequence of each transcript using color and letter case to indicate exons and
coding region respectively: we use alternating colors for exons; we use upper
case for coding region and lower case for UTRs (or introns).
Such sequences are available by clicking
from the transcript table on the “gene on genome” page, or individually at the
bottom of each mRNA page. For computer oriented people, the complete list of
transcripts coded this way is also available (in an easily parsable format) from the link to all sequences (gene page).
To design
primers specific to one form, you may use the “table of introns and exons” in
the (alternative) mRNAs page to select exons that would provide a specific
product: each exon knows all the AceView transcripts it belongs to, so you can
examine the specificity from this table; strictly specific exons and introns are
printed in red (since 2007). But remember that some variants are better
supported than others, and we provide some evidence about this too, in one of
the columns. If you are interested in the protein, the information you need is
in the last column of the protein table on the gene page: there, the number of
clones required to support the structure of the variant over the coding region
is given. If one suffice, you can feel comfortable the structure of your
protein-coding region is fully validated. (You may want to check at the same
time, in the first column of the protein table, if your protein is complete and
if it is identical to that from another transcript (that would only differ in
the UTR), or if it is included in another protein.) If the names of two or more
clones appear in the support last column, then the transcript is derived from
concatenation of partial or partially sequenced clones (ESTs) and your RT-PCR
experiment will be instrumental in confirming the existence of this transcript.
Then please do not forget to send your sequences to GenBank so that everybody
can benefit from your work, and we can make AceView even more solid and useful!
Heartfelt thanks!
History of Acembly and AceView; the
AceView database schema
Development corner
How to link to us, or cite us
Acknowledgments
Please send us wishes, comments, and bug reports.
The AceView
program (initially called Acembly) was inspired by Yuji Kohara and Tsodas Shin-i from the National Institute of Genetics, Mishima, Japan and was developed from 1995 to 2000 at CNRS, Montpellier, France and NIG, Japan
by Danielle and Jean Thierry-Mieg and Michel Potdevin to treat the nematode C. elegans transcriptome data. The program is written in
C on top of the Acedb object-oriented database manager;
and the schema of the database is truly gene
centered. All objects including papers, phenotypes, molecules, experiments,
regulation, and interactions and all our annotations ultimately point to the
genes.
Since 2000,
development of Acembly/AceView has been actively pursued at the National Center
for Biotechnology Information (NCBI) at the National Library of Medicine
(NLM) and the program is now applied to
other species including Homo sapiens. We have adapted the program to
larger projects, developed a Web viewer and a new C interface to Acedb, called
AceC, to construct the AceView text descriptions, fill up our data tables, and
run the Web, whereas a new indexing and “grepping”
system was written to support the gene centered “Query” functionality.
Here are a
few ideas for future development; we do not have a schedule. Our wonderful
collaborator, Mark Sienkiewicz, left us in June 03, and our current team
consists of the two of us at NCBI.
·
Finish analysis of expression pattern. We have created a specific Acedb
database to classify the tissue/stage/library fields of the GenBank records
(mRNAs and ESTs) into an anatomic/developmental/pathological schema and have
finished classifying about 3 million entries (1 more million to go). Then, we
will use this classification in conjunction with the main human AceView and use
the percentage of each tag aligned to describe the pattern of expression for
each gene: when and where is it expressed, is it overexpressed or underexpressed in tumors, or in other conditions.
We hope this will help map the still unmapped human diseases, in addition to
bringing useful information on promoters and chromatin domains. In November
2003, we discovered that two groups have developed tools that could ease the
process: the SANBI project (J. Kelso et al.) has
developed an ontology adopted by ensembl, while the GeneCards group (D. Lancet, Marilyn Safran
and Maxim Shklar) has performed microarray
experiments, and developed tools to compare their results to expression data
derived from EST/mRNAs assignments. We plan to collaborate with these groups to
offer our users a validated expression profile, hopefully for release 35
·
Design a new Query by feature, which would allow the user to select the genes
by acting on multiple fields at a time, using a combination of toggles and
simple windows where you would type numbers. You could select simultaneously on
map position, level and pattern of expression, presence of alternative variants
and the various alternative features we calculate in the database, presence of
a given protein motif, predicted cellular compartment, taxonomy (are there
close genes in bacteria, viruses, invertebrates...), presence of a known
phenotype, presence of an antisense gene, and also length of UTRs, of CDS,
molecular weight, pI, number of exons, of introns, types of introns (e.g.,
at-ac) and all of these queries that are structured, in that they apply to a
single field in the database at a time. It should be easy to then provide a
list with exactly the genes people need to consider, for example those in a
given region of a chromosome that have a membrane domain and are overexpressed in tumors; or those expressed in brain and
antisense to other genes. A second query, either of the standard AceView type
(looking for any words in the gene’s records) or an iterative query by feature
should allow researchers to sort sets of genes with very interesting properties
indeed, and to hopefully help our knowledge of the genes to progress faster.
·
Design a new graphical display where we would show all the protein isoforms
from one gene coaligned. The Pfam, Psort, and BlastP homologies will be
displayed along the entire combined length to allow quick visual assessment of
what the differences between the various isoforms affect and what the
functional consequences might be.
·
Design couples of primers to amplify all exons evidenced in AceView (at the
request of a user). We have done the first and maybe the most difficult step of
identifying the redundancy level of each 18-25 nucleotide-long primer genome
wide. We will look for primers that would amplify the exon plus at least 50 bp
on each side (to allow reuse of the same primer for PCR amplification and
sequencing and to still get good sequences), with a unique PCR product of
maximum size 1.8 kb, optimally 0.7 kb (for a standard exon, of less than 600
bp), so that the sequences read from the two PCR primers actually read the
entire two strands at high quality (on recent sequencers), yielding a non-ambiguous
sequence. This should simplify allele sequencing, hence the identification of
genes associated with diseases.
A related question is the design of primers to amplify specific
transcripts. We could do a bit more integration there.
Another issue
of interest is the identification of exons that could be used in the design of
microarrays or any other type of array. Those could aim at identifying the most
representative exon(s) in each gene, to get a panel of all the reliable human
genes. One could limit by level of expression, or stage or tissue or coding /
non-coding potential, or any property such as antisense gene. Alternatively one
may wish to identify the sequences most diagnostic of specific alternative transcripts.
We could do that relatively easily in AceView, because all the data is in
acedb, hence easily amenable to very complex requests and the DNA from any set
of objects can be dumped from the database at the end of any query.
We are of
course also very interested in defining and analyzing putative promotors, but apart from identifying hopefully reliable
first exons, we have not done any other analysis in that direction yet.
We would also
like to set up downloadable acedb databases per chromosome (or may be the
single database we use, but it may be a bit big), so that people can get the
whole lot of data and do the queries themselves, rather than being limited to
painfully getting what we can show on the (somewhat frightening) web pages or
what can be retrieved from the ftp site (very minimal and simplistic). We will
do that if there is some demand, and people should be aware that we would only
be able to offer extremely minimum user support.
Provide a
mouse and rat AceView, as many users have asked. Make Arabidopsis better, and
work on fly in collaboration with FlyBase. We could
then offer some meta-query that would allow seamless passage from one gene in
the species of interest to the others, guided by the availability of biological
data.
To Link
to AceView:
AceView tries
to answer any query by returning a gene or a list of genes. The query may be a
gene name or alias, a cDNA clone or sequence identifier, a NCBI GeneID or
UniGene ID, or more generally a meaningful identifier, word or group of words.
To create URL
links to AceView, please use the following syntax:
https://www.ncbi.nlm.nih.gov/IEB/Research/Acembly/av.cgi?db=SPECIES&q=QUERY
where SPECIES is either human (for Homo sapiens), mouse (for
Mus musculus), worm (for Caenorhabditis elegans), ara (for Arabidopsis thaliana)
and QUERY is the identifier or words (query is not case
sensitive; spaces between words are replaced by +)
Examples:
Access gene
PTEN in human: https://www.ncbi.nlm.nih.gov/IEB/Research/Acembly/av.cgi?db=human&q=pten
Access GenBank accession from a mouse cDNA: av.cgi?db=mouse&q=AK046795
Access cDNA clone IMAGE:4838723 in human: av.cgi?db=human&q=image:4838723
Access genes related to mitotic spindle in worm: av.cgi?db=worm&q=mitotic+spindle
Access homolog of the worm gene smg-5 in human: av.cgi?db=human&q=smg-5
These URLs will automatically lead to the most recent version of the data
and will find the genes even if their names changed.
One may be more specific and add a parameter c=CLASS where CLASS is a
class in our AceDB database server (not case sensitive). This supposes
knowledge of the current schema and is not usually recommended. It is however
useful when accessing specifically the transcript view rather than the gene
view (but the complete transcript name should then be given), calling for a
gene name with too few characters (e.g. genes a and A in Drosophila…) or small GeneIDs or other numeric identifiers.
Examples:
Access mRNA variant c in gene STAT5A in human:
https://www.ncbi.nlm.nih.gov/IEB/Research/Acembly/av.cgi?db=human&c=mRNA&q=STAT5A.cApr07
Access GeneID 3 in human: https://www.ncbi.nlm.nih.gov/IEB/Research/Acembly/av.cgi?db=human&c=geneid&q=3
To cite us: Please use Danielle Thierry-Mieg and Jean Thierry-Mieg, AceView: a comprehensive cDNA-supported gene and transcripts
annotation, Genome Biology 2006, 7(Suppl 1):S12
The web site
can be cited as: www.ncbi.nlm.nih.gov/IEB/Research/Acembly AceView: integrative annotation of cDNA-supported genes in human, mouse, rat,
worm and Arabidopsis.
You may also look at the Publications page if you are
interested in the ~150 articles reporting use and confirmation of AceView gene
annotation, or our other articles on genes or alternative transcripts
annotation, on microarrays, on worm genetics, or on Yang Mills theory and
particle physics.
We thank our
friend Yuji Kohara for giving us access to his beautiful data: the
collaboration with his team has inspired the entire view we have of the genes
and has led to the development of AceView. We are grateful to our previous
collaborators, Mark Sienkiewicz, Vahan Simonyan, Adam Lowe, and Yann Thierry-Mieg,
who contributed to this effort. We are indebted to Kenta Nakai and Sean Eddy
for the nice tools they provide. We thank David Lipman for his interest and
stimulating ideas, Donna Maglott for sharing the love of genes, and all our
friends at NCBI, in particular the systems, Blast and taxonomy groups, for
their help, encouragement, and support.
Feel free to contact us by email
Freedom of Information Act | Disclaimer