AceView Help

To learn more about AceView, you may enjoy these documents:

Danielle Thierry-Mieg and Jean Thierry-Mieg, AceView: a comprehensive cDNA-supported gene and transcripts annotation, Genome Biology 2006, 7(Suppl 1):S12

About AceView: Overview , News (most recent improvements; release statistics; special notes on Worm), About history of Acembly/AceView, development corner, acknowledgements

To get help, please see these:

Frequently asked questions (updated Sept 13, 2009)

Help on the three ways to query AceView

Help with texts, tables, displays

the “Gene on Genome” page (updated July, 2007)

the“Annotated mRNA” page

(updated May 2006)

How to link to AceView, cite us , Download, Register on AceView list

Please send us wishes, comments, and bug reports.

Overview on AceView

Summary:

Our molecular annotations of the transcripts (not of the proteins, which still today are largely predicted) are fully supported by experimental cDNA sequence data, but the biological and functional annotations should be considered as hints: please refer to the PubMed articles cited as evidence for each statement about associated diseases or pathways, processes, molecular function, localization and interactions, and if you can, please help us by sending your feedback.

We align cooperatively the millions of cDNA sequences available from the public databases on the genome sequence, then quality-filter the cDNA clones for rearrangements or defects, then cluster the 90% good cDNA sequences into mRNA models and genes. By construction, the redundant GenBank, dbEST and Trace cDNA sequences are merged, and the resulting sequence is fixed to match the excellent quality of the genome sequence, while the eventual SNPs (shared across multiple mRNAs) remain apparent on the graphical views of the reconstructed mRNA.

Our program performs semi-automatic hand-annotation: at each step, we use appropriate heuristics, search genome-wide for counterexamples, and tune the algorithms until the results meet manual curation criteria. We construct the minimal number of alternative variants necessary and sufficient to represent all cDNA clones, and each clone is associated to a single variant. In this way, our annotation is both comprehensive and non redundant.

On the molecular side, AceView genes and their alternative variants are analyzed in terms of expression, intron-exon structure, alternative features and regulation, neighbor relationships, such as antisense and operon-like arrangements; participating cDNA clones/accessions are annotated; the putative protein products are analyzed for completeness, their best covering cDNA clones are identified, the proteins are searched for motifs, membership to a protein family, conservation in evolution, closest homologues in other species and signals for subcellular localization; non coding genes are identified. We believe that our annotation of mRNAs and putative protein products, expression or regulation is quite reliable and comprehensive, as we carefully annotate the caveats, one of which is spurious concatenation of two rare transcripts into a single mRNA (in about ¼ cases in human), the other more general is that the proteins annotated are predicted from the mRNA sequence rather than experimental (as is also the case for SwissProt or RefSeq). But overall, it appears that AceView transcripts actually capture well the bulk of the complexity of the transcription apparent in GenBank and dbEST, and can be used as a framework to design experiments.

Rationale:

Millions of mRNAs and ESTs have been deposited in the public databases, even long before the genome sequence was known. They represent a snapshot of the mRNA content of the cells and are the most primary data reflecting gene activity. Now that the genome sequence is almost complete, it is essential to integrate this wealth of data, however noisy and complex, to trace all mRNAs back to the region of the genome from which they were transcribed and to try to reconstruct, by assembling the pieces of the puzzle, the most likely image of the original transcripts. To this effect, we have developed the AceView program (initially called Acembly), which allows us to see, without preconceived ideas, what the pattern of transcription looks like in vivo, and which mRNAs and proteins are likely made in the cell. Our basic hypothesis is that the cDNAs submitted to the public databases give a faithful image of the transcriptome.

While analyzing the data, we identify unexpected results, some of which may be real discoveries, some experimental artifacts or incomplete data, but in all cases, our hope is to stimulate further investigations toward a better understanding of the transcriptome.

AceView regularly downloads from the public databases the whole set of cDNA sequences, mRNAs and ESTs, aligns them on the current genome available at NCBI, and clusters them into reference transcripts, including alternative variants. At least three quarters of AceView reconstructed transcripts are unambiguously supported in GenBank, but the others could represent an inappropriate concatenation of two or more partial variants found separately in the cell. To identify these cases clearly, we rate explicitly the support of each transcript, hoping to encourage further sequencing. The consensus sequence of the reconstructed transcript is calculated from the cDNA sequences and from the underlying genome, and the cDNA clones and proteins are annotated. Inferences about the gene and its proteins are described. Biological annotations, usually selected from other sources (linked), are added, and summaries are presented in AceView, as a structured text document per gene. Text queries in AceView bring lists of genes related to the question asked. Useful data are also available for download in bulk, on our ftp site.

We summarize here the types of results from AceView analysis that are currently displayed for a gene or a transcript/protein, and we outline our annotation procedures.

AceView mRNA-to-genome co-alignment and clustering results: analysis and classification of properties of clones and genes

Annotation of the proteins deduced from the reconstructed mRNA sequences

Biological annotation is collected to allow meaningful queries by content

Presentation of the data on the Web

Additional information may be found in the Frequently Asked Questions, in the AceView News, in the “Gene on genome” page help, in the mRNA page help, or in the development corner.

AceView results: reconstruction of the transcripts and characterization of the genes and cDNA clones

AceView

co-aligns all publicly available mRNAs and ESTs onto the genome sequence, by using an original cDNA-to-genome coalignment program, which we initially called Acembly. The program was developed for the worm transcriptome data, and the fact that we had the sequencing traces allowed us to analyze in detail the sources of noise, in particular the base-call errors, and to tune the co-alignment strategy to find the best alignments of each cDNA on the genome even in the presence of noise in the sequences. The code is big and its behavior in special circumstances depends on various heuristics; it is not easy to summarize in a few lines (so it is not yet published). It is however very fast and efficient, and quite unique in its performances.

reconstructs transcripts by clustering the aligned cDNAs into minimal sets, as the maximal pacific components of the friends and foes cDNA clones connection graph

groups overlapping transcripts into genes, provided they share at least one intron boundary (this last statement was added from Aug 2005 on)

names the genes by following the official nomenclature when available or by creating new names that are then tracked from release to release.

The program then

describes for each gene the intron-exon structure and its support in a dedicated table

classifies the genes by level of expression relative to all main genes genome wide.

generates two sequences for each reconstructed transcript, one derived from the underlying genome sequence and one derived as the consensus of the cDNAs that best matches the genome sequence (the AceView mRNA reference sequence .AM)

for each protein, a “golden path” of clones is selected; the minimal number of clones required to structurally support each protein is given: if =1, the variant is of RefSeq quality; if =2 or more, the reconstructed transcript should be tested by resequencing clones or by RT-PCR to straighten combinatorial artifacts in the reconstruction.

identifies best representative cDNA clones for each transcript, as well as anomalies such as partial deletions or rearrangements of the clone insert, internal priming or mosaicism

displays various tables of supporting cDNA clones/accessions: one for all clones supporting the gene (accessible from the “Gene” page caption), one for clones supporting each variant (in each “annotated mRNA” page) and one for the “main clones”, defined as the set of clones necessary and sufficient to reconstruct the consensus sequences that best match the genome for all transcripts. In each table of clones are indicated the tissue and stage information, the sequence accessions (linked to the GenBank record), and the quality of the match to the genome (length aligned, number of single base differences and %).

evaluates completeness of the sequence-derived protein, and compares the alternative proteins among them

compares alternative transcripts and describes the presence of alternative promoters, alternative last exons, cassette exons, overlapping exons, unspliced introns, and how that would impact the alternate protein isoforms

characterizes and reports the existence of what we consider reliable antisense genes and other regulatory features that can be evidenced from the mRNA to genome alignment such as close neighbors, complex loci, RNA edition, suspected translational frameshift or leaky stops, or internal ribosome entry sites. Recognition of common sequences in promotors may also be proposed in a distant future.

The above analyses are displayed on the “Gene on Genome” page, together with a supporting color-coded zoomable diagram that shows the gene aligned on the genome, the AceView reconstructed transcripts with indication of standard and non-standard introns and delineation of the main open reading frame, and the gene’s neighbors.

Protein annotation and classification

AceView then submits the proteins predicted from the reconstructed transcripts sequences to an analysis pipeline using a set of programs selected after comparative evaluation among those publicly available. Annotations include NCBI BlastP against the nr database, TaxBlast for ancestry relationships and identification of the closest homolog in the most studied species, Psort 2 by Kenta Nakai for protein motifs and predicted localization in cellular compartments, Pfam, searched using HMMER and keeping only the highly significant matches, for classification into protein families. As a bonus of Pfam, we gain the InterPro descriptions and GO classifications. Primers and temperature conditions to amplify the predicted CDS are calculated by Osp. (Hillier L, Green P. PCR Methods Appl. 1991 Nov;1(2):124-8).

These data are displayed in the “Annotated mRNA” page, where AceView transcripts can be viewed as spliced mRNA variants, decorated with BlastP homologies, Pfam and Psort motifs, Stops and Met(AUG) in the three frames, and, by clicking on the “clones” button in the graph, all supporting mRNAs and ESTs, with color-coded indications of differences from the genome sequence as well as labeled anomalies. For the worm, we have indicated the stage and/or tissue at the bottom of each cDNA (3’ end) and the presence and type of trans-spliced leader at the top (5’ end).

Finally, all these data are used by AceView to generate a descriptive title for each transcript and protein: if BlastP gave good hits to meaningful names, its results are used in an intricate (and not yet fully bug free) way, allowing the protein family Pfam name to influence the name as well; in the absence of good hits with meaningful descriptions, Psort and TaxBlast are used to annotate the proteins with basic properties such as presence of motifs, high-probability compartment localization, and sequence conservation in speciation.

The schema we use in Acedb is object oriented, so it offers all the desired features of a model tailored to our needs, and it is modifiable and extensible at will. This is why we are able to construct the descriptive texts for each gene and mRNA. Yet users can also browse lists of all genes in the genome with a given Pfam or Psort motif, or a GO or phenotype term.

We evolved toward an NCBI AceDB version, which now differs appreciably from the Sanger version. It compiles under UNIX and Windows, has a different client server implementation, a very powerful “tablemaker” query system, and a nice C programmer’s interface AceC. We d be willing to distribute it if there was enough interest.

Biological annotation and querying by content

A very rich functional and phenotypic annotation is highly desirable. Queries in AceView interrogate over all data in the database, including the set of abstracts of publications available from PubMed (and worm meetings) that we index to this effect. Our main source of information is PubMed. One gets a good feel for the biology from it, yet it is limited and somewhat ambiguous, because not all papers are attached to the specific gene(s) they describe, and some are associated to too many genes. We filter the latter category out by manually reviewing the papers associated to more than 100 genes and not indexing them if they do not seem biologically relevant.

For additional annotation, our current procedure depends on the species:

In the nematode C. elegans, we perform annotation ourselves in a well structured database, WormGenes. We use the Worm Transcriptome Project of Kohara as a main data source and exploit many resources, published or available on the Web, including WormBase, the CGC, RNAiDB, the Worm ORFeome site, and of course PubMed. WormGenes was the direct source of RefSeq for worm at NCBI from 2002 to early 2005, when the task was pre-empted by WormBase.

For human, we use extensively the NCBI resources, and copy fields from LocusLink /Entrez Gene: names and aliases, literature and its annotation by NLM, RefSeq annotation, GO. We provide links to the wonderful OMIM texts of McKuzick and collaborators but did not yet index them. We also provide links to the excellent GeneCards resource, which is available free of charge to all academic, but may be fee/access-limited for profit-oriented users.

We are open to any suggestions to make biological annotations better.

The AceView Web display

AceView offers through a versatile query system, lists and individual genes annotated with text descriptions, tables, graphical representations and external links. Three main pages are available:

Gene on Genome

Annotated mRNA(s) page (where the spliced variants reconstructed by AceView are described and depicted)

External links, also provided in the first paragraph of the “gene on genome” page.

As biologists, we are used to reading text descriptions accompanied by tables of results and graphs or diagrams, and a database dump does not have the same appeal. That idea guides the development of our viewer: in AceView, we try to automatically build summaries and sentences in English from the data in the various fields in the database. We believe we gain a dimension in the depth and accuracy of the description, because we can modulate each piece of text as a function of multiple fields in the database. A drawback of automatic text generation is a “metallic” accent, that we hope people can accommodate. If you feel otherwise, please tell us!

But users in quest of a solid and thorough integration of the cDNA and genome data should find in AceView the answers to their questions, at all levels of detail. AceView may look a bit scary, but that is because the data are intrinsically complex, and we do not have to oversimplify since AceView at NCBI can be viewed as a complement to the RefSeq project at NCBI, which provides a simple and very secure view of the best known genes.

Current releases

Frequently asked questions

updated May 16, 2009

Are AceView transcripts reliable?

We carefully align each mRNA or EST onto the genome. Then we compare all possible alignments and keep only the best position on the genome and the most compact alignment. We reject the low-quality hits. In the August 05 release, we use to reconstruct AceView transcripts 91% of all mRNAs currently in GenBank, they match the genome at 99.74% identity over 98.8% of their length on average. We also use 76% of all ESTs in dbest, and they align at 98.16% identity over 93.3% of their length. Therefore our individual alignments are highly reliable. Of course, we align 99.7% of the RefSeqs at 99.98% identity over 99.9% of their length (but they are made to fit, so the numbers are meaningless!).

Few cDNA sequences have an ambiguous map: only 1.3% of the mRNAs and 2.2% of the ESTs have two or more equivalent niches on the genome: they participate in the reconstruction of multiple, sometimes exactly repeated, AceView genes.

The introns, and exons bounded by introns, are numerous but also highly reliable. By our criteria, a well-defined intron has at least eight bases on both sides of the intron boundary that are identical in at least one cDNA sequence from GenBank/dbEST and the current prototype of the human genome. 96.6% of the introns in AceView follow this stringent criteria. Graphically, the well defined introns are drawn as a broken line (> or >) to distinguish them from the less well defined introns, the “fuzzy”, drawn with a straight colored line (| or |). The 4.5 million ESTs and mRNAs used in the Aug05 release identify 317,417 different well defined introns, a very large number indeed: the 35,991 genes with standard introns in this release (some partial) have on average 9.1 good introns per gene. 65% of the introns are alternative (i.e. not found in all transcripts of the gene). The table below shows the frequencies of intron boundaries: 98.5% are standard, and we list only 265 cases of putative U12-type spliceosome involvement. In this new release, we have aggressively annotated and ignored anomalous splice sites (as explained here), because most do not recheck by direct RT-PCR in the Havana Gencode project. There are now only 1.4% left; these most probably correspond to structural defects in the cDNA, but we could not ignore this cDNA altogether because it was also bringing us a novel standard intron, which we trust, because defects in cDNA, such as partial insert deletions, are local.

Total number of well-defined introns (Aug05)

Introns are especially reliable in AceView, because there is a lot of redundancy in the cDNA data, and our co-alignment technique allows us to ascertain the boundaries without the shadow of doubt. Each intron is described with its support in the intron-exon table. Similarly, almost all exons are fully reliable: less than 1% exons required concatenation of 2 cDNA sequences. Only the terminal exons, especially at the 3’ end, suffer from our choice to mask the diversity in polyadenylation sites. But overall, merging the different UTR lengths reduces the number of variants and still seems preferable to most of our users. Notice that the mRNA graphical view allows to “see” the alternative polyadenylation sites when they occur, but beware that a substantial number actually correspond to internal priming of the cDNAs in A-rich regions, which are naturally enriched in the A/T rich UTRs, and we have not labeled these explicitly yet.

A less straightforward question is the reliability of the whole transcript, i.e. the reliability of the chaining of exons and introns. In the August 2005 release, 69% of the AceView proteins above 100 and above 300 aminoacids are fully encoded within a single identified clone in GenBank/dbEST (91,738 different proteins > 100 aa; 24,367 distinct proteins > 300 aa all have their coding part fully covered by a known cDNA). The remaining 31% AceView proteins could still result from an inappropriate concatenation of two clones (26%) or more (5%), because, if no conflicts arise, we merge partial transcripts.

The picture on the gene page displays explicitly the reliability of the CDS encoded by each transcript: a reconstructed transcript variant (pink) holds its name (a, b, c...) underneath it; the name is underlined if the CDS can be covered by a single clone (a, c) hence the protein is secure. Names not underlined signal provisional concatenations, that will require experimental evaluation. Indeed, there are usually multiple ways to concatenate cDNAs, but in AceView we chose to present the minimal number of variants and to prevent the combinatorial, hoping for people at the bench to find evidence for more variants, not less. Once compatible cDNAs are merged into long transcripts containing GenBank mRNAs, the left-over fragments, each incompatible with the main transcripts, may contact one another and get merged.

but we provide a direct clue in the “proteins” table. For each variant, we count the minimal number of clones needed to support the whole CDS or the whole mRNA, structurally. We report the identity of the clones/sequence in the “Protein” table, under “minimal set of supporting clones”.

- If a variant needs only one clone to be covered, the structure of the putative protein is reliable. Provided it extends to the Stop (second column of the same table) and is not included in another, or even better is complete (first column of the same table), which by us means there is also a Stop upstream of the Met, and if in addition the transcript has a reasonable coding potential, this variant is of RefSeq quality. You may look up the first table in “Main table of supporting clones” to see if an NM supports this variant, but even if the variant is not yet a RefSeq at NCBI, it should become one in the future.

- If a variant needs two or more clones to cover the CDS, it is more questionable, because we might be cutting and pasting two existing alternative forms that may not be compatible, and the fusion molecule may not exist as such in the cell, but the two parts would be found separately in different molecules. If each of the two clones were sequenced in full, they would have a rather high probability of belonging to non-compatible transcripts, because we already see on average five alternative variants per human gene with confirmed introns.

To fix ideas, in human build 33 release, 74,114 (64%) of the spliced transcripts are supported by a single identified clone covering the entire CDS; the remaining could represent an inappropriate concatenation of two clones (28%) or more (8%).

Finally we may discuss the reliability of the protein. Apart from the possibility of artificial mosaics described above, when more than one clone is needed to cover the structure of the CDS, there are five kinds of possible problems.

1) Truncated protein: The transcript may be incomplete (indicated in the “protein” table).

2) An error in the genome sequence may lead to a truncated or mutated protein, because we annotate the protein derived from the genome sequence. This is rare, but we still see a few. Note that we also provide the sequence of the protein derived from the consensus of the cDNAs that best matches the genome (called .AM for AceView reference mRNA; the sequence is accessible in the transcript table or the fasta sequence file).

3) We may occasionally have chosen the wrong coding frame.

4) We annotate one protein per transcript, so we lose the second protein in the 26% cases where more than one protein could be made from different regions of the molecule (polycistronic-like transcripts, or uORFs, or smaller ORFs in 3’UTR). We also have bugs (or bad features) in the program and may have artificially fused two sequences that overlap, but encode different proteins, when it would have been preferable not to fuse them.

5) The decision on which putative protein is worth annotating, i.e. which RNA sequence has been seen by the live ribosome as worth scanning and translating, is often extremely hard: it supposes we know the answer before we have the data. Using informatics, it is easy to define conservative criteria, usually based on comparison to other species, and to choose one good CDS from about 20000 human genes, but in general, given a standard transcript sequence, selecting an initiator Met and a proper CDS remains a (wild) guess, in the absence of large datasets, for example mass spectrometry data, on real eukaryotic proteins. We expect surprises when the proteins will be known rather than inferred (see for example the good work by the Sugano group): for now, we conservatively annotate multiple CDS per mRNA and pick the best for display to the public; if a CDS is complete (bounded by an in frame stop or there is an accumulation of 5’ ends there) we pick as initiator Met either the first ATG, or the first in frame NTG if the protein gains at least 30 residues upstream. Most of the annotation, except for the signal peptide, can be truncated by eye, so we let it to the researcher to look at the shorter protein, starting at the ATG of their choice (i.e. the green line - by the side of the protein drawing).

Why don’t I hit the AceView transcripts when I do a Blast search on the human genome?

Yes, this is an important word of caution. Although AceView is being developed at NCBI, it is not visible from NCBI Entrez. AceView is still considered a research project at NCBI, and the AceView human reference transcripts and genes are not yet visible from Entrez or from any of the general query tools at NCBI, including Blast. This is why we offer a specific Blast page to search over AceView transcripts, and we are trying to improve Web accessibility. We also make the sequences available for download on our ftp site, and they are displayed on the UCSC Genome Browser, as one of the gene tracks.

Could you add a reference for the statement you make (about alternative promoter, antisense, etc.)?

All annotations in the fields “Expression” and in the entire section on variants and mRNAs are the results of our AceView analysis, not a result taken from the literature (see details). Sometimes, there may be a paper in the literature that describes the same kind of observations, and it will hopefully be referred to in the “litterature” field at the bottom of the gene page. But our annotations are independent; they are only based on aligning all cDNA sequences from the public databases onto the genome sequence. Please cite us for the AceView analysis.

The Bibliography, in the “Gene on Genome” page, is taken from Entrez Gene, itself largely helped by OMIM. Each paper points to its summary in PubMed, and you can use the “Related papers” function of PubMed to get a more complete bibliography. Authors who think their paper provides a significant new result and is missing from the list may go to their gene in Entrez Gene (using, for example, the link we provide as the gene title); they can add a pointer to their paper with a note of its main contribution using the RIF (Reference Into Function) option.

I am hunting for a gene responsible for a phenotype in a region of a chromosome. Can you help?

Yes. We are developing a new query by feature that will allow you to look at lists of genes ordered by position in a given area, but in the meantime, it is easy for us to send you an html table of the genes lying in between two markers, or in a given region that you would define by coordinates. The list will have interesting data and each gene will be clickable and connect to our main AceView server. Just send us a mail.

I found a bug, or would like to submit a wish…

Concerning bugs, we really hate them, track them, and want them all reported! Thanks!

Concerning wishes or suggestions, please do not hesitate to send them: many do not take long to implement or to fix, and then it makes a big difference to all users! You can look at an outline of our plans for future development , but it is always essential for us to get feedback from you. We do this project because it is very challenging and therefore a lot of fun, but above all we do this work to try to help you with your experiments at the bench, in the hope that we all will learn more about the wondrous world of genes.

Where do AceView gene names come from? Can I use them in publications? or: I found no reference to the gene “jawker” in PubMed. Can you help?

For human, we do not recommend that you use AceView gene names in publications: as is customary, you should try to obtain an official gene name by consulting with the nomenclature committee (HUGO). That being said, as we explain below, we think we now track correctly most of the gene names from one release to the next. But we do not track the variant identifiers, we just call variants a-b-c… in decreasing order of size of the protein, so that variant aDec03 will possibly differ from variant aMar04. Please never quote a variant without saying which release you refer to.

Naming genes is a difficult issue: genes are hard to identify and hard to recognize, yet we need to be able to refer to them by unique, stable, callable names. In AceView initially, we used unique but unfriendly positional identifiers that were build-dependent (such as G_t1_Hs1_4478_30_0_2551). In August 2002, for the human release 30, we switched to a new three-step naming system:

1 If an AceView gene matches an NCBI gene model on the genome, or if it is supported by a GenBank mRNA or a RefSeq sequence carrying a HUGO official symbol or a LocusLink provisional symbol, we adopt this symbol as the name of our gene. In the few instances where more than one gene fights for the same symbol, we call them by the symbol and add an ending .1, .2,… for the second, third, and so on. This happens in 3 to 5 % of the genes in build 33, usually in unfinished regions of the chromosome where artificial duplications are frequent, or if the gene was too long or too complex to be handled correctly by AceView, which split it into pieces. In build 34 from Aug05, it happens only in 1% of the genes, the truly repeated ones, because all the above problems were solved, but we made progress in identifying repeated genes!

2 If strategy 1 fails, we look for the most significant protein family Pfam hit, and we name the gene accordingly, for example kinesin_1, _2, ... _27. Of course, we forbid reuse of official names in this class.

3 If strategy 2 fails, we number the genes sequentially by position as before and transform the number into a pronounceable pseudo-word, each digit in the appropriate number base corresponding to a phoneme, with the slowest moving digit to the right. We compose either a pseudo-Japanese-sounding name, such as sayuri, kimu, or nowara, to label genes where one of the principal clones is Japanese; else we use a pseudo-English-sounding name, such as jawker or sneery.

The resulting names have two, three or four phonemes and are composed exclusively of lowercase letters, they are easy to pronounce and to remember, and genes of the same area of the genome often share the same suffix. We do not expect that you would find these names or the Pfam-derived names in PubMed.

Since we started this strategy in August 2002, we have tried to maintain the same names, but it is not a trivial undertaking. What we do is attach the name to the best clone of the gene, see where this clone remaps, and let it bring the old name. At each release, we may lose a few names and create new ones. However, if a gene name is reused, it is most likely the same gene or its best approximation in the new genome, but of course as time goes, more and more genes have a LocusLink/geneID name, which supersedes the AceView gene names.

Why do you have composite genes names such as GeneAandGeneB, GeneA+GeneB, GeneNameC or GeneNameCo?

Many of our users are wondering why some AceView “genes” appear to correspond to two genes, known to encode two completely different proteins. This difficulty springs from the definition of what a gene is. Operationally, until August 2005, AceView defined genes molecularly by the clustering and contiguity of the footprint of the mRNA sequences on the genome. So, whenever two known genes actually overlapped in sequence, because some cDNA clone bridged the two, both genes were considered part of the same AceView “gene”. This created ambiguity for naming, especially when the two genes have an official name and a known function. To solve the problem in these rather frequent instances, we have introduced the notation GeneAandGeneB. However, it is usually possible to understand, from the description of the transcripts or from the graph, which type of protein a particular transcript produces.

In some instances, the overlap is only through the untranslated regions: gene A may produce various isoforms of protein A, but one of its 3’ UTR contacts the 5’ of the gene immediately downstream on the same strand, which itself produces various isoforms of an unrelated protein B. These are annoying cases. We solved this problem in August 2005 by redefining the gene (see the news). We now demand that transcripts from a single gene share not only some transcribed sequence, but at least one intron boundary. This way, we shrug off the loose attachments of that kind of complex, and secondarily of unspliced transcript variants: we allow genes to overlap through their 3’ and 5’ intronless region (usually UTRs). Note that the mere sharing of transcribed areas may be interesting biologically, since it indicates that transcription of one gene may leak through the next gene in cis, suggesting that the two genes may belong to some kind of co-regulated operon-like unit.

In other more involved cases, the overlap in sequence is through specific transcripts that do share exons and introns, hence usually coding regions, with both the A and B genes: the gene is a complex locus, it does include both known genes A and B, but may in addition produce proteins of type AB. Finally the current 1079 instances of GeneAandGeneB remaining in the Aug05 build are largely due to some AceView models contacting two predicted NCBI models with two GeneID, that may become merged in the NCBI gene database in a next release.

In nematode, we do manual edition and annotation of all the genes, to complement and enhance the WormBase view. By eye, it is easy to distinguish the case “complex locus” and the case “concatenated genes contacting through their 3’UTR/5’UTR”. Complex loci were initially defined on the basis of complementation tests in phage or Drosophila. In such a gene, two complete alternative variants may not share a base, although a third variant “bridges” them and overlaps both. By convention, we denote this property by adding the letter C for complex behind the name (e.g., 1C777C) or by a + sign if the gene had acquired two independent official gene names (e.g., mai-1+gpd-2+gpd-3). This convention follows that chosen by Ed Lewis for the Drosophila complex genes (e.g. BXC for bithorax complex locus, of which bx, Ubx, Cbx and pbx for example are alleles). In the case of nearby genes expressed from the same strand but with overlap between the 3’ end of the first gene and the 5’ end of the second, we use the suffix Co (for sequence in COmmon) appended after the gene name (e.g., cul-1Co or 5K225Co). We may also use the sign AND if both genes have a name (e.g., mev-1ANDced-9), but as of August 2005, the latter type should just vanish since the two genes can be shed and separated from one another under our new gene definition.

What are main genes, putative genes and cloud genes?

As of November 2004, we have classified the genes in three classes of decreasing interest (or increasing questionability):

The main genes include the protein coding genes (defined here by CDS > 100 amino acids) and all genes with at least one well-defined standard intron, i.e. an intron with a gt-ag or gc-ag boundary, supported by at least one clone matching exactly, and with no uncalled base, the 16 bp bordering the intron (identical to the genome over the last and first 8 bp of the successive exons). We added in this category some genes for which the CDS is smaller than 100 aminoacid, provided they have a NCBI RefSeq sequence (NM_#) or an OMIM, or they encode a protein with BlastP homology (expect <10-3) to a “real” nematode AceView protein. When we introduced this distinction, in November 2004 (build 35c), there were about 51,000 main human genes, 40,245 encoding CDS of more than 100 amino acids, and 11,046 encoding shorter peptide if any, but with at least one standard intron, which validates their RNA nature.

The main genes are displayed on the “neighborhood icon” on top of each gene page. Clicking on any gene in this icon brings you to the gene.

The putative genes have no standard intron and do not encode CDS of more than 100 aminoacids, yet they belong to a category that may be useful not to disregard completely. They may be of two types: either they are supported by more than 6 cDNA clones, or they encode a putative protein with an “interesting annotation”, for example a PFAM motif, a BlastP hit to another species than itself with e< 10-3, a transmembrane domain or other rare and meaningful domains identified by Psort2, or a highly probable localization in a cell compartment (excluding cytoplasm and nucleus). We disregard genes in this category if their extent on the genome is greater than 10 kb.

In November 2004 (build 35c), there were 60,089 human genes in this gray zone. Note that some may represent pseudogenes that have not yet diverged fully.

The cloud genes include all other genes. Although they are supported by sequence data submitted to GenBank as cDNA, they are less confirmed than the other AceView genes: they are expressed at low level (they include less than 6 overlapping cDNA clones), they do not have standard introns, are not clearly coding for proteins, and of course they do not have an associated RefSeq. The transcripts are usually shorter than 1 kb, and we discard those extending beyond 2 kb on the genome, because they usually contain dubious structural features.

The biological relevance of the cloud genes remains to be established. It might be tempting to assume that they are DNA contaminations in the RNA preparations, but instead our analysis of their biased locations indicates that most represent intermediates in the transcription process: they localize significantly more frequently “under” the main genes, on the same strand and in introns, but they constitute separate genes because they do not physically contact any other gene (which is our criterium for gene definition).

We used to filter this category before build 34, to keep only the genes supported by at least six clones, or encoding proteins or with standard introns. But people were sometimes searching AceView to locate the best alignment of a clone, and failing to find it because of the filter. We call this class of “genes” the cloud, and indicate this in the gene title.

How is the open reading frame selected? how is the protein score computed?

We witness important progress in peptide identification, and large proteome datasets have started to become available. But we should keep in mind that most annotated protein sequences are predicted in view of the mRNA sequence (or the genome): even the current Swissprot/Uniprot is in majority composed of predicted rather than experimentally validated proteins.

As of 2007, we have devised an empirical new score system to choose the most likely product from each mRNA, and to help decide if a transcript is potentially protein-coding or not.

1) We score the length of the predicted CDS:

a) If the CDS is above 100 aminoacids, every 100 aminoacids stretch scores 1 point,

b) CDS between 60 and 100 aminoacids score 0 point, CDS below 60 aminoacids score -1 point. In addition, if the initiator Met codon is atypical (non-ATG codon fitting Kozak’s rule) and the CDS has less than 80 aminoacids, a malus of -0.85 is applied.

2) We count introns within the CDS and introns outside the CDS. “Good looking CDSs” have all or almost all intron scars within the CDS.

a) If all introns are within the CDS, each intron scores 1.

b) If 1 intron is outside the CDS, each intron inside the CDS scores 1 point, except if there are only two introns total, 1 in 1 out, then the score is 0.

c) If a transcript has a unique intron outside the CDS, the score is -1.

d) If 2 or more introns are outside the CDS, we score +1 per intron inside the CDS and -1 per intron outside the CDS. However, some rare transcripts have an operon-like structure (e.g. BAGE4andTPTE.aNov06): unlike pre-messengers, they may have lots of introns, but potentially encode multiple CDSs, located in succession on the transcript and appear molecularly distinct (i.e. not clearly belonging to a unique protein and encoded by a partly unspliced mRNA). To avoid losing these significant CDSs that could all end up having negative scores and be all dismissed, we score a maximum of 4 introns outside the CDS (maximum penalty of 4 for introns outside the coding region). Note that in some instances, NMD may lead to rapid degradation of mRNAs with introns located downstream of ~55 bp up from the last intron scar, but the large number of cDNA sequences from scores of cDNA libraries that share this property indicates that NMD does not act on all genes, all transcripts, in all tissues or at all times, or else that it is quite inefficient.

3) We score the interest or credibility or ‘conservative quality’ of the protein in three ways, for a total of 1 point maximum. Properties that will score are conserved sequence, or presence of conserved motifs (PFAM or some PSORT), of interesting and rare predicted localization.

a) We examine BlastP homologies with expect less than 10^-3 and run TaxBlast. Existence of at least one BlastP hit to a species other than self scores a ‘conservation’ point.

b) We consider the Pfam significant hits, with thresholds as recommended by Sean Eddy. We exclude hits to frequent retrotransposons and rétroposons, so as not to rescue these products too actively: DDE, gag_*, GP36, rve, rvp, rvt_1,transposase_* and ribosomal_*. Existence of any other Pfam hit scores 1 conservation point, unless the protein already had 1 conservation point from BlastP/TaxBlast.

c) We examine the motifs defined by Kenta Nakai Psort2 collection of programs and exploit the predicted cellular localization. We score a maximum of 1 point, again only if there was no point from BlastP and Pfam, if any of the following domains is found:

i) transmembrane domain

ii) coiled coiled region

iii) ER retention domain

iv) Golgi transport domain

v) N-myristoylation domain

vi) Prenylation domain

vii) With high probability (>=50%, except for Golgi, >40%), the NH2 and COOH complete CDS of more than 70 aminoacids is predicted to be secreted or extracellular, or localized either in the plasma membrane, the mitochondria, the endoplasmic reticulum, the Golgi, the cytoskeleton, peroxisomes, lysosomes, secretory vesicles.

4) It is not rare that an RNA transcript sequence has the potential to encode multiple proteins. To decide between very closely rated CDS which is ‘the best’, we score the position of the predicted protein relative to the transcript. The protein most 5’ will be the ribosome’s first encounter, so if a CDS is NH2 and COOH complete and already has at least one point (i.e. is already encoding a ‘good’ protein), it scores an extra 0.8 points. Finally, we compare the two CDS length and add 0.001 point per aminoacid, so that the longest CDS wins over a shorter one with equal annotation score.

5) Final classification:

a) A CDS with 1 to 5 points is ‘good’, a CDS with 5 points or more is ‘very good’. For each mRNA, we select the CDS with the highest score and do not display annotation for all other CDSs unless they are of ‘very good’ grade. The integral part of the score is given in the text.

b) AceView genes have at least one of the following properties: either

i) a ‘good’ protein, with score above 1 as defined above

ii) or at least one intron with standard boundaries (gt-ag or gc-ag),

iii) or more than 5 cDNA supporting clones,

iv) or an Entrez Gene ID or an OMIM annotation, or finally they include a RefSeq NM or NR.

c) Other genes with none of the above properties become ‘cloud genes’; they are supported by 1 to 4 cDNAs, they align with no intron and do not visibly encode a protein. Note that the average length of cloud genes is 500 bp (+-200), leaving open the possibility that a fraction of those genes represent 5’ or 3’UTR of partial transcripts of previously known or new genes. Others may correspond to artefacts, such as genomic DNA contaminations in RNA libraries. Deep RNA sequencing will teach us more about the properties of these genes and bring some evidence as to their real or artefactual nature. A CDS with negative score, or with score below 1corresponds to either a partial product, or a non-coding RNA.

d) Promoter sequence: To facilitate the search for promoter elements, we currently provide the 2 kb sequence upstream of each mRNA. Please tell us if you would prefer a longer sequence upstream or downstream.

Why is the AceView protein longer than the protein from a very similar transcript in GenBank

There are two kinds of possible reasons, but please remember that all of the proteins we annotate are predicted from the mRNA sequence: experimental validation would be useful, and we would happily feed in our database all the peptide sequences that were actually observed (please contact us if you have such data).

1. Choice of the initiator Met: When working with mRNA and genome sequences, we do not have access to real protein sequences: we just predict them. Translation usually starts at an ATG (Met), but three other codons: GTG, TTG, or CTG are candidates to be used as initiators in most species (see the codon usage table maintained by the Taxonomy group at NCBI). Confronted to the choice of annotating a protein possibly too long or possibly too short, we decided to use as Start any of the possible Met codons: we pick whichever codon gives us the longest predicted protein, provided a non-standard initiator codon adds at least 30 aminoacids N-terminal to the ATG. Using this rule, we end up annotating close to one third of the complete proteins (bounded by a Stop on both sides) from one of the “rare” initiator sites instead of an ATG. The same proportion is seen in very long and very short open reading frames. Note that we have not yet fixed the code to reflect the fact that the protein sequence would start with a Met rather than Leu or Val, if these codons were really used as initiator.

But please remember we annotate longer rather than shorter proteins because it is easier to mentally truncate the annotation than to extend it. The position of the start specifically influences annotation for a signal peptide, and of course molecular weight and pI. And one can just look at the mRNA diagram where PFAM (yellow), blastP (blue) Psort motifs (red) and ATG Met (-) are displayed.

On the other hand, if users would prefer proteins annotated from ATGs rather than NTG(with 30 residues gained), they should simply manifest themselves.

6) AceView allows itself to extend open reading frames, including those of RefSeqs, counting on users to test validate the proposed proteins. We take into account all the data available in GenBank and align each EST and mRNA carefully in the place of the genome from where it originated. In a good number of cases, we find an EST aligning just upstream of an mRNA, itself open from its first base, and the two clones share some sequence: we merge them into a longer transcript candidate, potentially encoding a longer protein than available in a single GenBank entry. Because our transcript is then a composite minimally supported by more than one clone, in some cases the extension of the protein will not be legitimate. But in any event, considering the average 5’UTR size in human, we expect that a vast majority of complete transcripts should contain a stop in the 5’UTR; therefore a protein not bounded by a stop on each side would also be suspicious.

Can you help me design primers to amplify specific transcripts?

Right now, we only provide some useful hints to do that: since January 2004, we provide the sequence of each transcript using color and letter case to indicate exons and coding region respectively: we use alternating colors for exons; we use upper case for coding region and lower case for UTRs (or introns).

Such sequences are available by clicking from the transcript table on the “gene on genome” page, or individually at the bottom of each mRNA page. For computer oriented people, the complete list of transcripts coded this way is also available (in an easily parsable format) from the link to all sequences (gene page).

To design primers specific to one form, you may use the “table of introns and exons” in the (alternative) mRNAs page to select exons that would provide a specific product: each exon knows all the AceView transcripts it belongs to, so you can examine the specificity from this table; strictly specific exons and introns are printed in red (since 2007). But remember that some variants are better supported than others, and we provide some evidence about this too, in one of the columns. If you are interested in the protein, the information you need is in the last column of the protein table on the gene page: there, the number of clones required to support the structure of the variant over the coding region is given. If one suffice, you can feel comfortable the structure of your protein-coding region is fully validated. (You may want to check at the same time, in the first column of the protein table, if your protein is complete and if it is identical to that from another transcript (that would only differ in the UTR), or if it is included in another protein.) If the names of two or more clones appear in the support last column, then the transcript is derived from concatenation of partial or partially sequenced clones (ESTs) and your RT-PCR experiment will be instrumental in confirming the existence of this transcript. Then please do not forget to send your sequences to GenBank so that everybody can benefit from your work, and we can make AceView even more solid and useful! Heartfelt thanks!

The Acembly/AceView program, credits, how to cite us

History of Acembly and AceView; the AceView database schema

Development corner

How to link to us, or cite us

Acknowledgments

Please send us wishes, comments, and bug reports.

History of AceView

The AceView program (initially called Acembly) was inspired by Yuji Kohara and Tsodas Shin-i from the National Institute of Genetics, Mishima, Japan and was developed from 1995 to 2000 at CNRS, Montpellier, France and NIG, Japan by Danielle and Jean Thierry-Mieg and Michel Potdevin to treat the nematode C. elegans transcriptome data. The program is written in C on top of the Acedb object-oriented database manager; and the schema of the database is truly gene centered. All objects including papers, phenotypes, molecules, experiments, regulation, and interactions and all our annotations ultimately point to the genes.

Since 2000, development of Acembly/AceView has been actively pursued at the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM) and the program is now applied to other species including Homo sapiens. We have adapted the program to larger projects, developed a Web viewer and a new C interface to Acedb, called AceC, to construct the AceView text descriptions, fill up our data tables, and run the Web, whereas a new indexing and “grepping” system was written to support the gene centered “Query” functionality.

Development corner (September 2003)

Here are a few ideas for future development; we do not have a schedule. Our wonderful collaborator, Mark Sienkiewicz, left us in June 03, and our current team consists of the two of us at NCBI.

· Finish analysis of expression pattern. We have created a specific Acedb database to classify the tissue/stage/library fields of the GenBank records (mRNAs and ESTs) into an anatomic/developmental/pathological schema and have finished classifying about 3 million entries (1 more million to go). Then, we will use this classification in conjunction with the main human AceView and use the percentage of each tag aligned to describe the pattern of expression for each gene: when and where is it expressed, is it overexpressed or underexpressed in tumors, or in other conditions. We hope this will help map the still unmapped human diseases, in addition to bringing useful information on promoters and chromatin domains. In November 2003, we discovered that two groups have developed tools that could ease the process: the SANBI project (J. Kelso et al.) has developed an ontology adopted by ensembl, while the GeneCards group (D. Lancet, Marilyn Safran and Maxim Shklar) has performed microarray experiments, and developed tools to compare their results to expression data derived from EST/mRNAs assignments. We plan to collaborate with these groups to offer our users a validated expression profile, hopefully for release 35

· Design a new Query by feature, which would allow the user to select the genes by acting on multiple fields at a time, using a combination of toggles and simple windows where you would type numbers. You could select simultaneously on map position, level and pattern of expression, presence of alternative variants and the various alternative features we calculate in the database, presence of a given protein motif, predicted cellular compartment, taxonomy (are there close genes in bacteria, viruses, invertebrates...), presence of a known phenotype, presence of an antisense gene, and also length of UTRs, of CDS, molecular weight, pI, number of exons, of introns, types of introns (e.g., at-ac) and all of these queries that are structured, in that they apply to a single field in the database at a time. It should be easy to then provide a list with exactly the genes people need to consider, for example those in a given region of a chromosome that have a membrane domain and are overexpressed in tumors; or those expressed in brain and antisense to other genes. A second query, either of the standard AceView type (looking for any words in the gene’s records) or an iterative query by feature should allow researchers to sort sets of genes with very interesting properties indeed, and to hopefully help our knowledge of the genes to progress faster.

· Design a new graphical display where we would show all the protein isoforms from one gene coaligned. The Pfam, Psort, and BlastP homologies will be displayed along the entire combined length to allow quick visual assessment of what the differences between the various isoforms affect and what the functional consequences might be.

· Design couples of primers to amplify all exons evidenced in AceView (at the request of a user). We have done the first and maybe the most difficult step of identifying the redundancy level of each 18-25 nucleotide-long primer genome wide. We will look for primers that would amplify the exon plus at least 50 bp on each side (to allow reuse of the same primer for PCR amplification and sequencing and to still get good sequences), with a unique PCR product of maximum size 1.8 kb, optimally 0.7 kb (for a standard exon, of less than 600 bp), so that the sequences read from the two PCR primers actually read the entire two strands at high quality (on recent sequencers), yielding a non-ambiguous sequence. This should simplify allele sequencing, hence the identification of genes associated with diseases.

A related question is the design of primers to amplify specific transcripts. We could do a bit more integration there.

Another issue of interest is the identification of exons that could be used in the design of microarrays or any other type of array. Those could aim at identifying the most representative exon(s) in each gene, to get a panel of all the reliable human genes. One could limit by level of expression, or stage or tissue or coding / non-coding potential, or any property such as antisense gene. Alternatively one may wish to identify the sequences most diagnostic of specific alternative transcripts. We could do that relatively easily in AceView, because all the data is in acedb, hence easily amenable to very complex requests and the DNA from any set of objects can be dumped from the database at the end of any query.

We are of course also very interested in defining and analyzing putative promotors, but apart from identifying hopefully reliable first exons, we have not done any other analysis in that direction yet.

We would also like to set up downloadable acedb databases per chromosome (or may be the single database we use, but it may be a bit big), so that people can get the whole lot of data and do the queries themselves, rather than being limited to painfully getting what we can show on the (somewhat frightening) web pages or what can be retrieved from the ftp site (very minimal and simplistic). We will do that if there is some demand, and people should be aware that we would only be able to offer extremely minimum user support.

Provide a mouse and rat AceView, as many users have asked. Make Arabidopsis better, and work on fly in collaboration with FlyBase. We could then offer some meta-query that would allow seamless passage from one gene in the species of interest to the others, guided by the availability of biological data.

To link to us, or to cite us

To Link to AceView:

AceView tries to answer any query by returning a gene or a list of genes. The query may be a gene name or alias, a cDNA clone or sequence identifier, a NCBI GeneID or UniGene ID, or more generally a meaningful identifier, word or group of words.

To create URL links to AceView, please use the following syntax:

https://www.ncbi.nlm.nih.gov/IEB/Research/Acembly/av.cgi?db=SPECIES&q=QUERY

where SPECIES is either human (for Homo sapiens), mouse (for Mus musculus), worm (for Caenorhabditis elegans), ara (for Arabidopsis thaliana)

and QUERY is the identifier or words (query is not case sensitive; spaces between words are replaced by +)

Examples:

Access gene PTEN in human: https://www.ncbi.nlm.nih.gov/IEB/Research/Acembly/av.cgi?db=human&q=pten

Access GenBank accession from a mouse cDNA: av.cgi?db=mouse&q=AK046795

Access cDNA clone IMAGE:4838723 in human: av.cgi?db=human&q=image:4838723

Access genes related to mitotic spindle in worm: av.cgi?db=worm&q=mitotic+spindle

Access homolog of the worm gene smg-5 in human: av.cgi?db=human&q=smg-5

These URLs will automatically lead to the most recent version of the data and will find the genes even if their names changed.

One may be more specific and add a parameter c=CLASS where CLASS is a class in our AceDB database server (not case sensitive). This supposes knowledge of the current schema and is not usually recommended. It is however useful when accessing specifically the transcript view rather than the gene view (but the complete transcript name should then be given), calling for a gene name with too few characters (e.g. genes a and A in Drosophila…) or small GeneIDs or other numeric identifiers.

Examples:

Access mRNA variant c in gene STAT5A in human:

https://www.ncbi.nlm.nih.gov/IEB/Research/Acembly/av.cgi?db=human&c=mRNA&q=STAT5A.cApr07

Access GeneID 3 in human: https://www.ncbi.nlm.nih.gov/IEB/Research/Acembly/av.cgi?db=human&c=geneid&q=3

To cite us: Please use Danielle Thierry-Mieg and Jean Thierry-Mieg, AceView: a comprehensive cDNA-supported gene and transcripts annotation, Genome Biology 2006, 7(Suppl 1):S12

The web site can be cited as: www.ncbi.nlm.nih.gov/IEB/Research/Acembly AceView: integrative annotation of cDNA-supported genes in human, mouse, rat, worm and Arabidopsis.

You may also look at the Publications page if you are interested in the ~150 articles reporting use and confirmation of AceView gene annotation, or our other articles on genes or alternative transcripts annotation, on microarrays, on worm genetics, or on Yang Mills theory and particle physics.

Acknowledgments:

We thank our friend Yuji Kohara for giving us access to his beautiful data: the collaboration with his team has inspired the entire view we have of the genes and has led to the development of AceView. We are grateful to our previous collaborators, Mark Sienkiewicz, Vahan Simonyan, Adam Lowe, and Yann Thierry-Mieg, who contributed to this effort. We are indebted to Kenta Nakai and Sean Eddy for the nice tools they provide. We thank David Lipman for his interest and stimulating ideas, Donna Maglott for sharing the love of genes, and all our friends at NCBI, in particular the systems, Blast and taxonomy groups, for their help, encouragement, and support.

Feel free to contact us by email

Freedom of Information Act | Disclaimer