AceView downloads
Last updated October 8, 2012
This page provides access to the data that we generated over the years by using our AceView software with some manual guidance to annotate the genes from Human, Mouse, Rat, Arabidopsis and C. elegans. Most of the software we develop is available, although not necessarily fully documented.
Conditions of use:
Comprehensive integration of the public cDNA sequences into fully annotated genes is not an easy task, and we would appreciate some recognition! If you use the AceView data in your research or applications, please acknowledge the AceView site and quote our publication: Danielle Thierry-Mieg and Jean Thierry-Mieg, AceView: a comprehensive cDNA-supported gene and transcripts annotation, Genome Biology 2006, 7(Suppl 1):S12.
If you wish to perform a large scale analysis of the AceView data, do not hesitate to contact us: we may extract richer data from our database in a consistent way and are always open to collaborations.
If you want to receive an announcement when we update the data or the code on the site, please mail us to be added to either of our mailing lists.
- Human 2011
- Mouse 2007
- Rat 2008
- Worm 2010
- Arabidopsis 2007
- Help and File Format
- Archive
- Software
Last updated November 12, 2011 New download now available for the function pages !
The human AceView 2010 release used the 9.2 million cDNA sequences available in GenBank/dbEST/Trace as of August 4, 2010 and aligned them on the Human NCBI genome 37 /hg19 (Feb 2009)to generate a comprehensive non redundant curated representation of all data submitted as cDNA sequences.
The RNA sequences define 37,463 spliced genes and 23,744 single exon putatively coding genes, in addition to partial or non coding single exon genes plus the "cloud". The 37,463 spliced genes group 205,676 spliced transcripts which include a total of 382,279 distinct intronsscars, or exon-exon junctions.
The files available are:
Genes
- The coordinates of exons/introns, CDS and UTR of each mRNA in gff format, for all non-cloud genes (30 MB)(see the definition of clouds in the FAQ).
- An alternative file giving the structure/intron/exon/coding/UTR/support/other properties (34 MB) in a format suggested by a user and different from GFF.
- The plain coordinates of the AceView genes on the chromosomes (4.5 MB), and the connection between mRNAs, AceView genes, GeneID when available and eventual RefSeq (3.3 MB).
mRNA sequences, introns...
- The mRNA sequence models (coding or non-coding, including UTR parts) for the main AceView genes , annotated or not in Entrez Gene, in fasta format (89 MB). This list excludes the clouds (see the definition of clouds in the FAQ). The connection between RNA model names, AceView gene names, eventual GeneID (for genes annotated in Entrez Gene) and RefSeq ID is here (3.3 MB). Note that many variants for which there is not complete support from cap to polyA remain partial in AceView. Next generation sequence data will hopefully allow to extend them to completion.
- All AceView transcript models, with no restriction, in fasta format (139 MB). This is in our opinion an excellent 'nr' database which may be advantageously used instead of GenBank+dbEST in most Blast applications. It indeed represents all cDNA sequence collections at NCBI, Nucleotide/ESTs/Trace and soon SRA, yet it is limited to RNA sequences from the species itself (no conservation-based models); it is non-redundant because RNAs were clustered by genomic alignment, and finally the sequences were rationalized to match the genome reference sequence. Again, the connection between RNA model names, AceView gene names, eventual Entrez GeneID and RefSeq ID is here (3.3 MB).
- The bordering sequences and support for 407,455 introns (some from RNA-seq)(33 MB)
Proteins, motifs...
- The amino acid sequences of the 191,507 'best' proteins from protein-coding transcripts (16 MB), and the corresponding coding DNA sequences(20 MB), both in fasta format. This file is restricted to proteins that score as ‘good proteins’ (see the FAQ), i.e. score > 1. Warning: like most proteins in SwissProt or RefSeq, these are conceptual translation products, most have not been observed experimentally.
- All peptides and proteins (735,052 peptides), with no restriction, for decoding mass spectrometry data, in fasta peptide format (45 MB). This is the DNA corresponding file (56 MB).
- The highly significant PFAM hits (position on genome of the corresponding gene, title and accession of the PFAM, coordinates of the domain in the proteins (3.5 MB)
cDNA accessions, evidence
- cDNA accession evidence and interesting tissue or stage annotation from GenBank is given in this file for the AceView genes (82 MB), and here for the cloud (12 MB). Help is here.
- Detailed support of the alignments supporting the genes (155 MB): this huge file gives, for each supporting cDNA accession, the coordinates in the accession and the matching coordinates in the exons of the gene model, as well as the number of mismatches per exon. Again, all AceView models are supported by cDNA accessions.
Microarrays
- We provide mapping of probes from whole genome expression arrays or other microarrays on the AceView genes (see this help). Ask us if your favorite arrays are missing. The AceView mapping should improve significantly the information you get from your arrays.
Disease related genes and other functionnal annotations
- were provided at the request of users for the previous release. Please feel free to ask what you wish for this release as well.
Also see the GOLD and MAQC interesting works and resulting data.
If you need other kinds of files or analyses, please email us, we are open to collaborations and dedicated to a better understanding of the biology of the organisms.
Files for download are usually in tar.gz format. Help on some formats is available where indicated. Most formats are self documented in the legend, which is the first line of the file, please tell us if this is not the case or if additional explanations are needed.
Last edited October 7, 2012 (added or updated files)
The Mouse September 2007 AceView release aligns 4.8 million cDNA sequences (available from GenBank/dbEST August 26, 2007) into a total of 70,239 genes, including 32,249 spliced genes, of which we annotate 3,667 as spliced non coding. We annotate 119,128 spliced transcripts on the Mus musculus NCBI genome 37/mm9 (July 2007).
Only 19,502 of the 32,249 evidence-supported spliced genes are annotated in Entrez gene 37.2 (those genes include 21,838 GeneID because cDNAs bridge across some Entrez genes). On the other hand, 13,551 genes in Entrez Gene 37.2 are not supported by any cDNA sequences as of today.
The files posted here are:
Genes
- The coordinates of exons/introns, CDS and UTR of each mRNA in gff format, for non-cloud genes (20.0 MB).
- An alternative file giving the structure/intron/exon/coding/UTR/support/other properties (22.5 MB) in a format suggested by a user and different from GFF.
- The plain coordinates of the AceView genes on the chromosomes (2.5 MB), and the connection between mRNAs, AceView genes, eventual Entrez GeneID and RefSeq when available (2 MB).
mRNA sequences, introns...
- The mRNA sequence models (tar.gz or gz)(coding or non-coding, including UTR parts) for the main AceView genes, annotated or not in Entrez Gene, in fasta format (67 MB). This list excludes the clouds,(see the definition of clouds in the FAQ). The connection between RNA model names, AceView gene names, eventual Entrez GeneID (for genes annotated in Entrez Gene) and RefSeq ID is here (2 MB). Note that variants not completely supported, from cap to polyA, remain partial in AceView. Next generation sequence data will hopefully allow to extend them to completion.
- All 313,971 AceView transcript models, with no restriction, in fasta format (91.5 MB). This is in our opinion an excellent 'nr' database which may be advantageously used instead of GenBank+dbEST in most Blast applications. It indeed represents all cDNA sequence collections at NCBI, Nucleotide/ESTs/Trace and soon SRA, yet it is limited to RNA sequences from the species itself (no conservation-based models); it is non-redundant because RNAs were clustered by genomic alignment, and finally the sequences were rationalized to match the genome reference sequence. Again, the connection between RNA model names, AceView gene names, eventual Entrez GeneID and RefSeq ID is here (2 MB).
- The bordering sequences and support for the 281,484 introns annotated in AceView 2007 (23 MB)
Proteins, motifs...
- The amino acid sequences of the 190,260 'best' good proteins from protein-coding transcripts (15.1 MB), and the corresponding coding DNA sequences (20.8 MB), both in fasta format. This file is restricted to proteins that score as ‘good proteins’ (see the FAQ), i.e. score > 1.
Warning: like most proteins in SwissProt or RefSeq, these are conceptual translation products, most have not been observed experimentally. - All peptides and proteins, with no restriction, for decoding mass spectrometry data, in fasta peptide format (31.6 MB). The DNA corresponding file is here (40.3 MB).
- The highly significant PFAM hits (position on genome of the corresponding gene, title and accession of the PFAM, coordinates of the domain in the proteins (3.0 MB)
cDNA accessions, evidence
- cDNA accession evidence and tissue or stage annotation from GenBank is given in this file for the AceView genes (62 MB), and here for the cloud (5.2 MB). Help is here.
- Details on the alignments supporting the genes (106 MB): this huge file gives, for each supporting cDNA accession, the coordinates in the accession and the matching coordinates in the exons of the gene model, as well as the number of mismatches per exon.
Microarrays
- Updated March 30, 2009. Mapping of mouse expression microarray probes from Affymetrix 430-2 (44.7 MB) and Illumina MouseWG-6_V1_1_R3 (5.3 MB), WG-6 V2 (6 MB) or MEEBO (10 MB) on June07 AceView transcripts, on RefSeq (NM, NR, XM, XR from March 2009) and on the genome build 37. See these explanations and the help on format
- Archive: First posted April 24, 2008 , using the RefSeq from April 24, 2008, and the same genome NCBI37/mm9 and AceView models : link to Affy 430-2 2008 and Illumina WG-6 V1 (5.2 MB) and V2 (5.9 MB) from April 2008.
Affymetrix expression array Mouse 430-2 contains 496,468 probe sequences. 467,032 (94.1%) map on the current mouse genome, 466,730 (94.0%) map onto AceView transcripts, 301,873 (only 60.8%) map onto RefSeq: Using the AceView mapping should improve significantly the information you get from your arrays…
If you need other kinds of files or analyses, please email us, we are open to collaborations and dedicated to a better understanding of the biology of the organisms.
Files for download are usually in tar.gz format. Help on some formats is available where indicated. Most formats are self documented in the legend, which is the first line of the file, please tell us if this is not the case or if additional explanations are needed.
October 2008
This first AceView release for the rat Rattus norvegicus aligns 733,932 cDNA sequences (out of the 956,279 sequences available September 13, 2008 in GenBank or the Trace repository) into 23,128 spliced genes (45,126 alternatively spliced variants) and a total of 102,551 main and cloud genes. Altogether, 35,340 genes are annotated as protein coding, but because there are still (surprisingly) few cDNA sequences for the rat, many of the genes are actually gene fragments, and should become merged in the future. Alignments were done on the current Rat reference genome (build 4).
This first rat AceView build is public in our gene browser, but please mail us if you would be interested in any of the usual files we provide.
Some individuals showed interest for the mapping to microarrays, in view of a comparative analysis of different platforms. Mapping of the probes on the genes is provided for Affymetrix expression array Rat 230.2 and Illumina array Rat Ref12V1.0. If you are interested in comparative analyses across microarray platforms, you may wish to look at these files and help document.
January 26, 2010
By users demand, we post today the usual files for the rat AceView 2008 genes on our ftp site, assorted with a word of caution: the genome NCBI build 4 of the rat is missing 3 or 4% of the transcribed areas/genes, and is not of top quality. There are only 734,000 cDNA sequences in GenBank or dbESTthat mapped cleanly on this genome, as of October 2008. Consequently AceView genes in this species only give an incomplete shallow picture of the transcriptome: many of the genes are actually fragmented, and some will likely merge in the future.
Genes
- The coordinates of exons/introns, CDS and UTR of each mRNA in gff format, for all non-cloud genes (8 MB)(see the definition of clouds in the FAQ).
- An alternative file giving the structure/intron/exon/coding/UTR/support/other properties (6.9 MB) in a format suggested by a user and different from GFF (see the Help and File format tab).
- The plain coordinates of the AceView genes on the chromosomes (1.2 MB), and the connection between mRNAs, AceView genes, GeneID when available and eventual RefSeq (1 MB).
mRNA sequences, introns...
- The mRNA sequence models (coding or non-coding, including UTR parts) for the main AceView genes , annotated or not in Entrez Gene, in fasta format (22.5 MB). This list excludes the clouds (see the definition of clouds in the FAQ). The connection between RNA model names, AceView gene names, eventual GeneID (for genes annotated in Entrez Gene) and RefSeq ID is here (1 MB). Note that many variants for which there is not complete support from cap to polyA remain partial in AceView. Next generation sequence data will hopefully allow to extend them to completion.
- All AceView transcript models, with no restriction, in fasta format (31.4 MB). This is in our opinion a good 'nr' database which may be advantageously used instead of GenBank+dbEST in most Blast applications. It indeed represents all cDNA sequence collections at NCBI, Nucleotide/ESTs/Trace and soon SRA, yet it is limited to RNA sequences from the species itself (no conservation-based models); it is non-redundant because RNAs were clustered by genomic alignment, and finally the sequences were rationalized to match the genome reference sequence. Again, the connection between RNA model names, AceView gene names, eventual Entrez GeneID and RefSeq ID is here (1 MB).
- The bordering sequences and cDNA support for 163,600 introns supported by AceView alignments (8.5 MB)
Proteins, motifs...
- The amino acid sequences of the 57,842 good 'best' proteins from protein-coding transcripts (7.1 MB), and the corresponding coding DNA sequences(10.1 MB), both in fasta format. This file is restricted to proteins that score as ‘good proteins’ (see the FAQ), i.e. score > 1. Warning: like most proteins in SwissProt or RefSeq, these are conceptual translation products, most have not been observed experimentally.
- All peptides and proteins (240,258 peptides), with no restriction, for decoding mass spectrometry data, in fasta peptide format (13.4 MB). This is the DNA corresponding file (17.8 MB).
- The highly significant PFAM hits (position on genome of the corresponding gene, title and accession of the PFAM, coordinates of the domain in the proteins (1.3 MB)
cDNA accessions, evidence
- cDNA accession evidence and interesting tissue or stage annotation from GenBank is given in this file for the AceView genes (11 MB), and here for the cloud (2.2 MB). Help is here.
- Detailed support of the alignments supporting the genes (20.3 MB): this huge file gives, for each supporting cDNA accession, the coordinates in the accession and the matching coordinates in the exons of the gene model, as well as the number of mismatches per exon. Again, all AceView models are supported by cDNA accessions.
Microarrays
see above. Microarrays were mapped and compared in October 2008, and this analysis is still current
Last edited Feb 22, 2011
The Arabidopsis thaliana September 2007 AceView release aligns 1,188,694 cDNA sequences (available from GenBank/dbEST September 15, 2007) into 22,177 spliced genes or a total of 32,925 genes when we add the single exon genes. We annotate 33,787 spliced transcripts on the Arabidopsis NCBI genome 7.0 (August 2007).
The files available are:
Genes
- The coordinates of exons/introns, CDS and UTR of each mRNA in gff format, for non-cloud genes (6.1 MB).
- An alternative file giving the structure/intron/exon/coding/UTR/support/other properties (5.9 MB) in a format suggested by a user and different from GFF is also provided.
- The plain coordinates of the AceView genes on the chromosomes (0.4 MB), and the connection between mRNAs, AceView genes, eventual Entrez GeneID and RefSeq when available (0.5 MB).
mRNA sequences, introns...
- The mRNA sequence models (coding or non-coding, including UTR parts) for the main AceView genes, annotated or not in Entrez Gene, in fasta format (16.4 MB). This list excludes the clouds,(see the definition of clouds in the FAQ). The connection between RNA model names, AceView gene names, eventual Entrez GeneID (for genes annotated in Entrez Gene) and RefSeq ID is here (0.5 MB). Note that variants for which there is not complete support, from cap to polyA, remain partial in AceView. Next generation sequence data will hopefully allow to extend them to completion.
- All AceView transcript models, with no restriction, in fasta format (16.7 MB). This is in our opinion an excellent 'nr' database which may be advantageously used instead of GenBank+dbEST in most Blast applications. It indeed represents all cDNA sequence collections at NCBI, Nucleotide/ESTs/Trace and soon SRA, yet it is limited to RNA sequences from the species itself (no conservation-based models); it is non-redundant because RNAs were clustered by genomic alignment, and finally the sequences were rationalized to match the genome reference sequence. Again, the connection between RNA model names, AceView gene names, eventual Entrez GeneID and RefSeq ID is here (0.5 MB).
- The bordering sequences and support for the introns (7.4 MB)
Proteins, motifs...
- The amino acid sequences of the 'best' good proteins from protein-coding transcripts (7.3 MB), and the corresponding coding DNA sequences (10.6 MB), both in fasta format. This file is restricted to proteins that score as ‘good proteins’ (see the FAQ), i.e. score > 1.
Warning: like most proteins in SwissProt or RefSeq, these are conceptual translation products, most have not been observed experimentally. - All peptides and proteins, with no restriction, for decoding mass spectrometry data, in fasta peptide format (8.5 MB). The DNA corresponding file is here (11.8 MB).
- The highly significant PFAM hits (position on genome of the corresponding gene, title and accession of the PFAM, coordinates of the domain in the proteins (1.5 MB)
cDNA accessions, evidence
- cDNA accession evidence and interesting tissue or stage annotation from GenBank is given in this file for the AceView genes (12.1 MB), and here for the cloud (0.1 MB). Help is here.
- Detailed support of the alignments supporting the genes (22.2 MB): this huge file gives, for each supporting cDNA accession, the coordinates in the accession and the matching coordinates in the exons of the gene model, as well as the number of mismatches per exon.
If you need other kinds of files or analyses, please email us, we are open to collaborations and dedicated to a better understanding of the biology of the organisms.
Files for download are usually in tar.gz format. Help on some formats is available where indicated. Most formats are self documented in the legend, which is the first line of the file, please tell us if this is not the case or if additional explanations are needed.
Last updated February 28, 2011
Files for download are usually in tar.gz format. The format of simple files is embedded in the first line of the file. Files for which the content cannot be fully guessed from the legend in the file are documented here:
The .gff file provides the structure of the genes at the exclusion of the 'cloud', i.e. the coordinates of the genes elements. The tar.gz contains one file per chromosome, MT is mitochondria.
The gff format is nicely defined by Jim Kent on the UCSC genome browser.
Aim of this file:
To allow genome-wide resources depicting genes to represent nicely the properties of the alternatively spliced variants seen by AceView upon mapping and clustering of the millions public mRNA sequences on the genome. This table provides, for each mapped gene, the GeneID, the coordinates of the gene on the genome, and a description of all its cDNA-supported introns. It then provides, for each mRNA variant, the coordinates in the gene, the intron signature and information on completeness at both the 5’ and 3’ ends: putative promoters and confirmed alternative polyadenylation sites are annotated. For putative protein-coding mRNAs, information on the protein, its location in the transcript, its completeness, which start codon is possibly used and which cDNA clone(s)/accessions encode a protein identical to the predicted protein is given.
Format:
To avoid redundancy and to keep the table simple, we let the content of each column depend on the TYPE of data, given in the first column of each line. There are currently 5 types, that do appear in successive order in the table: GENE, INTRON (in the above gene), mRNA (in the gene above; there are as many mRNA lines under a gene as there are alternative variants, INTRONS (in the mRNA above it), PROTEIN (in the mRNA above, if applicable).
This format is versatile and will allow any desirable extensions.
The table is tab delimited; multi-valued columns are semi-column delimited (';').
Example :
GENE AMELY 266 Y 6785429 6777320 8110
INTRON 1 gt_ag 57;1407 5 BC069138;M86933;NM_001143;BC074976;BC074977
INTRON 2 gt_ag 1474;3974 7 AY487421;BC069138;M86933;NM_001143;BC074976;BC074977;BG198114
INTRON 3 gt_ag 4023;5118 7 AY487421;BC069138;M86933;NM_001143;BC074976;BC074977;BG198114
INTRON 4 gt_ag 4023;5251 7 AY487421;BC069138;M86933;NM_001143;BC074976;BC074977;BG198114
INTRON 5 gt_ag 5161;5251 1 AY487421
INTRON 6 gt_ag 5297;5565 7 AY487421;BC069138;M86933;NM_001143;BC074976;BC074977;BG198114
INTRON 7 gt_ag 5992;7949 7 AY487421;BC069138;M86933;NM_001143;BC074976;BC074977;BG198114
mRNA AMELY.aAug05 266 1420 7955 - - AY487421
INTRONS 5 2;3;5;6;7 1474;3974;7 4023;5118;7 5161;5251;1 5297;5565;7 5992;7949;7
PROTEIN 4 1420 7955 NH2_partial Stop
mRNA AMELY.bAug05 266 1 8110 1 7983;8110 M86933
INTRONS 5 1;2;4;6;7 57;1407;5 1474;3974;7 4023;5251;7 5297;5565;7 5992;7949;7
PROTEIN 4 1420 7955 ATG Stop BC069138;M86933;NM_001143;BC074976;BC074977
Content of the columns, per TYPE:
GENE
- Column 1: the name of the gene in AceView.
If there is an official name, the AceView name is the official gene name, except in the infrequent cases where there is evidence that at least one transcript shares introns with transcripts from two adjacent Entrez genes with different official names, say GENEa and GENEb. In this case, the gene is more complex than previously thought, and we name it GENEaANDGENEb, concatenating the official gene names. Each transcript then knows if its best contact is to GENEa or to GENEb, and bears that GeneID. If there is no official gene name, we use a Pfam-derived name.dot.number, else a purely computer generated AceView name. We try to track all gene names from release to release. - Column 2: the Entrez GeneID, uniquely inherited in this order by contact with an NCBI gene model aligned by NCBI on the reference genome, otherwise through a RefSeq model aligned in the AceView gene, otherwise through one of the 'reference mRNAs' in Entrez gene. A number of AceView genes do not have a corresponding Entrez GeneID yet, and, as explained above, some AceView genes merge multiple genes with different geneIDs, hence this column may be Null (-) or multi-valued.
- Column 3: The chromosome number (1 to 22, X and Y; NT_* denotes unmapped contigs and follows the NCBI nomenclature)
- Column 4: The coordinate of the 5’ end of the gene on the chromosome, in bp, as currently supported by cDNAs in the public domain (the gene may be partial, and nearby genes on the same strand may become merged by new cDNA sequences in a future release.)
- Column 5: The coordinate of the 3’ end of the gene on the chromosome, in bp. Note that comparing the values in column 4 and 5 gives the strand of the gene: if the value in column 5 is greater than the value in column 4, the gene is encoded on the forward strand; otherwise it is on the reverse strand.
- Column 6: the extent of the gene on the genome, in bp. Cumulatively, that is the extent of the genome transcribed in pre-messengers that will mature into mRNAs from this gene.
INTRON:
« Introns » are here taken in a broad sense to denote any discontinuity in the cDNAs alignments. The coordinates of exons can be deduced from the table and are not given explicitly. Because the ends of the terminal 5' and 3' exons are biologically variable and often partial, they are called alternative exons and appear artificially more numerous than alternative introns. More than 99% of the objects under INTRON are introns defined by perfect alignments of cDNAs over at least 8 bases on each side of the intron, 98% of all are standard introns with [gt-ag] or [gc-ag] boundaries. 1.7% are non-standard or fuzzy (imperfectly defined), the remaining 0.3% correspond to alignment gaps.
Following our analysis in GOLD, we now annotate all human introns below 60 bp as structural size polymorphisms and note they most often occur in the 3’UTRs. The relatively low level of non-standard introns we report reflects the fact that we recognize and flag cDNA clone anomalies such as partial insert deletions or rearrangements carefully (in about 5% of the cDNAs). To represent a variant with a non-standard intron, we demand either that three or more cDNAs support the non-standard intron, or that the cDNA harboring this non-standard intron also brings a unique standard alternative intron or exon.
AceView currently annotates close to twice the number of standard introns in RefSeq or EBI, yet all AceView introns in the table are supported by cDNA sequence alignments that are locally perfect (over 16 bp), hence unambiguous. The number of introns continues to grow almost linearly with the number of cDNA sequences deposited in the public databases, a sure sign that we are still far from saturation and that the transcriptome is truly more complex than what we describe today, even in AceView!
- Column 1: ordinal number of the intron
- Column 2: intron boundaries, either standard gt-ag or gc-ag, or any other sequence, fuzzy or gap
- Column 3: coordinates of the first and last base of the intron, relative to the gene (base 1 is the 5’ end of the gene).
- nombre de séquences soutenant exactement cet intron avec 8 bases exactes de chaque coté;
- liste des accessions du gène soutenant exactement l’intron
- First the number of introns (standard, with gt-ag or gc-ag intron boundaries); then the number of other non-
standard introns; then the number of alignment or sequencing gaps if any. The three numbers are semi-column delimited.
The intron-exon table provides the elements of the multiexon transcripts (those for which the first number in column 11 above is not 0), also in the coordinates of the gene.
• Columns 1: ordinal number of the intron in the transcript
• Column 2: ordinal number of the intron in the gene? To help the drawing at GeneCards. Then maybe simply give the intron signature (concatenated intron numbers)
• Columns 3 and 4: coordinates of the first and last base of this intron in the gene
• Column 5: intron boundaries. The first two and last two bases of the intron, almost always gt-ag, some gc-ag, and a few at-ac thought to be U12-dependent.
• Column 6: number of clones supporting each intron in the entire gene
• Column 7: example clones allocated to this variant (semi-column delimited)
The position of each AceView transcript, and whether or not each variant is complete on the 5’ side (putative promoter) and 3’ side. On the 3’ side, we now annotate carefully the multiple alternative polyadenylation sites.
A second table details the “intron exon structure” of the spliced variants and provides the degree of support (number of clones, total or specific) for each intron. their confirmed/validated/well supported/secure starts and ends
- All other coordinates in this table are given relative to the gene
- Column 6: the variant name. Note: the variant encoding the largest predicted protein is called a, then b and so on. Alternative variants may change from release to release as new cDNAs are submitted, but the variant name includes the date of the release aDate, bDate…, which makes it unambiguously attached to a single sequence.
- Column 7 and 8: The coordinates in the gene of the first and last base of the transcript (base 1 is the first base of the gene, i.e. the first base of the 5’-most transcript of the gene)
- Column 9: A well supported 5’ end, indicating a promoter. When there is evidence that the transcript is likely
complete on the 5’ side, the coordinate in the gene of the validated transcript start is given. There is at most one 5’end per variant. Elements to decide on 5’ completion include close agglomeration of many 5’ ends of cDNAs, the presence of a clone from a cap-selected or full-length enriched library within the first 50 bases of the transcript (because the actual start of transcription may biologically span up to 61 bp, according to is Suzuki et al, EMBO Rep. 2001 May;2(5):388-93). If no such evidence exists, we also consider as putatively complete on the 5’ side a transcript encoding a protein of good quality, bounded on the NH2 side by an in-frame Stop; in such a case, the real end of the transcript may be somewhat upstream of the current 5’end, but introns in 5’UTR are infrequent. Any transcript for which there is no evidence for completion on the 5’ end has NULL in this column.
- Column 10: the coordinates in the gene of all well supported (alternative) polyadenylation sites. In AceView, in order to reduce the apparent number of transcripts, we merge transcript variants that share the same introns and only differ in the length of the last exon i.e. in the position of their poly adenylation sites. We now indicate where the well supported alternative ends are. Factors used to determine the likely 3’ ends involves clustering the known 3’ ends, either from cDNAs or ESTs where the polyA is visible in the sequence, or from groups of reverse-aligned (usually 3’) ESTs whose 3’ ends cluster in a short area, or from extremities of the mRNAs (more precisely reference mRNA quoted in Entrez Gene) that have been submitted to GenBank with the mention ‘Complete CDS’. The ‘clusters of 3’ ends’ are then analyzed for the presence of A-rich stretches just downstream of each cluster in the genome, suggesting that the clones, usually prepared using oligo dT aiming at the poly A actually hybridized to a region rich in A in the mRNA and/or the genome. When such an A-rich region is found, all clones in the cluster are labeled as ‘internal priming in A rich region’, and the 3’ end is not confirmed. Finally, a search for the AATAAA or any of its single letter variants in the correct position, less than 40bases before the cluster of 3’ ends, allows to define the most probable ‘true’ 3’ ends. Contrary to UniGene, we let singlebase variants of the main AATAAA signal contribute because we found that altogether they account for just about half of the signals in unambiguously complete transcripts.
Some transcripts have multiple alternative polyadenylation sites (multiple confirmed 3’ ends sites, this column
may be multi-valued), while some are incomplete on the 3’ end and have no such site annotated.
- Column 11: the number of introns in the transcript, first the number of standard introns (with gt-ag or gc-ag
intron boundaries); then the number of other non-standard introns; then the number of alignment or sequencing gaps if any. The three numbers are semi-column delimited.
- Column 12: the protein-coding score. Variants with scores greater than 1 are likely protein coding, variants with 0 or negative scores are probably not. The magnitude of the score depends on the size, conservation and
annotation of the protein as well as on the position of the CDS relative to the intron scars. Scores are also used to select the ‘best predicted protein’, whose annotation is displayed on the web (more details here).
- Column 13 and 14: the coordinates of the ‘best predicted protein’ in the gene. Some proteins are partial. Many are complete and start at ATG/Met. Some of the complete proteins gain at least 60 aminoacids relative to the standard ATG if they start at a single letter variant of ATG, following the Kozak rule: those are hypothesized to possibly encode the longer variant, eventually in addition to a shorter isoform starting on the first Met. Only the longest variant is annotated in AceView, but since most of the annotation is local, it is easy to deduce the annotation of the shorter form by examining the graphical representation of the variant.
- Column 15: the number of standard introns outside the best protein. Unambiguously coding transcripts most often have no intron outside the CDS; some have one in the 5’UTR. It has been showed that some transcripts with introns downstream of the CDS might be targeted for one round of translation, followed by mRNA degradation by the nonsense mediated degradation NMD pathway, but the generality of the phenomenon remains to be demonstrated, and many exceptions have been documented at the functional and molecular level.
The first line of the table reads:
Gene Entrez_GeneID chromosome 5’end_of_gene 3’end_of_gene variant_name First_base_in_gene
Last_base_in_gene defined_5’end(promoter) defined_3’ends(polyA) #introns(standard_other_gap)
protein_coding_score best_protein_begins best_protein_ends #standardIntrons_outsideCDS
Variant_name First intron begins first intron ends intron boundaries totalNumberOfSupportClones
example clones allocated to this variant supporting this intron (semi-column delimited? Or one line per accession with repeats of the rest of the line?)
mRNA
les colonnes sont :
- nom du mRNA;
- GeneID (en général single-valued ici);
- coordonnée de la première base dans le gène ;
- coordonnée de la dernière base dans le gène;
- coordonnée du vrai 5’, si le mRNA est complet (indique un promoteur)/ - si le mRNA ne présente pas d’évidence
d’être complet ;
- coordonnées du ou des sites confirmes de polyadenylation, éventuellement multivalue/ - si pas de site de
polyadenylation confirme;
- exemples de clones soutenant le mRNA (choisis comme couvrant le mRNA ou au moins le CDS et ayant la séquence la
plus proche de celle du génome sous-jacent)
INTRONS (dans le mRNA dessus)
les colonnes sont :
- nombre d’introns dans le mRNA de la ligne au dessus;
- signature des introns du mRNA (numéros des introns decrits dans INTRON, concaténés) ;
- chacune des colonnes suivantes contient pour chaque intron successivement un triplet de chiffres: coordonnée de la
première base de l’intron dans le gène, dernière base de l’intron, nombre d’accession soutenant cet intron, les trois
chiffres sont séparés par des point virgules. Ca devrait etre utile pour tes dessins.
PROTEIN (meilleur protéine prédite dans le mRNA ci-dessus, si le mRNA semble codant)
les colonnes sont :
- le grade de la protéine (un système de gradation de qualité des protéines utilise pour choisir la meilleur
protéine codée par le transcrit ; je vais decrire ca en anglais pour la prochaine release. Un mRNA pour lequel la
meilleure protéine a un grade inferieur a 1 est traite comme non codant ou trop incomplet et n’est pas associe a une ligne
PROTEIN) ;
- coordonnée dans le gène du début de la protéine (bp) ;
- coordonnée dans le gène de la fin de la protéine, incluant le Stop ;
- nature du start de la protéine (soit NH2_partial, soit si le produit est complet, on indique le codon initiateur,
en general ATG, parfois un autre codon qui peut servir d’initiateur (selon Kozak et autres exemples) et qui gagne au moins
60 aminoacides par rapport a l’usage du premier ATG) ;
- nature de la fin de la protéine, Stop ou COOH_partial ;
- liste des clones dont la séquence code exactement pour la protéine, sans aucune substitution d’acide amine
The file gives the relationship between the mRNA variant in the fasta files on our downloads site, the AceView gene, the Entrez gene ID and the RefSeq accession when applicable.
- Col 1: mRNA variant, as in the fasta mRNA file
- Col 2: gene name (in AceView; it is the official name at the date of the release whenever there is an official name, else a gene name derived from a significant Pfam protein domain homology, else a computer generated name specific to AceView and maintained from release to release, all in lower case. Please if your gene does not have an official gene name, ask for one before you publish on this gene, as AceView gene names have nothing official.
- Col 3: GeneID (if any). Often a gene is annotated in AceView because it is supported by cDNA sequences, but it is not yet an official gene in Entrez gene, hence it has no GeneID. Also, in some cases, AceView sees one gene where Entrez sees two, so in those cases there will be two GeneID associated to that single AceView gene.
- Col 4: NM included in this variant if any, but mind you, we allow extensions of NMs so the correspondence means inclusion, not necessarily identity.
New, February 8, 2009
Introns are discovered by alignment to the reference genome and annotated in AceView. Our criteria for a cDNA to “support a junction” is that the cDNA sequence exactly matches 16 bp centered on the junction (8 bases in each exon bordering the intron).
This file documents the exon-exon junctions (or intron scars) seen in cDNA sequence accessions from any of the NCBI sequence databases (GenBank, dbEST, Trace, SRA, GEO). It provides, in tab delimited format, the gene, GeneID, intron boundaries, coordinates on the genome of the intron, the ~98 bp exonic sequence around the junction (49 bp of exon 1 followed by 49 bp of exon 2, when available) and finally the RefSeq and AceView transcript models each intron belongs to. The supporting accessions from GenBank/dbEST or Trace are given in an additional file.
Format and details:
- file AceViewRefSeqExonsJunctions.txt.gz
The title line reads:
- #Junction
- Gene
- GeneId
- Intron boundary
- Chromosome
- First base of intron
- Last base of intron
- Strand
- Junction position
- Sequence
- In transcript
Col 1: Junction name: chromosome__coordinate_coordinate_build#
We propose that the intron name includes the reference genome build for the organism (e.g. 36) in addition to the chromosomal coordinates. This nomenclature for introns can easily be understood.
Col 2: gene name.
Entrez/HGNC gene names available at AceView release date are systematically used in AceView. Genes absent from the Entrez gene annotation are given Pfam names or all lower case names (easily distinguished from all upper case official human gene names), AceView names are stably maintained across builds. Note that a few AceView genes join two official genes because some cDNA clone bridges them and either shares an intron boundary with both genes and/or has substantial sequence overlap with both: when this occurs, the AceView gene name is a concatenation of the official meaningful gene names (not the LOC#) joined with 'and'.
Col 3: Entrez Gene ID, when available.
Many spliced genes, especially non-protein-coding, are not yet in Entrez gene, so they do not have a GeneID at this time, although they have a stable AceView gene name.
Conversely, some genes in AceView contact models of two genes in Entrez/RefSeq, and some junctions cannot be attributed unambiguously to only one ID. In such a case, both GeneIDs are listed for that junction.
Col 4: intron boundaries: generally gt-ag, some gc-ag. Rare at-ac occur in a few U12 dependent cases, and a large variety of other intron boundaries are used in rare instances.
Col 5: chromosome
The file includes the entire genome, except mitochondria (chromosomes 1 to 22, X and Y). The unattached contigs are given in the NCBI official nomenclature: their name starts with the chromosome name. Note that most of the ribosomal genes and other high copy number sequences are absent or under-represented in the reference genome.
Col 6: coordinate on the chromosome of the first base of the intron (i.e. base immediately after the last base of the first exon)
The first base of the chromosome is called 1.
Col 7: coordinate on the chromosome of the last base of the intron (i.e. base immediately before the first base of the second exon)
Note that comparison of the values in col 3 and 4 gives the strand (col 3 < col 6: mRNA is on the top strand, col 3 > col 4: mRNA is on the bottom strand)
Col 8: strand, + or –
Col 9: length of DNA before the exon junction in the sequence given in column 7.
As we aim at providing 98 bp junction sequences, the junction usually lies at 49, except when the first exon is a first 5' exon in a transcript and is shorter than 49 bases.
Col 10: sequence of the exons junction
Usually 98 bases, but shorter when first or second exon is terminal in the transcript, and its length is less than 49 bp.
Note that AceView transcripts are not extended by concatenation: we prefer partial transcripts to complete but not uniquely supported: there are usually many alternative ways to complete any partial transcript by re-using data from the other transcripts in the gene.
Col 11: list of variants in which the junction is seen, semi-column delimited.
The list includes AceView variants from the April 2007 build and RefSeq entries NM/NR/XM/XR from November 22, 2008.
- File AceViewExonsJunctionsSupport.txt gives the actuall clone and accession support
Col 1: exons junction name, as defined above: chromosome__coordinate_coordinate_GenomeBuildNumber
Col 2: supporting cDNA clone
Col 3: corresponding accession(s). Some clones have both a 5' and a 3' read. Usually the junction is seen in only one of the accessions.
File: ncbi_*.introns2dna2support is an 8 columns file:
- # Gene name, in AceView: Entrez/HGNC gene names available at AceView release date are systematically used in AceView. Genes absent from the Entrez gene annotation are given Pfam names or all lower case names (easily distinguished from all upper case official human gene names), AceView names are stably maintained across builds. Note that a few AceView genes join two official genes because some cDNA clone bridges them and either shares an intron boundary with both genes and/or has substantial sequence overlap with both: when this occurs, the AceView gene name is a concatenation of the official meaningful gene names (not the LOC#) joined with 'and'.
- Intron boundaries: generally gt-ag, some gc-ag. A large variety of other intron boundaries are used in rare instances, for instance at-ac is found in a few cases thought to be U12 dependent.
- Coordinate in the gene of the first base of the intron. First base of the gene is called base 1.
- Coordinate in the gene of the last base of the intron
- Number of exactly supporting clones: Introns are discovered by alignment to the reference genome and annotated in AceView. Our criteria for a cDNA to “support a junction” is that the cDNA sequence exactly matches 16 bp centered on the junction (8 bases in each exon bordering the intron).
- last 40 bp from the 5' exon
- first 40 bp from the 3' exon
- Supporting clones in NCBI Genbank, dbEST or TraceDB
Example line:
"AADACL3" "gt_ag" 230 3358 2 gagagtcctccattgcatcttccagctgctgttgacatgg gggatgatatttgagaagctcagaatctgttctatgcccc "IMAGE:4803719\; NM_001103170"
These files ...genes/AceView.ncbi_*.gene2accession2tissue.main_genes.txt.gz and ...genes/AceView.ncbi_*.gene2accession2tissue.cloud_genes.txt.gz give the relation between all aligned ESTs/mRNAs, the genes, the GeneIDs and the alternative variants, with indication of quality of match, tissue of origin (from GenBank/dbEST), type of AceView gene, for all genes (main and cloud separately)
The file describes the association of each mRNA/EST accession satisfactorily aligned in AceView to the corresponding AceView gene and transcript.
It is ordered by map position, and includes all genes, even the "cloud" genes. We indicate the official gene name and the LocusID when available. We show, for each accession, the alternative variant to which the cDNA contributes, but please be aware that we try to minimize the number of variants to which a given cDNA belongs, to avoid combinatorics in number of AceView transcripts: even when a sequence could equally well participate in two or more variants, we try to assign it to only one. Most alternative variants, except those with structural defects that we filter out, are included.
The file was generated at the request of various users among which the GeneCards team; it has been made available since build 34 from the “downloads” page. For build 35, we have added the type of AceView gene (main, putative or cloud), tissue for each accession, characteristics and quality of the AceView EST/mRNA to genome match.
For build 34, the file has 5,101,247 lines and 5 columns; for build 35, the file has 6,397,928 lines and 12 columns (we added col 4, 7, 8, 9, 10, 11, 12).
The current content and format is described below:
column 1: chromosome position (Position)
This is a volatile name changing with each build, not a stable identifier: it starts with the chromosome name, then coordinate on that chromosome, in basepairs (e.g. 1_1287: chromosome 1, base 1287). It is handy because it allows us to export the table in genome order and can be helpful when you parse the data
column 2: AceView gene name (AceViewID).
This should be stable from release to release, although it evolves as new official gene names are generated. See how we generate the names in AceView: basically, we use official names if they exist, else PFAM derived names, else invented AceView names.
column 3: set of LocusID or GeneID from Entrez Gene (previously LocusLink LocusID)
Examples:
- 1234 LocusID 1234
- NULL: there is no LocusID yet, because this gene has not yet been modeled in LocusLink/Gene at NCBI.
- 1234;9876 this gene has 2 LocusID, 1234 and 9876. This actually occurs for 1373 human genes in build 34 out of 19,413 AceView genes with a LocusID. In build 35, 1498 human genes out of 24520 AceView genes with a GeneID have more than one GeneID (23,356 GeneID have one or more corresponding AceView gene(s)). Some AceView genes with more than one GeneID correspond to a single gene that was split in the RefSeq/LocusLink model, some to genes producing two different types of proteins, but fused in a single gene in AceView because there is at least one cDNA which bridges the two, even if only through the UTRs. AceView would actually consider this a single (possibly complex) gene, and name it by concatenating the LocusLink/official gene names (see this help and example: PEX19andWDR42A, where clone AGENCOURT_14368013 NIH_MGC_181 Homo sapiens cDNA clone IMAGE:30398500, accession CD518985 creates a nice bridge). Finally, a few correspond to repeated genes in a close tandem arrangements that we may have unintentionally merged.
column 4: AceView gene type (Type)
Since build 35, the AceView genes have been categorized as Main gene, Putative gene or Cloud gene, as defined here.
column 5: AceView transcript variant name (Variant)
This is very often identical to the gene name, except when there is evidence for alternative splicing, then this column shows how we sub-cluster the mRNAs. The relation accession/ alternative variant to which the cDNA contributes is indicated in the next column, but please be aware that we try to minimize the number of variants to which a given cDNA belongs, to avoid combinatorics in number of AceView transcripts: even when a sequence could equally well participate in two or more variants, we try to assign it to only one. On the other hand, it may happen that an accession belongs to several different genes or different variants.
column 6: mRNA or EST GenBank accession aligned in this gene (Accession)
We give the accession without a version, but for each new AceView release, we provide explicitly the date at which the mRNA/EST data were downloaded from GenBank.
column 7: Number of basepairs of the accession to be aligned (Length)
This number is derived from the length in basepairs, as found in GenBank, minus the polyA and the eventual vector sequence that AceView recognized and clipped. This is the length in basepairs that we think has to be aligned on the genome.
column 8: Number of unaligned bases, on the 5’ side (Start)
The value in this column is 0 if the alignment starts at the first base that needs to be aligned (base 1 of the GenBank accession if the vector was correctly clipped by the submitter, first base after the vector if some vector sequence was recognized unclipped in the GenBank accession. For example, the value in this column is 20 if we failed to align the first 20 bases, with no justification: the cDNA might have some structural anomaly or the reference genome might be missing a piece (or rarely, our alignment procedure might be bugged).
column 9: Number of bases of the mRNA accession actually aligned on the genome in AceView (Ali)
column 10: number of base differences in the aligned region between the cDNA and the genome (Err). A single base transition, transversion, addition or deletion counts as 1.
column 11: Quality of the AceView alignment (Quality), scored from 1 (very best alignments) to 9. The quality factor reflects both the % length aligned and the % differences from the genome.
Scores also depend on whether the sequence is supposed to be high quality (mRNA) or single pass (EST). For mRNAs, the score is measured over the entire length to be aligned (column 7); for ESTs, it is scored on a maximum of 600 bp. Scores follow the empirical chart below:
%length aligned %bp differences |
> 98% |
90-98% |
80-90% |
50-80% or 800 bp |
Less than 50% or 800 bp |
|||||
mRNA |
EST |
mRNA |
EST |
mRNA |
EST |
mRNA |
EST |
mRNA |
EST |
|
< 0.1% |
1 |
1 |
2 |
2 |
3 |
3 |
4 |
5 |
7 |
9 |
0.1 to 1% |
2 |
2 |
3 |
3 |
4 |
4 |
5 |
6 |
8 |
(10) |
1 to 2% |
3 |
3 |
4 |
4 |
5 |
5 |
6 |
7 |
(10) |
(11) |
2 to 3% |
4 |
4 |
5 |
5 |
6 |
6 |
7 |
9 |
(11) |
(11) |
3 to 4% |
5 |
5 |
6 |
6 |
7 |
7 |
9 |
(10) |
(11) |
(11) |
4 to 5% |
7 |
6 |
8 |
7 |
9 |
8 |
(10) |
(11) |
(11) |
(11) |
> 5% |
9 |
7 |
(10) |
8 |
(10) |
9 |
(11) |
(11) |
(11) |
(11) |
Note: Qualities are initially scored from 1 to 11, but if no structural rearrangement is involved, we ultimately only retain qualities from 1 to 9. This scoring system has been tuned first on the nematode then on unfinished human genome in an empirical way, but it is useful, since qualities are used internally in AceView as a critical step to keep only the best matches during the initial clean-up phase. These numbers may also be used as a quick indicator in the case of other projects, such as GeneCards/GeneTide, to explain some discrepancies between various clustering methods.
To give ideas on the statistics, for build 35, there are
Aligned at quality |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
(10) |
(11) |
#accessions |
1,254,893 |
2,502,199 |
1,014,606 |
573,650 |
387,665 |
254,449 |
212,953 |
114,645 |
82,866 |
29,844 |
559 |
column 12: Tissue, as copied from the GenBank submission (not yet systematic). We would prefer to use a standardized vocabulary, such as TissueInfo developed by Lucy Skrabanek and Fabien Campagne, but this is not yet done.
The AceView mapping should improve significantly the information you get from your arrays. We provide mapping of probes from whole genome expression arrays or other microarrays on the AceView genes. Ask us if your favorite arrays are missing.
For human, we routinely remap the Affymetrix HG-U133 Plus 2.0 GeneChip, Agilent Whole Human Genome Oligo Microarray, G4112A and Illumina Human-6 BeadChip, 48K v1.0 platforms.
For mouse, we currently remap Affymetrix 430-2 (44.7 MB) and Illumina MouseWG-6_V1_1_R3 (5.3 MB), WG-6 V2 (6 MB) or MEEBO
Mapping array probes to genes in a systematic fashion is essential for comparing expression results obtained on different platforms, as well as to interpret results from a single platform (for example see the MAQC project). We mapped probes to AceView transcripts and the genome for some of the most used commercial microarray platforms.: in the first release (March 6, 2008), we provide alignments for the Illumina Human-6 BeadChip (48k v1.0), Agilent WHG (G4112A0) and Affymetrix HG-U133 Plus 2.0 GeneChip expression arrays, used in MAQC.
The mapping is performed on the current human genome (build 36) as well as on three annotated transcripts sets, two from NCBI and one from EBI. We used all AceView transcripts (current release April 2007), all RefSeq (NM and XM, but not including the few non coding NR/XR) (downloaded January 2008), or all Ensembl models (downloaded January 2008).
We initially allowed up to 2 mismatches (base difference or indel) for Agilent 60-mer probes and Illumina 50-mer, and up to 1 mismatch for Affymetrix 25-mer probes.
You may ask for more data of that kind as we are willing to similarly map any platform which provides the sequence of their probes to the public.
The format is defined below, all files are ordered on the name of the probe (column 4):
Column 1: name of target mRNA (from AceView, RefSeq or Ensembl), or of chromosome if mapping is to reference genome
Column 2: (t1) coordinate of the first base of the match on the target mRNA
Column 3: (t2) coordinate of the last base of the match on the target. (t1 and t2 also give the strand)
Column 4: name of the arrayed probe
Column 5: (p1) coordinate of the first base of the match on the arrayed probe
Column 6: (p2) coordinate of the last base of the match on the probe
Column 7: number of uncalled bases (N) in probe [rare; in some control probes]
Column 8: number of mismatches between probe and target. One mismatch is a difference, an insertion or deletion affecting a single nucleotide
Column 9: coordinate in the probe of the start of the longest exact match
Column 10: coordinate of the end of the longest exact match
Column 11: length of the longest exact match
Column 12: position of the first mismatch
Column 13: type of the first mismatch: may be base_in_transcript > base_in_probe (e.g. a>g) OR +base for extra base in probe (insertion) OR -base for base missing in probe (deletion)
Column 14: only in Affymetrix, indicates the probeset
-----------
from March 2009:
We are happy to remap probes from any (expression) array of interest to the three main transcriptome annotations, NCBI RefSeq, Ensembl, AceView, and to the reference genome. Depending on the protocol used to prepare the cDNA sample, one may keep the strand information or lose it (for instance because there was a PCR step involved), so we provide the mapping for both situations, as we indicate hits on the mRNA sense or antisense strand. In our mapping, we allow for 10% mismatches (indels or single base variations), which may seem high. Yet the number of mismatches for each probe mapped is indicated in the table, so that users can use the threshold they view as more fit in their analyses.
- For RefSeq, we provide two flavors of mapping, either limited to the more curated NM/NR set or to all RefSeq, including XM/XR models.
- For EBI Ensembl, we use all models, including the ab initio models.
- For AceView, which summarizes the alignable cDNA sequences in the public repositories, we use our latest public version. Unlike Ensembl, AceView does not use any ab initio or predictive elements. Unlike RefSeq, AceView tries to represent all good quality cDNA sequences, even if many alternative mRNA forms become evident when this principle is applied. As a result, AceView represents the transcriptome in much greater depth than the other gene annotations, and it also houses a greater percentage of the probes designed by the microarray designers.
Please email us if one array you like is not in our current list and the probe sequences are available: we will remap it to the main annotations and place the results on our public ftp “downloads” site.
The files that go with the successive AceView gene reconstructions are here organized by organism and by date, from latest to earliest. Our first AceView human genes were available at NCBI in 2001..
All the files posted on this ftp site since 2001 are still available.
Also see the GOLD and MAQC works.
April 2007
The Human Apr07 release aligns more than 7 million cDNA sequences (available March 26, 2007 in GenBank/ dbEST/ RefSeq) into genes on the human genome assembly NCBI_36/hg18 of March 2006.
The files available are:
- The coordinates of exons/introns, CDS and UTR of each mRNA in gff format, for non-cloud genes (26.6 MB)
- The mRNA sequences (including UTR parts) for the genes, known or unknown, but excluding the clouds, in fasta format (79.3 MB) (see the definition of clouds in the FAQ)
- The amino acid sequences of the best protein from each transcript, provided they look like ‘good proteins’, in fasta format (15.6 MB) Warning: like most proteins in SwissProt or RefSeq, these are conceptual translation products, most have not been observed experimentally.
- All AceView mRNAs, with no restriction: a comprehensive non redundant curated representation of all data submitted as cDNA sequences to GenBank and dbEST as of March 2007, in fasta format (119.3 MB).
- All peptides and proteins, with no restriction, for decoding mass spectrometry data, in fasta format (44.9 MB)
- The highly significant PFAM hits (position on genome of the corresponding gene, title and accession of the PFAM, coordinates of the domain in the proteins (3.4 MB)
- Added March 6, 2008: Mapping of human expression microarray probes Affymetrix HG-U133 Plus 2.0 GeneChip, Agilent Whole Human Genome Oligo Microarray, G4112A and Illumina Human-6 BeadChip, 48K v1.0 platforms is provided (53 MB, see this help). The AceView mapping should improve significantly the information you get from your arrays… Mapping array probes to genes in a systematic fashion is essential for comparing expression results obtained on different platforms, as well as to interpret results from a single platform (for example see the MAQC project). At the request of users, we make public the mapping of probes to AceView transcripts and the genome for some of the most used commercial microarray platforms.: in the first release (March 6, 2008), we provide alignments for the Illumina Human-6 BeadChip (48k v1.0), Agilent WHG (G4112A0) and Affymetrix HG-U133 Plus 2.0 GeneChip expression arrays, used in MAQC.
The mapping is performed on the current human genome (build 36) as well as on three annotated transcripts sets, two from NCBI and one from EBI. We used all AceView transcripts (current release April 2007), all RefSeq (NM and XM, but not including the few non coding NR/XR) (downloaded January 2008), or all Ensembl models (downloaded January 2008).
We use our program ProbeAlign, set to allow up to 2 mismatches (base difference or indel) for Agilent 60-mer probes and Illumina 50-mer, and up to 1 mismatch for Affymetrix 25-mer probes. For help on formats, see here. Also see the MAQC interesting outcome here, and the explanations here. - Added February 8, 2009: The properties, sequence and support of all exon-exon junctions seen in mRNA sequences from any the NCBI sequence databases (GenBank, dbEST, Trace), as summarized in AceView. For help on content and format, see the 'File Format' tab.
The list made public on February 8 2009 includes 368,512 exon junctions found in AceView, from mapping all cDNA sequences present in the NCBI database in April 2007. The list includes in particular the 199,952 introns in RefSeq NM/NR/XM/XR from November 22, 2008.
The intron list will grow as more junctions are discovered, in particular in the deep sequencing data. Please tell us if you wish to be notified by mail when we update the list.
As of today, there is support, in the small set of deep transcriptome data from SRA that we analyzed, for 173,000 junctions from RefSeq (86%) and 259,000 from AceView (68%). This zip includes two files:
1- The AceViewRefSeqExonsJunctions.txt file provides the sequence of the junctions and their main properties (such as associated gene and transcripts, coordinates on the genome and boundaries of the intervening sequence).
2- The complementary AceViewExonsJunctionsSupport.txt file provides the clones and accessions in GenBank/dbEST/Trace (but not SRA!) exactly supporting the junctions. Our criteria for a cDNA to “support a junction” is that the cDNA sequence exactly matches 16 bp centered on the junction (8 bases in each exon bordering the intron).
There are currently 2,284,821 clones in the public databases supporting 367,750 junctions. Our aim is to annotate the deep transcriptome sequencing results on the web in our next release. We welcome wholeheartedly any new transcriptome data that you might want to see integrated: just send it to GEO or SRA! - Added at the request of a user: disease related genes, according to OMIM, GAD and hints from Mesh/PubMed hunting we did
MAQC Microarray quality control project
From our analyses for the MAQC study, led by Leming Shi from FDA, we extracted these interesting data:
- A list of the 276,184 MAQC validated probes from all MAQC platforms, according to our unpublished analysis (File: ConfirmedTitratingProbes.txt). These probes map uniquely to a single gene and do not cross-hybridize. They have two desirable properties when hybridized to mixes of RNA samples A (Universal RNA Stratagene) and B (Ambion brain): they are “titrating”, i.e. signals for the 4 samples containing mixes of the two RNAs: A (A100:B0), C (A75:B25), D (A25:B75), B (A0:B100) are in proper order, either monotonously increasing or decreasing when we average the normalized signals over the 15 replicas (minus a few outlier arrays). In addition, the direction of the differential expression for the gene measured agrees with at least 2 other platforms: A/C/D/B signals vary in the same direction in a majority of platforms assaying the gene, so in this sense, all probes in these files coherently and sensitively measure the validated differential expression A>B or B>A for the genes. Probes are ordered alphabetically, ABI, AFX (Affymetrix), AGL (Agilent), EPN (Eppendorf), GEH (General Electric Healthcare), GEX, ILM (Illumina), NCI (Operon), QGN (QuantiGene) and TAQ (TaqMan) (Note that this is a minimal list of validated probes, as some probes on the array might not have been testable by the two RNA samples selected for MAQC. But at least these probes have been proven to be good, sensitive and reliable.)
- A file giving a comparable measure of the melting temperature for all probes (files ending in .Tm).
- The mapping in AceView of all the MAQC probes to their desired (and undesired) target genes, including the identification of the specific alternative transcripts targeted, is available as a zip or a tar.gz compressed file. This file was generated as part of the Micro Array Quality Control project (MAQC study), we used ProbeAlign to map all probes sequences to the human genes (from the previous version of AceView, April 2005), including to their putatively cross-hybridizing targets. The total number of genes tested with gene specific probes by each genome wide array platform participating to the project, i.e. 29,040 genes for Affymetrix, 22,106 for Agilent, 21,943 for Applied Biosystems, 35,392 for Codelink General Electric, 27,463 for Illumina and 19,025 for the NCI/Operon array, are detailed in the Supplementary data, page 14, on the Nature Biotechnology website.
We hope this information may be useful, do not hesitate to feed us back on what your needs or questions are.
GOLD: Genomewide Optimization of Locus Description
The GOLD in-depth analysis we made in 2005 is still very actual: it includes a comparison between various cDNA aligners, including gnomon/splign and the EBI aligner, and the discovery of a multitude of interesting facts about the human genes. Results can be found in our (unpublished) article 'GOLD', which comes with a thorough supplementary material connecting on complete and detailed annotations. We are still now very proud of this work, which was investigating the reliability of cDNA to genome aligners (and was probably resented as too critical by the referee), but also summarizes the properties of transcription in human.
August 2005
The Human August 2005 AceView annotations were performed on the human genome NCBI_35/hg17 of July 2004.
This release aligns ESTs, mRNAs and RefSeqs in GenBank or dbEST on September 24 on the human genome build, NCBI_35 /hg17 of July 2004.
Two files were archived in July 07, the files still available are:
The coordinates of exons/introns, CDS and UTR of each mRNA in gff format, for non-cloud genes (26.8 MB)
The amino acid sequences of the best and good CDSs or ORFs in non-cloud mRNA, in fasta format (21.9 MB)
All peptides and proteins, with no restriction, for decoding mass spectrometry data, in fasta format (33.9 MB)
The mRNA sequences (including UTR parts) for all non-cloud genes, in fasta format (77.8 MB), and the cloud
The highly significant PFAM hits (position on genome of the corresponding gene, title and accession of the PFAM, coordinates of the domain in the proteins (3.2 MB)
December 2003
The Human December 2003 (Dec03) AceView annotations were performed on the human genome NCBI_34/hg16 of July 2003.
The amino acid sequences of each CDS or ORF in fasta format (16.3 MB)
The coordinates of exons/introns, CDS and UTR of each mRNA in gff format (18.5 MB)
The mRNA sequences (including UTR parts) in fasta format (77.9 MB)
The highly significant PFAM hits (position on genome of the corresponding gene, title and accession of the PFAM (730 kB)
May 2003
The human AceView annotations were performed on the previous human genome build, NCBI_33 genome build of May 2003.
The amino acid sequences of each CDS or ORF in fasta format (14.5 MB)
The coordinates of exons/introns, CDS and UTR of each mRNA in gff format (20.5 MB)
The mRNA sequences (including UTR parts) in fasta format (74.6 MB)
The highly significant PFAM hits (position on genome of the corresponding gene, title and accession of the PFAM (604 kB)
December 2002
The human AceView annotations were performed on the previous human genome build, NCBI_31 genome build of Decembre 2002.
The amino acid sequences of each CDS or ORF in fasta format (14.6 MB)
The coordinates of exons/introns, CDS and UTR of each mRNA in gff format (10.5 MB)
The mRNA sequences (including UTR parts) in fasta format (65.6 MB)
The highly significant PFAM hits (position on genome of the corresponding gene, title and accession of the PFAM (582k)
August 2002
The human AceView annotations were performed on the human genome build, NCBI_30 genome build of August 25 2002.
The amino acid sequences of each CDS or ORF in fasta format (11.5 MB)
The coordinates of exons/introns, CDS and UTR of each mRNA in gff format (12.9 MB).
The mRNA sequences (including UTR parts) in fasta format (46.8 MB)
Please quote us as D,J and Y Thierry-Mieg, M.Potdevin, M.Sienkiewicz, V.Simonyan, www.humangenes.org: Construction and automatic annotation of cDNA-supported genes using Acembly, unpublished
May 2002
The human AceView annotations were performed on the human genome build, NCBI_29 of may 25 2002:
The amino acid sequences of each CDS or ORF in fasta format (12.0 MB)
The coordinates of exons/introns, CDS and UTR of each mRNA in gff format (15.3 MB)
Please quote us as D,J and Y Thierry-Mieg, M.Potdevin, V.Simonyan, www.humangenes.org: Construction and automatic annotation of cDNA-supported genes using Acembly, unpublished
January 2002
The human AceView annotations were performed on the human genome build, NCBI_28 of January 2002:
The amino acid sequences of each CDS or ORF in fasta format (9.5 MB)
The mRNA sequences (including UTR parts) in fasta format (40.8 MB)
The coordinates of exons/introns, CDS and UTR of each mRNA in gff format (9.5 MB)
November 2001
The human AceView annotations were performed on the human genome build, NCBI_28 of January 2002:The amino acid sequences of each CDS or ORF in fasta format (10 MB)
The mRNA sequences (including UTR parts) in fasta format (42.8 MB)
The coordinates of exons/introns, CDS and UTR of each mRNA in gff format (10 MB)
Mouse June 2007
The Mouse June 2007 AceView release aligns more than 3 million cDNA sequences (available June 22, 2007) into genes and transcripts on the Mus musculus NCBI genome 37 (June 2007). The files available are:
- The coordinates of exons/introns, CDS and UTR of each mRNA in gff format, for non-cloud genes (17.4 MB)
- The mRNA sequences (including UTR parts) for the genes, known or unknown, but excluding the clouds, in fasta format (60.5 MB) (see the definition of clouds in the FAQ)
- The amino acid sequences of the best protein from each transcript, provided they look like ‘good proteins’, in fasta format (12.2 MB) Warning: like most proteins in SwissProt or RefSeq, these are conceptual translation products, most have not been observed experimentally.
- All AceView mRNAs, with no restriction: a comprehensive non redundant curated representation of all data submitted as cDNA sequences to GenBank and dbEST as of March 2007, in fasta format (81.3 MB).
- All peptides and proteins, with no restriction, for decoding mass spectrometry data, in fasta format (28.9 MB)
Last edited October 8, 2012
The source code of most of the programs developed by Jean and Danielle Thierry-Mieg at NCBI is made available from this page, under GNU Public Licence. The AceView code is under current and daily development, and despite our efforts, the code is probably not bug-free. We appreciate bug reports, and always welcome questions, comments and suggestions, yet we cannot commit to full support. We update the code once in a while, send us a mail if you want to be told when we change code release on the ftp site.
We naturally expect you to cite your source if you use parts or all of our code...
If you find AceView useful, you could also consider citing this reference AceView: a comprehensive cDNA-supported gene and transcripts annotation, Genome Biology 2006, 7(Suppl 1):S12.
NCBI Magic
Here is the Magic code and documentation, for integrated analysis of next generation sequencing, including RNA-seq, DNA-seq or ChIP-seq. This code is under continuous development; the documentation is not necessarily up to date with the code, but should mainly be.
NCBI AceView
This package contains the complete source code of the AceView software, including the NCBI version of the acedb database manager which has been used in a number of genome projects in many laboratories worldwide.
We have incorporated in the AceView NCBI version a very powerful cDNA alignment code, AceView can align long or short DNA sequences to a reference set of long DNA sequences, the target, even in the presence of noise. For instance, we use it to align GenBank ESTs or Roche 454, or indifferently RNA-Seq data from Helicos, Life Solid or Illumina, or microarray probes (say the 600,000 AFX probes) to the human genome or transcriptome. We have added many optimizations and new graphics to support sequencing trace edition, genome assembly, mRNA to genome alignment and biological annotation of the genes. The tabular query interface "TableMaker" was expanded to enable the selection of sets of genes with complex combinations of properties, including sequence constraints. Also, we have developed with Mark Sienkiewicz in 2004 a C language programmers’ interface, AceC, which is part of the present distribution.
The human server with its 30 million objects, and the mouse, rat, worm and Arabidopsis servers all currently share a CentOS Linux box with 16 Gigabytes of RAM and four double core Intel processors. The AceView web site is supported by AceC.
The AceView code is stand alone and distributed under GNU Public License. It compiles and runs on all the Unix/Linux platforms we have ever tested. You may download the source code here. The README file in the same directory contains the intruction on how to compile and test the code. This README file is also included in the source code tar.gz file. An AceView demo, human chromosome Y as of October 2005, is available here . Please follow the instructions in the README file.
AceDB at NCBI
AceView uses the AceDB object oriented database manager, and the AceView web site is supported by the AceDB@NCBI server. The source code for the NCBI version of the AceDB object oriented system, developped by Jean Thierry-Mieg, is available here. All Unix/Linux 32 or 64 bits platforms should be recognized, including IBM, Sun, Intel, AMD, alpha ... MacX, and Windows/Cygnus. Some documentation is available here, thanks to Sam Cartinhour.
The first version of AceDB was written in the early 90s by Richard Durbin, now at Sanger, and Jean Thierry-Mieg, author of AceView, then in Montpellier France and now at NCBI. However, over the years, the codes have evolved to suit the needs of the two main Acedb authors and their users. As a database engine, the NCBI version is compatible with the Sanger Center version: the data files can be freely exchanged between the two systems, they can even run from the same disc and they both support AcePerl. But note that the NCBI version is used, supported and developped, unlike the Sanger version which for instance contains since 2005 a bug that potentially looses data (http://www.acedb.org/Software/Downloads/). This bug has never affected our NCBI AceDB version, and we advise current AceDB users to try the NCBI version.
UCSCtrackCompare
This package can be used to compare the genome annotation tracks available from the magnificient UCSC genome browser maintained by Jim Kent's group. The script UCSCdownload.csh can be used to download the tracks discussed in our paper 'AceView: a comprehensive cDNA-supported gene and transcripts annotation, Genome Biology 2006, 7(Suppl 1):S12'. The chromConfig.txt and trackConfig.txt configuration files describes which tracks you wish to compare over which regions. Finally the executable UCSCtrackCompare perform the comparison. Called without parameters, it exports its self-documentation.
You may either download the executable, or recompile. The source code is provided in this directory for your convenience if you wish to read it. But since it links against the acedb libraries, the easiest way to recompile the code is to download our AceView package, to install it, and then to go to the wacext subdirectory of the AceView package and issue the command 'make UCSCtrackCompare'. The executable will appear in ../bin.$ACEDB_MACHINE
Here is the users guide, the source code the track download script, the chromosome configuration file, the track configuration file, the models.wrm (principal schema file) of our AceView database, our full schema, and some executables for Solaris, Mac, Linux.
SWFC/Flash
The Flash diagrams are generated using the open software SWFC. The acedb graphic package includes several drivers, allowing exporting acedb images in X11, Post-script, Gif... and now .sc which is the input format of the swfc compiler which in turn generates .swf swift, i.e. Adobe Flash flash files.