NCBI Oreochromis niloticus Annotation Release 102

The RefSeq genome records for Oreochromis niloticus were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Oreochromis niloticus Annotation Release 102

Annotation release ID: 102
Date of Entrez queries for transcripts and proteins: Jul 28 2015
Date of submission of annotation to the public databases: Jul 30 2015
Software version: 6.4

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
Orenil1.1	GCF_000188235.2	Broad Institute	02-08-2012	Reference	23 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	Orenil1.1
Genes and pseudogenes	30,174
protein-coding	26,329
non-coding	3,508
pseudogenes	337
genes with variants	10,341
mRNAs	47,700
fully-supported	45,245
with > 5% ab initio	1,327
partial	3,050
with filled gap(s)	2,480
known RefSeq (NM_)	145
model RefSeq (XM_)	47,555
Other RNAs	5,694
fully-supported	5,071
with > 5% ab initio	0
partial	10
with filled gap(s)	10
known RefSeq (NR_)	0
model RefSeq (XR_)	5,071
CDSs	47,892
fully-supported	45,245
with > 5% ab initio	1,444
partial	2,467
with major correction(s)	817
known RefSeq (NP_)	145
model RefSeq (XP_)	47,555

Detailed reports

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	29,837	17,211	7,548	70	1,113,693
All transcripts	53,394	3,150	2,585	70	92,220
mRNA	47,700	3,394	2,812	210	92,220
misc_RNA	922	2,777	2,456	102	12,119
tRNA	623	74	73	70	86
lncRNA	4,149	889	658	92	8,907
Single-exon transcripts	1,074	1,710	1,423	210	10,538
coding transcripts (NM_/XM_ )	1,074	1,710	1,423	210	10,538
CDSs	47,700	1,968	1,467	96	90,948
Exons	304,415	289	137	1	17,286
in coding transcripts (NM_/XM_ )	291,077	289	137	1	17,286
in non-coding transcripts (NR_/XR_ )	19,388	263	129	2	7,998
Introns	271,532	1,872	392	26	891,804
in coding transcripts (NM_/XM_ )	262,436	1,859	391	26	891,804
in non-coding transcripts (NR_/XR_ )	14,999	2,149	427	34	133,145

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.78	1	1	25
Number of exons per transcript	11.68	9	1	238

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the UniProtKB/Swiss-Prot curated proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 26137 coding genes, 23381 genes had a protein with an alignment covering 50% or more of the query and 10797 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: UniProtKB/Swiss-Prot curated proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
Orenil1.1	GCF_000188235.2	3.13%	21.91%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Depending on the other evidence available, long 454 reads (with average length above 250 nt) may be aligned as traditional evidence and reported in the Transcript alignments section or aligned with short reads and reported in the Short read transcript alignments section.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species known RefSeq (NM_/NR_)	149	149 (100.00%)	144 (96.64%)	99.00%	98.17%
Same-species Genbank	718	713 (99.30%)	610 (84.96%)	98.85%	97.57%
Same-species EST	120,986	111,320 (92.01%)	105,356 (87.08%)	99.23%	98.96%

Short read transcript alignments

The following short reads (RNA-Seq) from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Track name	Number of reads	Percent aligned reads	Percent spliced reads	Number of introns
All	Aggregate of all aligned samples	1,319,429,488	81%	23%	348,607
SAMEA2482241	ANF (Oreochromis niloticus, SAMEA2482241)	21,114,889	84%	27%	173,223
SAMEA2482242	ANS (Oreochromis niloticus, SAMEA2482242)	17,443,589	84%	28%	176,967
SAMEA2482243	PNF (Oreochromis niloticus, SAMEA2482243)	19,575,859	83%	27%	181,318
SAMEA2482244	PNS (Oreochromis niloticus, SAMEA2482244)	18,583,158	82%	27%	181,205
SAMN00715285	embryo (Oreochromis niloticus, SAMN00715285)	251,595	25%	6%	20,020
SAMN00767853	kidney (Oreochromis niloticus, Female, SAMN00767853)	87,556,570	77%	21%	211,135
SAMN00767854	heart (Oreochromis niloticus, Male, SAMN00767854)	60,263,332	77%	22%	193,033
SAMN00767855	skin (Oreochromis niloticus, Female, SAMN00767855)	45,307,090	80%	23%	192,669
SAMN00767856	eye (Oreochromis niloticus, Male, SAMN00767856)	56,997,428	80%	17%	209,238
SAMN00767857	blood (Oreochromis niloticus, Male, SAMN00767857)	77,932,704	80%	20%	156,552
SAMN00767858	ovary (Oreochromis niloticus, Female, SAMN00767858)	72,105,534	90%	31%	210,782
SAMN00767859	liver (Oreochromis niloticus, Female, SAMN00767859)	65,913,292	80%	26%	145,974
SAMN00767860	testis (Oreochromis niloticus, Male, SAMN00767860)	61,827,174	83%	24%	267,410
SAMN00767861	brain (Oreochromis niloticus, Female, SAMN00767861)	61,938,566	83%	16%	217,080
SAMN00767862	muscle (Oreochromis niloticus, Female, SAMN00767862)	56,207,368	87%	31%	168,066
SAMN00767863	embryo (Oreochromis niloticus, Male, SAMN00767863)	53,967,792	85%	25%	234,998
SAMN01086111	90 dah XX gonad (Oreochromis niloticus, SAMN01086111)	26,466,666	87%	26%	182,575
SAMN01087769	180 dah XY gonad (Oreochromis niloticus, SAMN01087769)	50,258,478	84%	23%	238,418
SAMN01087779	90 dah XY gonad (Oreochromis niloticus, SAMN01087779)	26,292,052	81%	22%	220,256
SAMN01091651	180 dah XX gonad (Oreochromis niloticus, SAMN01091651)	51,485,734	91%	27%	208,138
SAMN01091907	5 dah XX gonad (Oreochromis niloticus, SAMN01091907)	53,333,334	47%	12%	77,802
SAMN01091966	5 dah XY gonad (Oreochromis niloticus, SAMN01091966)	51,111,112	48%	4%	47,843
SAMN01093674	30 dah XX gonad (Oreochromis niloticus, SAMN01093674)	53,140,336	88%	30%	225,883
SAMN01093676	30 dah XY gonad (Oreochromis niloticus, SAMN01093676)	52,579,466	85%	23%	227,295
SAMN01985096	spleen and kidney (Oreochromis niloticus, SAMN01985096)	50,409,546	80%	22%	195,735
SAMN02212662	spleen (Oreochromis niloticus, 60 days, SAMN02212662)	26,622,238	84%	26%	180,735
SAMN02374859	General Sample for Oreochromis niloticus (Oreochromis niloticus, SAMN02374859)	100,744,586	89%	26%	240,483

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent spliced reads
ERR490212	ERX455666	ERP005655	SAMEA2482241	21,114,889	84%	27%
ERR490213	ERX455667	ERP005655	SAMEA2482242	17,443,589	84%	28%
ERR490214	ERX455668	ERP005655	SAMEA2482243	19,575,859	83%	27%
ERR490215	ERX455669	ERP005655	SAMEA2482244	18,583,158	82%	27%
SRR341913	SRX095989	SRP008027	SAMN00715285	251,595	25%	6%
SRR391680	SRX112566	SRP009911	SAMN00767853	28,435,226	76%	20%
SRR391684	SRX112566	SRP009911	SAMN00767853	29,765,322	78%	21%
SRR391689	SRX112566	SRP009911	SAMN00767853	29,356,022	78%	21%
SRR391681	SRX112567	SRP009911	SAMN00767854	20,499,202	78%	22%
SRR391686	SRX112567	SRP009911	SAMN00767854	20,167,908	78%	22%
SRR391696	SRX112567	SRP009911	SAMN00767854	19,596,222	76%	22%
SRR391682	SRX112568	SRP009911	SAMN00767855	14,763,778	78%	22%
SRR391694	SRX112568	SRP009911	SAMN00767855	15,386,980	81%	23%
SRR391710	SRX112568	SRP009911	SAMN00767855	15,156,332	80%	23%
SRR391683	SRX112569	SRP009911	SAMN00767856	19,090,574	81%	17%
SRR391700	SRX112569	SRP009911	SAMN00767856	19,401,466	81%	17%
SRR391712	SRX112569	SRP009911	SAMN00767856	18,505,388	79%	16%
SRR391685	SRX112570	SRP009911	SAMN00767857	26,102,090	80%	20%
SRR391692	SRX112570	SRP009911	SAMN00767857	25,315,020	78%	20%
SRR391703	SRX112570	SRP009911	SAMN00767857	26,515,594	81%	20%
SRR391687	SRX112571	SRP009911	SAMN00767858	23,389,304	88%	30%
SRR391691	SRX112571	SRP009911	SAMN00767858	24,544,706	91%	31%
SRR391693	SRX112571	SRP009911	SAMN00767858	24,171,524	90%	31%
SRR391688	SRX112572	SRP009911	SAMN00767859	22,059,186	80%	26%
SRR391698	SRX112572	SRP009911	SAMN00767859	21,468,202	78%	26%
SRR391708	SRX112572	SRP009911	SAMN00767859	22,385,904	81%	27%
SRR391690	SRX112573	SRP009911	SAMN00767860	21,025,146	84%	24%
SRR391695	SRX112573	SRP009911	SAMN00767860	20,695,416	84%	24%
SRR391701	SRX112573	SRP009911	SAMN00767860	20,106,612	81%	24%
SRR391697	SRX112574	SRP009911	SAMN00767861	20,763,590	83%	16%
SRR391699	SRX112574	SRP009911	SAMN00767861	20,087,900	82%	16%
SRR391709	SRX112574	SRP009911	SAMN00767861	21,087,076	84%	16%
SRR391702	SRX112575	SRP009911	SAMN00767862	19,126,854	88%	31%
SRR391704	SRX112575	SRP009911	SAMN00767862	18,235,448	85%	30%
SRR391706	SRX112575	SRP009911	SAMN00767862	18,845,066	87%	31%
SRR391705	SRX112576	SRP009911	SAMN00767863	17,513,910	83%	24%
SRR391707	SRX112576	SRP009911	SAMN00767863	18,376,206	85%	25%
SRR391711	SRX112576	SRP009911	SAMN00767863	18,077,676	85%	25%
SRR519096	SRX158097	SRP014017	SAMN01086111	26,466,666	87%	26%
SRR521273	SRX159747	SRP014017	SAMN01087769	50,258,478	84%	23%
SRR521274	SRX159748	SRP014017	SAMN01087779	26,292,052	81%	22%
SRR524807	SRX160791	SRP014017	SAMN01091651	51,485,734	91%	27%
SRR525179	SRX160996	SRP014017	SAMN01091907	53,333,334	47%	12%
SRR525236	SRX170311	SRP014017	SAMN01091966	51,111,112	48%	4%
SRR526901	SRX170662	SRP014017	SAMN01093674	53,140,336	88%	30%
SRR526903	SRX170664	SRP014017	SAMN01093676	52,579,466	85%	23%
SRR1011283	SRX364007	SRP014017	SAMN02374859	52,172,864	89%	26%
SRR1011284	SRX364009	SRP014017	SAMN02374859	48,571,722	89%	26%
SRR797490	SRX254258	SRP019938	SAMN01985096	50,409,546	80%	22%
SRR1291960	SRX320099	SRP026706	SAMN02212662	26,622,238	84%	26%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Actinopterygii GenBank	73,028	68,288 (93.51%)	68,288 (93.51%)	68.20%	77.77%
Actinopterygii known RefSeq (NP_)	23,637	22,746 (96.23%)	22,746 (96.23%)	68.19%	76.27%
Homo sapiens known RefSeq (NP_)	39,226	33,117 (84.43%)	33,117 (84.43%)	66.20%	66.55%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20

RefSeq

Integrated reference sequences