NCBI Haplochromis burtoni Annotation Release 101

The RefSeq genome records for Haplochromis burtoni were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Haplochromis burtoni Annotation Release 101

Annotation release ID: 101
Date of Entrez queries for transcripts and proteins: Sep 23 2015
Date of submission of annotation to the public databases: Sep 29 2015
Software version: 6.4

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
AstBur1.0	GCF_000239415.1	Broad Institute	12-22-2011	Reference	unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	AstBur1.0
Genes and pseudogenes	26,461
protein-coding	24,094
non-coding	2,105
pseudogenes	262
genes with variants	9,504
mRNAs	44,653
fully-supported	43,087
with > 5% ab initio	678
partial	4,361
with filled gap(s)	3,852
known RefSeq (NM_)	50
model RefSeq (XM_)	44,603
Other RNAs	3,656
fully-supported	3,154
with > 5% ab initio	0
partial	30
with filled gap(s)	30
known RefSeq (NR_)	0
model RefSeq (XR_)	3,154
CDSs	44,745
fully-supported	43,087
with > 5% ab initio	758
partial	3,861
with major correction(s)	1,576
known RefSeq (NP_)	50
model RefSeq (XP_)	44,603

Detailed reports

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	26,199	18,192	8,091	71	686,144
All transcripts	48,309	3,446	2,800	50	94,671
mRNA	44,653	3,623	2,951	199	94,671
misc_RNA	621	3,143	2,564	175	13,599
tRNA	502	74	73	71	86
lncRNA	2,533	1,079	711	50	15,220
Single-exon transcripts	901	1,851	1,507	199	11,953
coding transcripts (NM_/XM_ )	901	1,851	1,507	199	11,953
CDSs	44,653	2,129	1,530	96	93,396
Exons	281,103	293	138	2	17,286
in coding transcripts (NM_/XM_ )	272,952	291	138	2	17,286
in non-coding transcripts (NR_/XR_ )	12,516	291	132	2	12,371
Introns	251,426	1,870	390	26	373,549
in coding transcripts (NM_/XM_ )	245,826	1,839	387	26	373,549
in non-coding transcripts (NR_/XR_ )	9,856	2,668	501	30	231,774

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.83	1	1	33
Number of exons per transcript	12.67	9	1	249

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the UniProtKB/Swiss-Prot curated proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 24002 coding genes, 22043 genes had a protein with an alignment covering 50% or more of the query and 10479 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: UniProtKB/Swiss-Prot curated proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
AstBur1.0	GCF_000239415.1	2.70%	19.79%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Depending on the other evidence available, long 454 reads (with average length above 250 nt) may be aligned as traditional evidence and reported in the Transcript alignments section or aligned with short reads and reported in the Short read transcript alignments section.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species known RefSeq (NM_/NR_)	50	50 (100.00%)	50 (100.00%)	99.24%	99.52%
Same-species Genbank	124	124 (100.00%)	119 (95.97%)	99.43%	98.37%
Same-species EST	10,312	8,300 (80.49%)	7,450 (72.25%)	99.09%	97.63%

Short read transcript alignments

The following short reads (RNA-Seq) from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Track name	Number of reads	Percent aligned reads	Percent spliced reads	Number of introns
All	Aggregate of all aligned samples	3,658,783,392	70%	13%	308,853
SAMN00628004	mixed (Haplochromis burtoni, SAMN00628004)	638,345	53%	20%	82,533
SAMN00761590	heart (Haplochromis burtoni, SAMN00761590)	100,458,400	74%	24%	202,477
SAMN00761591	liver (Haplochromis burtoni, Male, SAMN00761591)	97,791,114	74%	24%	158,673
SAMN00761592	sample 253, T (Haplochromis burtoni, Male, SAMN00761592)	476,889,110	67%	5%	129,877
SAMN00761593	sample 255, T (Haplochromis burtoni, Male, SAMN00761593)	507,233,735	73%	18%	242,867
SAMN00761594	eye (Haplochromis burtoni, Male, SAMN00761594)	98,066,928	65%	12%	218,427
SAMN00761595	sample 291, NT (Haplochromis burtoni, Male, SAMN00761595)	524,448,499	70%	5%	130,927
SAMN00761596	sample 247, NT (Haplochromis burtoni, Male, SAMN00761596)	563,178,930	66%	15%	238,304
SAMN00761597	testis (Haplochromis burtoni, Male, SAMN00761597)	92,789,416	73%	22%	232,787
SAMN00761598	pooled embryo (Haplochromis burtoni, Female, SAMN00761598)	84,295,202	73%	19%	239,065
SAMN00761599	brain (Haplochromis burtoni, Male, SAMN00761599)	96,125,308	69%	13%	214,379
SAMN00765505	pooled ovary (Haplochromis burtoni, Female, SAMN00765505)	66,933,332	85%	25%	208,324
SAMN00771450	pooled muscle (Haplochromis burtoni, SAMN00771450)	78,288,782	72%	18%	144,258
SAMN00771451	pooled skin (Haplochromis burtoni, SAMN00771451)	179,190,656	74%	14%	222,836
SAMN00771452	pooled kidney (Haplochromis burtoni, SAMN00771452)	90,637,086	71%	12%	194,552
SAMN00771453	pooled blood (Haplochromis burtoni, SAMN00771453)	130,490,550	75%	12%	148,301
SAMN02796151	Gonad (Haplochromis burtoni, adult49, M, SAMN02796151)	17,966,560	80%	10%	153,102
SAMN02796152	Brain (Haplochromis burtoni, adult50, M, SAMN02796152)	16,840,835	74%	5%	143,395
SAMN02796153	Gonad (Haplochromis burtoni, adult51, M, SAMN02796153)	23,585,610	78%	10%	170,871
SAMN02796154	Brain (Haplochromis burtoni, adult52, M, SAMN02796154)	19,869,183	73%	5%	146,368
SAMN02796155	Gonad (Haplochromis burtoni, adult53, M, SAMN02796155)	14,470,381	74%	9%	145,697
SAMN02796156	Brain (Haplochromis burtoni, adult54, M, SAMN02796156)	23,757,973	79%	6%	159,294
SAMN02796157	Gonad (Haplochromis burtoni, adult55, F, SAMN02796157)	21,243,669	80%	10%	135,529
SAMN02796158	Brain (Haplochromis burtoni, adult56, F, SAMN02796158)	19,097,310	79%	6%	153,336
SAMN02796159	Gonad (Haplochromis burtoni, adult57, F, SAMN02796159)	24,449,861	85%	11%	145,671
SAMN02796160	Brain (Haplochromis burtoni, adult58, F, SAMN02796160)	17,817,706	74%	5%	144,697
SAMN02796161	Gonad (Haplochromis burtoni, adult59, F, SAMN02796161)	13,637,936	83%	9%	126,162
SAMN02796162	Brain (Haplochromis burtoni, adult60, F, SAMN02796162)	25,143,956	79%	6%	160,389
SAMN02911621	Anal fin (Haplochromis burtoni, Juvenile, male, SAMN02911621)	117,280,624	58%	13%	193,724
SAMN02911622	Anal fin (Haplochromis burtoni, Juvenile, female, SAMN02911622)	116,166,395	43%	10%	184,887

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent spliced reads
SRR287662	SRX078333	SRP007198	SAMN00628004	638,345	53%	20%
SRR387450	SRX110174	SRP009543	SAMN00761590	33,175,930	74%	24%
SRR387455	SRX110174	SRP009543	SAMN00761590	33,912,786	74%	24%
SRR387462	SRX110174	SRP009543	SAMN00761590	33,369,684	74%	24%
SRR387451	SRX110175	SRP009543	SAMN00761591	32,284,624	73%	24%
SRR387458	SRX110175	SRP009543	SAMN00761591	33,000,270	74%	24%
SRR387470	SRX110175	SRP009543	SAMN00761591	32,506,220	74%	24%
SRR387452	SRX110176	SRP009543	SAMN00761592	316,672,782	65%	5%
SRR387457	SRX110180	SRP009543	SAMN00761592	160,216,328	71%	5%
SRR387453	SRX110177	SRP009543	SAMN00761593	331,516,312	71%	17%
SRR387464	SRX110184	SRP009543	SAMN00761593	175,717,423	77%	19%
SRR387454	SRX110178	SRP009543	SAMN00761594	32,564,836	65%	12%
SRR387461	SRX110178	SRP009543	SAMN00761594	33,113,226	65%	12%
SRR387471	SRX110178	SRP009543	SAMN00761594	32,388,866	65%	11%
SRR387456	SRX110179	SRP009543	SAMN00761595	346,044,950	68%	5%
SRR387466	SRX110186	SRP009543	SAMN00761595	178,403,549	74%	5%
SRR387459	SRX110181	SRP009543	SAMN00761596	368,753,766	64%	14%
SRR387460	SRX110182	SRP009543	SAMN00761596	194,425,164	69%	16%
SRR387463	SRX110183	SRP009543	SAMN00761597	31,305,652	73%	22%
SRR387467	SRX110183	SRP009543	SAMN00761597	30,627,080	73%	22%
SRR387474	SRX110183	SRP009543	SAMN00761597	30,856,684	73%	22%
SRR387465	SRX110185	SRP009543	SAMN00761598	27,816,558	73%	19%
SRR387473	SRX110185	SRP009543	SAMN00761598	28,009,692	73%	19%
SRR387475	SRX110185	SRP009543	SAMN00761598	28,468,952	73%	19%
SRR387468	SRX110187	SRP009543	SAMN00761599	31,727,626	69%	13%
SRR387469	SRX110187	SRP009543	SAMN00761599	32,481,598	69%	13%
SRR387472	SRX110187	SRP009543	SAMN00761599	31,916,084	69%	13%
SRR527863	SRX111459	SRP009543	SAMN00765505	33,687,426	85%	25%
SRR527864	SRX111459	SRP009543	SAMN00765505	33,245,906	85%	25%
SRR452434	SRX115266	SRP009543	SAMN00771450	2,570,566	77%	17%
SRR452438	SRX115266	SRP009543	SAMN00771450	2,602,232	74%	16%
SRR495188	SRX146549	SRP009543	SAMN00771450	2,155,550	39%	7%
SRR496381	SRX146549	SRP009543	SAMN00771450	35,039,026	73%	18%
SRR496383	SRX146549	SRP009543	SAMN00771450	35,921,408	73%	18%
SRR452432	SRX115267	SRP009543	SAMN00771451	6,921,460	78%	14%
SRR452433	SRX115267	SRP009543	SAMN00771451	6,831,996	80%	14%
SRR495189	SRX146550	SRP009543	SAMN00771451	12,339,652	39%	6%
SRR496382	SRX146550	SRP009543	SAMN00771451	76,061,466	77%	15%
SRR496384	SRX146550	SRP009543	SAMN00771451	77,036,082	77%	15%
SRR452435	SRX115268	SRP009543	SAMN00771452	2,584,208	75%	11%
SRR452436	SRX115268	SRP009543	SAMN00771452	2,582,336	73%	10%
SRR495247	SRX146877	SRP009543	SAMN00771452	40,629,722	73%	12%
SRR495249	SRX146877	SRP009543	SAMN00771452	39,983,476	73%	12%
SRR495252	SRX146877	SRP009543	SAMN00771452	4,857,344	37%	5%
SRR527860	SRX115269	SRP009543	SAMN00771453	3,026,002	77%	13%
SRR527861	SRX115269	SRP009543	SAMN00771453	2,990,570	79%	13%
SRR527862	SRX146878	SRP009543	SAMN00771453	59,408,204	77%	13%
SRR527865	SRX146878	SRP009543	SAMN00771453	6,790,510	39%	6%
SRR527866	SRX146878	SRP009543	SAMN00771453	58,275,264	77%	13%
SRR1555481	SRX684589	SRP042144	SAMN02796151	17,966,560	80%	10%
SRR1555482	SRX684590	SRP042144	SAMN02796152	16,840,835	74%	5%
SRR1555483	SRX684591	SRP042144	SAMN02796153	23,585,610	78%	10%
SRR1555484	SRX684592	SRP042144	SAMN02796154	19,869,183	73%	5%
SRR1555485	SRX684593	SRP042144	SAMN02796155	14,470,381	74%	9%
SRR1555486	SRX684594	SRP042144	SAMN02796156	23,757,973	79%	6%
SRR1555487	SRX684595	SRP042144	SAMN02796157	21,243,669	80%	10%
SRR1555488	SRX684596	SRP042144	SAMN02796158	19,097,310	79%	6%
SRR1555489	SRX684597	SRP042144	SAMN02796159	24,449,861	85%	11%
SRR1555490	SRX684598	SRP042144	SAMN02796160	17,817,706	74%	5%
SRR1555491	SRX684599	SRP042144	SAMN02796161	13,637,936	83%	9%
SRR1555492	SRX684600	SRP042144	SAMN02796162	25,143,956	79%	6%
SRR1514653	SRX652229	SRP045292	SAMN02911621	39,953,491	53%	13%
SRR1514670	SRX652229	SRP045292	SAMN02911621	40,242,845	57%	13%
SRR1514671	SRX652229	SRP045292	SAMN02911621	37,084,288	63%	13%
SRR1514688	SRX652253	SRP045292	SAMN02911622	37,136,547	22%	5%
SRR1514689	SRX652253	SRP045292	SAMN02911622	38,239,151	52%	12%
SRR1514690	SRX652253	SRP045292	SAMN02911622	40,790,697	53%	12%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Actinopterygii GenBank	73,602	68,467 (93.02%)	68,467 (93.02%)	68.71%	76.92%
Actinopterygii known RefSeq (NP_)	23,704	22,732 (95.90%)	22,732 (95.90%)	68.39%	75.73%
Homo sapiens known RefSeq (NP_)	39,314	33,094 (84.18%)	33,094 (84.18%)	66.11%	65.76%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20

RefSeq

Integrated reference sequences