NCBI Etheostoma spectabile Annotation Release 100

The RefSeq genome records for Etheostoma spectabile were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Etheostoma spectabile Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: Feb 11 2020
Date of submission of annotation to the public databases: Feb 21 2020
Software version: 8.3

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
UIUC_Espe_1.0	GCF_008692095.1	University of Illinois at Urbana-Champaign	10-07-2019	Reference	25 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	UIUC_Espe_1.0
Genes and pseudogenes	38,154
protein-coding	22,341
non-coding	14,091
transcribed pseudogenes	0
non-transcribed pseudogenes	1,630
genes with variants	9,441
immunoglobulin/T-cell receptor gene segments	92
other	0
mRNAs	45,673
fully-supported	43,981
with > 5% ab initio	728
partial	1,646
with filled gap(s)	1,280
known RefSeq (NM_)	0
model RefSeq (XM_)	45,673
non-coding RNAs	15,595
fully-supported	7,432
with > 5% ab initio	0
partial	10
with filled gap(s)	10
known RefSeq (NR_)	0
model RefSeq (XR_)	9,762
pseudo transcripts	0
fully-supported	0
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	0
CDSs	45,778
fully-supported	43,981
with > 5% ab initio	833
partial	1,004
with major correction(s)	1,597
known RefSeq (NP_)	0
model RefSeq (XP_)	45,686

Detailed reports

The counts below do not include pseudogenes.

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	36,432	15,401	6,366	57	1,100,153
All transcripts	61,268	2,822	2,221	49	91,951
mRNA	45,673	3,621	2,935	117	91,951
misc_RNA	1,193	3,327	2,706	125	27,087
tRNA	5,831	74	73	66	87
lncRNA	6,239	427	190	49	9,939
snoRNA	420	166	213	60	332
snRNA	1,238	144	141	57	195
guide_RNA	6	249	284	130	388
rRNA	668	302	119	117	3,917
Single-exon transcripts	729	1,761	1,527	276	9,869
coding transcripts (NM_/XM_ )	729	1,761	1,527	276	9,869
CDSs	45,686	2,305	1,626	96	90,588
Exons	288,439	270	136	1	17,346
in coding transcripts (NM_/XM_ )	267,740	278	139	1	17,346
in non-coding transcripts (NR_/XR_ )	31,717	175	98	2	10,017
Introns	254,882	2,237	571	27	1,040,062
in coding transcripts (NM_/XM_ )	241,110	2,038	539	27	1,040,062
in non-coding transcripts (NR_/XR_ )	24,716	4,067	1,127	30	865,495

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.81	1	1	50
Number of exons per transcript	12.63	9	1	234

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the UniProtKB/Swiss-Prot curated proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 22328 coding genes, 20638 genes had a protein with an alignment covering 50% or more of the query and 9741 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: UniProtKB/Swiss-Prot curated proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
UIUC_Espe_1.0	GCF_008692095.1	3.74%	30.27%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Depending on the other evidence available, long 454 reads (with average length above 250 nt) may be aligned as traditional evidence and reported in the Transcript alignments section or aligned with RNA-Seq reads and reported in the RNA-Seq alignments section.

Transcript alignments

No transcript evidence was used in this annotation

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Publication	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	NA	Aggregate of all aligned samples	4,966,155,790	71%	29%	376,659
SAMN02944951	27189481	Ovary (Perca fluviatilis, 9 months, female, SAMN02944951)	34,137,838	66%	33%	162,961
SAMN02944952	27189481	Brain (Perca fluviatilis, 9 months, female, SAMN02944952)	47,628,916	65%	22%	225,455
SAMN02944953	27189481	Gills (Perca fluviatilis, 9 months, female, SAMN02944953)	75,441,436	67%	28%	228,470
SAMN02944954	27189481	Heart (Perca fluviatilis, 9 months, female, SAMN02944954)	64,750,202	68%	31%	195,414
SAMN02944955	27189481	Muscle (Perca fluviatilis, 9 months, female, SAMN02944955)	78,366,688	80%	39%	141,368
SAMN02944956	27189481	Liver (Perca fluviatilis, 9 months, female, SAMN02944956)	69,260,740	67%	32%	164,546
SAMN02944957	27189481	Head kidney (Perca fluviatilis, 9 months, female, SAMN02944957)	73,787,758	70%	32%	201,289
SAMN02944958	27189481	Bones (Perca fluviatilis, 9 months, female, SAMN02944958)	62,236,324	71%	33%	230,288
SAMN02944959	27189481	Intestine (Perca fluviatilis, 9 months, female, SAMN02944959)	55,637,122	67%	30%	207,182
SAMN02944960	27189481	Testis (Perca fluviatilis, 9 months, male, SAMN02944960)	62,858,528	67%	31%	237,610
SAMN02944961	27189481	Embryos (Perca fluviatilis, male and female, SAMN02944961)	72,593,576	74%	25%	219,190
SAMN03381148	NA	adult, gill, brain and heart (Gymnocephalus cernua, SAMN03381148)	87,169,850	69%	21%	220,694
SAMN03944739	NA	muscle, fin, brain, spleen, liver, bone, kidney, heart and gonad (Sander lucioperca, pooled male and female, SAMN03944739)	100,018,272	76%	35%	211,331
SAMN04908979	NA	gonad (Perca flavescens, 2, female, SAMN04908979)	50,038,026	74%	33%	180,615
SAMN04964641	NA	Muscle (Perca flavescens, 2 years, female, SAMN04964641)	90,988,070	81%	44%	152,709
SAMN04990485	NA	gonad (Perca flavescens, 2, male, SAMN04990485)	114,305,962	71%	29%	246,606
SAMN04990486	NA	muscle (Perca flavescens, 2, male, SAMN04990486)	114,324,326	80%	40%	202,994
SAMN04990487	NA	gonad (Perca flavescens, 2, SAMN04990487)	98,627,630	72%	29%	233,945
SAMN04990488	NA	muscle (Perca flavescens, 2, SAMN04990488)	103,790,110	81%	44%	179,950
SAMN05957801	NA	fresh brain, heart, gill, liver, muscle, kidney, and pancreas (Perca fluviatilis, one year old, pooled male and female, SAMN05957801)	204,393,140	63%	47%	268,094
SAMN08954934	30355765	eye ball (Perca fluviatilis, female, SAMN08954934)	105,243,036	73%	19%	229,033
SAMN09430511	NA	Gill (Sander vitreus, >1 year, female, SAMN09430511)	653,699,234	69%	28%	239,968
SAMN10390368	NA	brain/liver/heart/fin/gonad (Perca fluviatilis, mixed, SAMN10390368)	125,889,900	70%	40%	219,207
SAMN10390369	NA	brain/liver/heart/fin/gonad (Gymnocephalus cernua, mixed, SAMN10390369)	61,556,566	69%	34%	200,560
SAMN10390370	NA	brain/liver/heart/fin/gonad (Sander lucioperca, mixed, SAMN10390370)	130,294,572	71%	37%	202,710
SAMN10473287	NA	adult, retina (Perca fluviatilis, SAMN10473287)	30,595,942	56%	25%	183,807
SAMN10473288	NA	adult, retina (Perca fluviatilis, SAMN10473288)	36,171,406	45%	7%	106,206
SAMN10473289	NA	adult, retina (Perca fluviatilis, SAMN10473289)	33,106,668	59%	24%	184,716
SAMN11280411	NA	liver (Perca fluviatilis, 4, female, SAMN11280411)	24,671,660	68%	44%	132,728
SAMN11280412	NA	liver (Perca fluviatilis, 4, female, SAMN11280412)	24,418,684	65%	40%	124,196
SAMN11280413	NA	liver (Perca fluviatilis, 4, female, SAMN11280413)	22,434,936	68%	44%	127,568
SAMN11280414	NA	liver (Perca fluviatilis, 4, female, SAMN11280414)	22,073,690	68%	42%	127,990
SAMN11280415	NA	liver (Perca fluviatilis, 4, female, SAMN11280415)	24,703,636	67%	42%	124,117
SAMN11280416	NA	liver (Perca fluviatilis, 4, female, SAMN11280416)	22,490,872	73%	43%	127,358
SAMN11280417	NA	liver (Perca fluviatilis, 4, female, SAMN11280417)	24,102,410	73%	42%	128,490
SAMN11280418	NA	liver (Perca fluviatilis, 3, female, SAMN11280418)	24,407,688	72%	42%	126,033
SAMN11280419	NA	liver (Perca fluviatilis, 5, female, SAMN11280419)	23,866,650	69%	42%	131,421
SAMN11280420	NA	liver (Perca fluviatilis, 4, female, SAMN11280420)	22,252,306	68%	40%	109,816
SAMN12391855	NA	muscle, gonads, eyes, brain, liver, and fin from adults, and whole fry (Etheostoma spectabile, male and female, SAMN12391855)	127,045,138	87%	54%	256,049
SAMN13282760	NA	eye (Perca fluviatilis, female, SAMN13282760)	117,202,084	72%	19%	232,440
SAMN13282761	NA	eye (Perca fluviatilis, female, SAMN13282761)	115,296,280	72%	18%	223,262
SAMN13282762	NA	eye (Perca fluviatilis, female, SAMN13282762)	126,237,210	70%	18%	231,571
SAMN13282763	NA	eye (Perca fluviatilis, female, SAMN13282763)	110,428,432	71%	18%	227,495
SAMN13282764	NA	eye (Perca fluviatilis, female, SAMN13282764)	130,528,952	72%	18%	229,771
SAMN13282765	NA	eye (Perca fluviatilis, female, SAMN13282765)	128,676,954	72%	19%	237,039
SAMN13282766	NA	eye (Perca fluviatilis, female, SAMN13282766)	106,079,030	72%	18%	229,761
SAMN13282767	NA	eye (Perca fluviatilis, female, SAMN13282767)	106,041,636	73%	19%	225,822
SAMN13282768	NA	eye (Perca fluviatilis, female, SAMN13282768)	235,524,834	72%	18%	250,545
SAMN13282769	NA	eye (Perca fluviatilis, female, SAMN13282769)	103,937,648	69%	18%	229,265
SAMN13282770	NA	eye (Perca fluviatilis, female, SAMN13282770)	123,344,112	71%	18%	234,687
SAMN13282771	NA	eye (Perca fluviatilis, female, SAMN13282771)	119,528,348	73%	19%	234,934
SAMN13282772	NA	eye (Perca fluviatilis, female, SAMN13282772)	243,950,762	73%	19%	253,502

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
SRR1533685	SRX667280	SRP045144	SAMN02944951	34,137,838	66%	33%
SRR1533686	SRX667281	SRP045144	SAMN02944952	47,628,916	65%	22%
SRR1533687	SRX667282	SRP045144	SAMN02944953	75,441,436	67%	28%
SRR1533688	SRX667283	SRP045144	SAMN02944954	64,750,202	68%	31%
SRR1533689	SRX667284	SRP045144	SAMN02944955	78,366,688	80%	39%
SRR1533690	SRX667285	SRP045144	SAMN02944956	69,260,740	67%	32%
SRR1533691	SRX667286	SRP045144	SAMN02944957	73,787,758	70%	32%
SRR1533692	SRX667287	SRP045144	SAMN02944958	62,236,324	71%	33%
SRR1533693	SRX667288	SRP045144	SAMN02944959	55,637,122	67%	30%
SRR1533694	SRX667289	SRP045144	SAMN02944960	62,858,528	67%	31%
SRR1533695	SRX667290	SRP045144	SAMN02944961	72,593,576	74%	25%
SRR1822420	SRX893908	SRP055703	SAMN03381148	87,169,850	69%	21%
SRR2871497	SRX1328344	SRP064697	SAMN03944739	50,009,136	76%	35%
SRR3183224	SRX1385650	SRP064697	SAMN03944739	50,009,136	76%	35%
SRR3480061	SRX1745358	SRP074479	SAMN04908979	50,038,026	74%	33%
SRR3498542	SRX1757083	SRP074479	SAMN04964641	90,988,070	81%	44%
SRR3498594	SRX1757084	SRP074479	SAMN04990485	114,305,962	71%	29%
SRR3498596	SRX1757138	SRP074479	SAMN04990486	114,324,326	80%	40%
SRR3498815	SRX1757357	SRP074479	SAMN04990487	98,627,630	72%	29%
SRR3498823	SRX1757358	SRP074479	SAMN04990488	103,790,110	81%	44%
SRR4787434	SRX2317605	SRP092417	SAMN05957801	204,393,140	63%	47%
SRR8242429	SRX5060693	SRP126129	SAMN10473287	30,595,942	56%	25%
SRR8242432	SRX5060690	SRP126129	SAMN10473288	36,171,406	45%	7%
SRR8242431	SRX5060691	SRP126129	SAMN10473289	33,106,668	59%	24%
SRR7091762	SRX4020655	SRP144291	SAMN08954934	105,243,036	73%	19%
SRR7348420	SRX4221801	SRP150633	SAMN09430511	653,699,234	69%	28%
SRR8168424	SRX4989076	SRP167999	SAMN10390368	125,889,900	70%	40%
SRR8168423	SRX4989077	SRP167999	SAMN10390369	61,556,566	69%	34%
SRR8168422	SRX4989078	SRP167999	SAMN10390370	130,294,572	71%	37%
SRR8798620	SRX5587469	SRP189727	SAMN11280411	24,671,660	68%	44%
SRR8798619	SRX5587470	SRP189727	SAMN11280412	24,418,684	65%	40%
SRR8798622	SRX5587467	SRP189727	SAMN11280413	22,434,936	68%	44%
SRR8798621	SRX5587468	SRP189727	SAMN11280414	22,073,690	68%	42%
SRR8798624	SRX5587465	SRP189727	SAMN11280415	24,703,636	67%	42%
SRR8798623	SRX5587466	SRP189727	SAMN11280416	22,490,872	73%	43%
SRR8798626	SRX5587463	SRP189727	SAMN11280417	24,102,410	73%	42%
SRR8798625	SRX5587464	SRP189727	SAMN11280418	24,407,688	72%	42%
SRR8798618	SRX5587471	SRP189727	SAMN11280419	23,866,650	69%	42%
SRR8798617	SRX5587472	SRP189727	SAMN11280420	22,252,306	68%	40%
SRR9855987	SRX6609989	SRP216744	SAMN12391855	127,045,138	87%	54%
SRR10441602	SRX7136336	SRP229767	SAMN13282760	117,202,084	72%	19%
SRR10441601	SRX7136337	SRP229767	SAMN13282761	115,296,280	72%	18%
SRR10441597	SRX7136341	SRP229767	SAMN13282762	126,237,210	70%	18%
SRR10441596	SRX7136342	SRP229767	SAMN13282763	110,428,432	71%	18%
SRR10441595	SRX7136343	SRP229767	SAMN13282764	130,528,952	72%	18%
SRR10441594	SRX7136344	SRP229767	SAMN13282765	128,676,954	72%	19%
SRR10441593	SRX7136345	SRP229767	SAMN13282766	106,079,030	72%	18%
SRR10441592	SRX7136346	SRP229767	SAMN13282767	106,041,636	73%	19%
SRR10441591	SRX7136347	SRP229767	SAMN13282768	235,524,834	72%	18%
SRR10441590	SRX7136348	SRP229767	SAMN13282769	103,937,648	69%	18%
SRR10441600	SRX7136338	SRP229767	SAMN13282770	123,344,112	71%	18%
SRR10441599	SRX7136339	SRP229767	SAMN13282771	119,528,348	73%	19%
SRR10441598	SRX7136340	SRP229767	SAMN13282772	243,950,762	73%	19%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Larimichthys crocea high-quality model RefSeq (XP_)	18,161	17,473 (96.21%)	17,473 (96.21%)	72.58%	81.53%
Actinopterygii GenBank	85,370	51,105 (59.86%)	51,105 (59.86%)	68.91%	80.35%
Actinopterygii known RefSeq (NP_)	25,009	22,967 (91.83%)	22,967 (91.83%)	68.57%	78.47%
Danio rerio high-quality model RefSeq (XP_)	7,935	7,251 (91.38%)	7,251 (91.38%)	65.30%	72.95%
Perca flavescens high-quality model RefSeq (XP_)	16,027	15,389 (96.02%)	15,389 (96.02%)	75.84%	83.53%
Homo sapiens known RefSeq (NP_)	56,585	36,071 (63.75%)	36,071 (63.75%)	66.93%	69.67%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20

RefSeq

Integrated reference sequences