NCBI Esox lucius Annotation Release 104

The RefSeq genome records for Esox lucius were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction
Similarity of current and previous assembly: The similarity of the current and previous assembly
Comparison of the current and previous annotations: What proportion of the genes changed in this annotation

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Esox lucius Annotation Release 104

Annotation release ID: 104
Date of Entrez queries for transcripts and proteins: May 1 2020
Date of submission of annotation to the public databases: May 11 2020
Software version: 8.4

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
fEsoLuc1.pri	GCF_011004845.1	Vertebrate Genomes Project	03-05-2020	Reference	26 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	fEsoLuc1.pri
Genes and pseudogenes	30,105
protein-coding	24,647
non-coding	5,018
transcribed pseudogenes	1
non-transcribed pseudogenes	358
genes with variants	12,944
immunoglobulin/T-cell receptor gene segments	81
other	0
mRNAs	58,098
fully-supported	56,606
with > 5% ab initio	713
partial	149
with filled gap(s)	3
known RefSeq (NM_)	852
model RefSeq (XM_)	57,246
non-coding RNAs	7,180
fully-supported	4,621
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	5,621
pseudo transcripts	1
fully-supported	1
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	1
CDSs	58,192
fully-supported	56,606
with > 5% ab initio	803
partial	146
with major correction(s)	361
known RefSeq (NP_)	865
model RefSeq (XP_)	57,246

Detailed reports

The counts below do not include pseudogenes.

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	29,665	19,065	7,968	55	1,152,720
All transcripts	65,278	3,520	2,898	55	87,642
mRNA	58,098	3,804	3,144	111	87,642
misc_RNA	1,337	3,186	2,671	126	16,504
tRNA	1,557	75	73	67	84
lncRNA	3,288	1,282	854	92	15,749
snoRNA	255	116	119	60	306
snRNA	287	141	141	55	199
guide_RNA	11	207	162	128	376
rRNA	445	137	119	119	3,928
Single-exon transcripts	691	1,734	1,332	324	8,136
coding transcripts (NM_/XM_ )	690	1,725	1,332	324	7,589
non-coding transcripts (NR_/XR_ )	1	8,136	8,136	8,136	8,136
CDSs	58,111	2,281	1,617	99	87,270
Exons	315,234	302	140	1	22,308
in coding transcripts (NM_/XM_ )	302,998	300	140	1	22,308
in non-coding transcripts (NR_/XR_ )	21,969	285	132	2	8,136
Introns	283,497	2,231	435	26	1,101,188
in coding transcripts (NM_/XM_ )	274,861	2,224	436	26	1,101,188
in non-coding transcripts (NR_/XR_ )	18,169	2,345	432	30	213,241

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	2.27	1	1	50
Number of exons per transcript	13.4	10	1	223

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the UniProtKB/Swiss-Prot curated proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 24634 coding genes, 22544 genes had a protein with an alignment covering 50% or more of the query and 10544 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: UniProtKB/Swiss-Prot curated proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
fEsoLuc1.pri	GCF_011004845.1	29.12%	31.61%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Depending on the other evidence available, long 454 reads (with average length above 250 nt) may be aligned as traditional evidence and reported in the Transcript alignments section or aligned with RNA-Seq reads and reported in the RNA-Seq alignments section.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species known RefSeq (NM_/NR_)	863	861 (99.77%)	847 (98.15%)	99.50%	99.77%
Same-species Genbank	1,391	1,388 (99.78%)	1,351 (97.12%)	99.44%	98.84%
Same-species EST	32,833	31,968 (97.37%)	30,689 (93.47%)	99.16%	99.28%

RefSeq transcript alignment quality report

The known RefSeq transcripts (NM_ and NR_ accessions) are a set of hiqh-quality transcripts maintained by the RefSeq group at NCBI. Alignment statistics for this group of transcripts, such as percent and number of sequences not aligning at all, percent best alignments split between multiple scaffolds, and percent alignments not covering the full CDS are indicative of the genome quality and are provided below.

	fEsoLuc1.pri Primary Assembly
Number of sequences retrieved from Entrez	863
Number (%) of sequences not aligning	2 (0.23%)
Number (%) of sequences with multiple best alignments (split genes)	0 (0.00%)
Number (%) of sequences with CDS coverage < 95%	1 (0.12%)

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Publication	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	NA	Aggregate of all aligned samples	1,675,379,920	81%	28%	366,368
SAMN02724934	25069045,12742752	brain (Esox lucius, 1 year, male, SAMN02724934)	57,933,196	83%	21%	235,552
SAMN02724935	25069045,12742752	eye (Esox lucius, 1 year, male, SAMN02724935)	60,150,442	81%	27%	218,268
SAMN02724936	25069045,12742752	gill (Esox lucius, 1 year, male, SAMN02724936)	58,499,888	86%	31%	214,624
SAMN02724937	25069045,12742752	gut (Esox lucius, 1 year, male, SAMN02724937)	60,466,858	79%	23%	184,455
SAMN02724938	25069045,12742752	head kidney (Esox lucius, 1 year, male, SAMN02724938)	61,054,936	88%	30%	206,122
SAMN02724943	25069045,12742752	heart (Esox lucius, 1 year, male, SAMN02724943)	60,280,526	78%	30%	199,404
SAMN02724948	25069045,12742752	kidney (Esox lucius, 1 year, male, SAMN02724948)	60,694,314	85%	31%	221,614
SAMN02724950	25069045,12742752	liver (Esox lucius, 1 year, male, SAMN02724950)	60,306,770	89%	33%	147,070
SAMN02724952	25069045,12742752	muscle (Esox lucius, 1 year, male, SAMN02724952)	60,608,932	88%	38%	152,724
SAMN02724953	25069045,12742752	nose (Esox lucius, 1 year, male, SAMN02724953)	60,306,770	86%	25%	206,715
SAMN02724954	25069045,12742752	stomach (Esox lucius, 1 year, male, SAMN02724954)	59,331,610	83%	24%	142,156
SAMN02724955	25069045,12742752	spleen (Esox lucius, 1 year, male, SAMN02724955)	61,731,442	88%	25%	213,931
SAMN02724956	25069045,12742752	testis (Esox lucius, 1 year, male, SAMN02724956)	57,502,030	95%	34%	202,649
SAMN02944913	NA	Ovary (Esox lucius, female, SAMN02944913)	40,103,680	81%	32%	206,077
SAMN02944914	NA	Brain (Esox lucius, female, SAMN02944914)	90,316,846	83%	22%	257,241
SAMN02944915	NA	Gills (Esox lucius, female, SAMN02944915)	67,698,346	83%	27%	231,105
SAMN02944916	NA	Heart (Esox lucius, female, SAMN02944916)	119,526,094	64%	26%	229,302
SAMN02944917	NA	Muscle (Esox lucius, female, SAMN02944917)	60,086,582	89%	40%	131,976
SAMN02944918	NA	Liver (Esox lucius, female, SAMN02944918)	106,281,742	68%	29%	185,060
SAMN02944919	NA	Head kidney (Esox lucius, female, SAMN02944919)	59,582,004	84%	26%	214,376
SAMN02944920	NA	Bones (Esox lucius, female, SAMN02944920)	38,323,872	85%	32%	233,349
SAMN02944921	NA	Intestine (Esox lucius, female, SAMN02944921)	56,101,408	67%	26%	215,118
SAMN02944922	NA	Testis (Esox lucius, male, SAMN02944922)	69,201,512	66%	21%	263,448
SAMN02944923	NA	Embryos (Esox lucius, male and female, SAMN02944923)	98,644,258	85%	31%	263,325
SAMN10473268	NA	retina (Esox lucius, not determined, SAMN10473268)	29,438,714	72%	24%	203,555
SAMN10473269	NA	retina (Esox lucius, not determined, SAMN10473269)	61,207,148	81%	14%	226,882

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
SRR1228710	SRX514235	SRP040114	SAMN02724934	57,933,196	83%	21%
SRR1228711	SRX514236	SRP040114	SAMN02724935	60,150,442	81%	27%
SRR1228712	SRX514237	SRP040114	SAMN02724936	58,499,888	86%	31%
SRR1228713	SRX514238	SRP040114	SAMN02724937	60,466,858	79%	23%
SRR1228714	SRX514240	SRP040114	SAMN02724938	61,054,936	88%	30%
SRR1228723	SRX514258	SRP040114	SAMN02724943	60,280,526	78%	30%
SRR1228724	SRX514263	SRP040114	SAMN02724948	60,694,314	85%	31%
SRR1228725	SRX514266	SRP040114	SAMN02724950	60,306,770	89%	33%
SRR1228726	SRX514267	SRP040114	SAMN02724952	60,608,932	88%	38%
SRR1228727	SRX514268	SRP040114	SAMN02724953	60,306,770	86%	25%
SRR1228729	SRX514269	SRP040114	SAMN02724954	59,331,610	83%	24%
SRR1228730	SRX514270	SRP040114	SAMN02724955	61,731,442	88%	25%
SRR1228731	SRX514271	SRP040114	SAMN02724956	57,502,030	95%	34%
SRR1533651	SRX667246	SRP045141	SAMN02944913	40,103,680	81%	32%
SRR1533652	SRX667247	SRP045141	SAMN02944914	90,316,846	83%	22%
SRR1533653	SRX667248	SRP045141	SAMN02944915	67,698,346	83%	27%
SRR1533654	SRX667249	SRP045141	SAMN02944916	119,526,094	64%	26%
SRR1533655	SRX667250	SRP045141	SAMN02944917	60,086,582	89%	40%
SRR1533656	SRX667251	SRP045141	SAMN02944918	106,281,742	68%	29%
SRR1533657	SRX667252	SRP045141	SAMN02944919	59,582,004	84%	26%
SRR1533658	SRX667253	SRP045141	SAMN02944920	38,323,872	85%	32%
SRR1533659	SRX667254	SRP045141	SAMN02944921	56,101,408	67%	26%
SRR1533661	SRX667256	SRP045141	SAMN02944922	69,201,512	66%	21%
SRR1533660	SRX667255	SRP045141	SAMN02944923	98,644,258	85%	31%
SRR8242440	SRX5060682	SRP126129	SAMN10473268	29,438,714	72%	24%
SRR8242439	SRX5060683	SRP126129	SAMN10473269	61,207,148	81%	14%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Actinopteri GenBank	84,266	51,855 (61.54%)	51,855 (61.54%)	68.85%	80.87%
Actinopteri known RefSeq (NP_)	24,136	22,852 (94.68%)	22,852 (94.68%)	69.06%	80.21%
Same-species GenBank	1,385	493 (35.60%)	493 (35.60%)	80.90%	85.17%
Same-species known RefSeq (NP_)	863	856 (99.19%)	856 (99.19%)	79.92%	88.86%
Homo sapiens GenBank	144,551	70,463 (48.75%)	70,463 (48.75%)	59.59%	71.35%
Homo sapiens known RefSeq (NP_)	57,162	37,749 (66.04%)	37,749 (66.04%)	67.31%	71.50%

Assembly-assembly alignments of current to previous assembly

When the assembly changes between two rounds of annotation, genes in the current and the previous annotation are mapped to each other using the genomic alignments of the current assembly to the previous assembly so that gene identifiers can be preserved. The success of the remapping depends largely on how well the two assembly versions align to each other.

Below are the percent coverage of one assembly by the other and the average percent identity of the alignments. The 'First pass' alignments are reciprocal best hits, while the 'Total' alignments also include 'Second pass' or non-reciprocal best alignments. For more information about the assembly-assembly alignment process, please visit the NCBI Genome Remapping Service page.

First Pass	Total
fEsoLuc1.pri (Current) Coverage: 94.65%	fEsoLuc1.pri (Current) Coverage: 95.57%
Eluc_v4 (Previous) Coverage: 92.42%	Eluc_v4 (Previous) Coverage: 94.86%
Percent Identity: 98.99%	Percent Identity: 98.90%

Comparison of the current and previous annotations

The annotation produced for this release (104) was compared to the annotation in the previous release (103) for each assembly annotated in both releases. Scores for current and previous gene and transcript features were calculated based on overlap in exon sequence and matches in exon boundaries. Pairs of current and previous features were categorized based on these scores, whether they are reciprocal best matches, and changes in attributes (gene biotype, completeness, etc.). If the assembly was updated between the two releases, alignments between the current and the previous assembly were used to match the current and previous gene and transcript features in mapped regions.

The table below summarizes the changes in the gene set for each assembly as a percent of the number of genes in the current annotation release, and provides links to the details of the comparison in tabular format and in a Genome Workbench project.

	fEsoLuc1.pri (Current) to Eluc_v4 (Previous)
Identical	19%
Minor changes	64%
Major changes	5%
New	10%
Deprecated	37%
Other	2%
Download the report	tabular, Genome Workbench

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20

RefSeq

Integrated reference sequences