NCBI Oncorhynchus kisutch Annotation Release 100

The RefSeq genome records for Oncorhynchus kisutch were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Oncorhynchus kisutch Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: Mar 9 2017
Date of submission of annotation to the public databases: Mar 14 2017
Software version: 7.3

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
Okis_V1	GCF_002021735.1	University of Victoria	03-06-2017	Reference	31 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	Okis_V1
Genes and pseudogenes	46,096
protein-coding	36,425
non-coding	4,754
pseudogenes	4,917
genes with variants	11,142
mRNAs	57,579
fully-supported	50,348
with > 5% ab initio	3,600
partial	3,189
with filled gap(s)	1,373
known RefSeq (NM_)	0
model RefSeq (XM_)	57,579
Other RNAs	6,737
fully-supported	6,090
with > 5% ab initio	0
partial	4
with filled gap(s)	4
known RefSeq (NR_)	0
model RefSeq (XR_)	6,090
CDSs	57,655
fully-supported	50,348
with > 5% ab initio	4,133
partial	3,111
with major correction(s)	3,372
known RefSeq (NP_)	0
model RefSeq (XP_)	57,579

Detailed reports

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	41,179	23,056	9,291	69	1,397,543
All transcripts	64,316	2,553	2,129	63	80,021
mRNA	57,579	2,743	2,283	105	80,021
misc_RNA	1,041	2,177	1,872	99	8,587
tRNA	647	74	73	69	84
lncRNA	5,049	782	557	63	6,542
Single-exon transcripts	1,278	1,541	1,329	277	11,717
coding transcripts (NM_/XM_ )	1,278	1,541	1,329	277	11,717
CDSs	57,579	1,738	1,311	105	79,818
Exons	391,824	258	136	1	17,508
in coding transcripts (NM_/XM_ )	375,557	259	136	1	17,508
in non-coding transcripts (NR_/XR_ )	21,520	226	117	2	5,227
Introns	348,341	2,689	412	30	1,115,377
in coding transcripts (NM_/XM_ )	336,943	2,693	412	30	1,115,377
in non-coding transcripts (NR_/XR_ )	16,395	2,367	390	30	120,630

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.57	1	1	36
Number of exons per transcript	10.21	8	1	203

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the UniProtKB/Swiss-Prot curated proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 36349 coding genes, 33905 genes had a protein with an alignment covering 50% or more of the query and 15450 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: UniProtKB/Swiss-Prot curated proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
Okis_V1	GCF_002021735.1	4.97%	44.82%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Depending on the other evidence available, long 454 reads (with average length above 250 nt) may be aligned as traditional evidence and reported in the Transcript alignments section or aligned with RNA-Seq reads and reported in the RNA-Seq alignments section.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species Genbank	88	88 (100.00%)	85 (96.59%)	99.29%	97.57%
Same-species EST	4,938	4,321 (87.51%)	3,701 (74.95%)	99.21%	98.77%
Oncorhynchus known RefSeq (NM_/NR_)	1,257	1,253 (99.68%)	891 (70.88%)	97.31%	96.77%
Oncorhynchus Genbank	5,099	4,957 (97.22%)	3,069 (60.19%)	97.04%	92.30%
Oncorhynchus EST	313,149	256,509 (81.91%)	217,462 (69.44%)	96.13%	97.69%
Salmo salar known RefSeq (NM_/NR_)	3,557	3,519 (98.93%)	2,081 (58.50%)	95.11%	95.56%

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Publication	Track name	Number of reads	Percent aligned reads	Percent spliced reads	Number of introns
All	NA	Aggregate of all aligned samples	3,073,532,352	85%	29%	479,043
SAMEA3113919	NA	Food restricted wild-type coho salmon (Oncorhynchus kisutch, SAMEA3113919)	115,286,916	91%	37%	184,563
SAMEA3113920	NA	Food restricted wild-type coho salmon (Oncorhynchus kisutch, SAMEA3113920)	121,987,844	91%	38%	199,801
SAMEA3113921	NA	Food restricted wild-type coho salmon (Oncorhynchus kisutch, SAMEA3113921)	130,741,332	90%	37%	202,863
SAMEA3113922	NA	Food restricted wild-type coho salmon (Oncorhynchus kisutch, SAMEA3113922)	116,467,944	90%	38%	178,074
SAMEA3113923	NA	Food restricted wild-type coho salmon (Oncorhynchus kisutch, SAMEA3113923)	112,253,422	90%	37%	228,783
SAMEA3113924	NA	Food restricted wild-type coho salmon (Oncorhynchus kisutch, SAMEA3113924)	109,831,640	91%	38%	205,264
SAMEA3113925	NA	Food restricted transgenic coho salmon (Oncorhynchus kisutch, SAMEA3113925)	117,875,722	90%	38%	191,572
SAMEA3113926	NA	Food restricted transgenic coho salmon (Oncorhynchus kisutch, SAMEA3113926)	122,982,720	88%	36%	205,352
SAMEA3113927	NA	Food restricted transgenic coho salmon (Oncorhynchus kisutch, SAMEA3113927)	99,329,750	88%	34%	134,561
SAMEA3113928	NA	Food restricted transgenic coho salmon (Oncorhynchus kisutch, SAMEA3113928)	125,176,784	89%	35%	214,132
SAMEA3113929	NA	Food restricted transgenic coho salmon (Oncorhynchus kisutch, SAMEA3113929)	119,374,988	89%	36%	216,918
SAMEA3113930	NA	Food restricted transgenic coho salmon (Oncorhynchus kisutch, SAMEA3113930)	126,028,302	89%	36%	212,234
SAMN03733566	NA	kidney (Oncorhynchus kisutch, male, SAMN03733566)	198,535,462	84%	26%	320,348
SAMN03733567	NA	kidney (Oncorhynchus kisutch, female, SAMN03733567)	243,560,512	83%	25%	321,630
SAMN03733568	NA	spleen (Oncorhynchus kisutch, male, SAMN03733568)	247,350,372	86%	24%	311,490
SAMN03733569	NA	spleen (Oncorhynchus kisutch, female, SAMN03733569)	181,729,326	85%	24%	307,327
SAMN03733570	NA	liver (Oncorhynchus kisutch, male, SAMN03733570)	252,616,458	84%	22%	251,736
SAMN03733571	NA	liver (Oncorhynchus kisutch, female, SAMN03733571)	197,466,942	85%	24%	240,960
SAMN03983865	26614614	adipose (Oncorhynchus kisutch, post-smolt, mixed, SAMN03983865)	37,824,258	74%	18%	260,594
SAMN03983866	26614614	brain (Oncorhynchus kisutch, post-smolt, mixed, SAMN03983866)	38,591,268	59%	10%	230,499
SAMN03983867	26614614	gut+gill (Oncorhynchus kisutch, post-smolt, mixed, SAMN03983867)	38,454,562	71%	16%	269,675
SAMN03983868	26614614	head_kidney+spleen (Oncorhynchus kisutch, post-smolt, mixed, SAMN03983868)	36,567,820	74%	17%	276,013
SAMN03983869	26614614	hypothalmus+pituitary (Oncorhynchus kisutch, post-smolt, mixed, SAMN03983869)	35,719,720	70%	15%	282,810
SAMN03983870	26614614	liver (Oncorhynchus kisutch, post-smolt, mixed, SAMN03983870)	37,008,542	77%	19%	144,786
SAMN03983871	26614614	muscle+gill (Oncorhynchus kisutch, post-smolt, mixed, SAMN03983871)	36,609,616	71%	19%	230,280
SAMN03983872	26614614	ovary (Oncorhynchus kisutch, post-smolt, mixed, SAMN03983872)	37,238,800	77%	19%	256,268
SAMN03983873	26614614	testis (Oncorhynchus kisutch, post-smolt, mixed, SAMN03983873)	36,921,330	72%	16%	297,005

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent spliced reads
ERR673469	ERX628447	ERP008650	SAMEA3113919	115,286,916	91%	37%
ERR673470	ERX628448	ERP008650	SAMEA3113920	121,987,844	91%	38%
ERR673471	ERX628449	ERP008650	SAMEA3113921	130,741,332	90%	37%
ERR673472	ERX628450	ERP008650	SAMEA3113922	116,467,944	90%	38%
ERR673473	ERX628451	ERP008650	SAMEA3113923	112,253,422	90%	37%
ERR673474	ERX628452	ERP008650	SAMEA3113924	109,831,640	91%	38%
ERR673475	ERX628453	ERP008650	SAMEA3113925	117,875,722	90%	38%
ERR673476	ERX628454	ERP008650	SAMEA3113926	122,982,720	88%	36%
ERR673477	ERX628455	ERP008650	SAMEA3113927	99,329,750	88%	34%
ERR673478	ERX628456	ERP008650	SAMEA3113928	125,176,784	89%	35%
ERR673479	ERX628457	ERP008650	SAMEA3113929	119,374,988	89%	36%
ERR673480	ERX628458	ERP008650	SAMEA3113930	126,028,302	89%	36%
SRR2039712	SRX1037831	SRP058682	SAMN03733566	66,090,024	84%	26%
SRR2039713	SRX1037831	SRP058682	SAMN03733566	66,201,240	84%	26%
SRR2039714	SRX1037831	SRP058682	SAMN03733566	66,244,198	84%	26%
SRR2039715	SRX1037832	SRP058682	SAMN03733567	81,017,006	83%	25%
SRR2039743	SRX1037832	SRP058682	SAMN03733567	81,157,332	83%	25%
SRR2039744	SRX1037832	SRP058682	SAMN03733567	81,386,174	83%	25%
SRR2039803	SRX1038463	SRP058682	SAMN03733568	82,359,624	86%	24%
SRR2039804	SRX1038463	SRP058682	SAMN03733568	82,472,456	86%	24%
SRR2039805	SRX1038463	SRP058682	SAMN03733568	82,518,292	86%	24%
SRR2039806	SRX1038464	SRP058682	SAMN03733569	60,491,066	85%	24%
SRR2039807	SRX1038464	SRP058682	SAMN03733569	60,619,534	85%	24%
SRR2039808	SRX1038464	SRP058682	SAMN03733569	60,618,726	85%	24%
SRR2039745	SRX1037833	SRP058682	SAMN03733570	84,060,774	84%	22%
SRR2039746	SRX1037833	SRP058682	SAMN03733570	84,231,334	84%	22%
SRR2039747	SRX1037833	SRP058682	SAMN03733570	84,324,350	84%	22%
SRR2039748	SRX1038065	SRP058682	SAMN03733571	65,719,392	85%	24%
SRR2039749	SRX1038065	SRP058682	SAMN03733571	65,907,266	84%	24%
SRR2039750	SRX1038065	SRP058682	SAMN03733571	65,840,284	85%	24%
SRR2157178	SRX1143852	SRP062344	SAMN03983865	37,824,258	74%	18%
SRR2157182	SRX1143861	SRP062344	SAMN03983866	38,591,268	59%	10%
SRR2157180	SRX1143864	SRP062344	SAMN03983867	38,454,562	71%	16%
SRR2157183	SRX1143869	SRP062344	SAMN03983868	36,567,820	74%	17%
SRR2157184	SRX1143870	SRP062344	SAMN03983869	35,719,720	70%	15%
SRR2157185	SRX1143871	SRP062344	SAMN03983870	37,008,542	77%	19%
SRR2157186	SRX1143872	SRP062344	SAMN03983871	36,609,616	71%	19%
SRR2157187	SRX1143873	SRP062344	SAMN03983872	37,238,800	77%	19%
SRR2157188	SRX1143874	SRP062344	SAMN03983873	36,921,330	72%	16%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Actinopterygii GenBank	77,108	74,388 (96.47%)	74,388 (96.47%)	71.88%	81.16%
Actinopterygii known RefSeq (NP_)	24,681	23,917 (96.90%)	23,917 (96.90%)	70.83%	79.30%
Same-species GenBank	85	85 (100.00%)	85 (100.00%)	77.31%	85.66%
Homo sapiens known RefSeq (NP_)	45,680	39,035 (85.45%)	39,035 (85.45%)	65.73%	68.57%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20

RefSeq

Integrated reference sequences