NCBI Cynoglossus semilaevis Annotation Release 102

The RefSeq genome records for Cynoglossus semilaevis were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction
Comparison of the current and previous annotations: What proportion of the genes changed in this annotation

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Cynoglossus semilaevis Annotation Release 102

Annotation release ID: 102
Date of Entrez queries for transcripts and proteins: May 2 2018
Date of submission of annotation to the public databases: May 16 2018
Software version: 8.0

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
Cse_v1.0	GCF_000523025.1	Beijing Genomics Institute	01-28-2014	Reference	23 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	Cse_v1.0
Genes and pseudogenes	25,771
protein-coding	21,395
non-coding	4,109
transcribed pseudogenes	3
non-transcribed pseudogenes	252
genes with variants	8,706
immunoglobulin/T-cell receptor gene segments	12
other	0
mRNAs	39,226
fully-supported	37,640
with > 5% ab initio	331
partial	641
with filled gap(s)	6
known RefSeq (NM_)	112
model RefSeq (XM_)	39,114
non-coding RNAs	6,808
fully-supported	5,657
with > 5% ab initio	0
partial	2
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	5,942
pseudo transcripts	3
fully-supported	2
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	3
CDSs	39,251
fully-supported	37,640
with > 5% ab initio	462
partial	635
with major correction(s)	1,075
known RefSeq (NP_)	112
model RefSeq (XP_)	39,127

Detailed reports

The counts below do not include pseudogenes.

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	25,504	13,712	6,058	55	1,167,912
All transcripts	46,034	3,195	2,632	55	98,410
mRNA	39,226	3,438	2,852	232	98,410
misc_RNA	1,441	3,594	2,904	240	21,022
tRNA	864	74	73	68	84
lncRNA	4,216	1,651	1,217	81	18,354
snoRNA	179	114	93	64	297
snRNA	79	148	164	55	198
guide_RNA	8	214	272	96	379
rRNA	21	233	119	119	1,684
Single-exon transcripts	799	1,974	1,680	232	14,244
coding transcripts (NM_/XM_ )	799	1,974	1,680	232	14,244
CDSs	39,239	2,035	1,509	96	96,960
Exons	273,336	299	141	1	15,642
in coding transcripts (NM_/XM_ )	258,697	290	140	1	15,374
in non-coding transcripts (NR_/XR_ )	23,261	358	141	2	15,642
Introns	242,033	1,364	283	30	1,158,438
in coding transcripts (NM_/XM_ )	231,854	1,279	275	30	1,158,438
in non-coding transcripts (NR_/XR_ )	18,793	2,400	422	30	387,301

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.83	1	1	50
Number of exons per transcript	12.19	9	1	253

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the UniProtKB/Swiss-Prot curated proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 21382 coding genes, 20171 genes had a protein with an alignment covering 50% or more of the query and 9557 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: UniProtKB/Swiss-Prot curated proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
Cse_v1.0	GCF_000523025.1	4.11%	23.94%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Depending on the other evidence available, long 454 reads (with average length above 250 nt) may be aligned as traditional evidence and reported in the Transcript alignments section or aligned with RNA-Seq reads and reported in the RNA-Seq alignments section.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species known RefSeq (NM_/NR_)	113	113 (100.00%)	109 (96.46%)	99.01%	98.65%
Same-species Genbank	301	298 (99.00%)	262 (87.04%)	99.27%	91.18%
Same-species EST	10,128	9,833 (97.09%)	9,236 (91.19%)	99.37%	98.94%

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Publication	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	NA	Aggregate of all aligned samples	3,311,493,626	80%	43%	292,098
SAMN00752404	24487278,9023104	Cynoglossus semilaevis transcriptome reads (Cynoglossus semilaevis, SAMN00752404)	221,631,632	76%	26%	265,268
SAMN01766803	NA	Half-smooth tongue sole 454 transcriptome (Cynoglossus semilaevis, SAMN01766803)	749,954	22%	29%	32,388
SAMN03753709	NA	Brain (Cynoglossus semilaevis, two year old, female, SAMN03753709)	54,454,762	72%	27%	214,283
SAMN03753710	NA	Brain (Cynoglossus semilaevis, two year old, male, SAMN03753710)	38,920,486	71%	30%	212,198
SAMN03753711	NA	Brain (Cynoglossus semilaevis, one year old, male, SAMN03753711)	52,202,976	72%	26%	200,410
SAMN03753712	NA	Brain (Cynoglossus semilaevis, one year old, female, SAMN03753712)	56,243,848	71%	24%	199,284
SAMN04556680	NA	liver (Cynoglossus semilaevis, 10 months, female, SAMN04556680)	170,175,062	85%	44%	203,059
SAMN04556699	NA	liver (Cynoglossus semilaevis, 10 months, female, SAMN04556699)	172,186,760	84%	43%	194,422
SAMN06859058	20227047	Gonad, liver, spleen, heart, kidney, gill, muscle, and brain tissues (Cynoglossus semilaevis, adult fish, SAMN06859058)	109,680,782	88%	53%	243,322
SAMN07665584	NA	Gonad, liver, spleen, heart, kidney, gill, muscle, and brain tissues (Cynoglossus semilaevis, adult fish, SAMN07665584)	100,084,344	74%	19%	189,220
SAMN08193245	NA	juvenile, liver (Cynoglossus semilaevis, 1 year old, SAMN08193245)	91,101,546	76%	44%	188,261
SAMN08193246	NA	juvenile, liver (Cynoglossus semilaevis, 1 year old, SAMN08193246)	87,678,192	89%	51%	179,980
SAMN08193247	NA	juvenile, liver (Cynoglossus semilaevis, 1 year old, SAMN08193247)	99,489,508	90%	51%	182,635
SAMN08193248	NA	juvenile, liver (Cynoglossus semilaevis, 1 year old, SAMN08193248)	91,701,036	75%	42%	188,182
SAMN08193249	NA	juvenile, liver (Cynoglossus semilaevis, 1 year old, SAMN08193249)	97,266,618	75%	41%	181,652
SAMN08193250	NA	juvenile, liver (Cynoglossus semilaevis, 1 year old, SAMN08193250)	90,339,732	82%	47%	184,042
SAMN08193251	NA	juvenile, liver (Cynoglossus semilaevis, 1 year old, SAMN08193251)	95,811,190	84%	45%	191,453
SAMN08193252	NA	juvenile, liver (Cynoglossus semilaevis, 1 year old, SAMN08193252)	83,187,382	78%	46%	190,346
SAMN08193253	NA	juvenile, liver (Cynoglossus semilaevis, 1 year old, SAMN08193253)	90,326,790	87%	46%	189,374
SAMN08493827	NA	liver spleen kidney (Cynoglossus semilaevis, 1 year, SAMN08493827)	106,581,728	79%	47%	222,166
SAMN08493828	NA	liver spleen kidney (Cynoglossus semilaevis, 1 year, SAMN08493828)	105,878,506	79%	47%	222,386
SAMN08493829	NA	liver spleen kidney (Cynoglossus semilaevis, 1 year, SAMN08493829)	100,828,908	80%	47%	220,644
SAMN08493830	NA	liver spleen kidney (Cynoglossus semilaevis, 1 year, SAMN08493830)	85,660,884	80%	46%	219,109
SAMN08493831	NA	liver spleen kidney (Cynoglossus semilaevis, 1 year, SAMN08493831)	91,134,342	80%	48%	223,426
SAMN08493832	NA	liver spleen kidney (Cynoglossus semilaevis, 1 year, SAMN08493832)	91,226,452	79%	47%	214,878
SAMN08493833	NA	liver spleen kidney (Cynoglossus semilaevis, 1 year, SAMN08493833)	101,900,430	79%	46%	226,025
SAMN08493834	NA	liver spleen kidney (Cynoglossus semilaevis, 1 year, SAMN08493834)	104,216,082	79%	49%	221,943
SAMN08493835	NA	liver spleen kidney (Cynoglossus semilaevis, 1 year, SAMN08493835)	100,320,038	78%	42%	221,727
SAMN08493836	NA	liver spleen kidney (Cynoglossus semilaevis, 1 year, SAMN08493836)	93,383,974	78%	46%	224,516
SAMN08493837	NA	liver spleen kidney (Cynoglossus semilaevis, 1 year, SAMN08493837)	104,672,142	80%	52%	226,418
SAMN08493838	NA	liver spleen kidney (Cynoglossus semilaevis, 1 year, SAMN08493838)	80,095,718	79%	47%	216,850
SAMN08628276	NA	liver, high salinity (Cynoglossus semilaevis, 10 months, female, SAMN08628276)	56,908,994	85%	43%	156,005
SAMN08628277	NA	liver, high salinity (Cynoglossus semilaevis, 10 months, female, SAMN08628277)	58,772,456	86%	48%	171,658
SAMN08628278	NA	liver, high salinity (Cynoglossus semilaevis, 10 months, female, SAMN08628278)	54,493,612	82%	41%	168,879
SAMN08628279	NA	liver, low salinity (Cynoglossus semilaevis, 10 months, female, SAMN08628279)	60,234,302	83%	46%	153,054
SAMN08628280	NA	liver, low salinity (Cynoglossus semilaevis, 10 months, female, SAMN08628280)	56,006,034	86%	44%	164,083
SAMN08628281	NA	liver, low salinity (Cynoglossus semilaevis, 10 months, female, SAMN08628281)	55,946,424	84%	40%	161,865

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
SRR800048	SRX257138	SRP020479	SAMN01766803	749,954	22%	29%
SRR365036	SRX106096	SRP045760	SAMN00752404	17,695,662	69%	17%
SRR365037	SRX106097	SRP045760	SAMN00752404	17,794,118	85%	23%
SRR365038	SRX106098	SRP045760	SAMN00752404	13,343,700	73%	25%
SRR365039	SRX106099	SRP045760	SAMN00752404	19,654,402	71%	22%
SRR365040	SRX106100	SRP045760	SAMN00752404	23,966,712	71%	28%
SRR365041	SRX106101	SRP045760	SAMN00752404	23,978,776	69%	29%
SRR365042	SRX106102	SRP045760	SAMN00752404	29,135,610	70%	27%
SRR365043	SRX106103	SRP045760	SAMN00752404	27,314,150	81%	31%
SRR365044	SRX106104	SRP045760	SAMN00752404	26,074,956	85%	26%
SRR371553	SRX107056	SRP045760	SAMN00752404	22,673,546	81%	26%
SRR2046146	SRX1044489	SRP058898	SAMN03753709	54,454,762	72%	27%
SRR2046147	SRX1044497	SRP058898	SAMN03753710	38,920,486	71%	30%
SRR2046153	SRX1044498	SRP058898	SAMN03753711	52,202,976	72%	26%
SRR2046154	SRX1044499	SRP058898	SAMN03753712	56,243,848	71%	24%
SRR3233779	SRX1639480	SRP071827	SAMN04556680	54,493,612	82%	41%
SRR3233780	SRX1639481	SRP071827	SAMN04556680	56,908,994	85%	43%
SRR6705638	SRX3679747	SRP071827	SAMN04556680	58,772,456	86%	48%
SRR3233772	SRX1634904	SRP071827	SAMN04556699	56,006,034	86%	44%
SRR3233773	SRX1639475	SRP071827	SAMN04556699	60,234,302	83%	46%
SRR6705640	SRX3679749	SRP071827	SAMN04556699	55,946,424	84%	40%
SRR5494302	SRX2775256	SRP106013	SAMN06859058	109,680,782	88%	53%
SRR6049694	SRX3196694	SRP118060	SAMN07665584	100,084,344	74%	19%
SRR6407772	SRX3500948	SRP127310	SAMN08193245	91,101,546	76%	44%
SRR6407771	SRX3500949	SRP127310	SAMN08193246	87,678,192	89%	51%
SRR6407770	SRX3500950	SRP127310	SAMN08193247	99,489,508	90%	51%
SRR6407769	SRX3500951	SRP127310	SAMN08193248	91,701,036	75%	42%
SRR6407776	SRX3500944	SRP127310	SAMN08193249	97,266,618	75%	41%
SRR6407775	SRX3500945	SRP127310	SAMN08193250	90,339,732	82%	47%
SRR6407774	SRX3500946	SRP127310	SAMN08193251	95,811,190	84%	45%
SRR6407773	SRX3500947	SRP127310	SAMN08193252	83,187,382	78%	46%
SRR6407768	SRX3500952	SRP127310	SAMN08193253	90,326,790	87%	46%
SRR6706118	SRX3680230	SRP132661	SAMN08493827	106,581,728	79%	47%
SRR6706119	SRX3680229	SRP132661	SAMN08493828	105,878,506	79%	47%
SRR6706120	SRX3680228	SRP132661	SAMN08493829	100,828,908	80%	47%
SRR6706121	SRX3680227	SRP132661	SAMN08493830	85,660,884	80%	46%
SRR6706114	SRX3680234	SRP132661	SAMN08493831	91,134,342	80%	48%
SRR6706117	SRX3680231	SRP132661	SAMN08493832	91,226,452	79%	47%
SRR6706124	SRX3680224	SRP132661	SAMN08493833	101,900,430	79%	46%
SRR6706125	SRX3680223	SRP132661	SAMN08493834	104,216,082	79%	49%
SRR6706115	SRX3680233	SRP132661	SAMN08493835	100,320,038	78%	42%
SRR6706116	SRX3680232	SRP132661	SAMN08493836	93,383,974	78%	46%
SRR6706122	SRX3680226	SRP132661	SAMN08493837	104,672,142	80%	52%
SRR6706123	SRX3680225	SRP132661	SAMN08493838	80,095,718	79%	47%
SRR6795854	SRX3754871	SRP133777	SAMN08628276	56,908,994	85%	43%
SRR6795853	SRX3754870	SRP133777	SAMN08628277	58,772,456	86%	48%
SRR6795852	SRX3754869	SRP133777	SAMN08628278	54,493,612	82%	41%
SRR6795851	SRX3754868	SRP133777	SAMN08628279	60,234,302	83%	46%
SRR6795849	SRX3754867	SRP133777	SAMN08628280	56,006,034	86%	44%
SRR6795848	SRX3754866	SRP133777	SAMN08628281	55,946,424	84%	40%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Seriola dumerili high-quality model RefSeq (XP_)	14,554	14,316 (98.36%)	14,316 (98.36%)	71.84%	80.71%
Poecilia formosa high-quality model RefSeq (XP_)	18,503	18,097 (97.81%)	18,097 (97.81%)	69.31%	76.86%
Actinopterygii GenBank	80,209	74,821 (93.28%)	74,821 (93.28%)	69.68%	78.79%
Actinopterygii known RefSeq (NP_)	24,872	23,424 (94.18%)	23,424 (94.18%)	68.11%	76.37%
Danio rerio high-quality model RefSeq (XP_)	8,030	7,619 (94.88%)	7,619 (94.88%)	65.49%	68.73%
Oryzias latipes high-quality model RefSeq (XP_)	17,157	16,836 (98.13%)	16,836 (98.13%)	69.40%	76.87%
Homo sapiens known RefSeq (NP_)	50,173	42,187 (84.08%)	42,187 (84.08%)	65.40%	66.81%

Comparison of the current and previous annotations

The annotation produced for this release (102) was compared to the annotation in the previous release (101) for each assembly annotated in both releases. Scores for current and previous gene and transcript features were calculated based on overlap in exon sequence and matches in exon boundaries. Pairs of current and previous features were categorized based on these scores, whether they are reciprocal best matches, and changes in attributes (gene biotype, completeness, etc.). If the assembly was updated between the two releases, alignments between the current and the previous assembly were used to match the current and previous gene and transcript features in mapped regions.

The table below summarizes the changes in the gene set for each assembly as a percent of the number of genes in the current annotation release, and provides links to the details of the comparison in tabular format and in a Genome Workbench project.

	Cse_v1.0 (Current) to Cse_v1.0 (Previous)
Identical	5%
Minor changes	75%
Major changes	10%
New	9%
Deprecated	5%
Other	<1%
Download the report	tabular, Genome Workbench

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20

RefSeq

Integrated reference sequences