NCBI Cyprinus carpio Annotation Release 100

The RefSeq genome records for Cyprinus carpio were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Cyprinus carpio Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: Nov 3 2016
Date of submission of annotation to the public databases: Nov 10 2016
Software version: 7.2

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
common carp genome	GCF_000951615.1	CHINESE ACADEMY OF FISHERY SCIENCE	11-03-2014	Reference	51 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	common carp genome
Genes and pseudogenes	69,052
protein-coding	49,579
non-coding	10,547
pseudogenes	8,926
genes with variants	9,472
mRNAs	63,915
fully-supported	46,363
with > 5% ab initio	6,705
partial	2,809
with filled gap(s)	47
known RefSeq (NM_)	0
model RefSeq (XM_)	63,915
Other RNAs	13,014
fully-supported	11,437
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	11,438
CDSs	64,230
fully-supported	46,363
with > 5% ab initio	8,235
partial	2,837
with major correction(s)	5,814
known RefSeq (NP_)	0
model RefSeq (XP_)	63,915

Detailed reports

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	60,126	11,803	4,605	70	1,186,868
All transcripts	76,929	1,677	1,362	51	84,784
mRNA	63,915	1,875	1,544	135	84,784
misc_RNA	1,151	1,674	1,492	92	7,247
tRNA	1,576	74	73	70	84
lncRNA	10,287	690	515	51	8,397
Single-exon transcripts	1,788	1,393	1,151	253	8,082
coding transcripts (NM_/XM_ )	1,788	1,393	1,151	253	8,082
CDSs	63,915	1,273	972	99	84,555
Exons	437,972	233	134	1	10,056
in coding transcripts (NM_/XM_ )	407,104	233	135	1	10,056
in non-coding transcripts (NR_/XR_ )	35,018	224	119	2	7,609
Introns	374,220	1,763	361	30	1,131,913
in coding transcripts (NM_/XM_ )	353,565	1,777	365	30	1,131,913
in non-coding transcripts (NR_/XR_ )	24,605	1,596	315	30	376,875

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.29	1	1	50
Number of exons per transcript	7.55	5	1	221

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the UniProtKB/Swiss-Prot curated proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 49264 coding genes, 45139 genes had a protein with an alignment covering 50% or more of the query and 18937 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: UniProtKB/Swiss-Prot curated proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
common carp genome	GCF_000951615.1	5.03%	36.70%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Depending on the other evidence available, long 454 reads (with average length above 250 nt) may be aligned as traditional evidence and reported in the Transcript alignments section or aligned with RNA-Seq reads and reported in the RNA-Seq alignments section.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species Genbank	1,674	1,613 (96.36%)	1,197 (71.51%)	98.69%	87.90%
Same-species EST	49,606	36,059 (72.69%)	28,735 (57.93%)	98.86%	97.24%

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Track name	Number of reads	Percent aligned reads	Percent spliced reads	Number of introns
All	Aggregate of all aligned samples	2,779,880,620	72%	22%	651,611
SAMN00003891	Brain, muscle and liver tissue (Cyprinus carpio, SAMN00003891)	242,263	57%	40%	20,727
SAMN00780321	liver, 2-year-old (Cyprinus carpio, SAMN00780321)	774,965	53%	44%	18,644
SAMN00794395	pooled sample with twelve tissues (Cyprinus carpio, SAMN00794395)	2,119,118	58%	23%	176,450
SAMN01110264	Transcriptome sequence of common carp by using Illumina GAII (Cyprinus carpio carpio, SAMN01110264)	63,702,692	65%	18%	269,197
SAMN02183446	Oujiang color common carp: Dahua and Fenyu (Cyprinus carpio, SAMN02183446)	737,525	48%	23%	65,784
SAMN02900717	Brain, Yellow River (Cyprinus carpio, SAMN02900717)	36,058,166	75%	6%	323,325
SAMN02900719	Gill, Yellow River (Cyprinus carpio, SAMN02900719)	35,941,220	75%	8%	309,484
SAMN02900720	Blood, Yellow River (Cyprinus carpio, SAMN02900720)	35,114,692	80%	9%	207,351
SAMN02900721	Head kidney, Yellow River (Cyprinus carpio, SAMN02900721)	30,353,832	76%	9%	236,599
SAMN02900722	Muscle, Yellow River (Cyprinus carpio, SAMN02900722)	36,496,782	69%	10%	197,863
SAMN02900723	Brain, Xingguo red (Cyprinus carpio, SAMN02900723)	27,463,132	76%	6%	298,994
SAMN02900724	Skin, Xingguo red (Cyprinus carpio, SAMN02900724)	31,912,414	70%	6%	243,490
SAMN02900725	Gill, Xingguo red (Cyprinus carpio, SAMN02900725)	35,395,640	75%	8%	291,103
SAMN02900726	Blood, Xingguo red (Cyprinus carpio, SAMN02900726)	46,263,852	78%	8%	221,219
SAMN02900727	Head kidney, Xingguo red (Cyprinus carpio, SAMN02900727)	32,929,374	77%	9%	261,159
SAMN02900728	Muscle, Xingguo red (Cyprinus carpio, SAMN02900728)	50,520,326	65%	9%	234,750
SAMN02950730	liver (Cyprinus carpio, 2 year old, male, SAMN02950730)	130,919,072	75%	29%	238,799
SAMN02951857	skin (Cyprinus carpio, adult, not collected, SAMN02951857)	41,338,884	71%	18%	350,034
SAMN03024559	testis (Cyprinus carpio carpio, three years, male, SAMN03024559)	156,548,832	77%	28%	492,007
SAMN03263074	liver (Cyprinus carpio haematopterus, male, SAMN03263074)	121,648,848	76%	25%	322,556
SAMN03263075	muscle (Cyprinus carpio haematopterus, male, SAMN03263075)	144,067,590	53%	19%	343,400
SAMN03263076	brain (Cyprinus carpio haematopterus, male, SAMN03263076)	134,104,682	66%	14%	436,585
SAMN03263077	spleen (Cyprinus carpio haematopterus, male, SAMN03263077)	153,488,214	76%	20%	382,889
SAMN03263078	kidney (Cyprinus carpio haematopterus, male, SAMN03263078)	144,279,860	72%	20%	424,076
SAMN03338735	spleen (Cyprinus carpio haematopterus, one-year-old, female, SAMN03338735)	62,288,002	77%	30%	329,169
SAMN03338736	spleen (Cyprinus carpio haematopterus, three-year-old, female, SAMN03338736)	77,455,950	80%	27%	362,310
SAMN04046234	liver after saline injection (Cyprinus carpio, 6-month-old, pooled male and female, SAMN04046234)	49,730,178	66%	36%	218,467
SAMN04046236	liver after growth hormone injection (Cyprinus carpio, 6-month-old, pooled male and female, SAMN04046236)	42,303,842	65%	37%	241,815
SAMN04046237	liver after insulin injection (Cyprinus carpio, 6-month-old, pooled male and female, SAMN04046237)	45,983,588	63%	34%	229,414
SAMN04046239	liver after glucose injection (Cyprinus carpio, 6-month-old, pooled male and female, SAMN04046239)	73,541,752	66%	35%	244,554
SAMN04537393	Kidney (Cyprinus carpio, not determined, SAMN04537393)	28,645,158	77%	23%	279,630
SAMN04537394	Kidney (Cyprinus carpio, not determined, SAMN04537394)	31,309,358	78%	25%	273,478
SAMN04537395	Kidney (Cyprinus carpio, not determined, SAMN04537395)	33,802,985	78%	24%	302,803
SAMN04537396	Kidney (Cyprinus carpio, not determined, SAMN04537396)	33,014,462	78%	23%	282,964
SAMN04537397	Kidney (Cyprinus carpio, not determined, SAMN04537397)	32,699,016	76%	21%	273,898
SAMN04537398	Kidney (Cyprinus carpio, not determined, SAMN04537398)	31,940,121	77%	22%	300,965
SAMN04537399	Kidney (Cyprinus carpio, not determined, SAMN04537399)	33,243,586	77%	23%	291,406
SAMN04537400	Kidney (Cyprinus carpio, not determined, SAMN04537400)	33,933,720	78%	24%	292,471
SAMN04537401	Kidney (Cyprinus carpio, not determined, SAMN04537401)	32,050,049	76%	23%	275,069
SAMN04537402	Kidney (Cyprinus carpio, not determined, SAMN04537402)	32,591,406	36%	14%	228,818
SAMN04537403	Kidney (Cyprinus carpio, not determined, SAMN04537403)	31,401,721	74%	19%	246,618
SAMN04537404	Kidney (Cyprinus carpio, not determined, SAMN04537404)	36,074,985	76%	22%	340,697
SAMN04549285	Spleen (Cyprinus carpio, six-month, pooled male and female, SAMN04549285)	545,448,766	73%	28%	464,050

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent spliced reads
SRR022537	SRX007427	SRP001067	SAMN00003891	229,840	57%	40%
SRR022538	SRX007427	SRP001067	SAMN00003891	4,316	56%	39%
SRR022539	SRX007427	SRP001067	SAMN00003891	4,937	59%	40%
SRR022540	SRX007427	SRP001067	SAMN00003891	2,413	54%	38%
SRR022541	SRX007427	SRP001067	SAMN00003891	757	56%	37%
SRR404232	SRX118437	SRP010735	SAMN00780321	72,993	55%	47%
SRR404233	SRX118438	SRP010735	SAMN00780321	69,940	49%	37%
SRR404234	SRX118439	SRP010735	SAMN00780321	259,291	52%	44%
SRR404235	SRX118440	SRP010735	SAMN00780321	372,741	54%	45%
SRR424227	SRX124558	SRP011162	SAMN00794395	2,119,118	58%	23%
SRR536781	SRX175397	SRP014777	SAMN01110264	63,702,692	65%	18%
SRR875341	SRX291286	SRP023984	SAMN02183446	737,525	48%	23%
SRR1509519	SRX648297	SRP044033	SAMN02900717	36,058,166	75%	6%
SRR1509524	SRX648317	SRP044033	SAMN02900719	35,941,220	75%	8%
SRR1509525	SRX648318	SRP044033	SAMN02900720	35,114,692	80%	9%
SRR1509527	SRX648319	SRP044033	SAMN02900721	30,353,832	76%	9%
SRR1509528	SRX648320	SRP044033	SAMN02900722	36,496,782	69%	10%
SRR1509529	SRX648321	SRP044033	SAMN02900723	27,463,132	76%	6%
SRR1509530	SRX648322	SRP044033	SAMN02900724	31,912,414	70%	6%
SRR1509531	SRX648323	SRP044033	SAMN02900725	35,395,640	75%	8%
SRR1509532	SRX648324	SRP044033	SAMN02900726	46,263,852	78%	8%
SRR1509533	SRX648325	SRP044033	SAMN02900727	32,929,374	77%	9%
SRR1509534	SRX648326	SRP044033	SAMN02900728	50,520,326	65%	9%
SRR1535030	SRX668435	SRP045209	SAMN02950730	1,549,930	26%	22%
SRR1535031	SRX668436	SRP045209	SAMN02950730	129,369,142	76%	29%
SRR1536803	SRX669880	SRP045286	SAMN02951857	20,669,442	72%	18%
SRR1536804	SRX669889	SRP045288	SAMN02951857	20,669,442	71%	18%
SRR1573264	SRX699280	SRP047016	SAMN03024559	53,332,260	79%	23%
SRR1781596	SRX857397	SRP047016	SAMN03024559	103,216,572	77%	31%
SRR1707401	SRX806815	SRP051517	SAMN03263074	121,648,848	76%	25%
SRR1707354	SRX806930	SRP051517	SAMN03263075	144,067,590	53%	19%
SRR1707404	SRX806950	SRP051517	SAMN03263076	134,104,682	66%	14%
SRR1707407	SRX806952	SRP051517	SAMN03263077	153,488,214	76%	20%
SRR1707408	SRX806954	SRP051517	SAMN03263078	144,279,860	72%	20%
SRR1799752	SRX873880	SRP054250	SAMN03338735	62,288,002	77%	30%
SRR1799756	SRX873884	SRP054250	SAMN03338736	77,455,950	80%	27%
SRR2357568	SRX1227290	SRP063635	SAMN04046234	24,865,089	67%	37%
SRR2357588	SRX1227291	SRP063635	SAMN04046234	24,865,089	64%	35%
SRR2357589	SRX1227292	SRP063635	SAMN04046236	21,151,921	67%	37%
SRR2357620	SRX1227308	SRP063635	SAMN04046236	21,151,921	64%	36%
SRR2357621	SRX1227338	SRP063635	SAMN04046237	22,991,794	64%	34%
SRR2357622	SRX1227339	SRP063635	SAMN04046237	22,991,794	62%	33%
SRR2357623	SRX1227340	SRP063635	SAMN04046239	36,770,876	67%	36%
SRR2357650	SRX1227354	SRP063635	SAMN04046239	36,770,876	65%	35%
SRR3214073	SRX1621485	SRP071353	SAMN04537393	28,645,158	77%	23%
SRR3214074	SRX1621486	SRP071353	SAMN04537394	31,309,358	78%	25%
SRR3214077	SRX1621489	SRP071353	SAMN04537395	33,802,985	78%	24%
SRR3214078	SRX1621490	SRP071353	SAMN04537396	33,014,462	78%	23%
SRR3214079	SRX1621491	SRP071353	SAMN04537397	32,699,016	76%	21%
SRR3214080	SRX1621492	SRP071353	SAMN04537398	31,940,121	77%	22%
SRR3214081	SRX1621493	SRP071353	SAMN04537399	33,243,586	77%	23%
SRR3214082	SRX1621494	SRP071353	SAMN04537400	33,933,720	78%	24%
SRR3214083	SRX1621495	SRP071353	SAMN04537401	32,050,049	76%	23%
SRR3214084	SRX1621496	SRP071353	SAMN04537402	32,591,406	36%	14%
SRR3214075	SRX1621487	SRP071353	SAMN04537403	31,401,721	74%	19%
SRR3214076	SRX1621488	SRP071353	SAMN04537404	36,074,985	76%	22%
SRR3239506	SRX1654895	SRP072018	SAMN04549285	27,495,684	76%	28%
SRR3239530	SRX1654896	SRP072018	SAMN04549285	33,190,074	76%	32%
SRR3239519	SRX1654897	SRP072018	SAMN04549285	26,272,420	71%	26%
SRR3239522	SRX1654898	SRP072018	SAMN04549285	33,581,016	70%	27%
SRR3239523	SRX1654899	SRP072018	SAMN04549285	24,830,742	72%	27%
SRR3239525	SRX1654900	SRP072018	SAMN04549285	28,767,564	71%	28%
SRR3239527	SRX1654901	SRP072018	SAMN04549285	31,123,186	68%	21%
SRR3239531	SRX1654902	SRP072018	SAMN04549285	31,591,512	77%	34%
SRR3239535	SRX1654903	SRP072018	SAMN04549285	26,801,282	67%	26%
SRR3239537	SRX1654904	SRP072018	SAMN04549285	31,109,444	72%	30%
SRR3239544	SRX1654905	SRP072018	SAMN04549285	48,250,346	73%	24%
SRR3239547	SRX1654906	SRP072018	SAMN04549285	38,597,534	76%	28%
SRR3239549	SRX1654907	SRP072018	SAMN04549285	23,769,762	75%	26%
SRR3239507	SRX1654908	SRP072018	SAMN04549285	29,167,164	75%	29%
SRR3239510	SRX1654909	SRP072018	SAMN04549285	26,707,840	71%	25%
SRR3239512	SRX1654910	SRP072018	SAMN04549285	26,526,312	78%	30%
SRR3239515	SRX1654911	SRP072018	SAMN04549285	25,216,156	78%	29%
SRR3239516	SRX1654912	SRP072018	SAMN04549285	32,450,728	70%	27%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Cynoglossus semilaevis high-quality model RefSeq (XP_)	13,609	13,145 (96.59%)	13,145 (96.59%)	69.96%	70.48%
Poecilia formosa high-quality model RefSeq (XP_)	18,503	17,820 (96.31%)	17,820 (96.31%)	69.12%	69.08%
Actinopterygii GenBank	76,418	72,984 (95.51%)	72,984 (95.51%)	72.54%	77.02%
Actinopterygii known RefSeq (NP_)	24,661	23,915 (96.97%)	23,915 (96.97%)	73.16%	76.30%
Danio rerio high-quality model RefSeq (XP_)	7,662	7,573 (98.84%)	7,573 (98.84%)	72.89%	74.69%
Astyanax mexicanus high-quality model RefSeq (XP_)	13,209	12,942 (97.98%)	12,942 (97.98%)	71.19%	73.56%
Esox lucius high-quality model RefSeq (XP_)	15,544	15,081 (97.02%)	15,081 (97.02%)	70.11%	70.69%
Homo sapiens known RefSeq (NP_)	44,808	37,637 (84.00%)	37,637 (84.00%)	67.75%	64.81%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20

RefSeq

Integrated reference sequences