NCBI Penaeus chinensis Annotation Release 100

The RefSeq genome records for Penaeus chinensis were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
BUSCO results: Annotation completeness assessed with BUSCO
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Penaeus chinensis Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: Apr 5 2022
Date of submission of annotation to the public databases: Apr 8 2022
Software version: 9.0

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
ASM1920278v2	GCF_019202785.1	Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences	07-13-2021	Reference	44 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	ASM1920278v2
Genes and pseudogenes	23,929
protein-coding	20,076
non-coding	3,507
Transcribed pseudogenes	2
Non-transcribed pseudogenes	343
genes with variants	5,790
Immunoglobulin/T-cell receptor gene segments	0
other	1
mRNAs	34,817
fully-supported	29,432
with > 5% ab initio	4,049
partial	849
with filled gap(s)	502
known RefSeq (NM_)	0
model RefSeq (XM_)	34,817
non-coding RNAs	4,265
fully-supported	1,662
with > 5% ab initio	0
partial	3
with filled gap(s)	3
known RefSeq (NR_)	0
model RefSeq (XR_)	2,057
pseudo transcripts	2
fully-supported	2
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	2
CDSs	34,830
fully-supported	29,432
with > 5% ab initio	4,346
partial	797
with major correction(s)	521
known RefSeq (NP_)	0
model RefSeq (XP_)	34,830

Detailed reports

The counts below do not include pseudogenes.

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	23,584	20,353	7,567	63	1,410,248
All transcripts	39,082	2,786	2,064	63	40,813
mRNA	34,817	3,045	2,272	93	40,813
misc_RNA	593	2,647	1,974	190	20,109
tRNA	2,206	74	73	65	87
lncRNA	1,069	998	626	80	14,225
snoRNA	80	163	201	63	212
snRNA	179	149	161	102	195
rRNA	137	133	119	117	1,367
Single-exon transcripts	966	1,213	834	249	8,262
coding transcripts (NM_/XM_ )	966	1,213	834	249	8,262
CDSs	34,830	1,904	1,353	93	39,648
Exons	170,032	320	157	2	20,194
in coding transcripts (NM_/XM_ )	166,132	319	157	2	20,194
in non-coding transcripts (NR_/XR_ )	6,634	306	146	3	13,691
Introns	148,516	3,770	655	30	588,825
in coding transcripts (NM_/XM_ )	145,823	3,726	652	30	588,825
in non-coding transcripts (NR_/XR_ )	5,256	4,505	688	31	556,155

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.72	1	1	50
Number of exons per transcript	9.27	7	1	146

BUSCO analysis of gene annotation

BUSCO v4.1.4 was run in "protein" mode on the annotated gene set picking one longest protein per gene, and run using the arthropoda_odb10 lineage dataset. Results are reported for the gene set from the primary assembly unit, and presented in BUSCO notation.

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the UniProtKB/Swiss-Prot curated proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 20063 coding genes, 13324 genes had a protein with an alignment covering 50% or more of the query and 2652 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: UniProtKB/Swiss-Prot curated proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker (if calculated), for each assembly. RepeatMasker results are only calculated for organisms with complete Dfam HMM model collections.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with WindowMasker
ASM1920278v2	GCF_019202785.1	60.48%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign, minimap2, or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species Genbank	247	237 (95.95%)	230 (93.12%)	99.33%	96.79%
Same-species EST	10,446	8,748 (83.74%)	7,689 (73.61%)	98.83%	98.79%

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	Aggregate of all aligned samples	2,326,859,490	76%	32%	163,791
SAMN12429760	muscle (Penaeus chinensis, female, SAMN12429760)	47,494,284	86%	39%	73,918
SAMN12429761	muscle (Penaeus chinensis, female, SAMN12429761)	52,649,828	89%	35%	84,541
SAMN12429762	muscle (Penaeus chinensis, female, SAMN12429762)	51,933,274	88%	40%	82,649
SAMN12429763	gonad (Penaeus chinensis, female, SAMN12429763)	45,224,490	75%	30%	83,064
SAMN12429764	gonad (Penaeus chinensis, female, SAMN12429764)	49,993,496	82%	37%	95,460
SAMN12429765	gonad (Penaeus chinensis, female, SAMN12429765)	49,253,468	78%	37%	98,631
SAMN12429766	muscle (Penaeus chinensis, male, SAMN12429766)	51,864,504	89%	31%	80,560
SAMN12429767	muscle (Penaeus chinensis, male, SAMN12429767)	46,877,802	90%	32%	71,421
SAMN12429768	muscle (Penaeus chinensis, male, SAMN12429768)	49,514,746	89%	31%	73,015
SAMN12429769	gonad (Penaeus chinensis, male, SAMN12429769)	51,220,074	81%	34%	101,134
SAMN12429770	gonad (Penaeus chinensis, male, SAMN12429770)	52,052,302	75%	37%	103,267
SAMN12429771	gonad (Penaeus chinensis, male, SAMN12429771)	50,370,966	83%	38%	109,539
SAMN16789034	hepatopancreas (Penaeus chinensis, 5 month, SAMN16789034)	42,119,260	81%	43%	95,765
SAMN16789035	hepatopancreas (Penaeus chinensis, 5 month, SAMN16789035)	39,807,956	77%	42%	89,038
SAMN17035586	MIGS Eukaryotic sample (Penaeus chinensis, SAMN17035586)	521,389,792	72%	18%	141,762
SAMN19522615	gill (Penaeus chinensis, 4 month, female, SAMN19522615)	40,540,446	79%	39%	110,524
SAMN19522616	gill (Penaeus chinensis, 4 month, female, SAMN19522616)	44,076,814	77%	37%	111,028
SAMN19522617	gill (Penaeus chinensis, 4 month, female, SAMN19522617)	40,897,112	74%	37%	107,537
SAMN19522618	gill (Penaeus chinensis, 4 month, female, SAMN19522618)	43,080,336	72%	39%	107,940
SAMN19522619	gill (Penaeus chinensis, 4 month, female, SAMN19522619)	42,731,386	74%	40%	104,245
SAMN19522620	gill (Penaeus chinensis, 4 month, female, SAMN19522620)	44,471,972	75%	39%	108,803
SAMN19522621	gill (Penaeus chinensis, 4 month, female, SAMN19522621)	38,932,846	76%	38%	107,783
SAMN19522622	gill (Penaeus chinensis, 4 month, female, SAMN19522622)	41,038,342	77%	38%	107,836
SAMN19522623	gill (Penaeus chinensis, 4 month, female, SAMN19522623)	38,861,912	76%	39%	109,207
SAMN19522624	gill (Penaeus chinensis, 4 month, female, SAMN19522624)	44,091,760	74%	39%	107,117
SAMN19522625	gill (Penaeus chinensis, 4 month, female, SAMN19522625)	40,391,480	74%	41%	107,051
SAMN19522626	gill (Penaeus chinensis, 4 month, female, SAMN19522626)	39,280,686	76%	40%	106,783
SAMN25734233	gill (Penaeus chinensis, 4 month, female, SAMN25734233)	55,132,738	73%	34%	110,502
SAMN25734234	gill (Penaeus chinensis, 4 month, female, SAMN25734234)	46,814,918	78%	35%	109,068
SAMN25734235	gill (Penaeus chinensis, 4 month, female, SAMN25734235)	53,615,834	76%	35%	110,708
SAMN25734236	gill (Penaeus chinensis, 4 month, female, SAMN25734236)	32,826,150	67%	21%	92,783
SAMN25734237	gill (Penaeus chinensis, 4 month, female, SAMN25734237)	58,451,784	76%	28%	108,653
SAMN25734238	gill (Penaeus chinensis, 4 month, female, SAMN25734238)	43,853,520	77%	36%	108,720
SAMN25734239	gill (Penaeus chinensis, 4 month, female, SAMN25734239)	50,666,898	74%	30%	104,087
SAMN25734240	gill (Penaeus chinensis, 4 month, female, SAMN25734240)	47,678,006	77%	35%	108,466
SAMN25734241	gill (Penaeus chinensis, 4 month, female, SAMN25734241)	58,147,068	77%	33%	108,487
SAMN25734242	gill (Penaeus chinensis, 4 month, female, SAMN25734242)	65,210,848	75%	30%	102,988
SAMN25734243	gill (Penaeus chinensis, 4 month, female, SAMN25734243)	65,923,040	66%	22%	99,777
SAMN25734244	gill (Penaeus chinensis, 4 month, female, SAMN25734244)	48,377,352	69%	30%	108,192

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
SRR9894259	SRX6646672	SRP217208	SAMN12429760	47,494,284	86%	39%
SRR9894258	SRX6646673	SRP217208	SAMN12429761	52,649,828	89%	35%
SRR9894257	SRX6646674	SRP217208	SAMN12429762	51,933,274	88%	40%
SRR9894256	SRX6646675	SRP217208	SAMN12429763	45,224,490	75%	30%
SRR9894255	SRX6646676	SRP217208	SAMN12429764	49,993,496	82%	37%
SRR9894254	SRX6646677	SRP217208	SAMN12429765	49,253,468	78%	37%
SRR9894253	SRX6646678	SRP217208	SAMN12429766	51,864,504	89%	31%
SRR9894252	SRX6646679	SRP217208	SAMN12429767	46,877,802	90%	32%
SRR9894251	SRX6646680	SRP217208	SAMN12429768	49,514,746	89%	31%
SRR9894262	SRX6646669	SRP217208	SAMN12429769	51,220,074	81%	34%
SRR9894261	SRX6646670	SRP217208	SAMN12429770	52,052,302	75%	37%
SRR9894263	SRX6646668	SRP217208	SAMN12429771	50,370,966	83%	38%
SRR13052508	SRX9501905	SRP292453	SAMN16789034	42,119,260	81%	43%
SRR13052507	SRX9501906	SRP292453	SAMN16789035	39,807,956	77%	42%
SRR13304120	SRX9732967	SRP299293	SAMN17035586	39,224,708	70%	16%
SRR13304119	SRX9732968	SRP299293	SAMN17035586	43,309,308	75%	20%
SRR13304118	SRX9732969	SRP299293	SAMN17035586	46,215,510	71%	16%
SRR13304117	SRX9732970	SRP299293	SAMN17035586	42,680,974	59%	12%
SRR13304116	SRX9732971	SRP299293	SAMN17035586	44,770,916	75%	22%
SRR13304115	SRX9732972	SRP299293	SAMN17035586	46,057,900	72%	21%
SRR13304114	SRX9732973	SRP299293	SAMN17035586	42,480,740	80%	19%
SRR13304113	SRX9732974	SRP299293	SAMN17035586	43,620,388	75%	18%
SRR13304112	SRX9732975	SRP299293	SAMN17035586	42,721,136	75%	19%
SRR13304111	SRX9732976	SRP299293	SAMN17035586	43,018,894	76%	20%
SRR13304110	SRX9732977	SRP299293	SAMN17035586	44,718,118	71%	20%
SRR13304109	SRX9732978	SRP299293	SAMN17035586	42,571,200	60%	15%
SRR14718720	SRX11055743	SRP322492	SAMN19522615	40,540,446	79%	39%
SRR14718719	SRX11055744	SRP322492	SAMN19522616	44,076,814	77%	37%
SRR14718716	SRX11055747	SRP322492	SAMN19522617	40,897,112	74%	37%
SRR14718715	SRX11055748	SRP322492	SAMN19522618	43,080,336	72%	39%
SRR14718714	SRX11055749	SRP322492	SAMN19522619	42,731,386	74%	40%
SRR14718713	SRX11055750	SRP322492	SAMN19522620	44,471,972	75%	39%
SRR14718712	SRX11055751	SRP322492	SAMN19522621	38,932,846	76%	38%
SRR14718711	SRX11055752	SRP322492	SAMN19522622	41,038,342	77%	38%
SRR14718710	SRX11055753	SRP322492	SAMN19522623	38,861,912	76%	39%
SRR14718709	SRX11055754	SRP322492	SAMN19522624	44,091,760	74%	39%
SRR14718718	SRX11055745	SRP322492	SAMN19522625	40,391,480	74%	41%
SRR14718717	SRX11055746	SRP322492	SAMN19522626	39,280,686	76%	40%
SRR17931580	SRX14089999	SRP322492	SAMN25734233	55,132,738	73%	34%
SRR17931579	SRX14090000	SRP322492	SAMN25734234	46,814,918	78%	35%
SRR17931576	SRX14090003	SRP322492	SAMN25734235	53,615,834	76%	35%
SRR17931575	SRX14090004	SRP322492	SAMN25734236	32,826,150	67%	21%
SRR17931574	SRX14090005	SRP322492	SAMN25734237	58,451,784	76%	28%
SRR17931573	SRX14090006	SRP322492	SAMN25734238	43,853,520	77%	36%
SRR17931572	SRX14090007	SRP322492	SAMN25734239	50,666,898	74%	30%
SRR17931571	SRX14090008	SRP322492	SAMN25734240	47,678,006	77%	35%
SRR17931570	SRX14090009	SRP322492	SAMN25734241	58,147,068	77%	33%
SRR17931569	SRX14090010	SRP322492	SAMN25734242	65,210,848	75%	30%
SRR17931578	SRX14090001	SRP322492	SAMN25734243	65,923,040	66%	22%
SRR17931577	SRX14090002	SRP322492	SAMN25734244	48,377,352	69%	30%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species GenBank	236	234 (99.15%)	234 (99.15%)	83.08%	89.88%
Hyalella azteca high-quality model RefSeq (XP_)	9,395	6,692 (71.23%)	6,692 (71.23%)	64.85%	59.39%
Crustacea GenBank	45,999	40,513 (88.07%)	40,513 (88.07%)	71.56%	79.08%
Daphnia pulex Other	58,982	31,592 (53.56%)	31,592 (53.56%)	65.34%	58.30%
Tribolium castaneum GenBank	673	566 (84.10%)	566 (84.10%)	67.49%	63.10%
Tribolium castaneum high-quality model RefSeq (XP_)	11,487	7,678 (66.84%)	7,678 (66.84%)	62.26%	53.27%
Tribolium castaneum known RefSeq (NP_)	627	510 (81.34%)	510 (81.34%)	66.55%	59.83%
Drosophila melanogaster known RefSeq (NP_)	30,704	19,163 (62.41%)	19,163 (62.41%)	65.73%	55.34%
Eurytemora affinis high-quality model RefSeq (XP_)	14,540	7,683 (52.84%)	7,683 (52.84%)	61.36%	48.14%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
BUSCO: Manni M, Berkeley MR, Seppey M, Simão FA, Zdobnov EM. Molecular biology and evolution 2021.38(10):4647-4654
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20
Minimap2: Li H. Bioinformatics 2018 Sep 15;34(18):3094-3100

RefSeq

Integrated reference sequences