NCBI Mauremys mutica Annotation Release 100

The RefSeq genome records for Mauremys mutica were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
BUSCO results: Annotation completeness assessed with BUSCO
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Mauremys mutica Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: Nov 3 2021
Date of submission of annotation to the public databases: Nov 6 2021
Software version: 9.0

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
ASM2049712v1	GCF_020497125.1	Institute of Zoology, Guangdong Academy of Science	10-14-2021	Reference	27 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	ASM2049712v1
Genes and pseudogenes	35,641
protein-coding	23,281
non-coding	10,781
Transcribed pseudogenes	0
Non-transcribed pseudogenes	1,165
genes with variants	12,461
Immunoglobulin/T-cell receptor gene segments	414
other	0
mRNAs	54,907
fully-supported	52,244
with > 5% ab initio	1,228
partial	216
with filled gap(s)	0
known RefSeq (NM_)	0
model RefSeq (XM_)	54,907
non-coding RNAs	14,882
fully-supported	9,751
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	10,619
pseudo transcripts	0
fully-supported	0
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	0
CDSs	55,334
fully-supported	52,244
with > 5% ab initio	1,422
partial	259
with major correction(s)	315
known RefSeq (NP_)	0
model RefSeq (XP_)	54,920

Detailed reports

The counts below do not include pseudogenes.

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	34,062	39,005	12,516	61	2,584,986
All transcripts	69,789	2,892	2,368	61	82,758
mRNA	54,907	3,406	2,851	114	82,758
misc_RNA	1,833	2,750	2,290	171	12,711
tRNA	4,261	75	73	66	88
lncRNA	7,918	1,175	770	100	16,947
snoRNA	365	135	131	63	312
snRNA	438	155	164	61	200
guide_RNA	16	191	143	86	420
rRNA	51	544	119	116	3,941
Single-exon transcripts	2,112	1,235	948	201	14,671
coding transcripts (NM_/XM_ )	2,112	1,235	948	201	14,671
CDSs	54,920	1,855	1,395	96	81,795
Exons	293,885	305	135	1	24,530
in coding transcripts (NM_/XM_ )	268,113	305	135	1	24,530
in non-coding transcripts (NR_/XR_ )	36,071	273	133	2	9,913
Introns	260,642	6,386	1,645	30	1,189,664
in coding transcripts (NM_/XM_ )	241,655	6,228	1,611	30	1,189,664
in non-coding transcripts (NR_/XR_ )	28,899	7,389	2,014	30	507,927

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	2.2	1	1	50
Number of exons per transcript	10.8	8	1	190

BUSCO analysis of gene annotation

BUSCO v4.1.4 (Simão et al 2015, PMID: 26059717) was run in "protein" mode on the annotated gene set picking one longest protein per gene, and run using the sauropsida_odb10 lineage dataset. Results are reported for the gene set from the primary assembly unit, and presented in BUSCO notation (C:complete [S:single-copy, D:duplicated], F:fragmented, M:missing, n:number of genes used).

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the UniProtKB/Swiss-Prot curated proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 23268 coding genes, 22255 genes had a protein with an alignment covering 50% or more of the query and 14193 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: UniProtKB/Swiss-Prot curated proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker (if calculated), for each assembly. RepeatMasker results are only calculated for organisms with complete Dfam HMM model collections.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with WindowMasker
ASM2049712v1	GCF_020497125.1	33.20%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign, minimap2, or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species Genbank	14	14 (100.00%)	14 (100.00%)	98.85%	99.79%

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	Aggregate of all aligned samples	1,032,384,828	65%	30%	418,248
SAMN15066209	liver (Mauremys mutica, age 2, SAMN15066209)	43,188,306	67%	36%	144,941
SAMN15066210	liver (Mauremys mutica, age 2, SAMN15066210)	39,477,288	68%	33%	142,418
SAMN15066211	liver (Mauremys mutica, age 2, SAMN15066211)	43,391,576	66%	39%	145,444
SAMN15066212	liver (Mauremys mutica, age 2, SAMN15066212)	42,733,470	45%	11%	84,236
SAMN15066213	liver (Mauremys mutica, age 2, SAMN15066213)	42,384,266	70%	33%	136,598
SAMN15066214	liver (Mauremys mutica, age 2, SAMN15066214)	38,516,128	59%	31%	114,580
SAMN15066215	liver (Mauremys mutica, age 2, SAMN15066215)	43,280,466	36%	17%	109,162
SAMN15066216	liver (Mauremys mutica, age 2, SAMN15066216)	44,323,306	57%	29%	110,232
SAMN15066217	liver (Mauremys mutica, age 2, SAMN15066217)	42,516,666	63%	31%	126,975
SAMN15066218	liver (Mauremys mutica, age 2, SAMN15066218)	41,318,648	46%	26%	101,748
SAMN15066219	liver (Mauremys mutica, age 2, SAMN15066219)	41,108,034	41%	27%	112,947
SAMN15066220	liver (Mauremys mutica, age 2, SAMN15066220)	40,689,454	56%	30%	98,516
SAMN15066221	liver (Mauremys mutica, age 2, SAMN15066221)	41,822,302	51%	26%	95,758
SAMN15066222	liver (Mauremys mutica, age 2, SAMN15066222)	42,296,624	57%	26%	121,922
SAMN15066223	liver (Mauremys mutica, age 2, SAMN15066223)	41,187,478	50%	27%	92,906
SAMN17131976	gonad (Mauremys mutica, male, SAMN17131976)	66,952,102	81%	34%	265,968
SAMN17131977	gonad (Mauremys mutica, male, SAMN17131977)	66,018,416	74%	27%	219,995
SAMN17131978	gonad (Mauremys mutica, male, SAMN17131978)	66,790,594	87%	44%	234,494
SAMN17131979	gonad (Mauremys mutica, female, SAMN17131979)	69,649,798	77%	23%	209,137
SAMN17131980	gonad (Mauremys mutica, female, SAMN17131980)	66,268,362	80%	35%	310,386
SAMN17131981	gonad (Mauremys mutica, female, SAMN17131981)	68,471,544	76%	25%	202,944

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
SRR11884994	SRX8432505	SRP265393	SAMN15066209	43,188,306	67%	36%
SRR11884993	SRX8432506	SRP265393	SAMN15066210	39,477,288	68%	33%
SRR11884987	SRX8432512	SRP265393	SAMN15066211	43,391,576	66%	39%
SRR11884986	SRX8432513	SRP265393	SAMN15066212	42,733,470	45%	11%
SRR11884985	SRX8432514	SRP265393	SAMN15066213	42,384,266	70%	33%
SRR11884984	SRX8432515	SRP265393	SAMN15066214	38,516,128	59%	31%
SRR11884983	SRX8432516	SRP265393	SAMN15066215	43,280,466	36%	17%
SRR11884982	SRX8432517	SRP265393	SAMN15066216	44,323,306	57%	29%
SRR11884981	SRX8432518	SRP265393	SAMN15066217	42,516,666	63%	31%
SRR11884980	SRX8432519	SRP265393	SAMN15066218	41,318,648	46%	26%
SRR11884992	SRX8432507	SRP265393	SAMN15066219	41,108,034	41%	27%
SRR11884991	SRX8432508	SRP265393	SAMN15066220	40,689,454	56%	30%
SRR11884990	SRX8432509	SRP265393	SAMN15066221	41,822,302	51%	26%
SRR11884989	SRX8432510	SRP265393	SAMN15066222	42,296,624	57%	26%
SRR11884988	SRX8432511	SRP265393	SAMN15066223	41,187,478	50%	27%
SRR13283011	SRX9712393	SRP298783	SAMN17131976	66,952,102	81%	34%
SRR13283012	SRX9712392	SRP298783	SAMN17131977	66,018,416	74%	27%
SRR13283007	SRX9712397	SRP298783	SAMN17131978	66,790,594	87%	44%
SRR13283008	SRX9712396	SRP298783	SAMN17131979	69,649,798	77%	23%
SRR13283010	SRX9712394	SRP298783	SAMN17131980	66,268,362	80%	35%
SRR13283009	SRX9712395	SRP298783	SAMN17131981	68,471,544	76%	25%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Pelodiscus sinensis high-quality model RefSeq (XP_)	10,355	9,957 (96.16%)	9,957 (96.16%)	74.59%	85.55%
Same-species GenBank	14	12 (85.71%)	12 (85.71%)	81.06%	91.15%
Xenopus GenBank	31,765	9,432 (29.69%)	9,432 (29.69%)	68.01%	75.76%
Xenopus known RefSeq (NP_)	19,183	18,357 (95.69%)	18,357 (95.69%)	69.17%	79.57%
Sauropsida GenBank	30,014	18,888 (62.93%)	18,888 (62.93%)	68.06%	77.05%
Sauropsida known RefSeq (NP_)	9,241	8,416 (91.07%)	8,416 (91.07%)	72.38%	81.59%
Chrysemys picta high-quality model RefSeq (XP_)	14,825	14,382 (97.01%)	14,382 (97.01%)	77.67%	87.52%
Homo sapiens GenBank	148,831	81,015 (54.43%)	81,015 (54.43%)	64.69%	78.88%
Homo sapiens known RefSeq (NP_)	62,816	43,610 (69.42%)	43,610 (69.42%)	69.36%	77.20%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20
Minimap2: Li H. Bioinformatics 2018 Sep 15;34(18):3094-3100

RefSeq

Integrated reference sequences