NCBI Diceros bicornis minor Annotation Release GCF_020826845.1-RS_2023_07

The genome sequence records for Diceros bicornis minor RefSeq assembly GCF_020826845.1 (mDicBic1.mat.cur) were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
BUSCO results: Annotation completeness assessed with BUSCO
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as "GCF_020826845.1-RS_2023_07".

Date of Entrez queries for transcripts and proteins: Jul 26 2023
Date of submission of annotation to the public databases: Jul 31 2023
Software version: 10.1

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
mDicBic1.mat.cur	GCF_020826845.1	Vertebrate Genomes Project	11-10-2021	Reference	43 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	mDicBic1.mat.cur
Genes and pseudogenes	30,368
protein-coding	21,434
non-coding	5,394
Transcribed pseudogenes	1
Non-transcribed pseudogenes	3,350
genes with variants	10,166
Immunoglobulin/T-cell receptor gene segments	166
other	23
mRNAs	47,774
fully-supported	45,607
with > 5% ab initio	1,042
partial	98
with filled gap(s)	0
known RefSeq (NM_)	0
model RefSeq (XM_)	47,774
non-coding RNAs	9,061
fully-supported	7,064
with > 5% ab initio	0
partial	1
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	8,480
pseudo transcripts	1
fully-supported	1
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	1
CDSs	47,953
fully-supported	45,607
with > 5% ab initio	1,170
partial	99
with major correction(s)	1,059
known RefSeq (NP_)	0
model RefSeq (XP_)	47,787

Detailed reports

The counts below do not include pseudogenes.

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	26,851	41,635	12,738	53	1,867,373
All transcripts	56,835	3,025	2,405	53	103,482
mRNA	47,774	3,338	2,733	141	103,482
misc_RNA	1,599	2,327	1,815	119	14,016
tRNA	579	74	73	59	87
lncRNA	5,465	1,487	884	76	12,696
snoRNA	537	104	93	53	327
snRNA	487	114	107	59	194
rRNA	371	1,209	153	115	4,716
Single-exon transcripts	2,510	1,266	953	141	15,149
coding transcripts (NM_/XM_ )	2,510	1,266	953	141	15,149
CDSs	47,787	2,019	1,482	99	103,113
Exons	252,219	298	135	1	20,267
in coding transcripts (NM_/XM_ )	235,263	293	134	1	20,267
in non-coding transcripts (NR_/XR_ )	23,143	324	132	10	11,695
Introns	225,395	6,033	1,461	30	905,198
in coding transcripts (NM_/XM_ )	212,776	5,853	1,441	30	905,198
in non-coding transcripts (NR_/XR_ )	18,475	7,799	1,731	30	387,450

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	2.14	1	1	50
Number of exons per transcript	11.33	8	1	314

BUSCO analysis of gene annotation

BUSCO v4.1.4 was run in "protein" mode on the annotated gene set picking one longest protein per gene, and run using the laurasiatheria_odb10 lineage dataset. Results are reported for the gene set from the primary assembly unit, and presented in BUSCO notation.

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the UniProtKB/Swiss-Prot curated proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 21421 coding genes, 21179 genes had a protein with an alignment covering 50% or more of the query and 18015 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: UniProtKB/Swiss-Prot curated proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker (if calculated), for each assembly. RepeatMasker results are only calculated for organisms with complete Dfam HMM model collections.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with WindowMasker
mDicBic1.mat.cur	GCF_020826845.1	39.53%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez Nucleotide, Entrez Protein, and SRA, and aligned to the genome.

Transcript alignments

The alignments of the following transcripts with Splign were used for gene prediction:

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species Genbank	3	3 (100.00%)	3 (100.00%)	99.47%	99.88%

RNA-Seq alignments

The alignments of the following RNA-Seq reads with STAR were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Publication	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	NA	Aggregate of all aligned samples	2,472,147,904	83%	16%	283,819
SAMN07187888	NA	adult, blood (Ceratotherium simum, SAMN07187888)	81,814,796	79%	21%	161,381
SAMN16722525	35260583	Rino-P4: NWR iPSCs primed - rep3 (Ceratotherium simum cottoni, SAMN16722525)	201,092,240	69%	14%	210,826
SAMN16722526	35260583	Rhino-lib13: NWR iPSCs primed - rep2 (Ceratotherium simum cottoni, SAMN16722526)	155,679,508	84%	12%	211,262
SAMN16722527	35260583	Rhino-lib4: NWR iPSCs primed - rep1 (Ceratotherium simum cottoni, SAMN16722527)	180,746,576	86%	13%	210,162
SAMN16722528	35260583		163,940,123	75%	12%	215,387
SAMN16722529	35260583		121,117,509	79%	13%	203,587
SAMN16722530	35260583		109,819,035	81%	10%	200,859
SAMN16722531	35260583		136,423,125	82%	10%	208,706
SAMN16722532	35260583		373,491,737	87%	17%	232,917
SAMN16722533	35260583	Rino-P5: NWR iPSCs primed - rep4 (Ceratotherium simum cottoni, SAMN16722533)	169,476,513	82%	14%	218,525
SAMN24694883	36490332	primordial germ cell like cells, (Ceratotherium simum cottoni, SAMN24694883)	39,895,301	91%	20%	162,788
SAMN24694884	36490332	primordial germ cell like cells, (Ceratotherium simum cottoni, SAMN24694884)	38,912,506	91%	20%	167,068
SAMN24694885	36490332	pre-induced cells, (Ceratotherium simum cottoni, SAMN24694885)	43,961,763	92%	20%	173,269
SAMN24694886	36490332	pre-induced cells, (Ceratotherium simum cottoni, SAMN24694886)	44,005,077	92%	20%	171,549
SAMN24694887	36490332	embryonic stem cells, (Ceratotherium simum cottoni, SAMN24694887)	46,711,028	92%	19%	174,064
SAMN24694888	36490332	embryonic stem cells, (Ceratotherium simum cottoni, SAMN24694888)	41,425,928	91%	19%	166,549
SAMN24694889	36490332	primordial germ cell like cells, (Ceratotherium simum simum, SAMN24694889)	28,034,375	89%	19%	162,042
SAMN24694890	36490332	primordial germ cell like cells, (Ceratotherium simum simum, SAMN24694890)	27,734,457	89%	20%	162,736
SAMN24694891	36490332	pre-induced cells, (Ceratotherium simum simum, SAMN24694891)	30,303,565	89%	19%	159,615
SAMN24694892	36490332	pre-induced cells, (Ceratotherium simum simum, SAMN24694892)	31,170,210	88%	18%	161,076
SAMN24694893	36490332	embryonic stem cells, (Ceratotherium simum simum, SAMN24694893)	29,986,447	90%	19%	165,516
SAMN24694894	36490332	embryonic stem cells, (Ceratotherium simum simum, SAMN24694894)	30,457,946	90%	19%	165,475
SAMN24694895	36490332	primordial germ cell like cells, (Ceratotherium simum simum, SAMN24694895)	27,153,577	90%	19%	164,958
SAMN24694896	36490332	primordial germ cell like cells, (Ceratotherium simum simum, SAMN24694896)	27,400,991	89%	19%	165,663
SAMN24694897	36490332	primordial germ cell like cells, (Ceratotherium simum simum, SAMN24694897)	17,499,508	91%	19%	153,180
SAMN24694898	36490332	primordial germ cell like cells, (Ceratotherium simum simum, SAMN24694898)	17,284,690	90%	19%	152,301
SAMN24694899	36490332	pre-induced cells, (Ceratotherium simum simum, SAMN24694899)	17,831,443	91%	19%	149,820
SAMN24694900	36490332	pre-induced cells, (Ceratotherium simum simum, SAMN24694900)	15,588,228	91%	19%	145,361
SAMN24694901	36490332	embryonic stem cells, (Ceratotherium simum simum, SAMN24694901)	14,185,645	92%	18%	141,579
SAMN24694902	36490332	embryonic stem cells, (Ceratotherium simum simum, SAMN24694902)	59,654,876	91%	17%	181,043
SAMN30052224	36490332	primordial germ cell like cells, (Ceratotherium simum simum, SAMN30052224)	18,549,821	90%	19%	154,295
SAMN30052225	36490332	primordial germ cell like cells, (Ceratotherium simum simum, SAMN30052225)	18,446,021	91%	20%	157,554
SAMN30052226	36490332	primordial germ cell like cells, (Ceratotherium simum simum, SAMN30052226)	20,671,440	91%	20%	158,518
SAMN30052227	36490332	primordial germ cell like cells, (Ceratotherium simum simum, SAMN30052227)	19,222,337	91%	20%	155,161
SAMN30052228	36490332	primordial germ cell like cells, (Ceratotherium simum simum, SAMN30052228)	15,981,439	89%	19%	145,413
SAMN30052229	36490332	primordial germ cell like cells, (Ceratotherium simum simum, SAMN30052229)	18,495,427	90%	19%	149,983
SAMN31353918	NA	Granulosa Cell, (Ceratotherium simum simum, SAMN31353918)	1,492,212	83%	17%	41,587
SAMN31353919	NA	Granulosa Cell, (Ceratotherium simum simum, SAMN31353919)	2,560,221	62%	18%	54,558
SAMN31353920	NA	Granulosa Cell, (Ceratotherium simum simum, SAMN31353920)	2,188,766	83%	16%	64,312
SAMN31353921	NA	Granulosa Cell, (Ceratotherium simum simum, SAMN31353921)	22,671,204	76%	18%	129,239

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
SRR5647986	SRX2885270	SRP108483	SAMN07187888	81,814,796	79%	21%
SRR13017768	SRX9468703	SRP291986	SAMN16722525	201,092,240	69%	14%
SRR13017767	SRX9468702	SRP291986	SAMN16722526	155,679,508	84%	12%
SRR13017766	SRX9468701	SRP291986	SAMN16722527	180,746,576	86%	13%
SRR13017774	SRX9468709	SRP291986	SAMN16722528	163,940,123	75%	12%
SRR13017773	SRX9468708	SRP291986	SAMN16722529	121,117,509	79%	13%
SRR13017772	SRX9468707	SRP291986	SAMN16722530	109,819,035	81%	10%
SRR13017771	SRX9468706	SRP291986	SAMN16722531	136,423,125	82%	10%
SRR13017770	SRX9468705	SRP291986	SAMN16722532	373,491,737	87%	17%
SRR13017769	SRX9468704	SRP291986	SAMN16722533	169,476,513	82%	14%
SRR17468379	SRX13639536	SRP353822	SAMN24694883	39,895,301	91%	20%
SRR17468380	SRX13639535	SRP353822	SAMN24694884	38,912,506	91%	20%
SRR17468381	SRX13639534	SRP353822	SAMN24694885	43,961,763	92%	20%
SRR17468382	SRX13639533	SRP353822	SAMN24694886	44,005,077	92%	20%
SRR17468383	SRX13639532	SRP353822	SAMN24694887	46,711,028	92%	19%
SRR17468384	SRX13639531	SRP353822	SAMN24694888	41,425,928	91%	19%
SRR17468385	SRX13639530	SRP353822	SAMN24694889	28,034,375	89%	19%
SRR17468386	SRX13639529	SRP353822	SAMN24694890	27,734,457	89%	20%
SRR17468387	SRX13639528	SRP353822	SAMN24694891	30,303,565	89%	19%
SRR17468388	SRX13639527	SRP353822	SAMN24694892	31,170,210	88%	18%
SRR17468389	SRX13639526	SRP353822	SAMN24694893	29,986,447	90%	19%
SRR17468390	SRX13639525	SRP353822	SAMN24694894	30,457,946	90%	19%
SRR17468391	SRX13639524	SRP353822	SAMN24694895	27,153,577	90%	19%
SRR17468392	SRX13639523	SRP353822	SAMN24694896	27,400,991	89%	19%
SRR17468393	SRX13639522	SRP353822	SAMN24694897	17,499,508	91%	19%
SRR17468394	SRX13639521	SRP353822	SAMN24694898	17,284,690	90%	19%
SRR17468395	SRX13639520	SRP353822	SAMN24694899	17,831,443	91%	19%
SRR17468396	SRX13639519	SRP353822	SAMN24694900	15,588,228	91%	19%
SRR17468397	SRX13639518	SRP353822	SAMN24694901	14,185,645	92%	18%
SRR17468398	SRX13639517	SRP353822	SAMN24694902	59,654,876	91%	17%
SRR20711243	SRX16732163	SRP353822	SAMN30052224	18,549,821	90%	19%
SRR20711244	SRX16732162	SRP353822	SAMN30052225	18,446,021	91%	20%
SRR20711245	SRX16732161	SRP353822	SAMN30052226	20,671,440	91%	20%
SRR20711246	SRX16732160	SRP353822	SAMN30052227	19,222,337	91%	20%
SRR20711247	SRX16732159	SRP353822	SAMN30052228	15,981,439	89%	19%
SRR20711248	SRX16732158	SRP353822	SAMN30052229	18,495,427	90%	19%
SRR21954062	SRX17937694	SRP403267	SAMN31353918	1,492,212	83%	17%
SRR21954063	SRX17937693	SRP403267	SAMN31353919	2,560,221	62%	18%
SRR21954064	SRX17937692	SRP403267	SAMN31353920	2,188,766	83%	16%
SRR21954065	SRX17937691	SRP403267	SAMN31353921	22,671,204	76%	18%

Protein alignments

The alignments of the following proteins with ProSplign were used for gene prediction:

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Homo sapiens known RefSeq (NP_)	67,111	64,516 (96.13%)	64,516 (96.13%)	79.10%	87.04%
Perissodactyla GenBank	1,431	1,410 (98.53%)	1,410 (98.53%)	74.73%	84.63%
Perissodactyla known RefSeq (NP_)	1,946	1,919 (98.61%)	1,919 (98.61%)	73.21%	94.24%
Equus caballus high-quality model RefSeq (XP_)	15,163	14,808 (97.66%)	14,808 (97.66%)	80.53%	89.31%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
BUSCO: Manni M, Berkeley MR, Seppey M, Simão FA, Zdobnov EM. Molecular biology and evolution 2021.38(10):4647-4654
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20
STAR: Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. Bioinformatics 2013 Jan 1;29(1):15-21.
Minimap2: Li H. Bioinformatics 2018 Sep 15;34(18):3094-3100

RefSeq

Integrated reference sequences