NCBI Vigna unguiculata Annotation Release 100

The RefSeq genome records for Vigna unguiculata were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Vigna unguiculata Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: Feb 6 2019
Date of submission of annotation to the public databases: Feb 11 2019
Software version: 8.1

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
ASM411807v1	GCF_004118075.1	University of California, Riverside	01-30-2019	Reference	12 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	ASM411807v1
Genes and pseudogenes	34,383
protein-coding	28,314
non-coding	4,855
transcribed pseudogenes	26
non-transcribed pseudogenes	1,188
genes with variants	7,436
immunoglobulin/T-cell receptor gene segments	0
other	0
mRNAs	41,089
fully-supported	36,843
with > 5% ab initio	3,636
partial	120
with filled gap(s)	0
known RefSeq (NM_)	0
model RefSeq (XM_)	41,089
non-coding RNAs	8,638
fully-supported	5,524
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	7,495
pseudo transcripts	26
fully-supported	26
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	26
CDSs	41,173
fully-supported	36,843
with > 5% ab initio	3,695
partial	120
with major correction(s)	341
known RefSeq (NP_)	0
model RefSeq (XP_)	41,173

Detailed reports

The counts below do not include pseudogenes.

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	33,169	3,699	2,653	59	104,097
All transcripts	49,727	1,830	1,608	59	16,728
mRNA	41,089	1,933	1,678	180	16,728
misc_RNA	2,440	2,450	2,161	256	12,152
tRNA	1,135	74	73	59	88
lncRNA	3,084	1,470	1,062	92	13,114
snoRNA	518	105	107	64	226
snRNA	175	155	160	102	198
rRNA	1,286	714	119	104	3,463
Single-exon transcripts	4,754	1,359	1,158	180	10,907
coding transcripts (NM_/XM_ )	4,753	1,359	1,158	180	10,907
non-coding transcripts (NR_/XR_ )	1	568	568	568	568
CDSs	41,173	1,437	1,188	90	16,278
Exons	186,783	324	171	1	11,295
in coding transcripts (NM_/XM_ )	175,668	322	170	1	10,907
in non-coding transcripts (NR_/XR_ )	18,479	301	150	2	11,295
Introns	150,254	538	189	30	87,770
in coding transcripts (NM_/XM_ )	142,482	523	185	30	87,770
in non-coding transcripts (NR_/XR_ )	14,861	700	244	30	46,407

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.52	1	1	50
Number of exons per transcript	6.33	5	1	78

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the Arabidopsis thaliana known RefSeq proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 28230 coding genes, 24895 genes had a protein with an alignment covering 50% or more of the query and 11842 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: Arabidopsis thaliana known RefSeq proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
ASM411807v1	GCF_004118075.1	4.19%	40.35%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Depending on the other evidence available, long 454 reads (with average length above 250 nt) may be aligned as traditional evidence and reported in the Transcript alignments section or aligned with RNA-Seq reads and reported in the RNA-Seq alignments section.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species Genbank	227	222 (97.80%)	210 (92.51%)	99.40%	98.03%
Same-species EST	187,486	183,232 (97.73%)	176,744 (94.27%)	99.56%	99.57%

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Publication	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	NA	Aggregate of all aligned samples	3,212,380,672	87%	25%	175,060
SAMN03284650	NA	root, stem, leaf, flower, tender pod (Vigna unguiculata, SAMN03284650)	54,417,074	90%	21%	132,451
SAMN03923016	26954786	leaf (Vigna unguiculata subsp. sesquipedalis, two-weeks, SAMN03923016)	88,668,904	86%	24%	124,222
SAMN03944580	26954786	leaf (Vigna unguiculata subsp. sesquipedalis, two-weeks, SAMN03944580)	82,379,258	87%	26%	125,255
SAMN07160186	NA	roots (Vigna unguiculata, SAMN07160186)	83,610,190	90%	24%	142,342
SAMN07160187	NA	roots (Vigna unguiculata, SAMN07160187)	65,037,704	90%	25%	139,874
SAMN07160188	NA	roots (Vigna unguiculata, SAMN07160188)	52,932,274	89%	24%	136,743
SAMN07160189	NA	roots (Vigna unguiculata, SAMN07160189)	57,455,300	87%	25%	138,910
SAMN07160190	NA	roots (Vigna unguiculata, SAMN07160190)	63,374,060	85%	24%	137,563
SAMN07160191	NA	roots (Vigna unguiculata, SAMN07160191)	82,306,692	90%	25%	139,694
SAMN07160192	NA	roots (Vigna unguiculata, SAMN07160192)	61,907,874	85%	24%	137,071
SAMN07160193	NA	roots (Vigna unguiculata, SAMN07160193)	60,386,940	87%	24%	136,640
SAMN07160194	NA	roots (Vigna unguiculata, SAMN07160194)	73,102,600	88%	24%	139,686
SAMN07160195	NA	roots (Vigna unguiculata, SAMN07160195)	76,009,330	91%	25%	141,036
SAMN07160196	NA	roots (Vigna unguiculata, SAMN07160196)	86,267,696	89%	24%	142,325
SAMN07160197	NA	roots (Vigna unguiculata, SAMN07160197)	72,758,410	89%	24%	142,094
SAMN07160198	NA	roots (Vigna unguiculata, SAMN07160198)	58,910,798	88%	24%	136,901
SAMN07160199	NA	roots (Vigna unguiculata, SAMN07160199)	74,785,482	89%	25%	139,104
SAMN07160200	NA	roots (Vigna unguiculata, SAMN07160200)	57,154,786	86%	24%	137,144
SAMN07160201	NA	roots (Vigna unguiculata, SAMN07160201)	56,772,856	87%	24%	138,786
SAMN07194302	NA	roots (Vigna unguiculata, SAMN07194302)	68,155,908	89%	24%	138,281
SAMN07194303	NA	roots (Vigna unguiculata, SAMN07194303)	68,182,452	91%	24%	139,298
SAMN07194304	NA	roots (Vigna unguiculata, SAMN07194304)	70,838,118	76%	24%	138,289
SAMN07194305	NA	roots (Vigna unguiculata, SAMN07194305)	66,037,212	89%	24%	139,132
SAMN07194306	NA	roots (Vigna unguiculata, SAMN07194306)	61,306,316	88%	24%	137,650
SAMN07194307	NA	roots (Vigna unguiculata, SAMN07194307)	57,857,824	89%	24%	136,617
SAMN07194308	NA	roots (Vigna unguiculata, SAMN07194308)	60,621,962	88%	24%	137,904
SAMN07194309	NA	roots (Vigna unguiculata, SAMN07194309)	63,492,974	88%	24%	139,077
SAMN07194882	27448251	flower (Vigna unguiculata, SAMN07194882)	55,412,844	84%	26%	132,645
SAMN07194883	27448251	flower (Vigna unguiculata, SAMN07194883)	30,677,798	79%	32%	128,552
SAMN07194884	27448251	flower (Vigna unguiculata, SAMN07194884)	35,651,002	81%	26%	126,995
SAMN07194885	27448251	leaf (Vigna unguiculata, SAMN07194885)	52,176,786	89%	24%	123,477
SAMN07194886	27448251	leaf (Vigna unguiculata, SAMN07194886)	13,018,072	64%	41%	100,524
SAMN07194887	27448251	leaf (Vigna unguiculata, SAMN07194887)	50,494,816	85%	26%	127,044
SAMN07194888	27448251	Mix tissues (Vigna unguiculata, SAMN07194888)	19,904,076	83%	42%	135,148
SAMN07194889	27448251	Pod (Vigna unguiculata, SAMN07194889)	77,365,556	89%	23%	135,126
SAMN07194890	27448251	Pod (Vigna unguiculata, SAMN07194890)	63,709,902	91%	22%	134,471
SAMN07194891	27448251	Pod (Vigna unguiculata, SAMN07194891)	71,985,516	91%	23%	130,993
SAMN07194892	27448251	Root (Vigna unguiculata, SAMN07194892)	78,892,548	89%	21%	137,411
SAMN07194893	27448251	Root (Vigna unguiculata, SAMN07194893)	55,737,098	82%	38%	139,946
SAMN07194894	27448251	Root (Vigna unguiculata, SAMN07194894)	50,012,362	87%	23%	134,725
SAMN07194895	27448251	Seed (Vigna unguiculata, SAMN07194895)	63,000,362	90%	27%	135,676
SAMN07194896	27448251	Seed (Vigna unguiculata, SAMN07194896)	41,364,578	83%	20%	129,874
SAMN07194897	27448251	Seed (Vigna unguiculata, SAMN07194897)	44,684,622	95%	31%	133,900
SAMN07194898	27448251	Seed (Vigna unguiculata, SAMN07194898)	72,883,196	89%	26%	129,043
SAMN07194899	27448251	Seed (Vigna unguiculata, SAMN07194899)	48,686,638	93%	27%	127,516
SAMN07194900	27448251	Seed (Vigna unguiculata, SAMN07194900)	51,584,984	94%	26%	125,215
SAMN07194901	27448251	Seed (Vigna unguiculata, SAMN07194901)	80,500,710	89%	24%	120,274
SAMN07194902	27448251	Seed (Vigna unguiculata, SAMN07194902)	38,709,344	92%	23%	112,191
SAMN07194903	27448251	Seed (Vigna unguiculata, SAMN07194903)	50,135,170	86%	24%	118,314
SAMN07194904	27448251	Seed (Vigna unguiculata, SAMN07194904)	54,139,128	72%	20%	101,458
SAMN07194905	27448251	Seed (Vigna unguiculata, SAMN07194905)	39,096,012	89%	22%	99,157
SAMN07194906	27448251	Seed (Vigna unguiculata, SAMN07194906)	31,547,096	81%	15%	68,742
SAMN07194907	27448251	Stem (Vigna unguiculata, SAMN07194907)	51,800,482	89%	24%	129,153
SAMN07194908	27448251	Stem (Vigna unguiculata, SAMN07194908)	33,550,898	79%	28%	126,376
SAMN07194909	27448251	Stem (Vigna unguiculata, SAMN07194909)	30,928,082	80%	26%	124,727

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
SRR2140755	SRX1129675	SRP061809	SAMN03923016	47,524,396	88%	25%
SRR2140756	SRX1129676	SRP061809	SAMN03923016	41,144,508	83%	23%
SRR2140598	SRX1124686	SRP061809	SAMN03944580	46,040,414	88%	27%
SRR2135042	SRX1124982	SRP061809	SAMN03944580	36,338,844	86%	25%
SRR4897234	SRX2323067	SRP092517	SAMN03284650	54,417,074	90%	21%
SRR5645475	SRX2883569	SRP108619	SAMN07160186	83,610,190	90%	24%
SRR5645476	SRX2883568	SRP108619	SAMN07160187	65,037,704	90%	25%
SRR5645473	SRX2883571	SRP108619	SAMN07160188	52,932,274	89%	24%
SRR5645474	SRX2883570	SRP108619	SAMN07160189	57,455,300	87%	25%
SRR5645471	SRX2883573	SRP108619	SAMN07160190	63,374,060	85%	24%
SRR5645472	SRX2883572	SRP108619	SAMN07160191	82,306,692	90%	25%
SRR5645469	SRX2883575	SRP108619	SAMN07160192	61,907,874	85%	24%
SRR5645470	SRX2883574	SRP108619	SAMN07160193	60,386,940	87%	24%
SRR5645467	SRX2883577	SRP108619	SAMN07160194	73,102,600	88%	24%
SRR5645468	SRX2883576	SRP108619	SAMN07160195	76,009,330	91%	25%
SRR5645481	SRX2883563	SRP108619	SAMN07160196	86,267,696	89%	24%
SRR5645482	SRX2883562	SRP108619	SAMN07160197	72,758,410	89%	24%
SRR5645479	SRX2883565	SRP108619	SAMN07160198	58,910,798	88%	24%
SRR5645480	SRX2883564	SRP108619	SAMN07160199	74,785,482	89%	25%
SRR5645477	SRX2883567	SRP108619	SAMN07160200	57,154,786	86%	24%
SRR5645478	SRX2883566	SRP108619	SAMN07160201	56,772,856	87%	24%
SRR5645588	SRX2883680	SRP108619	SAMN07194302	68,155,908	89%	24%
SRR5645587	SRX2883681	SRP108619	SAMN07194303	68,182,452	91%	24%
SRR5645590	SRX2883678	SRP108619	SAMN07194304	70,838,118	76%	24%
SRR5645589	SRX2883679	SRP108619	SAMN07194305	66,037,212	89%	24%
SRR5645584	SRX2883684	SRP108619	SAMN07194306	61,306,316	88%	24%
SRR5645583	SRX2883685	SRP108619	SAMN07194307	57,857,824	89%	24%
SRR5645586	SRX2883682	SRP108619	SAMN07194308	60,621,962	88%	24%
SRR5645585	SRX2883683	SRP108619	SAMN07194309	63,492,974	88%	24%
SRR5648395	SRX2885737	SRP108689	SAMN07194882	55,412,844	84%	26%
SRR5648394	SRX2885738	SRP108689	SAMN07194883	30,677,798	79%	32%
SRR5648393	SRX2885739	SRP108689	SAMN07194884	35,651,002	81%	26%
SRR5648392	SRX2885740	SRP108689	SAMN07194885	52,176,786	89%	24%
SRR5648399	SRX2885733	SRP108689	SAMN07194886	13,018,072	64%	41%
SRR5648398	SRX2885734	SRP108689	SAMN07194887	50,494,816	85%	26%
SRR5648397	SRX2885735	SRP108689	SAMN07194888	19,904,076	83%	42%
SRR5648396	SRX2885736	SRP108689	SAMN07194889	77,365,556	89%	23%
SRR5648391	SRX2885741	SRP108689	SAMN07194890	63,709,902	91%	22%
SRR5648390	SRX2885742	SRP108689	SAMN07194891	71,985,516	91%	23%
SRR5648382	SRX2885750	SRP108689	SAMN07194892	78,892,548	89%	21%
SRR5648383	SRX2885748	SRP108689	SAMN07194893	55,737,098	82%	38%
SRR5648385	SRX2885747	SRP108689	SAMN07194894	50,012,362	87%	23%
SRR5648384	SRX2885749	SRP108689	SAMN07194895	63,000,362	90%	27%
SRR5648387	SRX2885745	SRP108689	SAMN07194896	41,364,578	83%	20%
SRR5648386	SRX2885746	SRP108689	SAMN07194897	44,684,622	95%	31%
SRR5648389	SRX2885743	SRP108689	SAMN07194898	72,883,196	89%	26%
SRR5648388	SRX2885744	SRP108689	SAMN07194899	48,686,638	93%	27%
SRR5648381	SRX2885751	SRP108689	SAMN07194900	51,584,984	94%	26%
SRR5648380	SRX2885752	SRP108689	SAMN07194901	80,500,710	89%	24%
SRR5648376	SRX2885756	SRP108689	SAMN07194902	38,709,344	92%	23%
SRR5648377	SRX2885755	SRP108689	SAMN07194903	50,135,170	86%	24%
SRR5648378	SRX2885754	SRP108689	SAMN07194904	54,139,128	72%	20%
SRR5648379	SRX2885753	SRP108689	SAMN07194905	39,096,012	89%	22%
SRR5648372	SRX2885760	SRP108689	SAMN07194906	31,547,096	81%	15%
SRR5648373	SRX2885759	SRP108689	SAMN07194907	51,800,482	89%	24%
SRR5648374	SRX2885758	SRP108689	SAMN07194908	33,550,898	79%	28%
SRR5648375	SRX2885757	SRP108689	SAMN07194909	30,928,082	80%	26%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Arabidopsis thaliana known RefSeq (NP_)	48,148	42,005 (87.24%)	42,005 (87.24%)	66.41%	70.95%
Fabaceae GenBank	41,640	38,754 (93.07%)	38,754 (93.07%)	73.90%	85.93%
Fabaceae known RefSeq (NP_)	8,063	7,939 (98.46%)	7,939 (98.46%)	74.70%	85.66%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20

RefSeq

Integrated reference sequences