NCBI Juglans regia Annotation Release 100

The RefSeq genome records for Juglans regia were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Juglans regia Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: Oct 28 2016
Date of submission of annotation to the public databases: Nov 4 2016
Software version: 7.2

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
wgs.5d	GCF_001411555.1	Johns Hopkins University	10-22-2015	Reference	unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	wgs.5d
Genes and pseudogenes	43,323
protein-coding	36,861
non-coding	4,327
pseudogenes	2,135
genes with variants	10,507
mRNAs	55,627
fully-supported	48,276
with > 5% ab initio	5,891
partial	2,915
with filled gap(s)	1
known RefSeq (NM_)	0
model RefSeq (XM_)	55,627
Other RNAs	7,622
fully-supported	7,050
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	7,050
CDSs	55,627
fully-supported	48,276
with > 5% ab initio	6,031
partial	2,915
with major correction(s)	155
known RefSeq (NP_)	0
model RefSeq (XP_)	55,627

Detailed reports

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	41,188	4,641	2,742	68	149,019
All transcripts	63,249	1,706	1,477	68	17,222
mRNA	55,627	1,785	1,548	120	17,222
misc_RNA	2,527	2,053	1,761	170	8,666
tRNA	572	74	73	68	84
lncRNA	4,523	752	544	74	9,280
Single-exon transcripts	6,749	1,127	919	120	5,055
coding transcripts (NM_/XM_ )	6,749	1,127	919	120	5,055
CDSs	55,627	1,336	1,095	120	16,896
Exons	230,261	314	168	1	8,422
in coding transcripts (NM_/XM_ )	214,624	317	169	1	7,992
in non-coding transcripts (NR_/XR_ )	23,942	253	136	2	8,422
Introns	181,419	835	258	30	97,509
in coding transcripts (NM_/XM_ )	171,113	810	257	30	97,509
in non-coding transcripts (NR_/XR_ )	18,174	1,047	275	30	63,084

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.54	1	1	23
Number of exons per transcript	6	4	1	79

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the Arabidopsis thaliana known RefSeq proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 36861 coding genes, 33584 genes had a protein with an alignment covering 50% or more of the query and 15371 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: Arabidopsis thaliana known RefSeq proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
wgs.5d	GCF_001411555.1	3.92%	31.70%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Depending on the other evidence available, long 454 reads (with average length above 250 nt) may be aligned as traditional evidence and reported in the Transcript alignments section or aligned with RNA-Seq reads and reported in the RNA-Seq alignments section.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species Genbank	51	50 (98.04%)	44 (86.27%)	99.25%	96.90%
Same-species EST	5,253	4,982 (94.84%)	4,822 (91.80%)	99.34%	98.63%
Fagales Genbank	1,678	1,142 (68.06%)	498 (29.68%)	90.97%	88.28%
Fagales TSA	308,533	132,617 (42.98%)	199 (0.06%)	95.67%	92.15%
Fagales EST	287,600	67,069 (23.32%)	38,148 (13.26%)	91.28%	97.04%
Arabidopsis thaliana known RefSeq (NM_/NR_)	53,793	8,117 (15.09%)	71 (0.13%)	87.94%	70.46%
Arabidopsis thaliana Genbank	145,111	13,033 (8.98%)	149 (0.10%)	86.81%	71.06%

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Track name	Number of reads	Percent aligned reads	Percent spliced reads	Number of introns
All	Aggregate of all aligned samples	1,923,037,223	86%	19%	279,758
SAMN02484535	vegetative bud (Juglans regia, SAMN02484535)	41,466,970	74%	15%	155,563
SAMN02484536	leaf (Juglans regia, SAMN02484536)	43,732,392	93%	18%	159,670
SAMN02484537	root (Juglans regia, SAMN02484537)	39,839,020	90%	15%	149,349
SAMN02484538	callus interior (Juglans regia, SAMN02484538)	60,902,108	93%	18%	157,120
SAMN02484539	callus exterior (Juglans regia, SAMN02484539)	30,577,642	84%	23%	149,804
SAMN02484541	catkins (Juglans regia, SAMN02484541)	57,807,019	90%	18%	162,983
SAMN02484542	somatic embryo (Juglans regia, SAMN02484542)	28,357,550	92%	18%	151,288
SAMN02484543	leaf (Juglans regia, SAMN02484543)	53,333,298	87%	15%	146,392
SAMN02484544	leaf (Juglans regia, SAMN02484544)	61,822,738	94%	19%	165,039
SAMN02484545	fruit (Juglans regia, SAMN02484545)	58,683,826	88%	18%	162,354
SAMN02484546	hull (Juglans regia, SAMN02484546)	57,673,738	85%	16%	152,947
SAMN02484547	packing tissue (Juglans regia, SAMN02484547)	64,903,726	94%	19%	164,168
SAMN02484548	hull peel (Juglans regia, SAMN02484548)	44,029,546	88%	17%	147,541
SAMN02484549	hull cortex (Juglans regia, SAMN02484549)	65,446,528	87%	19%	143,970
SAMN02484550	packing tissue (Juglans regia, SAMN02484550)	59,283,694	88%	16%	154,536
SAMN02484551	pellicle (Juglans regia, SAMN02484551)	43,392,812	65%	12%	146,048
SAMN02484552	embryo (Juglans regia, SAMN02484552)	37,247,600	83%	11%	123,639
SAMN02484553	hull (Juglans regia, SAMN02484553)	61,885,858	83%	16%	145,141
SAMN02484554	wood transition zone (Juglans nigra, SAMN02484554)	49,304,672	90%	16%	137,889
SAMN02604460	mature leaf from tree, wild-type control (Juglans regia, SAMN02604460)	39,610,730	92%	23%	142,750
SAMN02604461	mature leaf from tree, wild-type control (Juglans regia, SAMN02604461)	31,182,572	84%	19%	137,294
SAMN02604462	mature leaf from tree, wild-type control (Juglans regia, SAMN02604462)	39,057,634	95%	23%	133,973
SAMN02604463	mature leaf from tree, transgenic, PPO-silenced (Juglans regia, SAMN02604463)	49,165,068	92%	21%	156,971
SAMN02604464	mature leaf from tree, transgenic, PPO-silenced (Juglans regia, SAMN02604464)	37,183,968	94%	22%	139,197
SAMN02604465	mature leaf from tree, transgenic, PPO-silenced (Juglans regia, SAMN02604465)	39,192,690	95%	23%	144,788
SAMN03288031	Leaf tissue, 5 years (Juglans regia, SAMN03288031)	60,634,104	93%	29%	161,894
SAMN03289065	leaf (Juglans nigra, 2 years old, SAMN03289065)	24,467,258	78%	19%	153,208
SAMN03289066	leaf (Juglans nigra, 2 years old, SAMN03289066)	20,630,474	70%	16%	147,537
SAMN03289067	leaf (Juglans nigra, 2 years old, SAMN03289067)	25,948,114	84%	21%	160,729
SAMN03289068	leaf (Juglans nigra, 2 years old, SAMN03289068)	20,958,148	86%	21%	150,149
SAMN03289069	leaf (Juglans nigra, 2 years old, SAMN03289069)	497,990	90%	35%	61,906
SAMN03289070	leaf (Juglans nigra, 2 years old, SAMN03289070)	383,622	84%	33%	49,828
SAMN03289071	leaf (Juglans nigra, 2 years old, SAMN03289071)	477,680	90%	36%	58,567
SAMN03289072	leaf (Juglans nigra, 2 years old, SAMN03289072)	528,592	89%	34%	66,012
SAMN03289073	leaf (Juglans nigra, 2 years old, SAMN03289073)	547,984	88%	35%	64,248
SAMN03289074	leaf (Juglans nigra, 2 years old, SAMN03289074)	557,356	89%	35%	65,842
SAMN03289075	leaf (Juglans nigra, 2 years old, SAMN03289075)	452,432	89%	33%	63,101
SAMN03289076	leaf (Juglans nigra, 2 years old, SAMN03289076)	457,592	80%	31%	59,650
SAMN03289077	leaf (Juglans nigra, 2 years old, SAMN03289077)	415,176	88%	34%	55,042
SAMN03289078	leaf (Juglans nigra, 2 years old, SAMN03289078)	450,004	87%	34%	58,338
SAMN03289079	leaf (Juglans nigra, 2 years old, SAMN03289079)	440,366	88%	35%	56,148
SAMN03289080	leaf (Juglans nigra, 2 years old, SAMN03289080)	445,258	90%	35%	57,465
SAMN03289081	dormant twigs (Juglans nigra, adult, SAMN03289081)	22,988,476	80%	19%	146,842
SAMN03289082	undamaged leaves (Juglans nigra, adult, SAMN03289082)	24,268,850	78%	19%	151,293
SAMN03289083	damaged leaves (Juglans nigra, adult, SAMN03289083)	24,648,664	91%	23%	160,066
SAMN03289084	undamaged twigs (Juglans nigra, adult, SAMN03289084)	24,103,806	79%	21%	155,779
SAMN03289085	damaged twigs (Juglans nigra, adult, SAMN03289085)	23,671,254	76%	19%	152,366
SAMN03289086	green twigs/buds-this year growth (Juglans nigra, adult, SAMN03289086)	23,677,624	82%	21%	165,563
SAMN03289087	catkins (male flowers- very mature) (Juglans nigra, adult, SAMN03289087)	23,998,160	61%	12%	131,000
SAMN03289088	dormant twigs (Juglans nigra, adult, SAMN03289088)	21,830,962	76%	18%	142,761
SAMN03289089	undamaged leaves (Juglans nigra, adult, SAMN03289089)	26,813,586	81%	19%	151,880
SAMN03289090	damaged leaves (Juglans nigra, adult, SAMN03289090)	14,933,690	79%	18%	134,879
SAMN03289091	undamaged twigs (Juglans nigra, adult, SAMN03289091)	24,031,272	54%	13%	140,949
SAMN03289092	damaged twigs (Juglans nigra, adult, SAMN03289092)	21,512,558	82%	21%	156,968
SAMN03289093	green twigs/buds-this year growth (Juglans nigra, adult, SAMN03289093)	32,473,390	83%	21%	170,932
SAMN03289094	catkins (male flowers-immature/not shed polen yet (Juglans nigra, adult, SAMN03289094)	26,794,996	78%	20%	156,762
SAMN03289095	female flowers (Juglans nigra, adult, SAMN03289095)	26,710,522	89%	24%	168,981
SAMN03294216	leaf, bud, female flower, and male flowers (Juglans mandshurica, SAMN03294216)	100,819,504	91%	23%	178,720
SAMN03329630	leaves (Juglans sigillata, 3 years old, SAMN03329630)	45,236,680	91%	27%	168,501
SAMN03329632	Leaf tissue (Juglans cathayensis, 15 years old, not applicable, SAMN03329632)	61,147,680	91%	27%	167,631

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent spliced reads
SRR1067931	SRX403874	SRP034866	SAMN02484535	41,466,970	74%	15%
SRR1067933	SRX403877	SRP034866	SAMN02484536	43,732,392	93%	18%
SRR1067935	SRX404094	SRP034866	SAMN02484537	39,839,020	90%	15%
SRR1067938	SRX404099	SRP034866	SAMN02484538	60,902,108	93%	18%
SRR1067941	SRX404102	SRP034866	SAMN02484539	30,577,642	84%	23%
SRR1068161	SRX404315	SRP034866	SAMN02484541	39,254,844	92%	16%
SRR1106640	SRX426484	SRP034866	SAMN02484541	18,552,175	86%	22%
SRR1068163	SRX404317	SRP034866	SAMN02484542	28,357,550	92%	18%
SRR1068165	SRX404319	SRP034866	SAMN02484543	53,333,298	87%	15%
SRR1068166	SRX404320	SRP034866	SAMN02484544	61,822,738	94%	19%
SRR1068167	SRX404321	SRP034866	SAMN02484545	58,683,826	88%	18%
SRR1068168	SRX404322	SRP034866	SAMN02484546	57,673,738	85%	16%
SRR1068169	SRX404323	SRP034866	SAMN02484547	64,903,726	94%	19%
SRR1068170	SRX404324	SRP034866	SAMN02484548	44,029,546	88%	17%
SRR1068171	SRX404325	SRP034866	SAMN02484549	65,446,528	87%	19%
SRR1068172	SRX404326	SRP034866	SAMN02484550	59,283,694	88%	16%
SRR1068173	SRX404327	SRP034866	SAMN02484551	43,392,812	65%	12%
SRR1068174	SRX404328	SRP034866	SAMN02484552	37,247,600	83%	11%
SRR1068175	SRX404329	SRP034866	SAMN02484553	61,885,858	83%	16%
SRR1068178	SRX404331	SRP034866	SAMN02484554	49,304,672	90%	16%
SRR1151448	SRX456597	SRP036081	SAMN02604460	39,610,730	92%	23%
SRR1151610	SRX456599	SRP036081	SAMN02604461	31,182,572	84%	19%
SRR1151611	SRX456600	SRP036081	SAMN02604462	39,057,634	95%	23%
SRR1151613	SRX456601	SRP036081	SAMN02604463	49,165,068	92%	21%
SRR1151614	SRX456602	SRP036081	SAMN02604464	37,183,968	94%	22%
SRR1151615	SRX456603	SRP036081	SAMN02604465	39,192,690	95%	23%
SRR1767236	SRX894150	SRP052610	SAMN03288031	60,634,104	93%	29%
SRR1767234	SRX894152	SRP052610	SAMN03329630	45,236,680	91%	27%
SRR1767237	SRX894151	SRP052610	SAMN03329632	61,147,680	91%	27%
SRR1779028	SRX858651	SRP052966	SAMN03289065	1,397,450	72%	23%
SRR1779059	SRX858682	SRP052966	SAMN03289065	23,069,808	79%	18%
SRR1779029	SRX858652	SRP052966	SAMN03289066	1,398,564	65%	20%
SRR1779060	SRX858683	SRP052966	SAMN03289066	19,231,910	70%	16%
SRR1779030	SRX858653	SRP052966	SAMN03289067	1,432,400	77%	26%
SRR1779061	SRX858684	SRP052966	SAMN03289067	24,515,714	85%	21%
SRR1779031	SRX858654	SRP052966	SAMN03289068	1,421,978	79%	25%
SRR1779062	SRX858685	SRP052966	SAMN03289068	19,536,170	86%	20%
SRR1779032	SRX858655	SRP052966	SAMN03289069	497,990	90%	35%
SRR1779033	SRX858656	SRP052966	SAMN03289070	383,622	84%	33%
SRR1779034	SRX858657	SRP052966	SAMN03289071	477,680	90%	36%
SRR1779035	SRX858658	SRP052966	SAMN03289072	528,592	89%	34%
SRR1779036	SRX858659	SRP052966	SAMN03289073	547,984	88%	35%
SRR1779037	SRX858660	SRP052966	SAMN03289074	557,356	89%	35%
SRR1779038	SRX858661	SRP052966	SAMN03289075	452,432	89%	33%
SRR1779039	SRX858662	SRP052966	SAMN03289076	457,592	80%	31%
SRR1779040	SRX858663	SRP052966	SAMN03289077	415,176	88%	34%
SRR1779041	SRX858664	SRP052966	SAMN03289078	450,004	87%	34%
SRR1779042	SRX858665	SRP052966	SAMN03289079	440,366	88%	35%
SRR1779043	SRX858666	SRP052966	SAMN03289080	445,258	90%	35%
SRR1779044	SRX858667	SRP052966	SAMN03289081	1,518,662	73%	23%
SRR1779063	SRX858686	SRP052966	SAMN03289081	21,469,814	81%	19%
SRR1779045	SRX858668	SRP052966	SAMN03289082	1,338,154	72%	23%
SRR1779064	SRX858687	SRP052966	SAMN03289082	22,930,696	78%	18%
SRR1779046	SRX858669	SRP052966	SAMN03289083	1,061,060	85%	29%
SRR1779065	SRX858688	SRP052966	SAMN03289083	23,587,604	91%	23%
SRR1779047	SRX858670	SRP052966	SAMN03289084	1,472,200	73%	26%
SRR1779066	SRX858689	SRP052966	SAMN03289084	22,631,606	79%	21%
SRR1779048	SRX858671	SRP052966	SAMN03289085	1,299,978	70%	23%
SRR1779067	SRX858690	SRP052966	SAMN03289085	22,371,276	76%	19%
SRR1779049	SRX858672	SRP052966	SAMN03289086	1,480,982	77%	26%
SRR1779068	SRX858691	SRP052966	SAMN03289086	22,196,642	82%	21%
SRR1779050	SRX858673	SRP052966	SAMN03289087	1,365,356	55%	15%
SRR1779069	SRX858692	SRP052966	SAMN03289087	22,632,804	62%	12%
SRR1779051	SRX858674	SRP052966	SAMN03289088	1,396,126	70%	22%
SRR1779070	SRX858693	SRP052966	SAMN03289088	20,434,836	77%	18%
SRR1779052	SRX858675	SRP052966	SAMN03289089	1,476,030	72%	23%
SRR1779071	SRX858694	SRP052966	SAMN03289089	25,337,556	81%	19%
SRR1779053	SRX858676	SRP052966	SAMN03289090	737,584	67%	21%
SRR1779072	SRX858695	SRP052966	SAMN03289090	14,196,106	80%	18%
SRR1779054	SRX858677	SRP052966	SAMN03289091	1,350,252	51%	16%
SRR1779073	SRX858696	SRP052966	SAMN03289091	22,681,020	54%	12%
SRR1779055	SRX858678	SRP052966	SAMN03289092	1,351,342	75%	26%
SRR1779074	SRX858697	SRP052966	SAMN03289092	20,161,216	82%	21%
SRR1779056	SRX858679	SRP052966	SAMN03289093	1,760,058	77%	26%
SRR1779075	SRX858698	SRP052966	SAMN03289093	30,713,332	83%	21%
SRR1779057	SRX858680	SRP052966	SAMN03289094	1,439,790	73%	24%
SRR1779076	SRX858699	SRP052966	SAMN03289094	25,355,206	79%	19%
SRR1779058	SRX858681	SRP052966	SAMN03289095	1,443,134	81%	30%
SRR1779077	SRX858700	SRP052966	SAMN03289095	25,267,388	89%	24%
SRR2537331	SRX1295882	SRP064333	SAMN03294216	100,819,504	91%	23%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Prunus mume high-quality model RefSeq (XP_)	12,164	11,831 (97.26%)	11,831 (97.26%)	70.34%	78.62%
Cucumis sativus high-quality model RefSeq (XP_)	11,600	11,399 (98.27%)	11,399 (98.27%)	70.05%	77.25%
Arabidopsis thaliana GenBank	53,420	49,407 (92.49%)	49,407 (92.49%)	69.80%	75.09%
Arabidopsis thaliana known RefSeq (NP_)	48,113	42,386 (88.10%)	42,386 (88.10%)	67.76%	70.94%
Same-species GenBank	49	49 (100.00%)	49 (100.00%)	78.00%	86.65%
Fragaria vesca high-quality model RefSeq (XP_)	13,116	12,767 (97.34%)	12,767 (97.34%)	69.46%	77.24%
Populus euphratica high-quality model RefSeq (XP_)	18,422	18,024 (97.84%)	18,024 (97.84%)	70.35%	78.33%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20

RefSeq

Integrated reference sequences