NCBI Nilaparvata lugens Annotation Release 100

The RefSeq genome records for Nilaparvata lugens were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Nilaparvata lugens Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: Aug 15 2017
Date of submission of annotation to the public databases: Aug 17 2017
Software version: 7.4

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
NilLug1.0	GCF_000757685.1	Nilaparvata lugens Genome Consortium	09-24-2014	Reference	1 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	NilLug1.0
Genes and pseudogenes	21,442
protein-coding	19,806
non-coding	1,386
pseudogenes	250
genes with variants	2,734
mRNAs	24,306
fully-supported	17,011
with > 5% ab initio	4,187
partial	5,427
with filled gap(s)	2,855
known RefSeq (NM_)	0
model RefSeq (XM_)	24,306
Other RNAs	1,773
fully-supported	1,327
with > 5% ab initio	0
partial	6
with filled gap(s)	6
known RefSeq (NR_)	0
model RefSeq (XR_)	1,328
CDSs	24,306
fully-supported	17,011
with > 5% ab initio	4,628
partial	5,205
with major correction(s)	1,284
known RefSeq (NP_)	0
model RefSeq (XP_)	24,306

Detailed reports

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	21,192	22,015	11,023	71	563,633
All transcripts	26,079	1,972	1,516	71	47,093
mRNA	24,306	2,065	1,600	129	47,093
misc_RNA	253	1,917	1,467	166	8,578
tRNA	445	74	73	71	133
lncRNA	1,075	668	464	84	8,311
Single-exon transcripts	2,179	1,233	932	176	9,192
coding transcripts (NM_/XM_ )	2,177	1,233	932	176	9,192
non-coding transcripts (NR_/XR_ )	2	1,236	1,991	480	1,991
CDSs	24,306	1,440	1,071	108	47,039
Exons	136,369	289	169	1	11,186
in coding transcripts (NM_/XM_ )	133,024	290	169	1	11,186
in non-coding transcripts (NR_/XR_ )	4,277	235	138	2	8,025
Introns	114,292	3,958	1,649	28	375,599
in coding transcripts (NM_/XM_ )	112,004	3,939	1,647	28	375,599
in non-coding transcripts (NR_/XR_ )	3,129	4,639	1,741	31	197,430

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.24	1	1	32
Number of exons per transcript	6.87	5	1	138

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the Drosophila melanogaster known RefSeq proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 19806 coding genes, 13101 genes had a protein with an alignment covering 50% or more of the query and 3468 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: Drosophila melanogaster known RefSeq proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
NilLug1.0	GCF_000757685.1	3.26%	37.74%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Depending on the other evidence available, long 454 reads (with average length above 250 nt) may be aligned as traditional evidence and reported in the Transcript alignments section or aligned with RNA-Seq reads and reported in the RNA-Seq alignments section.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species Genbank	1,777	1,399 (78.73%)	1,309 (73.66%)	99.01%	92.52%
Same-species EST	118,054	77,653 (65.78%)	68,567 (58.08%)	98.97%	99.19%

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Publication	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	NA	Aggregate of all aligned samples	860,612,460	73%	30%	143,601
SAMD00012191	NA	C89i inbred strain of Nilaparvata lugens (Nilaparvata lugens, SAMD00012191)	35,609,384	78%	21%	113,780
SAMD00012192	NA	I87i inbred strain of Nilaparvata lugens (Nilaparvata lugens, SAMD00012192)	59,791,824	79%	19%	120,093
SAMN00010597	NA	whole body (Nilaparvata lugens, SAMN00010597)	157,918	25%	28%	3,513
SAMN00672640	NA	the midgut of Nilaparvata lugens (Nilaparvata lugens, SAMN00672640)	323,957	8%	30%	2,235
SAMN02046809	24244529	Salivary glands (Nilaparvata lugens, Adult, Female, SAMN02046809)	38,487,746	77%	21%	108,205
SAMN02046810	24244529	Salivary glands (Nilaparvata lugens, Adult, Female, SAMN02046810)	40,350,780	76%	23%	110,035
SAMN02047185	28127058,25378627	Generic sample from Nilaparvata lugens (Nilaparvata lugens, SAMN02047185)	21,700,114	78%	35%	98,509
SAMN02189702	150322	antenna (Nilaparvata lugens, SAMN02189702)	19,599,936	73%	22%	105,704
SAMN02260306	NA	General Sample for Nilaparvata lugens (Nilaparvata lugens, SAMN02260306)	51,243,310	38%	19%	81,973
SAMN02363647	NA	Fat Bodies from Two Populations (Nilaparvata lugens, SAMN02363647)	78,148,658	76%	19%	116,157
SAMN02725018	NA	salivary glands (Nilaparvata lugens, since 2009, female, SAMN02725018)	20,000,000	140%	46%	66,382
SAMN03018653	NA	wing pads (Nilaparvata lugens, third-instar nymphs, not determined, SAMN03018653)	116,744,580	74%	18%	120,475
SAMN05559867	NA	whole body (Nilaparvata lugens, SAMN05559867)	7,346,939	74%	6%	50,977
SAMN05559893	NA	whole body (Nilaparvata lugens, SAMN05559893)	7,374,301	75%	7%	50,946
SAMN05559896	NA	whole body (Nilaparvata lugens, SAMN05559896)	7,371,897	77%	7%	53,397
SAMN07191730	NA	salivary glands (Nilaparvata lugens, Adult, female, SAMN07191730)	106,460,024	76%	44%	86,694
SAMN07191769	NA	salivary glands (Nilaparvata lugens, Adult, female, SAMN07191769)	106,404,968	74%	47%	74,937
SAMN07191770	NA	salivary glands (Nilaparvata lugens, Adult, female, SAMN07191770)	40,000,000	70%	43%	86,779
SAMN07191771	NA	salivary glands (Nilaparvata lugens, Adult, female, SAMN07191771)	103,496,124	59%	40%	77,626

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
DRR016172	DRX014541	DRP001792	SAMD00012191	35,609,384	78%	21%
DRR016171	DRX014540	DRP001792	SAMD00012192	59,791,824	79%	19%
SRR066630	SRX018177	SRP002162	SAMN00010597	157,918	25%	28%
SRR315362	SRX084684	SRP007536	SAMN00672640	7,041	7%	24%
SRR315369	SRX084684	SRP007536	SAMN00672640	155,034	7%	28%
SRR315370	SRX084684	SRP007536	SAMN00672640	161,882	9%	32%
SRR921622	SRX314876	SRP017801	SAMN02047185	21,700,114	78%	35%
SRR1537484	SRX276865	SRP022256	SAMN02046809	38,487,746	77%	21%
SRR1537486	SRX276866	SRP022256	SAMN02046810	40,350,780	76%	23%
SRR1187936	SRX326774	SRP028104	SAMN02260306	51,243,310	38%	19%
SRR1003049	SRX360412	SRP030465	SAMN02363647	39,764,938	77%	18%
SRR1002947	SRX360414	SRP030465	SAMN02363647	38,383,720	75%	20%
SRR871556	SRX290503	SRP035299	SAMN02189702	19,599,936	73%	22%
SRR1269581	SRX532392	SRP041639	SAMN02725018	20,000,000	140%	46%
SRR1573316	SRX698355	SRP046763	SAMN03018653	116,744,580	74%	18%
SRR4021819	SRX2012503	SRP081257	SAMN05559867	7,346,939	74%	6%
SRR4018985	SRX2012514	SRP081257	SAMN05559893	7,374,301	75%	7%
SRR4018986	SRX2012515	SRP081257	SAMN05559896	7,371,897	77%	7%
SRR5644065	SRX2882231	SRP108568	SAMN07191730	106,460,024	76%	44%
SRR5644109	SRX2882276	SRP108568	SAMN07191769	106,404,968	74%	47%
SRR5644110	SRX2882277	SRP108568	SAMN07191770	40,000,000	70%	43%
SRR5644112	SRX2882278	SRP108568	SAMN07191771	103,496,124	59%	40%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species GenBank	1,346	1,317 (97.85%)	1,317 (97.85%)	86.97%	84.69%
Insecta GenBank	89,537	63,262 (70.65%)	63,262 (70.65%)	67.85%	64.56%
Insecta known RefSeq (NP_)	38,451	25,867 (67.27%)	25,867 (67.27%)	66.42%	55.23%
Acyrthosiphon pisum high-quality model RefSeq (XP_)	11,047	7,498 (67.87%)	7,498 (67.87%)	62.47%	52.33%
Bemisia tabaci high-quality model RefSeq (XP_)	11,628	8,379 (72.06%)	8,379 (72.06%)	64.08%	54.71%
Cimex lectularius high-quality model RefSeq (XP_)	10,771	8,215 (76.27%)	8,215 (76.27%)	65.66%	59.00%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20

RefSeq

Integrated reference sequences