NCBI Hyalella azteca Annotation Release 100

The RefSeq genome records for Hyalella azteca were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Hyalella azteca Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: Sep 13 2016
Date of submission of annotation to the public databases: Sep 16 2016
Software version: 7.1

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
Hazt_2.0	GCF_000764305.1	Baylor College of Medicine	07-20-2016	Reference	unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	Hazt_2.0
Genes and pseudogenes	20,022
protein-coding	18,608
non-coding	1,328
pseudogenes	86
genes with variants	2,516
mRNAs	22,749
fully-supported	14,525
with > 5% ab initio	5,618
partial	2,004
with filled gap(s)	859
known RefSeq (NM_)	0
model RefSeq (XM_)	22,749
Other RNAs	1,618
fully-supported	1,258
with > 5% ab initio	0
partial	1
with filled gap(s)	1
known RefSeq (NR_)	0
model RefSeq (XR_)	1,258
CDSs	22,749
fully-supported	14,525
with > 5% ab initio	6,009
partial	1,883
with major correction(s)	279
known RefSeq (NP_)	0
model RefSeq (XP_)	22,749

Detailed reports

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	19,936	10,627	6,048	70	244,044
All transcripts	24,367	2,538	1,887	70	35,861
mRNA	22,749	2,648	1,989	177	35,861
misc_RNA	217	3,130	2,236	190	11,168
tRNA	360	74	73	70	88
lncRNA	1,041	873	493	73	9,330
Single-exon transcripts	2,527	1,128	921	206	9,059
coding transcripts (NM_/XM_ )	2,527	1,128	921	206	9,059
CDSs	22,749	1,704	1,215	177	33,705
Exons	133,606	359	159	2	17,702
in coding transcripts (NM_/XM_ )	130,316	360	159	2	17,702
in non-coding transcripts (NR_/XR_ )	4,172	327	125	2	9,265
Introns	113,429	1,612	584	30	98,918
in coding transcripts (NM_/XM_ )	111,250	1,612	584	30	98,918
in non-coding transcripts (NR_/XR_ )	3,017	1,558	576	30	56,934

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.23	1	1	42
Number of exons per transcript	7.2	5	1	105

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the UniProtKB/Swiss-Prot curated proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 18608 coding genes, 10882 genes had a protein with an alignment covering 50% or more of the query and 1911 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: UniProtKB/Swiss-Prot curated proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
Hazt_2.0	GCF_000764305.1	1.71%	27.54%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Depending on the other evidence available, long 454 reads (with average length above 250 nt) may be aligned as traditional evidence and reported in the Transcript alignments section or aligned with RNA-Seq reads and reported in the RNA-Seq alignments section.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species Genbank	1	1 (100.00%)	1 (100.00%)	99.00%	99.00%
Same-species EST	47	26 (55.32%)	16 (34.04%)	98.25%	95.81%
Crustacea Genbank	19,770	1,379 (6.98%)	70 (0.35%)	86.98%	90.80%
Crustacea EST	919,780	7,491 (0.81%)	1,694 (0.18%)	87.61%	94.25%

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Track name	Number of reads	Percent aligned reads	Percent spliced reads	Number of introns
All	Aggregate of all aligned samples	715,140,764	46%	5%	134,396
SAMN01914920	Hyalella azteca Control (Hyalella azteca, SAMN01914920)	447,025	35%	9%	20,011
SAMN01915237	Hyalella azteca Experimental (Hyalella azteca, SAMN01915237)	626,623	29%	9%	23,344
SAMN04325043	Whole organism (Hyalella azteca, Mixed, pooled male and female, SAMN04325043)	121,641,360	19%	1%	92,314
SAMN04325046	Whole organism (Hyalella azteca, Mixed, pooled male and female, SAMN04325046)	132,134,120	17%	1%	92,896
SAMN04537355	whole animal (Hyalella azteca, 17-20 days, SAMN04537355)	32,106,848	59%	7%	92,475
SAMN04537356	whole animal (Hyalella azteca, 17-20 days, SAMN04537356)	37,112,084	62%	7%	97,305
SAMN04537357	whole animal (Hyalella azteca, 17-20 days, SAMN04537357)	32,899,270	62%	7%	95,433
SAMN04537358	whole animal (Hyalella azteca, 17-20 days, SAMN04537358)	35,185,080	60%	6%	96,978
SAMN04537359	whole animal (Hyalella azteca, 17-20 days, SAMN04537359)	39,124,476	53%	5%	92,146
SAMN04537360	whole animal (Hyalella azteca, 17-20 days, SAMN04537360)	32,072,250	61%	7%	95,549
SAMN04537361	whole animal (Hyalella azteca, 17-20 days, SAMN04537361)	41,772,668	67%	7%	100,180
SAMN04537362	whole animal (Hyalella azteca, 17-20 days, SAMN04537362)	37,265,384	69%	6%	98,692
SAMN04537363	whole animal (Hyalella azteca, 17-20 days, SAMN04537363)	32,089,492	67%	8%	101,467
SAMN04537364	whole animal (Hyalella azteca, 17-20 days, SAMN04537364)	32,784,720	62%	7%	89,006
SAMN04537365	whole animal (Hyalella azteca, 17-20 days, SAMN04537365)	33,238,560	62%	6%	90,363
SAMN04537366	whole animal (Hyalella azteca, 17-20 days, SAMN04537366)	32,840,460	57%	6%	89,261
SAMN04537367	whole animal (Hyalella azteca, mixed ages, SAMN04537367)	41,800,344	63%	7%	106,499

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent spliced reads
SRR687301	SRX229108	SRP018517	SAMN01914920	447,025	35%	9%
SRR687323	SRX229167	SRP018517	SAMN01915237	626,623	29%	9%
SRR2921204	SRX1470506	SRP066301	SAMN04325043	121,641,360	19%	1%
SRR2921281	SRX1470499	SRP066301	SAMN04325046	132,134,120	17%	1%
SRR3532637	SRX1618336	SRP071250	SAMN04537355	32,106,848	59%	7%
SRR3532638	SRX1618337	SRP071250	SAMN04537356	37,112,084	62%	7%
SRR3532639	SRX1618343	SRP071250	SAMN04537357	32,899,270	62%	7%
SRR3532643	SRX1618344	SRP071250	SAMN04537358	35,185,080	60%	6%
SRR3532644	SRX1618345	SRP071250	SAMN04537359	39,124,476	53%	5%
SRR3532645	SRX1618346	SRP071250	SAMN04537360	32,072,250	61%	7%
SRR3532640	SRX1618347	SRP071250	SAMN04537361	41,772,668	67%	7%
SRR3532641	SRX1618348	SRP071250	SAMN04537362	37,265,384	69%	6%
SRR3532642	SRX1618349	SRP071250	SAMN04537363	32,089,492	67%	8%
SRR3532634	SRX1618350	SRP071250	SAMN04537364	32,784,720	62%	7%
SRR3532635	SRX1618340	SRP071250	SAMN04537365	33,238,560	62%	6%
SRR3532636	SRX1618341	SRP071250	SAMN04537366	32,840,460	57%	6%
SRR3532633	SRX1618342	SRP071250	SAMN04537367	41,800,344	63%	7%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Nicrophorus vespilloides high-quality model RefSeq (XP_)	10,013	6,589 (65.80%)	6,589 (65.80%)	59.31%	42.42%
Same-species GenBank	1	1 (100.00%)	1 (100.00%)	98.00%	100.00%
Caenorhabditis elegans known RefSeq (NP_)	28,093	8,884 (31.62%)	8,884 (31.62%)	57.61%	35.64%
Crustacea GenBank	15,895	13,650 (85.88%)	13,650 (85.88%)	68.42%	68.54%
Daphnia pulex Other	30,973	12,169 (39.29%)	12,169 (39.29%)	60.94%	46.89%
Tribolium castaneum GenBank	598	480 (80.27%)	480 (80.27%)	65.45%	53.49%
Tribolium castaneum high-quality model RefSeq (XP_)	11,490	7,154 (62.26%)	7,154 (62.26%)	58.70%	40.46%
Tribolium castaneum known RefSeq (NP_)	621	488 (78.58%)	488 (78.58%)	63.52%	47.43%
Drosophila melanogaster known RefSeq (NP_)	30,430	17,725 (58.25%)	17,725 (58.25%)	60.85%	42.84%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20

RefSeq

Integrated reference sequences