NCBI Hyalella azteca Annotation Release 101

The RefSeq genome records for Hyalella azteca were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
BUSCO results: Annotation completeness assessed with BUSCO
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction
Similarity of current and previous assembly: The similarity of the current and previous assembly
Comparison of the current and previous annotations: What proportion of the genes changed in this annotation

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Hyalella azteca Annotation Release 101

Annotation release ID: 101
Date of Entrez queries for transcripts and proteins: Apr 28 2022
Date of submission of annotation to the public databases: Apr 29 2022
Software version: 9.0

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
Hazt_2.0.2	GCF_000764305.2	Baylor College of Medicine	09-09-2019	Reference	1 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	Hazt_2.0.2
Genes and pseudogenes	18,714
protein-coding	17,163
non-coding	1,488
Transcribed pseudogenes	0
Non-transcribed pseudogenes	62
genes with variants	3,155
Immunoglobulin/T-cell receptor gene segments	0
other	1
mRNAs	22,601
fully-supported	16,868
with > 5% ab initio	2,930
partial	2,041
with filled gap(s)	867
known RefSeq (NM_)	0
model RefSeq (XM_)	22,601
non-coding RNAs	1,804
fully-supported	1,433
with > 5% ab initio	0
partial	2
with filled gap(s)	2
known RefSeq (NR_)	0
model RefSeq (XR_)	1,504
pseudo transcripts	0
fully-supported	0
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	0
CDSs	22,614
fully-supported	16,868
with > 5% ab initio	3,294
partial	1,942
with major correction(s)	363
known RefSeq (NP_)	0
model RefSeq (XP_)	22,614

Detailed reports

The counts below do not include pseudogenes.

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	18,652	12,217	7,064	53	444,699
All transcripts	24,405	2,698	2,098	53	60,718
mRNA	22,601	2,838	2,220	189	60,718
misc_RNA	211	2,897	2,164	121	11,207
tRNA	298	73	73	53	87
lncRNA	1,224	859	515	74	16,328
snoRNA	13	136	136	66	214
snRNA	48	126	116	102	193
rRNA	9	841	153	119	3,398
Single-exon transcripts	733	1,472	1,074	267	12,422
coding transcripts (NM_/XM_ )	733	1,472	1,074	267	12,422
CDSs	22,614	1,819	1,314	156	58,620
Exons	139,717	347	158	1	17,704
in coding transcripts (NM_/XM_ )	136,064	347	159	1	17,704
in non-coding transcripts (NR_/XR_ )	4,372	321	130	2	8,493
Introns	120,620	1,706	593	30	241,947
in coding transcripts (NM_/XM_ )	118,216	1,708	593	30	241,947
in non-coding transcripts (NR_/XR_ )	3,083	1,605	551	30	102,108

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.31	1	1	50
Number of exons per transcript	7.81	6	1	141

BUSCO analysis of gene annotation

BUSCO v4.1.4 was run in "protein" mode on the annotated gene set picking one longest protein per gene, and run using the arthropoda_odb10 lineage dataset. Results are reported for the gene set from the primary assembly unit, and presented in BUSCO notation.

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the UniProtKB/Swiss-Prot curated proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 17150 coding genes, 9993 genes had a protein with an alignment covering 50% or more of the query and 1738 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: UniProtKB/Swiss-Prot curated proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker (if calculated), for each assembly. RepeatMasker results are only calculated for organisms with complete Dfam HMM model collections.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with WindowMasker
Hazt_2.0.2	GCF_000764305.2	27.36%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign, minimap2, or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species Genbank	1	1 (100.00%)	1 (100.00%)	99.19%	99.93%
Same-species EST	47	24 (51.06%)	16 (34.04%)	98.29%	95.83%

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	Aggregate of all aligned samples	1,412,895,110	52%	11%	149,811
SAMN04537355	whole animal (Hyalella azteca, 17-20 days, SAMN04537355)	32,106,848	59%	11%	94,250
SAMN04537356	whole animal (Hyalella azteca, 17-20 days, SAMN04537356)	37,112,084	62%	11%	99,441
SAMN04537357	whole animal (Hyalella azteca, 17-20 days, SAMN04537357)	32,899,270	62%	11%	97,299
SAMN04537358	whole animal (Hyalella azteca, 17-20 days, SAMN04537358)	35,185,080	60%	11%	98,925
SAMN04537359	whole animal (Hyalella azteca, 17-20 days, SAMN04537359)	39,124,476	53%	10%	93,822
SAMN04537360	whole animal (Hyalella azteca, 17-20 days, SAMN04537360)	32,072,250	61%	12%	97,424
SAMN04537361	whole animal (Hyalella azteca, 17-20 days, SAMN04537361)	41,772,668	67%	10%	102,060
SAMN04537362	whole animal (Hyalella azteca, 17-20 days, SAMN04537362)	37,265,384	69%	10%	100,515
SAMN04537363	whole animal (Hyalella azteca, 17-20 days, SAMN04537363)	32,089,492	67%	11%	103,304
SAMN04537364	whole animal (Hyalella azteca, 17-20 days, SAMN04537364)	32,784,720	62%	11%	90,587
SAMN04537365	whole animal (Hyalella azteca, 17-20 days, SAMN04537365)	33,238,560	62%	10%	91,971
SAMN04537366	whole animal (Hyalella azteca, 17-20 days, SAMN04537366)	32,840,460	57%	11%	90,852
SAMN04537367	whole animal (Hyalella azteca, mixed ages, SAMN04537367)	41,800,344	63%	11%	109,043
SAMN20062602	whole (Hyalella azteca, 28d, female, SAMN20062602)	56,718,642	46%	10%	103,967
SAMN20062603	whole (Hyalella azteca, 28d, female, SAMN20062603)	48,642,652	50%	12%	96,257
SAMN20062604	whole (Hyalella azteca, 28d, female, SAMN20062604)	79,017,840	56%	12%	115,875
SAMN20062605	whole (Hyalella azteca, 28d, female, SAMN20062605)	41,262,002	36%	10%	88,261
SAMN20062606	whole (Hyalella azteca, 28d, female, SAMN20062606)	69,806,290	58%	14%	111,424
SAMN20062607	whole (Hyalella azteca, 28d, female, SAMN20062607)	29,581,154	54%	13%	93,376
SAMN20062608	whole (Hyalella azteca, 28d, female, SAMN20062608)	54,921,176	45%	13%	96,429
SAMN20062609	whole (Hyalella azteca, 28d, female, SAMN20062609)	42,118,156	56%	14%	106,051
SAMN20062610	whole (Hyalella azteca, 28d, female, SAMN20062610)	56,672,512	52%	12%	104,475
SAMN20062611	whole (Hyalella azteca, 28d, female, SAMN20062611)	43,653,862	50%	11%	97,102
SAMN20062612	whole (Hyalella azteca, 28d, female, SAMN20062612)	32,765,950	48%	12%	90,441
SAMN20062613	whole (Hyalella azteca, 28d, female, SAMN20062613)	47,450,754	33%	8%	94,697
SAMN20062614	whole (Hyalella azteca, 28d, female, SAMN20062614)	39,854,812	46%	12%	86,265
SAMN20062615	whole (Hyalella azteca, 28d, female, SAMN20062615)	45,252,672	41%	11%	93,346
SAMN20062616	whole (Hyalella azteca, 28d, female, SAMN20062616)	36,506,182	24%	9%	73,757
SAMN20062617	whole (Hyalella azteca, 28d, female, SAMN20062617)	62,687,312	46%	9%	96,371
SAMN20062618	whole (Hyalella azteca, 28d, female, SAMN20062618)	33,833,540	57%	14%	100,251
SAMN20062619	whole (Hyalella azteca, 28d, female, SAMN20062619)	53,849,162	40%	11%	86,990
SAMN20062620	whole (Hyalella azteca, 28d, female, SAMN20062620)	47,657,322	55%	14%	102,936
SAMN20062621	whole (Hyalella azteca, 28d, female, SAMN20062621)	30,351,482	53%	12%	95,395

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
SRR3532637	SRX1618336	SRP071250	SAMN04537355	32,106,848	59%	11%
SRR3532638	SRX1618337	SRP071250	SAMN04537356	37,112,084	62%	11%
SRR3532639	SRX1618343	SRP071250	SAMN04537357	32,899,270	62%	11%
SRR3532643	SRX1618344	SRP071250	SAMN04537358	35,185,080	60%	11%
SRR3532644	SRX1618345	SRP071250	SAMN04537359	39,124,476	53%	10%
SRR3532645	SRX1618346	SRP071250	SAMN04537360	32,072,250	61%	12%
SRR3532640	SRX1618347	SRP071250	SAMN04537361	41,772,668	67%	10%
SRR3532641	SRX1618348	SRP071250	SAMN04537362	37,265,384	69%	10%
SRR3532642	SRX1618349	SRP071250	SAMN04537363	32,089,492	67%	11%
SRR3532634	SRX1618350	SRP071250	SAMN04537364	32,784,720	62%	11%
SRR3532635	SRX1618340	SRP071250	SAMN04537365	33,238,560	62%	10%
SRR3532636	SRX1618341	SRP071250	SAMN04537366	32,840,460	57%	11%
SRR3532633	SRX1618342	SRP071250	SAMN04537367	41,800,344	63%	11%
SRR15043440	SRX11354067	SRP326989	SAMN20062602	56,718,642	46%	10%
SRR15043439	SRX11354068	SRP326989	SAMN20062603	48,642,652	50%	12%
SRR15043428	SRX11354079	SRP326989	SAMN20062604	79,017,840	56%	12%
SRR15043427	SRX11354080	SRP326989	SAMN20062605	41,262,002	36%	10%
SRR15043426	SRX11354081	SRP326989	SAMN20062606	69,806,290	58%	14%
SRR15043425	SRX11354082	SRP326989	SAMN20062607	29,581,154	54%	13%
SRR15043424	SRX11354083	SRP326989	SAMN20062608	54,921,176	45%	13%
SRR15043423	SRX11354084	SRP326989	SAMN20062609	42,118,156	56%	14%
SRR15043422	SRX11354085	SRP326989	SAMN20062610	56,672,512	52%	12%
SRR15043421	SRX11354086	SRP326989	SAMN20062611	43,653,862	50%	11%
SRR15043438	SRX11354069	SRP326989	SAMN20062612	32,765,950	48%	12%
SRR15043437	SRX11354070	SRP326989	SAMN20062613	47,450,754	33%	8%
SRR15043436	SRX11354071	SRP326989	SAMN20062614	39,854,812	46%	12%
SRR15043435	SRX11354072	SRP326989	SAMN20062615	45,252,672	41%	11%
SRR15043434	SRX11354073	SRP326989	SAMN20062616	36,506,182	24%	9%
SRR15043433	SRX11354074	SRP326989	SAMN20062617	62,687,312	46%	9%
SRR15043432	SRX11354075	SRP326989	SAMN20062618	33,833,540	57%	14%
SRR15043431	SRX11354076	SRP326989	SAMN20062619	53,849,162	40%	11%
SRR15043430	SRX11354077	SRP326989	SAMN20062620	47,657,322	55%	14%
SRR15043429	SRX11354078	SRP326989	SAMN20062621	30,351,482	53%	12%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Nicrophorus vespilloides high-quality model RefSeq (XP_)	10,013	6,510 (65.02%)	6,510 (65.02%)	58.95%	44.18%
Penaeus japonicus high-quality model RefSeq (XP_)	15,395	11,044 (71.74%)	11,044 (71.74%)	62.61%	49.54%
Same-species GenBank	1	1 (100.00%)	1 (100.00%)	98.00%	100.00%
Caenorhabditis elegans known RefSeq (NP_)	28,534	8,891 (31.16%)	8,891 (31.16%)	56.88%	36.43%
Crustacea GenBank	46,243	38,313 (82.85%)	38,313 (82.85%)	65.90%	66.10%
Daphnia pulex high-quality model RefSeq (XP_)	14,091	8,346 (59.23%)	8,346 (59.23%)	59.99%	41.56%
Procambarus clarkii high-quality model RefSeq (XP_)	13,976	10,084 (72.15%)	10,084 (72.15%)	63.37%	51.83%
Tribolium castaneum GenBank	673	537 (79.79%)	537 (79.79%)	65.36%	55.60%
Tribolium castaneum high-quality model RefSeq (XP_)	11,487	7,068 (61.53%)	7,068 (61.53%)	58.36%	42.05%
Tribolium castaneum known RefSeq (NP_)	627	489 (77.99%)	489 (77.99%)	62.86%	47.35%
Drosophila melanogaster known RefSeq (NP_)	30,704	17,700 (57.65%)	17,700 (57.65%)	60.56%	44.19%

Assembly-assembly alignments of current to previous assembly

When the assembly changes between two rounds of annotation, genes in the current and the previous annotation are mapped to each other using the genomic alignments of the current assembly to the previous assembly so that gene identifiers can be preserved. The success of the remapping depends largely on how well the two assembly versions align to each other.

Below are the percent coverage of one assembly by the other and the average percent identity of the alignments. The 'First pass' alignments are reciprocal best hits, while the 'Total' alignments also include 'Second pass' or non-reciprocal best alignments. For more information about the assembly-assembly alignment process, please visit the NCBI Genome Remapping Service page.

First Pass	Total
Hazt_2.0.2 (Current) Coverage: 99.99%	Hazt_2.0.2 (Current) Coverage: 99.99%
Hazt_2.0 (Previous) Coverage: 98.44%	Hazt_2.0 (Previous) Coverage: 98.44%
Percent Identity: 99.14%	Percent Identity: 99.14%

Comparison of the current and previous annotations

The annotation produced for this release (101) was compared to the annotation in the previous release (100) for each assembly annotated in both releases. Scores for current and previous gene and transcript features were calculated based on overlap in exon sequence and matches in exon boundaries. Pairs of current and previous features were categorized based on these scores, whether they are reciprocal best matches, and changes in attributes (gene biotype, completeness, etc.). If the assembly was updated between the two releases, alignments between the current and the previous assembly were used to match the current and previous gene and transcript features in mapped regions.

The table below summarizes the changes in the gene set for each assembly as a percent of the number of genes in the current annotation release, and provides links to the details of the comparison in tabular format and in a Genome Workbench project.

	Hazt_2.0.2 (Current) to Hazt_2.0 (Previous)
Identical	13%
Minor changes	64%
Major changes	11%
New	11%
Deprecated	18%
Other	<1%
Download the report	tabular, Genome Workbench

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
BUSCO: Manni M, Berkeley MR, Seppey M, Simão FA, Zdobnov EM. Molecular biology and evolution 2021.38(10):4647-4654
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20
Minimap2: Li H. Bioinformatics 2018 Sep 15;34(18):3094-3100

RefSeq

Integrated reference sequences