NCBI Hydractinia symbiolongicarpus Annotation Release GCF_029227915.1-RS_2023_06

The genome sequence records for Hydractinia symbiolongicarpus RefSeq assembly GCF_029227915.1 (HSymV2.1) were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
BUSCO results: Annotation completeness assessed with BUSCO
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as "GCF_029227915.1-RS_2023_06".

Date of Entrez queries for transcripts and proteins: Jun 8 2023
Date of submission of annotation to the public databases: Jun 14 2023
Software version: 10.1

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
HSymV2.1	GCF_029227915.1	University of Vienna	06-05-2023	Reference	15 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	HSymV2.1
Genes and pseudogenes	50,843
protein-coding	20,492
non-coding	29,782
Transcribed pseudogenes	0
Non-transcribed pseudogenes	568
genes with variants	4,904
Immunoglobulin/T-cell receptor gene segments	0
other	1
mRNAs	28,497
fully-supported	25,085
with > 5% ab initio	3,030
partial	57
with filled gap(s)	5
known RefSeq (NM_)	0
model RefSeq (XM_)	28,497
non-coding RNAs	30,598
fully-supported	2,024
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	13,744
pseudo transcripts	0
fully-supported	0
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	0
CDSs	28,497
fully-supported	25,085
with > 5% ab initio	3,136
partial	56
with major correction(s)	54
known RefSeq (NP_)	0
model RefSeq (XP_)	28,497

Detailed reports

The counts below do not include pseudogenes.

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	50,275	3,845	165	63	255,466
All transcripts	59,095	1,524	790	63	76,089
mRNA	28,497	2,747	2,133	258	76,089
misc_RNA	639	3,150	2,528	206	19,401
tRNA	16,854	75	73	70	89
lncRNA	1,385	1,937	1,193	101	28,438
snoRNA	103	165	217	63	223
snRNA	4,020	147	139	105	199
rRNA	7,596	688	119	115	3,674
Single-exon transcripts	2,594	1,598	1,278	258	12,925
coding transcripts (NM_/XM_ )	2,594	1,598	1,278	258	12,925
CDSs	28,497	1,814	1,278	159	75,486
Exons	171,381	346	141	3	59,068
in coding transcripts (NM_/XM_ )	166,730	336	140	3	59,068
in non-coding transcripts (NR_/XR_ )	8,587	476	155	4	28,380
Introns	146,982	1,041	359	30	198,511
in coding transcripts (NM_/XM_ )	144,149	1,025	358	30	198,511
in non-coding transcripts (NR_/XR_ )	6,699	1,440	409	30	75,270

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.26	1	1	50
Number of exons per transcript	6.69	3	1	180

BUSCO analysis of gene annotation

BUSCO v4.1.4 was run in "protein" mode on the annotated gene set picking one longest protein per gene, and run using the metazoa_odb10 lineage dataset. Results are reported for the gene set from the primary assembly unit, and presented in BUSCO notation.

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the UniProtKB/Swiss-Prot curated proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 20492 coding genes, 11789 genes had a protein with an alignment covering 50% or more of the query and 2530 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: UniProtKB/Swiss-Prot curated proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker (if calculated), for each assembly. RepeatMasker results are only calculated for organisms with complete Dfam HMM model collections.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with WindowMasker
HSymV2.1	GCF_029227915.1	52.37%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez Nucleotide, Entrez Protein, and SRA, and aligned to the genome.

Transcript alignments

The alignments of the following transcripts with Splign were used for gene prediction:

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species Genbank	446	445 (99.78%)	78 (17.49%)	97.51%	99.27%
Same-species TSA	186,563	178,074 (95.45%)	138,191 (74.07%)	99.17%	98.73%

RNA-Seq alignments

The alignments of the following RNA-Seq reads with STAR were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Publication	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	NA	Aggregate of all aligned samples	6,726,015,054	35%	32%	162,113
SAMN12098753	NA	adult polyp (Hydractinia symbiolongicarpus, female, SAMN12098753)	148,140,222	16%	32%	5,211
SAMN12098754	NA	adult polyp (Hydractinia symbiolongicarpus, female, SAMN12098754)	182,237,200	18%	32%	27,437
SAMN12098755	NA	adult polyp (Hydractinia symbiolongicarpus, female, SAMN12098755)	159,153,624	20%	37%	13,540
SAMN12098759	NA	adult polyp (Hydractinia symbiolongicarpus, male, SAMN12098759)	207,710,622	25%	20%	30,220
SAMN12098760	NA	adult polyp (Hydractinia symbiolongicarpus, male, SAMN12098760)	136,061,916	29%	21%	26,503
SAMN12098763	NA	adult polyp (Hydractinia symbiolongicarpus, male, SAMN12098763)	185,656,444	17%	40%	19,060
SAMN12098764	NA	adult polyp (Hydractinia symbiolongicarpus, male, SAMN12098764)	173,690,968	12%	8%	4,527
SAMN12098767	NA	adult polyp (Hydractinia symbiolongicarpus, male, SAMN12098767)	174,732,160	31%	37%	93,484
SAMN12098768	NA	adult polyp (Hydractinia symbiolongicarpus, male, SAMN12098768)	186,630,512	30%	23%	19,373
SAMN18789452	NA	adult polyp (Hydractinia symbiolongicarpus, male, SAMN18789452)	178,613,632	10%	6%	7,549
SAMN18789456	NA	adult polyp (Hydractinia symbiolongicarpus, male, SAMN18789456)	155,230,706	22%	24%	32,044
SAMN18789457	NA	adult polyp (Hydractinia symbiolongicarpus, male, SAMN18789457)	165,499,760	25%	19%	12,525
SAMN18789459	NA	adult polyp (Hydractinia symbiolongicarpus, SAMN18789459)	169,665,754	27%	11%	48,171
SAMN18789460	NA	adult polyp (Hydractinia symbiolongicarpus, SAMN18789460)	266,643,598	16%	30%	28,838
SAMN18789461	NA	adult polyp (Hydractinia symbiolongicarpus, SAMN18789461)	183,097,840	20%	10%	30,322
SAMN18789462	NA	adult polyp (Hydractinia symbiolongicarpus, SAMN18789462)	250,108,664	11%	22%	12,535
SAMN18789467	NA	adult polyp (Hydractinia symbiolongicarpus, SAMN18789467)	176,695,290	17%	20%	70,909
SAMN18789468	NA	adult polyp (Hydractinia symbiolongicarpus, SAMN18789468)	151,404,262	21%	27%	90,014
SAMN18789469	NA	adult polyp (Hydractinia symbiolongicarpus, SAMN18789469)	161,616,838	17%	28%	94,403
SAMN18789470	NA	adult polyp (Hydractinia symbiolongicarpus, SAMN18789470)	216,934,362	23%	26%	94,261
SAMN18789471	NA	adult polyp (Hydractinia symbiolongicarpus, SAMN18789471)	224,604,126	18%	24%	78,977
SAMN18789472	NA	adult polyp (Hydractinia symbiolongicarpus, SAMN18789472)	162,998,954	24%	27%	93,651
SAMN26175503	35287376	40 feeding polyps + 10 mature female reproductive polyps (Hydractinia symbiolongicarpus, female, SAMN26175503)	267,344,638	78%	28%	141,363
SAMN27479252	NA	hypostome (Hydractinia symbiolongicarpus, centuries, male, SAMN27479252)	194,973,848	48%	41%	138,738
SAMN27479253	NA	hypostome (Hydractinia symbiolongicarpus, centuries, male, SAMN27479253)	242,878,648	58%	39%	146,214
SAMN27479255	NA	hypostome (Hydractinia symbiolongicarpus, centuries, male, SAMN27479255)	174,671,742	39%	32%	138,777
SAMN27479256	NA	hypostome (Hydractinia symbiolongicarpus, centuries, male, SAMN27479256)	247,473,722	57%	38%	139,440
SAMN27479258	NA	hypostome (Hydractinia symbiolongicarpus, centuries, male, SAMN27479258)	250,025,970	55%	37%	143,735
SAMN27479259	NA	hypostome (Hydractinia symbiolongicarpus, centuries, male, SAMN27479259)	206,341,964	54%	41%	139,487
SAMN27479260	NA	hypostome (Hydractinia symbiolongicarpus, centuries, male, SAMN27479260)	275,733,374	56%	41%	141,511
SAMN35011025	NA	Whole embryo (Hydractinia symbiolongicarpus, SAMN35011025)	51,543,326	63%	31%	115,329
SAMN35011027	NA	Whole embryo (Hydractinia symbiolongicarpus, SAMN35011027)	54,448,772	71%	33%	111,445
SAMN35011029	NA	Whole embryo (Hydractinia symbiolongicarpus, SAMN35011029)	58,644,316	61%	33%	111,036
SAMN35011030	NA	Whole embryo (Hydractinia symbiolongicarpus, SAMN35011030)	40,093,656	58%	33%	108,293
SAMN35011031	NA	Whole embryo (Hydractinia symbiolongicarpus, SAMN35011031)	11,844,306	55%	32%	99,088
SAMN35011034	NA	Whole embryo (Hydractinia symbiolongicarpus, SAMN35011034)	45,865,530	68%	33%	110,070
SAMN35011050	NA	Whole embryo (Hydractinia symbiolongicarpus, SAMN35011050)	78,364,776	71%	32%	118,434
SAMN35011052	NA	Whole embryo (Hydractinia symbiolongicarpus, SAMN35011052)	66,005,520	69%	34%	111,443
SAMN35011054	NA	Whole embryo (Hydractinia symbiolongicarpus, SAMN35011054)	65,514,608	66%	34%	111,356
SAMN35011055	NA	Whole embryo (Hydractinia symbiolongicarpus, SAMN35011055)	82,646,060	73%	34%	113,484
SAMN35011056	NA	Whole embryo (Hydractinia symbiolongicarpus, SAMN35011056)	73,102,290	71%	34%	112,249
SAMN35011059	NA	Whole embryo (Hydractinia symbiolongicarpus, SAMN35011059)	77,144,482	73%	34%	112,930

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
SRR9331394	SRX6097993	SRP201979	SAMN12098753	148,140,222	16%	32%
SRR9331395	SRX6097992	SRP201979	SAMN12098754	182,237,200	18%	32%
SRR9331396	SRX6097991	SRP201979	SAMN12098755	159,153,624	20%	37%
SRR9331400	SRX6097987	SRP201979	SAMN12098759	207,710,622	25%	20%
SRR9331401	SRX6097986	SRP201979	SAMN12098760	136,061,916	29%	21%
SRR9331390	SRX6097997	SRP201979	SAMN12098763	185,656,444	17%	40%
SRR9331391	SRX6097996	SRP201979	SAMN12098764	173,690,968	12%	8%
SRR9331388	SRX6097999	SRP201979	SAMN12098767	174,732,160	31%	37%
SRR9331389	SRX6097998	SRP201979	SAMN12098768	186,630,512	30%	23%
SRR14265608	SRX10627431	SRP201979	SAMN18789452	178,613,632	10%	6%
SRR14265614	SRX10627425	SRP201979	SAMN18789456	155,230,706	22%	24%
SRR14265613	SRX10627426	SRP201979	SAMN18789457	165,499,760	25%	19%
SRR14265611	SRX10627428	SRP201979	SAMN18789459	169,665,754	27%	11%
SRR14265610	SRX10627429	SRP201979	SAMN18789460	266,643,598	16%	30%
SRR14265609	SRX10627430	SRP201979	SAMN18789461	183,097,840	20%	10%
SRR14265606	SRX10627433	SRP201979	SAMN18789462	250,108,664	11%	22%
SRR14265622	SRX10627417	SRP201979	SAMN18789467	176,695,290	17%	20%
SRR14265621	SRX10627418	SRP201979	SAMN18789468	151,404,262	21%	27%
SRR14265620	SRX10627419	SRP201979	SAMN18789469	161,616,838	17%	28%
SRR14265619	SRX10627420	SRP201979	SAMN18789470	216,934,362	23%	26%
SRR14265618	SRX10627421	SRP201979	SAMN18789471	224,604,126	18%	24%
SRR14265616	SRX10627423	SRP201979	SAMN18789472	162,998,954	24%	27%
SRR18100457	SRX14251343	SRP360956	SAMN26175503	120,867,778	86%	25%
SRR18100456	SRX14251344	SRP360956	SAMN26175503	146,476,860	71%	32%
SRR18686545	SRX14787587	SRP368295	SAMN27479252	194,973,848	48%	41%
SRR18686544	SRX14787588	SRP368295	SAMN27479253	242,878,648	58%	39%
SRR18686542	SRX14787590	SRP368295	SAMN27479255	174,671,742	39%	32%
SRR18686541	SRX14787591	SRP368295	SAMN27479256	247,473,722	57%	38%
SRR18686539	SRX14787593	SRP368295	SAMN27479258	250,025,970	55%	37%
SRR18686538	SRX14787594	SRP368295	SAMN27479259	206,341,964	54%	41%
SRR18686546	SRX14787586	SRP368295	SAMN27479260	275,733,374	56%	41%
SRR24482136	SRX20267759	SRP436676	SAMN35011025	51,543,326	63%	31%
SRR24482138	SRX20267757	SRP436676	SAMN35011027	54,448,772	71%	33%
SRR24482140	SRX20267755	SRP436676	SAMN35011029	58,644,316	61%	33%
SRR24482141	SRX20267754	SRP436676	SAMN35011030	40,093,656	58%	33%
SRR24482142	SRX20267753	SRP436676	SAMN35011031	11,844,306	55%	32%
SRR24482145	SRX20267750	SRP436676	SAMN35011034	45,865,530	68%	33%
SRR24482166	SRX20267729	SRP436676	SAMN35011050	78,364,776	71%	32%
SRR24482168	SRX20267727	SRP436676	SAMN35011052	66,005,520	69%	34%
SRR24482170	SRX20267725	SRP436676	SAMN35011054	65,514,608	66%	34%
SRR24482171	SRX20267724	SRP436676	SAMN35011055	82,646,060	73%	34%
SRR24482172	SRX20267723	SRP436676	SAMN35011056	73,102,290	71%	34%
SRR24482175	SRX20267720	SRP436676	SAMN35011059	77,144,482	73%	34%

Protein alignments

The alignments of the following proteins with ProSplign were used for gene prediction:

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species GenBank	342	342 (100.00%)	342 (100.00%)	76.85%	84.60%
Dendronephthya gigantea high-quality model RefSeq (XP_)	15,619	9,691 (62.05%)	9,691 (62.05%)	57.42%	43.74%
Nematostella vectensis GenBank	394	328 (83.25%)	328 (83.25%)	63.03%	45.12%
Nematostella vectensis model RefSeq (XP_)	32,272	20,136 (62.39%)	20,136 (62.39%)	57.11%	39.10%
Saccharomyces cerevisiae known RefSeq (NP_)	5,998	1,714 (28.58%)	1,714 (28.58%)	56.62%	45.55%
Caenorhabditis elegans known RefSeq (NP_)	28,561	8,470 (29.66%)	8,470 (29.66%)	56.54%	37.44%
Drosophila melanogaster known RefSeq (NP_)	30,786	13,715 (44.55%)	13,715 (44.55%)	57.91%	43.48%
Strongylocentrotus purpuratus high-quality model RefSeq (XP_)	19,173	9,954 (51.92%)	9,954 (51.92%)	58.16%	44.01%
Strongylocentrotus purpuratus known RefSeq (NP_)	425	274 (64.47%)	274 (64.47%)	65.16%	56.59%
Homo sapiens known RefSeq (NP_)	67,077	35,614 (53.09%)	35,614 (53.09%)	57.73%	43.11%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
BUSCO: Manni M, Berkeley MR, Seppey M, Simão FA, Zdobnov EM. Molecular biology and evolution 2021.38(10):4647-4654
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20
STAR: Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. Bioinformatics 2013 Jan 1;29(1):15-21.
Minimap2: Li H. Bioinformatics 2018 Sep 15;34(18):3094-3100

RefSeq

Integrated reference sequences