NCBI Fundulus heteroclitus Annotation Release 101

The RefSeq genome records for Fundulus heteroclitus were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction
Comparison of the current and previous annotations: What proportion of the genes changed in this annotation

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Fundulus heteroclitus Annotation Release 101

Annotation release ID: 101
Date of Entrez queries for transcripts and proteins: May 29 2017
Date of submission of annotation to the public databases: Jun 1 2017
Software version: 7.4

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
Fundulus_heteroclitus-3.0.2	GCF_000826765.1	The Genome Institute at Washington University School of Medicine	01-21-2015	Reference	1 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	Fundulus_heteroclitus-3.0.2
Genes and pseudogenes	26,290
protein-coding	23,561
non-coding	2,476
pseudogenes	253
genes with variants	7,233
mRNAs	38,221
fully-supported	36,440
with > 5% ab initio	600
partial	6,743
with filled gap(s)	6,054
known RefSeq (NM_)	79
model RefSeq (XM_)	38,142
Other RNAs	3,334
fully-supported	2,942
with > 5% ab initio	0
partial	44
with filled gap(s)	44
known RefSeq (NR_)	0
model RefSeq (XR_)	2,942
CDSs	38,271
fully-supported	36,440
with > 5% ab initio	725
partial	6,199
with major correction(s)	2,960
known RefSeq (NP_)	79
model RefSeq (XP_)	38,142

Detailed reports

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	26,037	22,132	9,773	68	841,817
All transcripts	41,555	3,134	2,488	68	90,562
mRNA	38,221	3,308	2,637	150	90,562
misc_RNA	546	3,273	2,537	116	13,846
tRNA	392	74	73	68	86
lncRNA	2,396	838	563	86	12,389
Single-exon transcripts	1,023	1,758	1,512	156	8,888
coding transcripts (NM_/XM_ )	1,023	1,758	1,512	156	8,888
CDSs	38,221	2,024	1,473	96	90,321
Exons	262,430	278	136	1	17,307
in coding transcripts (NM_/XM_ )	254,777	278	136	1	17,307
in non-coding transcripts (NR_/XR_ )	11,807	260	127	2	8,828
Introns	231,224	2,375	530	30	443,673
in coding transcripts (NM_/XM_ )	226,065	2,358	534	30	443,673
in non-coding transcripts (NR_/XR_ )	9,194	2,846	461	30	141,384

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.61	1	1	44
Number of exons per transcript	11.73	8	1	253

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the UniProtKB/Swiss-Prot curated proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 23511 coding genes, 21707 genes had a protein with an alignment covering 50% or more of the query and 9737 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: UniProtKB/Swiss-Prot curated proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
Fundulus_heteroclitus-3.0.2	GCF_000826765.1	2.01%	25.55%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Depending on the other evidence available, long 454 reads (with average length above 250 nt) may be aligned as traditional evidence and reported in the Transcript alignments section or aligned with RNA-Seq reads and reported in the RNA-Seq alignments section.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species known RefSeq (NM_/NR_)	81	81 (100.00%)	73 (90.12%)	98.45%	96.64%
Same-species Genbank	250	243 (97.20%)	229 (91.60%)	98.85%	97.02%
Same-species EST	89,849	52,847 (58.82%)	42,737 (47.57%)	98.32%	96.20%

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Publication	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	NA	Aggregate of all aligned samples	1,663,250,801	78%	34%	283,757
SAMN02116571	25158801	Fundulus heteroclitus Transcriptome Assembly (Fundulus heteroclitus, SAMN02116571)	427,093,658	81%	32%	255,907
SAMN03775336	NA	Gonad (Fundulus heteroclitus heteroclitus, male, SAMN03775336)	44,202,566	76%	35%	180,451
SAMN05752007	21609454	embryos, PCB-126 (Fundulus heteroclitus, 10 dpf, SAMN05752007)	163,819	55%	38%	5,928
SAMN05752008	21609454	embryos, DMSO (Fundulus heteroclitus, 10 dpf, SAMN05752008)	129,399	55%	39%	4,851
SAMN05752019	21609454	embryos, PCB-126 (Fundulus heteroclitus, 10 dpf, SAMN05752019)	177,193	56%	35%	5,793
SAMN05752044	21609454	embryos, untreated (Fundulus heteroclitus, 1-15 dpf, SAMN05752044)	780,686	56%	56%	120,092
SAMN05752058	21609454	embryos, untreated (Fundulus heteroclitus, 1-15 dpf, SAMN05752058)	756,261	56%	53%	118,798
SAMN05752656	21609454	embryos, DMSO (Fundulus heteroclitus, 10 dpf, SAMN05752656)	164,128	56%	38%	5,615
SAMN05890460	NA	day 7 gonad control (Fundulus heteroclitus, male, SAMN05890460)	50,796,398	76%	35%	182,701
SAMN05890461	NA	day 7 gonad exposure (Fundulus heteroclitus, male, SAMN05890461)	64,506,958	74%	33%	128,767
SAMN05890462	NA	non sexually active gonad (Fundulus heteroclitus, male, SAMN05890462)	94,745,344	77%	36%	142,628
SAMN05890463	NA	day 7 gonad control (Fundulus heteroclitus, female, SAMN05890463)	68,979,536	82%	39%	175,422
SAMN05890464	NA	day 7 gonad control (Fundulus heteroclitus, female, SAMN05890464)	20,601,062	84%	38%	142,815
SAMN05890465	NA	day 7 gonad exposure (Fundulus heteroclitus, female, SAMN05890465)	53,281,243	83%	38%	146,494
SAMN05890466	NA	non sexually active gonad (Fundulus heteroclitus, female, SAMN05890466)	48,869,426	85%	39%	162,460
SAMN05919683	NA	skeletal muscle (Fundulus heteroclitus, female, SAMN05919683)	65,086,998	80%	37%	156,269
SAMN05919684	NA	skeletal muscle (Fundulus heteroclitus, female, SAMN05919684)	47,283,556	78%	37%	141,530
SAMN05919685	NA	skeletal muscle (Fundulus heteroclitus, male, SAMN05919685)	31,364,398	75%	35%	143,135
SAMN05919686	NA	skeletal muscle (Fundulus heteroclitus, male, SAMN05919686)	44,862,270	76%	35%	147,406
SAMN05919687	NA	skeletal muscle (Fundulus heteroclitus, female, SAMN05919687)	37,283,422	76%	35%	145,583
SAMN05919688	NA	skeletal muscle (Fundulus heteroclitus, female, SAMN05919688)	41,748,732	79%	36%	150,103
SAMN05919689	NA	skeletal muscle (Fundulus heteroclitus, male, SAMN05919689)	61,147,936	75%	35%	149,486
SAMN05919690	NA	skeletal muscle (Fundulus heteroclitus, male, SAMN05919690)	48,065,494	74%	34%	144,911
SAMN05919691	NA	skeletal muscle (Fundulus heteroclitus, female, SAMN05919691)	57,160,418	68%	32%	155,566
SAMN05919692	NA	skeletal muscle (Fundulus heteroclitus, female, SAMN05919692)	56,804,944	78%	32%	154,969
SAMN05919693	NA	skeletal muscle (Fundulus heteroclitus, male, SAMN05919693)	48,593,998	77%	34%	156,357
SAMN05919694	NA	skeletal muscle (Fundulus heteroclitus, male, SAMN05919694)	49,021,044	77%	33%	147,110
SAMN05919695	NA	skeletal muscle (Fundulus heteroclitus, female, SAMN05919695)	47,408,296	76%	33%	156,267
SAMN05919696	NA	skeletal muscle (Fundulus heteroclitus, female, SAMN05919696)	52,535,624	78%	34%	154,522
SAMN05919697	NA	skeletal muscle (Fundulus heteroclitus, male, SAMN05919697)	48,320,502	72%	33%	145,921
SAMN05919698	NA	skeletal muscle (Fundulus heteroclitus, male, SAMN05919698)	51,315,492	76%	32%	175,002

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
SRR866251	SRX278446	SRP022771	SAMN02116571	1,309,254	29%	43%
SRR1692373	SRX795206	SRP050504	SAMN02116571	212,807,968	78%	33%
SRR1692374	SRX795206	SRP050504	SAMN02116571	212,976,436	84%	30%
SRR2062963	SRX1058750	SRP059504	SAMN03775336	44,202,566	76%	35%
SRR4204746	SRX2148777	SRP087635	SAMN05752007	163,819	55%	38%
SRR4204745	SRX2148776	SRP087635	SAMN05752008	129,399	55%	39%
SRR4204748	SRX2148779	SRP087635	SAMN05752019	177,193	56%	35%
SRR4204750	SRX2148781	SRP087635	SAMN05752044	780,686	56%	56%
SRR4204749	SRX2148780	SRP087635	SAMN05752058	756,261	56%	53%
SRR4204747	SRX2148778	SRP087635	SAMN05752656	164,128	56%	38%
SRR4384819	SRX2231675	SRP091059	SAMN05890460	50,796,398	76%	35%
SRR4384820	SRX2231676	SRP091059	SAMN05890461	64,506,958	74%	33%
SRR4384821	SRX2231677	SRP091059	SAMN05890462	94,745,344	77%	36%
SRR4384822	SRX2231678	SRP091059	SAMN05890463	68,979,536	82%	39%
SRR4384823	SRX2231679	SRP091059	SAMN05890464	20,601,062	84%	38%
SRR4384824	SRX2231680	SRP091059	SAMN05890465	53,281,243	83%	38%
SRR4384825	SRX2231681	SRP091059	SAMN05890466	48,869,426	85%	39%
SRR4431231	SRX2251139	SRP091735	SAMN05919683	65,086,998	80%	37%
SRR4431233	SRX2251141	SRP091735	SAMN05919684	47,283,556	78%	37%
SRR4431239	SRX2251147	SRP091735	SAMN05919685	31,364,398	75%	35%
SRR4431238	SRX2251146	SRP091735	SAMN05919686	44,862,270	76%	35%
SRR4431230	SRX2251138	SRP091735	SAMN05919687	37,283,422	76%	35%
SRR4431237	SRX2251145	SRP091735	SAMN05919688	41,748,732	79%	36%
SRR4431234	SRX2251142	SRP091735	SAMN05919689	61,147,936	75%	35%
SRR4431232	SRX2251140	SRP091735	SAMN05919690	48,065,494	74%	34%
SRR4431236	SRX2251144	SRP091735	SAMN05919691	57,160,418	68%	32%
SRR4431235	SRX2251143	SRP091735	SAMN05919692	56,804,944	78%	32%
SRR4431245	SRX2251153	SRP091735	SAMN05919693	48,593,998	77%	34%
SRR4431244	SRX2251152	SRP091735	SAMN05919694	49,021,044	77%	33%
SRR4431243	SRX2251151	SRP091735	SAMN05919695	47,408,296	76%	33%
SRR4431242	SRX2251150	SRP091735	SAMN05919696	52,535,624	78%	34%
SRR4431241	SRX2251149	SRP091735	SAMN05919697	48,320,502	72%	33%
SRR4431240	SRX2251148	SRP091735	SAMN05919698	51,315,492	76%	32%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Actinopterygii GenBank	77,471	73,982 (95.50%)	73,982 (95.50%)	68.78%	77.73%
Actinopterygii known RefSeq (NP_)	24,632	23,473 (95.29%)	23,473 (95.29%)	68.02%	75.39%
Danio rerio high-quality model RefSeq (XP_)	7,632	7,272 (95.28%)	7,272 (95.28%)	65.34%	69.02%
Same-species GenBank	236	230 (97.46%)	230 (97.46%)	76.15%	83.59%
Same-species known RefSeq (NP_)	81	81 (100.00%)	81 (100.00%)	72.84%	80.66%
Poecilia reticulata high-quality model RefSeq (XP_)	16,791	16,674 (99.30%)	16,674 (99.30%)	71.93%	79.33%
Oryzias latipes high-quality model RefSeq (XP_)	13,692	13,519 (98.74%)	13,519 (98.74%)	69.94%	77.40%
Homo sapiens GenBank	128,779	101,027 (78.45%)	101,027 (78.45%)	64.41%	66.98%
Homo sapiens known RefSeq (NP_)	47,564	40,139 (84.39%)	40,139 (84.39%)	65.33%	65.41%

Comparison of the current and previous annotations

The annotation produced for this release (101) was compared to the annotation in the previous release (100) for each assembly annotated in both releases. Scores for current and previous gene and transcript features were calculated based on overlap in exon sequence and matches in exon boundaries. Pairs of current and previous features were categorized based on these scores, whether they are reciprocal best matches, and changes in attributes (gene biotype, completeness, etc.). If the assembly was updated between the two releases, alignments between the current and the previous assembly were used to match the current and previous gene and transcript features in mapped regions.

The table below summarizes the changes in the gene set for each assembly as a percent of the number of genes in the current annotation release, and provides links to the details of the comparison in tabular format and in a Genome Workbench project.

	Fundulus_heteroclitus-3.0.2 (Current) to Fundulus_heteroclitus-3.0.2 (Previous)
Identical	1%
Minor changes	73%
Major changes	13%
New	10%
Deprecated	8%
Other	2%
Download the report	tabular, Genome Workbench

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20

RefSeq

Integrated reference sequences