NCBI Notechis scutatus Annotation Release 100

The RefSeq genome records for Notechis scutatus were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Notechis scutatus Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: Oct 3 2018
Date of submission of annotation to the public databases: Oct 6 2018
Software version: 8.1

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
TS10Xv2-PRI	GCF_900518725.1	UNIVERSITY OF NEW SOUTH WALES	09-24-2018	Reference	unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	TS10Xv2-PRI
Genes and pseudogenes	23,178
protein-coding	19,770
non-coding	2,523
transcribed pseudogenes	0
non-transcribed pseudogenes	826
genes with variants	5,682
immunoglobulin/T-cell receptor gene segments	59
other	0
mRNAs	31,232
fully-supported	26,924
with > 5% ab initio	1,572
partial	2,404
with filled gap(s)	9
known RefSeq (NM_)	0
model RefSeq (XM_)	31,232
non-coding RNAs	3,257
fully-supported	2,665
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	3,019
pseudo transcripts	0
fully-supported	0
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	0
CDSs	31,291
fully-supported	26,924
with > 5% ab initio	1,841
partial	2,423
with major correction(s)	331
known RefSeq (NP_)	0
model RefSeq (XP_)	31,232

Detailed reports

The counts below do not include pseudogenes.

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	22,293	30,898	13,438	48	1,202,654
All transcripts	34,489	2,735	2,043	48	100,377
mRNA	31,232	2,925	2,209	108	100,377
misc_RNA	510	2,881	2,446	109	12,105
tRNA	238	74	73	71	84
lncRNA	2,155	667	387	50	8,219
snoRNA	184	112	98	48	320
snRNA	141	117	107	61	199
guide_RNA	14	206	168	88	388
rRNA	15	350	119	119	1,821
Single-exon transcripts	1,315	1,244	962	174	12,459
coding transcripts (NM_/XM_ )	1,315	1,244	962	174	12,459
CDSs	31,232	1,916	1,290	96	99,150
Exons	207,737	259	134	1	17,100
in coding transcripts (NM_/XM_ )	200,924	259	134	1	17,100
in non-coding transcripts (NR_/XR_ )	10,795	231	115	2	9,282
Introns	184,166	3,930	1,378	30	914,701
in coding transcripts (NM_/XM_ )	179,673	3,897	1,372	30	914,701
in non-coding transcripts (NR_/XR_ )	8,401	4,373	1,487	30	657,181

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.55	1	1	50
Number of exons per transcript	10.77	8	1	310

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the UniProtKB/Swiss-Prot curated proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 19770 coding genes, 19220 genes had a protein with an alignment covering 50% or more of the query and 11732 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: UniProtKB/Swiss-Prot curated proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
TS10Xv2-PRI	GCF_900518725.1	5.72%	39.62%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Depending on the other evidence available, long 454 reads (with average length above 250 nt) may be aligned as traditional evidence and reported in the Transcript alignments section or aligned with RNA-Seq reads and reported in the RNA-Seq alignments section.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species Genbank	40	39 (97.50%)	23 (57.50%)	98.69%	79.04%

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Publication	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	NA	Aggregate of all aligned samples	1,192,597,895	51%	23%	255,008
SAMD00076719	NA	venom gland (Micrurus surinamensis, SAMD00076719)	118,266,608	37%	18%	101,709
SAMD00076720	NA	venom gland (Micrurus lemniscatus, SAMD00076720)	77,203,222	60%	28%	141,631
SAMD00076721	NA	venom gland (Micrurus corallinus, SAMD00076721)	77,771,906	50%	32%	125,775
SAMD00076722	NA	venom gland (Micrurus spixii, SAMD00076722)	110,859,044	46%	27%	126,495
SAMD00076723	NA	venom gland (Micrurus paraensis, SAMD00076723)	104,274,402	36%	19%	95,539
SAMD00076724	NA	venom gland (Micrurus lemniscatus carvalhoi, SAMD00076724)	91,913,022	38%	29%	94,828
SAMN01086130	NA	Notechis scutatus (Tiger Snake) liver transcriptome from a male individual (Notechis scutatus, SAMN01086130)	51,253,528	61%	19%	101,698
SAMN01086234	NA	Notechis scutatus (Tiger Snake) liver transcriptome from a female individual (Notechis scutatus, SAMN01086234)	54,044,366	80%	32%	94,664
SAMN01830718	25727380,23025625,23758969,25476704,26358130,29329419,29570631	Venom-gland transcriptome of the eastern coral snake (Micrurus fulvius) (Micrurus fulvius, SAMN01830718)	159,146,096	49%	22%	155,326
SAMN01922383	24351719	General Sample for Acanthophis wellsi (Acanthophis wellsi, SAMN01922383)	55,372	65%	18%	7,013
SAMN01922384	24351719	General Sample for Cacophis squamulosus (Cacophis squamulosus, SAMN01922384)	38,985	53%	54%	9,135
SAMN01922385	24351719	General Sample for Denisonia devisi (Denisonia devisi, SAMN01922385)	28,275	62%	23%	4,648
SAMN01922387	24351719	General Sample for Furina ornata (Furina ornata, SAMN01922387)	46,190	65%	22%	7,838
SAMN01922388	24351719	General Sample for Hemiaspis signata (Hemiaspis signata, SAMN01922388)	48,904	55%	14%	4,184
SAMN01922389	24351719	General Sample for Hoplocephalus bungaroides (Hoplocephalus bungaroides, SAMN01922389)	60,586	63%	37%	10,715
SAMN01922390	24351719	General Sample for Pseudonaja modesta (Pseudonaja modesta, SAMN01922390)	51,984	62%	22%	8,903
SAMN01922391	24351719	General Sample for Suta fasciata (Suta fasciata, SAMN01922391)	90,963	64%	33%	14,147
SAMN01924508	24351719	General Sample for Vermicella annulata (Vermicella annulata, SAMN01924508)	61,196	71%	15%	7,644
SAMN01924509	24351719	General Sample for Brachyurophis roperi (Brachyurophis roperi, SAMN01924509)	55,694	70%	28%	12,416
SAMN02370759	NA	Accessory gland (Ophiophagus hannah, SAMN02370759)	11,209,677	70%	9%	83,700
SAMN02370760	NA	Venom gland (Ophiophagus hannah, SAMN02370760)	15,166,590	56%	6%	57,015
SAMN02370761	NA	pooled organs (Ophiophagus hannah, SAMN02370761)	17,858,289	67%	9%	99,147
SAMN03329627	26079951	venom gland (Pseudonaja textilis, male, SAMN03329627)	104,878	35%	50%	24,706
SAMN03378984	26358635	Adult, venom gland (Ophiophagus hannah, SAMN03378984)	52,280,572	57%	16%	130,544
SAMN03658783	25727380,23025625,23758969,25476704,26358130,29329419,29570631	Venom gland (Micrurus tener, male, SAMN03658783)	122,072,966	50%	22%	144,223
SAMN04270323	NA	adult, venom gland (Naja kaouthia, SAMN04270323)	53,663,062	70%	19%	143,887
SAMN04270326	NA	adult, venom gland (Naja kaouthia, SAMN04270326)	51,430,686	61%	17%	127,081
SAMN06330126	NA	venom gland (Dendroaspis polylepis, male, SAMN06330126)	6,009,916	75%	34%	92,235
SAMN06330253	NA	venom gland (Dendroaspis angusticeps, male, SAMN06330253)	4,326,295	66%	33%	77,218
SAMN06330254	NA	Adult, venom gland (Dendroaspis viridis, SAMN06330254)	4,127,806	63%	31%	75,963
SAMN06330255	NA	venom gland (Dendroaspis jamesoni kaimosae, male, SAMN06330255)	4,651,718	68%	28%	73,543
SAMN06330256	NA	Adult, venom gland (Dendroaspis jamesoni jamesoni, SAMN06330256)	4,425,097	62%	29%	76,581

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
DRR089660	DRX083388	DRP003603	SAMD00076719	118,266,608	37%	18%
DRR089661	DRX083389	DRP003603	SAMD00076720	77,203,222	60%	28%
DRR089662	DRX083390	DRP003603	SAMD00076721	77,771,906	50%	32%
DRR089663	DRX083391	DRP003603	SAMD00076722	110,859,044	46%	27%
DRR089664	DRX083392	DRP003603	SAMD00076723	104,274,402	36%	19%
DRR089665	DRX083393	DRP003603	SAMD00076724	91,913,022	38%	29%
SRR630454	SRX209497	SRP011323	SAMN01830718	159,146,096	49%	22%
SRR2028245	SRX1029424	SRP011323	SAMN03658783	122,072,966	50%	22%
SRR519122	SRX158115	SRP014151	SAMN01086130	51,253,528	61%	19%
SRR519464	SRX158353	SRP014151	SAMN01086234	54,044,366	80%	32%
SRR768900	SRX247250	SRP018999	SAMN01922383	55,372	65%	18%
SRR768902	SRX247251	SRP019000	SAMN01924509	55,694	70%	28%
SRR768909	SRX247254	SRP019002	SAMN01922384	38,985	53%	54%
SRR768910	SRX247255	SRP019003	SAMN01922385	28,275	62%	23%
SRR768912	SRX247257	SRP019005	SAMN01922387	46,190	65%	22%
SRR768913	SRX247258	SRP019006	SAMN01922388	48,904	55%	14%
SRR768914	SRX247259	SRP019007	SAMN01922389	60,586	63%	37%
SRR768915	SRX247260	SRP019008	SAMN01922390	51,984	62%	22%
SRR768916	SRX247261	SRP019009	SAMN01922391	90,963	64%	33%
SRR768917	SRX247262	SRP019010	SAMN01924508	61,196	71%	15%
SRR1012886	SRX365142	SRP031481	SAMN02370759	11,209,677	70%	9%
SRR1012887	SRX365143	SRP031481	SAMN02370760	15,166,590	56%	6%
SRR1012888	SRX365144	SRP031481	SAMN02370761	17,858,289	67%	9%
SRR1791676	SRX866935	SRP053239	SAMN03329627	104,878	35%	50%
SRR1821260	SRX892825	SRP055563	SAMN03378984	52,280,572	57%	16%
SRR2917658	SRX1432812	SRP066203	SAMN04270323	53,663,062	70%	19%
SRR2917657	SRX1432814	SRP066203	SAMN04270326	51,430,686	61%	17%
SRR5485228	SRX2768426	SRP105399	SAMN06330126	6,009,916	75%	34%
SRR5485229	SRX2768427	SRP105399	SAMN06330253	4,326,295	66%	33%
SRR5485230	SRX2768428	SRP105399	SAMN06330254	4,127,806	63%	31%
SRR5485231	SRX2768429	SRP105399	SAMN06330255	4,651,718	68%	28%
SRR5485232	SRX2768430	SRP105399	SAMN06330256	4,425,097	62%	29%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Pogona vitticeps high-quality model RefSeq (XP_)	13,733	13,186 (96.02%)	13,186 (96.02%)	69.83%	77.97%
Protobothrops mucrosquamatus high-quality model RefSeq (XP_)	6,944	6,879 (99.06%)	6,879 (99.06%)	77.75%	85.98%
Python bivittatus high-quality model RefSeq (XP_)	12,616	12,396 (98.26%)	12,396 (98.26%)	74.40%	81.52%
Anolis carolinensis high-quality model RefSeq (XP_)	13,146	12,444 (94.66%)	12,444 (94.66%)	67.55%	77.52%
Xenopus GenBank	31,795	29,666 (93.30%)	29,666 (93.30%)	68.66%	74.79%
Xenopus known RefSeq (NP_)	19,630	18,496 (94.22%)	18,496 (94.22%)	68.74%	75.31%
Sauropsida GenBank	22,102	19,601 (88.68%)	19,601 (88.68%)	67.95%	72.54%
Sauropsida known RefSeq (NP_)	8,123	7,658 (94.28%)	7,658 (94.28%)	71.60%	78.10%
Same-species GenBank	40	38 (95.00%)	38 (95.00%)	76.15%	73.81%
Homo sapiens GenBank	130,118	106,401 (81.77%)	106,401 (81.77%)	66.34%	72.08%
Homo sapiens known RefSeq (NP_)	51,663	45,675 (88.41%)	45,675 (88.41%)	66.32%	71.35%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20

RefSeq

Integrated reference sequences