NCBI Falco peregrinus Annotation Release 102

The RefSeq genome records for Falco peregrinus were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction
Comparison of the current and previous annotations: What proportion of the genes changed in this annotation

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Falco peregrinus Annotation Release 102

Annotation release ID: 102
Date of Entrez queries for transcripts and proteins: Jan 18 2019
Date of submission of annotation to the public databases: Jan 24 2019
Software version: 8.1

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
F_peregrinus_v1.0	GCF_000337955.1	BGI	02-05-2013	Reference	1 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	F_peregrinus_v1.0
Genes and pseudogenes	18,971
protein-coding	15,230
non-coding	3,632
transcribed pseudogenes	6
non-transcribed pseudogenes	91
genes with variants	6,260
immunoglobulin/T-cell receptor gene segments	12
other	0
mRNAs	31,127
fully-supported	28,354
with > 5% ab initio	855
partial	1,354
with filled gap(s)	0
known RefSeq (NM_)	11
model RefSeq (XM_)	31,116
non-coding RNAs	6,594
fully-supported	6,130
with > 5% ab initio	0
partial	2
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	6,373
pseudo transcripts	6
fully-supported	4
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	6
CDSs	31,152
fully-supported	28,354
with > 5% ab initio	1,310
partial	1,355
with major correction(s)	1,220
known RefSeq (NP_)	24
model RefSeq (XP_)	31,116

Detailed reports

The counts below do not include pseudogenes.

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	18,862	31,019	12,811	62	1,204,310
All transcripts	37,721	3,739	3,002	39	95,289
mRNA	31,127	3,888	3,186	185	95,289
misc_RNA	1,247	3,948	3,148	181	19,396
tRNA	219	74	73	66	84
lncRNA	4,883	3,080	1,822	39	26,084
snoRNA	194	110	93	62	321
snRNA	34	144	149	106	190
guide_RNA	15	180	139	129	306
rRNA	2	1,289	1,599	979	1,599
Single-exon transcripts	543	1,919	1,333	185	11,107
coding transcripts (NM_/XM_ )	543	1,919	1,333	185	11,107
CDSs	31,140	2,049	1,497	96	94,305
Exons	202,028	349	135	1	24,156
in coding transcripts (NM_/XM_ )	187,384	316	133	1	22,154
in non-coding transcripts (NR_/XR_ )	21,954	586	150	2	24,156
Introns	181,608	3,546	1,025	28	895,868
in coding transcripts (NM_/XM_ )	171,142	3,432	1,004	28	895,868
in non-coding transcripts (NR_/XR_ )	17,593	4,424	1,306	30	432,685

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	2.01	1	1	50
Number of exons per transcript	12.18	9	1	255

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the UniProtKB/Swiss-Prot curated proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 15217 coding genes, 14828 genes had a protein with an alignment covering 50% or more of the query and 9487 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: UniProtKB/Swiss-Prot curated proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
F_peregrinus_v1.0	GCF_000337955.1	5.23%	17.91%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Depending on the other evidence available, long 454 reads (with average length above 250 nt) may be aligned as traditional evidence and reported in the Transcript alignments section or aligned with RNA-Seq reads and reported in the RNA-Seq alignments section.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species known RefSeq (NM_/NR_)	11	11 (100.00%)	11 (100.00%)	99.63%	96.81%
Same-species Genbank	15	15 (100.00%)	11 (73.33%)	99.45%	96.81%
Aves known RefSeq (NM_/NR_)	8,951	7,608 (85.00%)	3,520 (39.33%)	91.21%	85.82%
Aves Genbank	42,012	28,604 (68.09%)	12,249 (29.16%)	90.94%	87.31%
Aves TSA	375,956	212,352 (56.48%)	9,416 (2.50%)	96.33%	95.36%
Aves EST	756,345	243,145 (32.15%)	151,468 (20.03%)	91.17%	96.56%

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Publication	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	NA	Aggregate of all aligned samples	2,477,227,523	83%	21%	229,872
SAMN01055120	9724766,12078637,23525076,9023104	Falco peregrinus reads sequenced by BGI (Falco, SAMN01055120)	79,571,270	77%	27%	107,586
SAMN01055121	9023104,23525076	Falco cherrug reads sequenced by BGI (Falco, SAMN01055121)	76,863,908	82%	25%	109,910
SAMN04525554	NA	liver (Falco sparverius, male, SAMN04525554)	17,456,911	82%	11%	62,880
SAMN04525555	NA	liver (Falco sparverius, female, SAMN04525555)	278,044,640	78%	10%	118,649
SAMN04525556	NA	liver (Falco sparverius, female, SAMN04525556)	39,385,559	83%	11%	89,442
SAMN04525557	NA	liver (Falco sparverius, male, SAMN04525557)	34,611,542	79%	9%	80,886
SAMN04525558	NA	liver (Falco sparverius, female, SAMN04525558)	38,265,091	80%	10%	78,194
SAMN04525559	NA	liver (Falco sparverius, female, SAMN04525559)	31,591,785	83%	10%	91,046
SAMN04525560	NA	liver (Falco sparverius, female, SAMN04525560)	39,434,485	84%	10%	88,128
SAMN04525561	NA	liver (Falco sparverius, female, SAMN04525561)	28,290,295	78%	10%	81,499
SAMN04525562	NA	liver (Falco sparverius, female, SAMN04525562)	34,294,777	79%	10%	83,507
SAMN04531380	NA	retina and cochlea (Falco tinnunculus, SAMN04531380)	100,655,654	81%	24%	180,180
SAMN04531384	NA	retina and cochlea (Falco subbuteo, SAMN04531384)	100,188,960	80%	25%	183,409
SAMN05831928	NA	Liver (Falco sparverius, 24 Hours, male, SAMN05831928)	78,042,186	84%	24%	154,068
SAMN06101813	NA	Liver (Falco sparverius, 24 Hours, female, SAMN06101813)	80,936,692	85%	25%	151,835
SAMN06101814	NA	Liver (Falco sparverius, 24 Hours, female, SAMN06101814)	87,855,668	84%	24%	158,560
SAMN06101815	NA	Liver (Falco sparverius, 24 Hours, male, SAMN06101815)	115,745,142	88%	24%	115,369
SAMN06101816	NA	Liver (Falco sparverius, 24 Hours, male, SAMN06101816)	84,219,458	86%	23%	156,411
SAMN06101817	NA	Liver (Falco sparverius, 24 Hours, female, SAMN06101817)	84,760,702	83%	22%	154,444
SAMN06101818	NA	Liver (Falco sparverius, 24 Hours, female, SAMN06101818)	84,100,882	84%	25%	157,440
SAMN06101819	NA	Liver (Falco sparverius, 24 Hours, female, SAMN06101819)	86,763,472	84%	24%	157,849
SAMN06101820	NA	Liver (Falco sparverius, 24 Hours, male, SAMN06101820)	84,265,484	85%	23%	155,727
SAMN06101821	NA	Liver (Falco sparverius, 24 Hours, female, SAMN06101821)	87,216,286	84%	23%	162,396
SAMN06101822	NA	Liver (Falco sparverius, 24 Hours, male, SAMN06101822)	100,556,822	84%	24%	164,292
SAMN06101823	NA	Liver (Falco sparverius, 24 Hours, female, SAMN06101823)	86,700,708	84%	24%	158,387
SAMN06101824	NA	Liver (Falco sparverius, 24 Hours, female, SAMN06101824)	95,780,238	81%	22%	161,064
SAMN06101825	NA	Liver (Falco sparverius, 24 Hours, male, SAMN06101825)	83,002,084	84%	24%	158,552
SAMN06101826	NA	Liver (Falco sparverius, 24 Hours, female, SAMN06101826)	89,253,326	84%	22%	158,364
SAMN06101827	NA	Liver (Falco sparverius, 24 Hours, male, SAMN06101827)	87,793,098	86%	24%	163,538
SAMN06101828	NA	Liver (Falco sparverius, 24 Hours, male, SAMN06101828)	86,581,814	85%	23%	161,181
SAMN06101829	NA	Liver (Falco sparverius, 24 Hours, female, SAMN06101829)	74,998,584	85%	25%	158,924

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
SRR522906	SRX160529	SRP013939	SAMN01055120	79,571,270	77%	27%
SRR522907	SRX160530	SRP018394	SAMN01055121	76,863,908	82%	25%
SRR3203231	SRX1612734	SRP071126	SAMN04531380	100,655,654	81%	24%
SRR3203238	SRX1612742	SRP071126	SAMN04531384	100,188,960	80%	25%
SRR3217264	SRX1624460	SRP071583	SAMN04525554	17,456,911	82%	11%
SRR3217266	SRX1624462	SRP071583	SAMN04525555	278,044,640	78%	10%
SRR3217265	SRX1624461	SRP071583	SAMN04525556	39,385,559	83%	11%
SRR3217261	SRX1624457	SRP071583	SAMN04525557	34,611,542	79%	9%
SRR3217262	SRX1624458	SRP071583	SAMN04525558	38,265,091	80%	10%
SRR3217263	SRX1624459	SRP071583	SAMN04525559	31,591,785	83%	10%
SRR3217258	SRX1624454	SRP071583	SAMN04525560	39,434,485	84%	10%
SRR3217259	SRX1624455	SRP071583	SAMN04525561	28,290,295	78%	10%
SRR3217260	SRX1624456	SRP071583	SAMN04525562	34,294,777	79%	10%
SRR5070564	SRX2390435	SRP094478	SAMN05831928	78,042,186	84%	24%
SRR5270429	SRX2574481	SRP094478	SAMN06101813	80,936,692	85%	25%
SRR5270428	SRX2574480	SRP094478	SAMN06101814	87,855,668	84%	24%
SRR5270427	SRX2574479	SRP094478	SAMN06101815	115,745,142	88%	24%
SRR5270426	SRX2574478	SRP094478	SAMN06101816	84,219,458	86%	23%
SRR5270425	SRX2574477	SRP094478	SAMN06101817	84,760,702	83%	22%
SRR5270424	SRX2574476	SRP094478	SAMN06101818	84,100,882	84%	25%
SRR5270423	SRX2574475	SRP094478	SAMN06101819	86,763,472	84%	24%
SRR5270422	SRX2574474	SRP094478	SAMN06101820	84,265,484	85%	23%
SRR5270421	SRX2574473	SRP094478	SAMN06101821	87,216,286	84%	23%
SRR5270420	SRX2574472	SRP094478	SAMN06101822	100,556,822	84%	24%
SRR5270419	SRX2574471	SRP094478	SAMN06101823	86,700,708	84%	24%
SRR5270418	SRX2574470	SRP094478	SAMN06101824	95,780,238	81%	22%
SRR5270417	SRX2574469	SRP094478	SAMN06101825	83,002,084	84%	24%
SRR5270416	SRX2574468	SRP094478	SAMN06101826	89,253,326	84%	22%
SRR5270415	SRX2574467	SRP094478	SAMN06101827	87,793,098	86%	24%
SRR5270414	SRX2574466	SRP094478	SAMN06101828	86,581,814	85%	23%
SRR5270413	SRX2574465	SRP094478	SAMN06101829	74,998,584	85%	25%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Xenopus known RefSeq (NP_)	19,651	18,026 (91.73%)	18,026 (91.73%)	69.87%	74.93%
Aves GenBank	14,838	13,997 (94.33%)	13,997 (94.33%)	74.25%	82.61%
Aves known RefSeq (NP_)	7,919	7,720 (97.49%)	7,720 (97.49%)	77.71%	81.61%
Columba livia high-quality model RefSeq (XP_)	8,292	8,232 (99.28%)	8,232 (99.28%)	78.44%	84.32%
Gallus gallus high-quality model RefSeq (XP_)	9,468	9,228 (97.47%)	9,228 (97.47%)	77.00%	78.63%
Parus major high-quality model RefSeq (XP_)	12,103	11,930 (98.57%)	11,930 (98.57%)	77.86%	81.35%
Homo sapiens known RefSeq (NP_)	52,492	45,763 (87.18%)	45,763 (87.18%)	69.14%	71.81%

Comparison of the current and previous annotations

The annotation produced for this release (102) was compared to the annotation in the previous release (101) for each assembly annotated in both releases. Scores for current and previous gene and transcript features were calculated based on overlap in exon sequence and matches in exon boundaries. Pairs of current and previous features were categorized based on these scores, whether they are reciprocal best matches, and changes in attributes (gene biotype, completeness, etc.). If the assembly was updated between the two releases, alignments between the current and the previous assembly were used to match the current and previous gene and transcript features in mapped regions.

The table below summarizes the changes in the gene set for each assembly as a percent of the number of genes in the current annotation release, and provides links to the details of the comparison in tabular format and in a Genome Workbench project.

	F_peregrinus_v1.0 (Current) to F_peregrinus_v1.0 (Previous)
Identical	2%
Minor changes	58%
Major changes	18%
New	20%
Deprecated	2%
Other	2%
Download the report	tabular, Genome Workbench

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20

RefSeq

Integrated reference sequences