NCBI Papio anubis Annotation Release 103

The RefSeq genome records for Papio anubis were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction
Similarity of current and previous assembly: The similarity of the current and previous assembly
Comparison of the current and previous annotations: What proportion of the genes changed in this annotation

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Papio anubis Annotation Release 103

Annotation release ID: 103
Date of Entrez queries for transcripts and proteins: Jul 15 2017
Date of submission of annotation to the public databases: Jul 26 2017
Software version: 7.4

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
Panu_3.0	GCF_000264685.3	Human Genome Sequencing Center	04-20-2017	Reference	22 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	Panu_3.0
Genes and pseudogenes	35,731
protein-coding	21,300
non-coding	8,433
pseudogenes	5,998
genes with variants	13,693
mRNAs	66,646
fully-supported	65,519
with > 5% ab initio	422
partial	331
with filled gap(s)	7
known RefSeq (NM_)	488
model RefSeq (XM_)	66,158
Other RNAs	16,521
fully-supported	16,105
with > 5% ab initio	0
partial	1
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	16,105
CDSs	66,857
fully-supported	65,519
with > 5% ab initio	581
partial	337
with major correction(s)	1,634
known RefSeq (NP_)	488
model RefSeq (XP_)	66,158

Detailed reports

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	29,733	46,988	15,543	70	2,500,808
All transcripts	83,167	3,712	2,995	38	104,371
mRNA	66,646	3,895	3,174	213	104,371
misc_RNA	4,252	3,401	2,755	138	21,026
tRNA	416	74	73	70	85
lncRNA	11,853	2,919	2,040	38	22,653
Single-exon transcripts	2,120	2,653	2,127	213	21,071
coding transcripts (NM_/XM_ )	2,116	2,654	2,127	213	21,071
non-coding transcripts (NR_/XR_ )	4	2,039	2,942	876	2,976
CDSs	66,646	2,026	1,494	75	103,125
Exons	290,121	448	149	1	23,286
in coding transcripts (NM_/XM_ )	254,101	392	144	1	23,286
in non-coding transcripts (NR_/XR_ )	59,448	596	163	2	19,286
Introns	250,990	7,240	1,829	25	1,178,500
in coding transcripts (NM_/XM_ )	226,106	7,040	1,769	25	1,178,500
in non-coding transcripts (NR_/XR_ )	47,556	7,332	2,015	30	661,873

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	2.84	1	1	50
Number of exons per transcript	11.74	9	1	314

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the UniProtKB/Swiss-Prot curated proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 21089 coding genes, 20362 genes had a protein with an alignment covering 50% or more of the query and 17510 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: UniProtKB/Swiss-Prot curated proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with RepeatMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
Panu_3.0	GCF_000264685.3	52.39%	38.13%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Depending on the other evidence available, long 454 reads (with average length above 250 nt) may be aligned as traditional evidence and reported in the Transcript alignments section or aligned with RNA-Seq reads and reported in the RNA-Seq alignments section.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species known RefSeq (NM_/NR_)	505	502 (99.41%)	486 (96.24%)	99.35%	98.95%
Same-species Genbank	519	519 (100.00%)	379 (73.03%)	98.45%	97.81%
Same-species EST	145,582	139,989 (96.16%)	133,809 (91.91%)	99.49%	99.62%
Homo sapiens known RefSeq (NM_/NR_)	63,799	62,661 (98.22%)	39,371 (61.71%)	95.39%	98.09%
Homo sapiens Genbank	287,825	240,103 (83.42%)	145,976 (50.72%)	94.52%	93.20%
Homo sapiens EST	8,647,226	7,273,785 (84.12%)	6,414,928 (74.18%)	94.60%	96.05%

RefSeq transcript alignment quality report

The known RefSeq transcripts (NM_ and NR_ accessions) are a set of hiqh-quality transcripts maintained by the RefSeq group at NCBI. Alignment statistics for this group of transcripts, such as percent and number of sequences not aligning at all, percent best alignments split between multiple scaffolds, and percent alignments not covering the full CDS are indicative of the genome quality and are provided below.

	Panu_3.0 Primary Assembly
Number of sequences retrieved from Entrez	505
Number (%) of sequences not aligning	3 (0.59%)
Number (%) of sequences with multiple best alignments (split genes)	3 (0.60%)
Number (%) of sequences with CDS coverage < 95%	22 (4.38%)

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Publication	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	NA	Aggregate of all aligned samples	12,151,875,970	80%	14%	343,061
SAMN00000303	NA	placental tissue (Papio anubis, SAMN00000303)	1,039,224	56%	36%	40,737
SAMN02045697	23203872	Generic sample from Baboon (Papio anubis, SAMN02045697)	1,982,625,205	82%	19%	300,714
SAMN02045698	23203872	Generic sample from Baboon (Papio anubis, SAMN02045698)	1,905,235,297	87%	18%	314,529
SAMN02045699	23203872	Generic sample from Baboon (Papio anubis, SAMN02045699)	151,524,634	74%	9%	189,145
SAMN02401340	NA	Spleen (Papio anubis, 6 years 2 months, female, SAMN02401340)	154,998,440	80%	9%	198,981
SAMN02401341	NA	Liver (Papio anubis, 6 years 2 months, female, SAMN02401341)	113,370,678	83%	16%	170,447
SAMN02401342	NA	Lymph node (Papio anubis, 6 years 2 months, female, SAMN02401342)	108,724,870	80%	8%	183,808
SAMN02401343	NA	Colon (Papio anubis, 6 years 2 months, female, SAMN02401343)	113,051,498	76%	7%	179,966
SAMN02401344	NA	Heart (Papio anubis, 6 years 2 months, female, SAMN02401344)	137,880,750	81%	13%	176,053
SAMN02401345	NA	Kidney (Papio anubis, 6 years 2 months, female, SAMN02401345)	108,692,774	81%	11%	184,705
SAMN02401346	NA	Bone Marrow (Papio anubis, 6 years 2 months, female, SAMN02401346)	96,352,712	69%	8%	163,784
SAMN02401347	NA	Brain Cerebellum (Papio anubis, 6 years 2 months, female, SAMN02401347)	180,557,678	78%	7%	196,104
SAMN02401348	NA	Brain Frontal Cortex (Papio anubis, 6 years 2 months, female, SAMN02401348)	177,250,086	79%	9%	206,136
SAMN02427293	NA	Brain Temporal Lobe (Papio anubis, 6 years 2 months, female, SAMN02427293)	206,459,182	80%	9%	211,507
SAMN02427294	NA	Brain Pituitary (Papio anubis, 6 years 2 months, female, SAMN02427294)	112,305,812	78%	8%	194,475
SAMN02427295	NA	Lung (Papio anubis, 6 years 2 months, female, SAMN02427295)	161,548,190	78%	8%	196,798
SAMN02427296	NA	Skeletal Muscle (Papio anubis, 6 years 2 months, female, SAMN02427296)	195,288,700	87%	21%	171,797
SAMN02427297	NA	Thymus (Papio anubis, 6 years 2 months, female, SAMN02427297)	454,474,272	80%	7%	229,253
SAMN03085068	25392405	Whole blood (Papio anubis, male and female, SAMN03085068)	175,734,602	79%	3%	117,552
SAMN03282292	25392405	Bone Marrow (Papio anubis, male and female, SAMN03282292)	96,352,712	69%	8%	163,784
SAMN03282293	25392405	Brain Cerebellum (Papio anubis, male and female, SAMN03282293)	180,557,678	78%	7%	196,104
SAMN03282294	25392405	Brain Frontal Cortex (Papio anubis, male and female, SAMN03282294)	177,250,086	79%	9%	206,136
SAMN03282295	25392405	Brain Pituitary (Papio anubis, male and female, SAMN03282295)	112,305,812	78%	8%	194,475
SAMN03282296	25392405	Brain Temporal Lobe (Papio anubis, female, SAMN03282296)	206,459,182	80%	9%	211,507
SAMN03282297	25392405	Colon (Papio anubis, male and female, SAMN03282297)	113,051,498	76%	7%	179,966
SAMN03282298	25392405	Heart (Papio anubis, male and female, SAMN03282298)	137,880,750	81%	13%	176,053
SAMN03282299	25392405	Kidney (Papio anubis, male and female, SAMN03282299)	108,692,774	81%	11%	184,705
SAMN03282300	25392405	Liver (Papio anubis, male and female, SAMN03282300)	113,370,678	83%	16%	170,447
SAMN03282301	25392405	Lung (Papio anubis, male and female, SAMN03282301)	161,548,190	78%	8%	196,798
SAMN03282302	25392405	Lymph Node (Papio anubis, male and female, SAMN03282302)	108,724,870	80%	8%	183,808
SAMN03282303	25392405	Skeletal Muscle (Papio anubis, male and female, SAMN03282303)	195,288,700	87%	21%	171,797
SAMN03282304	25392405	Spleen (Papio anubis, male and female, SAMN03282304)	154,998,440	80%	9%	198,981
SAMN03282305	25392405	Thymus (Papio anubis, male and female, SAMN03282305)	454,474,272	80%	7%	229,253
SAMN05440900	NA	Peripheral blood mononuclear cells (Papio anubis, 2-6 years old, pooled male and female, SAMN05440900)	872,659,480	71%	11%	208,999
SAMN05440901	NA	Peripheral blood mononuclear cells (Papio anubis, 2-6 years old, pooled male and female, SAMN05440901)	856,073,226	79%	13%	211,403
SAMN05440902	NA	Mesenteric lymph nodes (Papio anubis, 2-6 years old, pooled male and female, SAMN05440902)	478,598,574	80%	19%	224,922
SAMN05440903	NA	Mesenteric lymph nodes (Papio anubis, 2-6 years old, pooled male and female, SAMN05440903)	473,351,240	80%	17%	220,415
SAMN05440922	NA	Spleen (Papio anubis, 2-6 years old, pooled male and female, SAMN05440922)	125,347,268	71%	8%	162,660
SAMN05440923	NA	Spleen (Papio anubis, 2-6 years old, pooled male and female, SAMN05440923)	487,775,936	70%	9%	204,954

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
SRR001693	SRX000433	SRP000221	SAMN00000303	315,491	59%	34%
SRR001694	SRX000433	SRP000221	SAMN00000303	341,165	56%	37%
SRR001695	SRX000433	SRP000221	SAMN00000303	382,568	54%	36%
SRR832903	SRX270611	SRP021223	SAMN02045697	1,911,147,598	81%	19%
SRR832906	SRX270615	SRP021223	SAMN02045697	71,477,607	83%	19%
SRR832905	SRX270614	SRP021223	SAMN02045698	1,837,471,794	87%	19%
SRR832907	SRX270616	SRP021223	SAMN02045698	67,763,503	89%	17%
SRR832912	SRX270620	SRP021223	SAMN02045699	151,524,634	74%	9%
SRR1041117	SRX385516	SRP033402	SAMN02401340	154,998,440	80%	9%
SRR1041115	SRX385514	SRP033402	SAMN02401341	113,370,678	83%	16%
SRR1041116	SRX385515	SRP033402	SAMN02401342	108,724,870	80%	8%
SRR1041112	SRX385511	SRP033402	SAMN02401343	113,051,498	76%	7%
SRR1041113	SRX385512	SRP033402	SAMN02401344	137,880,750	81%	13%
SRR1041114	SRX385513	SRP033402	SAMN02401345	108,692,774	81%	11%
SRR1041109	SRX385508	SRP033402	SAMN02401346	96,352,712	69%	8%
SRR1041110	SRX385509	SRP033402	SAMN02401347	180,557,678	78%	7%
SRR1041111	SRX385510	SRP033402	SAMN02401348	177,250,086	79%	9%
SRR1045085	SRX388834	SRP033402	SAMN02427293	206,459,182	80%	9%
SRR1045084	SRX388833	SRP033402	SAMN02427294	112,305,812	78%	8%
SRR1045086	SRX388835	SRP033402	SAMN02427295	161,548,190	78%	8%
SRR1045087	SRX388836	SRP033402	SAMN02427296	117,041,732	88%	21%
SRR1045088	SRX388836	SRP033402	SAMN02427296	78,246,968	85%	20%
SRR1045089	SRX388837	SRP033402	SAMN02427297	454,474,272	80%	7%
SRR1602575	SRX724890	SRP048678	SAMN03085068	175,734,602	79%	3%
SRR1758900	SRX843131	SRP051959	SAMN03282292	96,352,712	69%	8%
SRR1758901	SRX843132	SRP051959	SAMN03282293	180,557,678	78%	7%
SRR1758902	SRX843133	SRP051959	SAMN03282294	177,250,086	79%	9%
SRR1758903	SRX843134	SRP051959	SAMN03282295	112,305,812	78%	8%
SRR1758904	SRX843135	SRP051959	SAMN03282296	206,459,182	80%	9%
SRR1758905	SRX843136	SRP051959	SAMN03282297	113,051,498	76%	7%
SRR1758906	SRX843137	SRP051959	SAMN03282298	137,880,750	81%	13%
SRR1758907	SRX843138	SRP051959	SAMN03282299	108,692,774	81%	11%
SRR1758908	SRX843139	SRP051959	SAMN03282300	113,370,678	83%	16%
SRR1758909	SRX843140	SRP051959	SAMN03282301	161,548,190	78%	8%
SRR1758910	SRX843141	SRP051959	SAMN03282302	108,724,870	80%	8%
SRR1758911	SRX843142	SRP051959	SAMN03282303	117,041,732	88%	21%
SRR1758912	SRX843143	SRP051959	SAMN03282303	78,246,968	85%	20%
SRR1758913	SRX843144	SRP051959	SAMN03282304	154,998,440	80%	9%
SRR1758914	SRX843145	SRP051959	SAMN03282305	454,474,272	80%	7%
SRR4015317	SRX2009919	SRP081155	SAMN05440900	217,133,070	79%	14%
SRR4015318	SRX2009920	SRP081155	SAMN05440900	188,862,094	67%	8%
SRR4015326	SRX2009928	SRP081155	SAMN05440900	218,734,066	78%	12%
SRR4015330	SRX2009932	SRP081155	SAMN05440900	247,930,250	62%	11%
SRR4015319	SRX2009921	SRP081155	SAMN05440901	203,616,318	78%	12%
SRR4015321	SRX2009923	SRP081155	SAMN05440901	218,626,584	78%	11%
SRR4015327	SRX2009929	SRP081155	SAMN05440901	257,754,718	77%	12%
SRR4015329	SRX2009931	SRP081155	SAMN05440901	176,075,606	81%	16%
SRR4015333	SRX2009935	SRP081155	SAMN05440902	124,474,128	86%	22%
SRR4015340	SRX2009942	SRP081155	SAMN05440902	126,363,808	85%	21%
SRR4015341	SRX2009943	SRP081155	SAMN05440902	121,897,372	85%	21%
SRR4015342	SRX2009944	SRP081155	SAMN05440902	105,863,266	63%	7%
SRR4015331	SRX2009933	SRP081155	SAMN05440903	109,720,982	85%	23%
SRR4015332	SRX2009934	SRP081155	SAMN05440903	131,474,672	80%	14%
SRR4015344	SRX2009946	SRP081155	SAMN05440903	115,698,860	72%	6%
SRR4015345	SRX2009947	SRP081155	SAMN05440903	116,456,726	85%	24%
SRR4015349	SRX2009951	SRP081155	SAMN05440922	125,347,268	71%	8%
SRR4015343	SRX2009945	SRP081155	SAMN05440923	120,831,196	69%	3%
SRR4015346	SRX2009948	SRP081155	SAMN05440923	82,860,448	48%	5%
SRR4015347	SRX2009949	SRP081155	SAMN05440923	170,319,832	80%	15%
SRR4015348	SRX2009950	SRP081155	SAMN05440923	113,764,460	74%	5%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Primates GenBank	19,532	18,937 (96.95%)	18,937 (96.95%)	83.74%	92.80%
Primates known RefSeq (NP_)	13,947	13,772 (98.75%)	13,772 (98.75%)	87.26%	93.19%
Same-species GenBank	404	401 (99.26%)	401 (99.26%)	83.01%	96.17%
Same-species known RefSeq (NP_)	505	499 (98.81%)	499 (98.81%)	86.98%	91.57%
Homo sapiens GenBank	128,820	121,841 (94.58%)	121,841 (94.58%)	84.70%	88.84%
Homo sapiens known RefSeq (NP_)	48,880	48,235 (98.68%)	48,235 (98.68%)	87.23%	91.74%

Assembly-assembly alignments of current to previous assembly

When the assembly changes between two rounds of annotation, genes in the current and the previous annotation are mapped to each other using the genomic alignments of the current assembly to the previous assembly so that gene identifiers can be preserved. The success of the remapping depends largely on how well the two assembly versions align to each other.

Below are the percent coverage of one assembly by the other and the average percent identity of the alignments. The 'First pass' alignments are reciprocal best hits, while the 'Total' alignments also include 'Second pass' or non-reciprocal best alignments. For more information about the assembly-assembly alignment process, please visit the NCBI Genome Remapping Service page.

First Pass	Total
Panu_3.0 (Current) Coverage: 96.01%	Panu_3.0 (Current) Coverage: 96.13%
Panu_2.0 (Previous) Coverage: 97.48%	Panu_2.0 (Previous) Coverage: 97.49%
Percent Identity: 99.96%	Percent Identity: 99.95%

Comparison of the current and previous annotations

The annotation produced for this release (103) was compared to the annotation in the previous release (102) for each assembly annotated in both releases. Scores for current and previous gene and transcript features were calculated based on overlap in exon sequence and matches in exon boundaries. Pairs of current and previous features were categorized based on these scores, whether they are reciprocal best matches, and changes in attributes (gene biotype, completeness, etc.). If the assembly was updated between the two releases, alignments between the current and the previous assembly were used to match the current and previous gene and transcript features in mapped regions.

The table below summarizes the changes in the gene set for each assembly as a percent of the number of genes in the current annotation release, and provides links to the details of the comparison in tabular format and in a Genome Workbench project.

	Panu_3.0 (Current) to Panu_2.0 (Previous)
Identical	12%
Minor changes	57%
Major changes	20%
New	10%
Deprecated	21%
Other	1%
Download the report	tabular, Genome Workbench

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20

RefSeq

Integrated reference sequences