NCBI Lotus japonicus Annotation Release GCF_012489685.1-RS_2023_06

The genome sequence records for Lotus japonicus RefSeq assembly GCF_012489685.1 (LjGifu_v1.2) were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
BUSCO results: Annotation completeness assessed with BUSCO
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as "GCF_012489685.1-RS_2023_06".

Date of Entrez queries for transcripts and proteins: Jun 21 2023
Date of submission of annotation to the public databases: Jun 27 2023
Software version: 10.1

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
LjGifu_v1.2	GCF_012489685.1	Aarhus University	11-15-2022	Reference	6 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	LjGifu_v1.2
Genes and pseudogenes	40,429
protein-coding	32,752
non-coding	5,726
Transcribed pseudogenes	64
Non-transcribed pseudogenes	1,887
genes with variants	7,931
Immunoglobulin/T-cell receptor gene segments	0
other	0
mRNAs	44,420
fully-supported	37,126
with > 5% ab initio	6,589
partial	147
with filled gap(s)	1
known RefSeq (NM_)	0
model RefSeq (XM_)	44,420
non-coding RNAs	13,568
fully-supported	11,421
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	13,055
pseudo transcripts	64
fully-supported	59
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	64
CDSs	44,420
fully-supported	37,126
with > 5% ab initio	6,701
partial	147
with major correction(s)	192
known RefSeq (NP_)	0
model RefSeq (XP_)	44,420

Detailed reports

The counts below do not include pseudogenes.

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	38,478	3,719	2,700	64	107,251
All transcripts	57,988	1,959	1,644	64	21,457
mRNA	44,420	1,959	1,658	147	18,702
misc_RNA	3,050	2,727	2,121	253	19,736
tRNA	513	74	73	70	84
lncRNA	8,371	2,122	1,704	144	21,457
snoRNA	863	106	107	64	228
snRNA	129	144	145	98	201
rRNA	642	588	119	117	3,410
Single-exon transcripts	6,491	1,254	1,070	147	8,689
coding transcripts (NM_/XM_ )	6,490	1,254	1,070	147	8,689
non-coding transcripts (NR_/XR_ )	1	610	610	610	610
CDSs	44,420	1,384	1,149	102	16,260
Exons	209,596	359	181	1	18,413
in coding transcripts (NM_/XM_ )	184,814	346	175	1	16,896
in non-coding transcripts (NR_/XR_ )	31,232	406	203	10	18,413
Introns	163,986	563	219	30	86,094
in coding transcripts (NM_/XM_ )	147,157	545	212	30	86,094
in non-coding transcripts (NR_/XR_ )	23,133	664	260	30	71,897

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.51	1	1	50
Number of exons per transcript	6	5	1	78

BUSCO analysis of gene annotation

BUSCO v4.1.4 was run in "protein" mode on the annotated gene set picking one longest protein per gene, and run using the fabales_odb10 lineage dataset. Results are reported for the gene set from the primary assembly unit, and presented in BUSCO notation.

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the Arabidopsis thaliana known RefSeq proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 32752 coding genes, 27280 genes had a protein with an alignment covering 50% or more of the query and 11492 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: Arabidopsis thaliana known RefSeq proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker (if calculated), for each assembly. RepeatMasker results are only calculated for organisms with complete Dfam HMM model collections.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with WindowMasker
LjGifu_v1.2	GCF_012489685.1	40.30%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez Nucleotide, Entrez Protein, and SRA, and aligned to the genome.

Transcript alignments

The alignments of the following transcripts with Splign were used for gene prediction:

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species Genbank	12,225	12,028 (98.39%)	11,058 (90.45%)	99.69%	99.19%
Same-species EST	242,403	234,066 (96.56%)	223,808 (92.33%)	99.51%	99.40%

RNA-Seq alignments

The alignments of the following RNA-Seq reads with STAR were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Publication	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	NA	Aggregate of all aligned samples	5,529,236,872	79%	27%	188,753
SAMN07455372	NA	petal (Lotus japonicus, SAMN07455372)	917,203,062	82%	29%	172,138
SAMN14527160	NA	root, shoot (Lotus japonicus, 3 weeks, SAMN14527160)	156,722,002	65%	23%	143,512
SAMN14527161	NA	root, shoot (Lotus japonicus, 3 weeks, SAMN14527161)	170,485,345	63%	24%	148,891
SAMN15376480	NA	Root (Lotus japonicus, 14 days, SAMN15376480)	126,917,514	79%	25%	136,493
SAMN15376481	NA	Root (Lotus japonicus, 14 days, SAMN15376481)	124,842,320	57%	25%	134,220
SAMN15376482	NA	Root (Lotus japonicus, 14 days, SAMN15376482)	128,634,778	87%	28%	146,883
SAMN15376483	NA	Root (Lotus japonicus, 14 days, SAMN15376483)	134,210,608	85%	27%	147,416
SAMN15376484	NA	Root (Lotus japonicus, 14 days, SAMN15376484)	127,107,670	84%	29%	136,137
SAMN15376485	NA	Root (Lotus japonicus, 14 days, SAMN15376485)	132,169,568	76%	25%	132,997
SAMN15376486	NA	Root (Lotus japonicus, 14 days, SAMN15376486)	133,308,028	85%	31%	146,006
SAMN20510973	NA	root (Lotus japonicus, SAMN20510973)	136,919,374	82%	25%	153,743
SAMN20510974	NA	root (Lotus japonicus, SAMN20510974)	124,518,650	85%	24%	151,221
SAMN20510975	NA	root (Lotus japonicus, SAMN20510975)	117,084,920	87%	23%	151,278
SAMN20510976	NA	root (Lotus japonicus, SAMN20510976)	115,967,720	76%	24%	156,087
SAMN20510977	NA	root (Lotus japonicus, SAMN20510977)	107,712,430	81%	24%	148,779
SAMN20510978	NA	root (Lotus japonicus, SAMN20510978)	113,400,600	81%	25%	153,738
SAMN20510979	NA	root (Lotus japonicus, SAMN20510979)	122,720,944	78%	24%	151,498
SAMN20510980	NA	root (Lotus japonicus, SAMN20510980)	115,035,768	78%	25%	150,960
SAMN20510981	NA	root (Lotus japonicus, SAMN20510981)	190,967,036	76%	25%	160,326
SAMN24563794	NA	root (Lotus japonicus, SAMN24563794)	37,544,650	93%	35%	142,405
SAMN24563796	NA	root (Lotus japonicus, SAMN24563796)	43,557,340	92%	34%	141,780
SAMN24563798	NA	root (Lotus japonicus, SAMN24563798)	54,659,060	94%	35%	143,018
SAMN24563800	NA	root (Lotus japonicus, SAMN24563800)	45,272,440	91%	34%	141,156
SAMN28688648	NA	roots stem leaf flower seed pod (Lotus japonicus, SAMN28688648)	47,425,965	77%	18%	143,352
SAMN28689148	NA	root stem leafflower seed pod (Lotus japonicus, SAMN28689148)	108,492,524	92%	21%	158,690
SAMN31014967	NA	seedling (Lotus japonicus, 3-week old, SAMN31014967)	1,106,251,684	61%	21%	170,323
SAMN33272609	36945518	root (Lotus japonicus, SAMN33272609)	60,413,838	96%	36%	150,939
SAMN33272610	36945518	root (Lotus japonicus, SAMN33272610)	60,237,260	95%	36%	151,277
SAMN33272611	36945518	root (Lotus japonicus, SAMN33272611)	60,230,248	95%	35%	150,791
SAMN33272612	36945518	root (Lotus japonicus, SAMN33272612)	60,433,400	96%	36%	151,608
SAMN33272613	36945518	root (Lotus japonicus, SAMN33272613)	60,460,778	95%	36%	150,746
SAMN33272614	36945518	root (Lotus japonicus, SAMN33272614)	60,119,664	95%	36%	152,674
SAMN33272615	36945518	root (Lotus japonicus, SAMN33272615)	60,505,266	95%	35%	152,138
SAMN33272616	36945518	root (Lotus japonicus, SAMN33272616)	60,312,596	96%	36%	150,256
SAMN33272617	36945518	root (Lotus japonicus, SAMN33272617)	60,193,966	95%	36%	150,933
SAMN33272618	36945518	root (Lotus japonicus, SAMN33272618)	60,418,538	95%	35%	151,083
SAMN33272619	36945518	root (Lotus japonicus, SAMN33272619)	60,157,692	95%	35%	151,160
SAMN33272620	36945518	root (Lotus japonicus, SAMN33272620)	60,302,118	95%	36%	151,983
SAMN33272621	36945518	root (Lotus japonicus, SAMN33272621)	60,343,360	94%	35%	151,816
SAMN33272622	36945518	root (Lotus japonicus, SAMN33272622)	60,222,410	95%	36%	151,147

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
SRR5950194	SRX3108763	SRP115872	SAMN07455372	52,368,258	84%	30%
SRR5950193	SRX3108764	SRP115872	SAMN07455372	44,872,562	84%	30%
SRR5950192	SRX3108765	SRP115872	SAMN07455372	44,492,592	83%	30%
SRR5950191	SRX3108766	SRP115872	SAMN07455372	87,275,806	83%	30%
SRR5950190	SRX3108767	SRP115872	SAMN07455372	48,592,184	84%	30%
SRR5950189	SRX3108768	SRP115872	SAMN07455372	67,730,648	79%	30%
SRR5950188	SRX3108769	SRP115872	SAMN07455372	65,229,978	83%	29%
SRR5950187	SRX3108770	SRP115872	SAMN07455372	58,836,486	74%	27%
SRR5950186	SRX3108771	SRP115872	SAMN07455372	65,164,292	83%	28%
SRR5950185	SRX3108772	SRP115872	SAMN07455372	67,049,760	83%	31%
SRR5950184	SRX3108773	SRP115872	SAMN07455372	55,224,624	78%	29%
SRR5950183	SRX3108774	SRP115872	SAMN07455372	60,431,276	82%	28%
SRR5950182	SRX3108775	SRP115872	SAMN07455372	58,788,550	80%	28%
SRR5950181	SRX3108776	SRP115872	SAMN07455372	89,866,944	83%	28%
SRR5950180	SRX3108777	SRP115872	SAMN07455372	51,279,102	83%	28%
SRR11472034	SRX8048314	SRP255050	SAMN14527160	61,668,374	84%	23%
SRR11472033	SRX8048315	SRP255050	SAMN14527160	57,619,994	80%	23%
SRR11472032	SRX8048316	SRP255050	SAMN14527160	18,830,547	13%	22%
SRR11472031	SRX8048317	SRP255050	SAMN14527160	18,603,087	12%	22%
SRR11472030	SRX8048318	SRP255050	SAMN14527161	68,183,454	80%	24%
SRR11472029	SRX8048319	SRP255050	SAMN14527161	60,927,388	75%	23%
SRR11472028	SRX8048320	SRP255050	SAMN14527161	21,945,541	17%	22%
SRR11472027	SRX8048321	SRP255050	SAMN14527161	19,428,962	17%	22%
SRR12097590	SRX8623010	SRP268920	SAMN15376480	40,623,682	86%	28%
SRR12097589	SRX8623011	SRP268920	SAMN15376480	40,108,044	75%	23%
SRR12097577	SRX8623022	SRP268920	SAMN15376480	46,185,788	77%	24%
SRR12097578	SRX8623025	SRP268920	SAMN15376481	41,988,016	86%	26%
SRR12097574	SRX8623026	SRP268920	SAMN15376481	42,713,878	84%	25%
SRR12097573	SRX8623027	SRP268920	SAMN15376482	41,221,872	88%	26%
SRR12097572	SRX8623028	SRP268920	SAMN15376482	43,450,542	88%	30%
SRR12097571	SRX8623029	SRP268920	SAMN15376482	43,962,364	86%	28%
SRR12097588	SRX8623012	SRP268920	SAMN15376483	42,811,372	87%	26%
SRR12097587	SRX8623013	SRP268920	SAMN15376483	42,699,074	82%	25%
SRR12097570	SRX8623030	SRP268920	SAMN15376483	48,700,162	84%	29%
SRR12097585	SRX8623014	SRP268920	SAMN15376484	42,181,554	80%	27%
SRR12097584	SRX8623015	SRP268920	SAMN15376484	41,334,048	87%	27%
SRR12097583	SRX8623016	SRP268920	SAMN15376484	43,592,068	85%	31%
SRR12097586	SRX8623017	SRP268920	SAMN15376485	43,779,578	75%	24%
SRR12097582	SRX8623018	SRP268920	SAMN15376485	44,105,642	70%	22%
SRR12097581	SRX8623019	SRP268920	SAMN15376485	44,284,348	82%	29%
SRR12097580	SRX8623020	SRP268920	SAMN15376486	42,887,828	86%	31%
SRR12097579	SRX8623021	SRP268920	SAMN15376486	44,185,134	84%	30%
SRR12097576	SRX8623023	SRP268920	SAMN15376486	46,235,066	84%	31%
SRR15311779	SRX11616382	SRP330661	SAMN20510973	136,919,374	82%	25%
SRR15311778	SRX11616383	SRP330661	SAMN20510974	124,518,650	85%	24%
SRR15311777	SRX11616384	SRP330661	SAMN20510975	117,084,920	87%	23%
SRR15311776	SRX11616385	SRP330661	SAMN20510976	115,967,720	76%	24%
SRR15311775	SRX11616386	SRP330661	SAMN20510977	107,712,430	81%	24%
SRR15311774	SRX11616387	SRP330661	SAMN20510978	113,400,600	81%	25%
SRR15311773	SRX11616388	SRP330661	SAMN20510979	122,720,944	78%	24%
SRR15311772	SRX11616389	SRP330661	SAMN20510980	115,035,768	78%	25%
SRR15311771	SRX11616390	SRP330661	SAMN20510981	190,967,036	76%	25%
SRR17464527	SRX13635687	SRP353770	SAMN24563794	37,544,650	93%	35%
SRR17464526	SRX13635688	SRP353770	SAMN24563796	43,557,340	92%	34%
SRR17464525	SRX13635689	SRP353770	SAMN24563798	54,659,060	94%	35%
SRR17464524	SRX13635690	SRP353770	SAMN24563800	45,272,440	91%	34%
SRR19428215	SRX15482230	SRP377357	SAMN28689148	108,492,524	92%	21%
SRR19428639	SRX15482654	SRP377372	SAMN28688648	47,425,965	77%	18%
SRR21717365	SRX17714456	SRP399738	SAMN31014967	176,330,034	92%	22%
SRR21717366	SRX17714457	SRP399738	SAMN31014967	170,857,962	93%	22%
SRR21721662	SRX17718737	SRP399738	SAMN31014967	201,771,452	92%	20%
SRR21730521	SRX17726288	SRP399738	SAMN31014967	186,383,030	90%	20%
SRR23445670	SRX19355094	SRP422296	SAMN33272609	60,413,838	96%	36%
SRR23445671	SRX19355093	SRP422296	SAMN33272610	60,237,260	95%	36%
SRR23445672	SRX19355092	SRP422296	SAMN33272611	60,230,248	95%	35%
SRR23445673	SRX19355091	SRP422296	SAMN33272612	60,433,400	96%	36%
SRR23445674	SRX19355090	SRP422296	SAMN33272613	60,460,778	95%	36%
SRR23445675	SRX19355089	SRP422296	SAMN33272614	60,119,664	95%	36%
SRR23445676	SRX19355088	SRP422296	SAMN33272615	60,505,266	95%	35%
SRR23445677	SRX19355087	SRP422296	SAMN33272616	60,312,596	96%	36%
SRR23445678	SRX19355086	SRP422296	SAMN33272617	60,193,966	95%	36%
SRR23445679	SRX19355085	SRP422296	SAMN33272618	60,418,538	95%	35%
SRR23445680	SRX19355084	SRP422296	SAMN33272619	60,157,692	95%	35%
SRR23445681	SRX19355083	SRP422296	SAMN33272620	60,302,118	95%	36%
SRR23445682	SRX19355082	SRP422296	SAMN33272621	60,343,360	94%	35%
SRR23445683	SRX19355081	SRP422296	SAMN33272622	60,222,410	95%	36%

Protein alignments

The alignments of the following proteins with ProSplign were used for gene prediction:

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species GenBank	7,254	7,097 (97.84%)	7,097 (97.84%)	80.38%	88.56%
Theobroma cacao high-quality model RefSeq (XP_)	13,536	13,041 (96.34%)	13,041 (96.34%)	69.05%	78.98%
Cucurbita maxima high-quality model RefSeq (XP_)	18,981	18,549 (97.72%)	18,549 (97.72%)	69.33%	78.12%
Arabidopsis thaliana GenBank	45,186	41,578 (92.02%)	41,578 (92.02%)	69.02%	76.43%
Arabidopsis thaliana known RefSeq (NP_)	48,147	42,062 (87.36%)	42,062 (87.36%)	66.93%	72.30%
Fabaceae GenBank	31,193	28,947 (92.80%)	28,947 (92.80%)	73.90%	85.82%
Fabaceae known RefSeq (NP_)	8,625	8,432 (97.76%)	8,432 (97.76%)	73.18%	84.52%
Nelumbo nucifera high-quality model RefSeq (XP_)	14,296	13,842 (96.82%)	13,842 (96.82%)	69.23%	78.63%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
BUSCO: Manni M, Berkeley MR, Seppey M, Simão FA, Zdobnov EM. Molecular biology and evolution 2021.38(10):4647-4654
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20
STAR: Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. Bioinformatics 2013 Jan 1;29(1):15-21.
Minimap2: Li H. Bioinformatics 2018 Sep 15;34(18):3094-3100

RefSeq

Integrated reference sequences