Get one protein per gene from a set of orthologs

Use datasets and dataformat to get one representative (the longest) protein sequence per gene from a set of orthologs

Get one protein per gene from a set of orthologs

Use datasets and dataformat to get one representative (the longest) protein sequence per gene from a set of orthologs

For a given set of gene orthologs, there are often many protein sequences per gene. This tutorial will show you how to download a single protein sequence per gene.

This can be done in three steps:

  1. Get transcript and protein metadata for the ortholog set
  2. Extract the accessions of the longest protein and corresponding transcript from this metadata
  3. Download the set of longest protein and corresponding transcript sequences, one per gene
Get transcript and protein metadata for the ortholog set

There are two data reports that describe gene metadata, the gene data report and the product report. The gene data report is always included with the gene data package and is the default output of the datasets summary gene command. The product report is obtained by using the flag --report product and contains metadata about the transcript and protein products. We will use the command datasets summary with the flags --report product, --ortholog 'homo sapiens,mus musculus,mustela putorius furo' and --as-json-lines to get the product report for human BRCA1 and orthologs in mouse and ferret, and save this to a file, brca1-orthologs.jsonl. Then, we will use dataformat to create a tsv file from the data in the product report with the following fields: gene ID, taxonomic name, transcript accession, transcript length, protein accession and protein length.

Use datasets to get metadata for human BRCA1 and orthologs in mouse and ferret:

datasets summary gene symbol brca1 \
--ortholog 'homo sapiens,mus musculus,mustela putorius furo' \
--report product --as-json-lines > brca1_orthologs.jsonl

Use dataformat to generate a tsv with selected fields:

dataformat tsv gene-product \
--inputfile brca1_orthologs.jsonl \
--fields gene-id,tax-name,symbol,transcript-accession,transcript-length,transcript-protein-accession,transcript-protein-length > transcript_protein.tsv

Show the first 5 lines of the resulting table:

head -5 transcript_protein.tsv
NCBI GeneID	Taxonomic Name	Symbol	Transcript Accession	Transcript Transcript Length	Transcript Protein Accession	Transcript Protein Length
672	Homo sapiens	BRCA1	NM_001408458.1	3785	NP_001395387.1	712
672	Homo sapiens	BRCA1	NM_001407967.1	6390	NP_001394896.1	1566
672	Homo sapiens	BRCA1	NM_001407959.1	6851	NP_001394888.1	1736
672	Homo sapiens	BRCA1	NM_001407931.1	6803	NP_001394860.1	1774

Note that there are many transcript splice variants and protein isoforms for human BRCA1.

Extract the accessions of the longest protein and corresponding transcript

In order to pick a single protein sequence for each gene, we’ll use a loop to identify the longest protein for each gene, and save the accession for this longest protein and the corresponding transcript accession to a new file, longest.list.

First, we’ll use tail, cut and sort to get the Gene IDs from the first column of the transcript_protein.tsv file, then we’ll sort by amino acid length (column #7 in transcript_protein.tsv) to identify the longest protein for each gene.

Next, for each gene, we’ll save the longest protein accession and corresponding transcript accession (columns #4 and #6) to the file, longest.list.

tail -n +2 transcript_protein.tsv | \
cut -f1 | \
sort -u | \
while read GENE_IDS; do
	LONGEST=$(grep -w "^$GENE_IDS" transcript_protein.tsv | \
	sort -t$'\t' -nr -k7 | \
	head -n1 | \
	cut -f4,6 | \
	tr '\t' '\n'); 
	printf "$LONGEST\n" >> longest.list; 
done

The resulting file contains the longest protein accession and the corresponding transcript accession for each of 3 genes:

cat longest.list 
XM_004772608.3
XP_004772665.1
XM_030245495.2
XP_030101355.1
NM_001407582.1
NP_001394511.1
Download the longest protein and corresponding transcript sequences, one per gene

Now that we have a list of transcript and protein accessions, we can use datasets to download the sequences.
Use the --fasta-filter-file flag to only get sequence for the specific transcript and protein accessions in the file, longest.list.

datasets download gene accession \
--inputfile longest.list \
--fasta-filter-file longest.list \
--filename longest.zip

The downloaded file, longest.zip, will contain two sequence files, protein.faa and rna.fna. The first file, protein.faa, will contain one protein sequence per gene, for human BRCA1 and the ferret and mouse gene orthologs. The second file, rna.fna, will contain the corresponding transcript sequences.

Generated May 1, 2024