Get one protein per gene from a set of orthologs
Use datasets and dataformat to get one representative (the longest) protein sequence per gene from a set of orthologs
Get one protein per gene from a set of orthologs
For a given set of gene orthologs, there are often many protein sequences per gene. This tutorial will show you how to download a single protein sequence per gene.
This can be done in three steps:
- Get transcript and protein metadata for the ortholog set
- Extract the accessions of the longest protein and corresponding transcript from this metadata
- Download the set of longest protein and corresponding transcript sequences, one per gene
Get transcript and protein metadata for the ortholog set
There are two data reports that describe gene metadata, the gene data report and the product report. The gene data report is always included with the gene data package
and is the default output of the datasets summary gene
command. The product report is obtained by using the flag --report product
and contains metadata about the transcript and protein products. We will use the command datasets summary
with the flags --report product
, --ortholog 'homo sapiens,mus musculus,mustela putorius furo'
and --as-json-lines
to get the product report for human BRCA1 and orthologs in mouse and ferret, and save this to a file, brca1-orthologs.jsonl. Then, we will use dataformat to create a tsv file from the data in the product report with the following fields: gene ID, taxonomic name, transcript accession, transcript length, protein accession and protein length.
Use datasets to get metadata for human BRCA1 and orthologs in mouse and ferret:
datasets summary gene symbol brca1 \
--ortholog 'homo sapiens,mus musculus,mustela putorius furo' \
--report product --as-json-lines > brca1_orthologs.jsonl
Use dataformat to generate a tsv with selected fields:
dataformat tsv gene-product \
--inputfile brca1_orthologs.jsonl \
--fields gene-id,tax-name,symbol,transcript-accession,transcript-length,transcript-protein-accession,transcript-protein-length > transcript_protein.tsv
Show the first 5 lines of the resulting table:
head -5 transcript_protein.tsv
NCBI GeneID Taxonomic Name Symbol Transcript Accession Transcript Transcript Length Transcript Protein Accession Transcript Protein Length
672 Homo sapiens BRCA1 NM_001408458.1 3785 NP_001395387.1 712
672 Homo sapiens BRCA1 NM_001407967.1 6390 NP_001394896.1 1566
672 Homo sapiens BRCA1 NM_001407959.1 6851 NP_001394888.1 1736
672 Homo sapiens BRCA1 NM_001407931.1 6803 NP_001394860.1 1774
Note that there are many transcript splice variants and protein isoforms for human BRCA1.
Extract the accessions of the longest protein and corresponding transcript
In order to pick a single protein sequence for each gene, we’ll use a loop to identify the longest protein for each gene, and save the accession for this longest protein and the corresponding transcript accession to a new file, longest.list.
First, we’ll use tail
, cut
and sort
to get the Gene IDs from the first column of the transcript_protein.tsv
file, then we’ll sort by amino acid length (column #7 in transcript_protein.tsv
) to identify the longest protein for each gene.
Next, for each gene, we’ll save the longest protein accession and corresponding transcript accession (columns #4 and #6) to the file, longest.list.
tail -n +2 transcript_protein.tsv | \
cut -f1 | \
sort -u | \
while read GENE_IDS; do
LONGEST=$(grep -w "^$GENE_IDS" transcript_protein.tsv | \
sort -t$'\t' -nr -k7 | \
head -n1 | \
cut -f4,6 | \
tr '\t' '\n');
printf "$LONGEST\n" >> longest.list;
done
The resulting file contains the longest protein accession and the corresponding transcript accession for each of 3 genes:
cat longest.list
XM_004772608.3
XP_004772665.1
XM_030245495.2
XP_030101355.1
NM_001407582.1
NP_001394511.1
Download the longest protein and corresponding transcript sequences, one per gene
Now that we have a list of transcript and protein accessions, we can use datasets to download the sequences.
Use the --fasta-filter-file
flag to only get sequence for the specific transcript and protein accessions in the file, longest.list.
datasets download gene accession \
--inputfile longest.list \
--fasta-filter-file longest.list \
--filename longest.zip
The downloaded file, longest.zip, will contain two sequence files, protein.faa
and rna.fna
. The first file, protein.faa, will contain one protein sequence per gene, for human BRCA1 and the ferret and mouse gene orthologs. The second file, rna.fna, will contain the corresponding transcript sequences.