Rename downloaded files

Rename downloaded files to use descriptive file names

Rename downloaded files

Rename downloaded files to use descriptive file names

For most files included in data packages downloaded from NCBI Datasets, the default filenames are generic.
For example, protein files included in the NCBI Datasets genome data package are all named protein.faa.

Use the simple script below to rename the protein.faa files included in the NCBI Datasets genome data package to use descriptive file names that include the genome assembly accession.

Before running the file renaming script

# Note that all protein sequence files share the same generic name, protein.faa

ls ncbi_dataset/data/GC*/*protein.faa | head -3
ncbi_dataset/data/GCA_000774145.1/protein.faa
ncbi_dataset/data/GCA_009818265.1/protein.faa
ncbi_dataset/data/GCA_016735085.1/protein.faa

After running the file renaming script

# Note that each protein sequence file has been renamed to include the genome assembly accession

ls ncbi_dataset/data/GC*/*protein.faa | head -3
ncbi_dataset/data/GCA_000774145.1/GCA_000774145.1_protein.faa
ncbi_dataset/data/GCA_009818265.1/GCA_009818265.1_protein.faa
ncbi_dataset/data/GCA_016735085.1/GCA_016735085.1_protein.faa

How to run the file renaming script

First, create a file called rename.sh and open it in the nano text editor by running the following.

nano rename.sh

Next, copy and paste the following script into nano.

#!/bin/bash

for file in ncbi_dataset/data/*/protein.faa
do
directory_name=$(dirname $file)
accession=$(basename $directory_name)
mv "${file}" "${directory_name}/${accession}_$(basename $file)"
done

Use ctrl+X to save the file and exit nano.

Finally, run the script while you are in the directory containing the extracted NCBI Datasets data package.
bash rename.sh

Generated May 1, 2024