Download SRA sequence data using Amazon Web Services (AWS)

SRA Data in the AWS Registry of Open Data
Accessing SRA Data in AWS
Introduction for First Time Users
Creating an AWS Instance
The SRA Toolkit in AWS
- Installing The SRA Toolkit in your instance
- Using the SRA Toolkit in AWS
Youtube Video Tutorial - Setting up AWS - demo
Engage

SRA Data in the AWS Registry of Open Data

Amazon Web Services publicly hosts SRA data offsite image through the Registry of Open Data. SRA has several datasets in the AWS Registry of Open Data, all of which can be accessed freely, without charge, through either an HTTPS or S3 URL. One dataset contains public SRA data in the originally submitted format from select high value and newly-released studies. The second dataset acts as a centralized repository of SARS-CoV-2 related sequences submitted to NCBI. Included are both the original files submitted by the principal investigator as well as SRA-processed sequences (including normalized sequence files and SRA aligned read format files) that require the SRA Toolkit for analysis. This dataset also includes metadata searchable in AWS Athena by BLAST result, taxonomic analysis, and more, to allow rapid discovery of the most relevant data to your research.

Coronaviridae Datasets

Runs directory contains normalized sequence data, accessible in multiple formats (fastq, sam, fasta) via the SRA Toolkit and organized by Run accession.
sra-src directory contains the submitted sequence files in their original format, organized by Run accession.
VCF directory contains SRA generated VCF files, organized by Run accession.

AWS CLI Access (No AWS account required):

aws s3 ls s3://sra-pub-sars-cov2/ --no-sign-request

Public data

Contains all public SRA Runs organized by Run accession.

AWS CLI Access (No AWS account required):

aws s3 ls s3://sra-pub-run-odp/ --no-sign-request

Public user-submitted files

Contains submitted sequence files in their original format, organized by Run accession.

AWS CLI Access (No AWS account required):

aws s3 ls s3://sra-pub-src-2/ --no-sign-request

Accessing SRA Data in AWS

If you know your Run accessions of interest you can access the data several ways. To download files from the AWS Console offsite image using a browser, visit the HTTPS URL for the Coronaviridae dataset, Public SRA data, or Public user-submitted files respectively:

https://s3.console.aws.amazon.com/s3/buckets/sra-pub-sars-cov2/
https://s3.console.aws.amazon.com/s3/buckets/sra-pub-run-odp/
https://s3.console.aws.amazon.com/s3/buckets/sra-pub-src-2/

From there you can navigate the directory structure using the provided graphical interface and you can search a given directory for your accession of interest using the provided search box near the top of the page. Once you have navigated to a specific file of interest you can click the Object URL link or use the Object actions button to copy the file to your own S3 bucket or download a copy to local storage.

To access files from within AWS, e.g. from an EC2 instance, you can use the AWS CLI offsite image to perform an S3 copy or sync, using a command like this:

aws s3 cp s3://sra-pub-sars-cov2/README.txt $USER/$HOME/README.txt

These data can also be accessed using various other tools and libraries offsite image . Access to files in the AWS Registry of Open Data is free. This is true whether you use the HTTPS or S3 URL. For S3 URLs, the transfer is free even if it crosses an AWS region boundary; there is no inter-regional data transfer fee.

If you don't know the Run accessions you are interested in, you can start by searching in the SRA Run Selector,
AWS Athena offsite image , or SRA Entrez.
A full list of Coronaviridae-containing SRA runs as detected with NCBI's kmer analysis tool is available here: ftp://ftp.ncbi.nlm.nih.gov/sra/reports/AccList/ .

Introduction for First Time Users

Amazon Elastic Compute Cloud (EC2) is the Amazon Web Service you use to create and run virtual machines in the cloud. AWS calls these virtual machines 'instances'. You will need to install your bioinformatics tools for data analysis and the SRA Toolkit for accessing the SRA data.

Creating an AWS Instance

Exclamation point Users will need to address accounts on their own.
Please work with your organization for credential and billing questions. If using a personal account, this guide attempts to stay within AWS Free Tier for users who are still eligible.

Exclamation point Users of this guide are expected to have experience using a Unix command-line interface.

Sign-in using your AWS account: Amazon AWS Console offsite image .

Create an AWS Instance

Please follow this Amazon step-by-step guide offsite image that will help you launch a Linux virtual machine on Amazon EC2 within Amazon AWS Free Tier.
Please make sure to create your EC2 instance in the US East (N. Virginia) us-east-1 region.

Connect to the Instance

Use either a Unix/OSX terminal or your preferred ssh application to connect the same as the Amazon tutorial linked above. - This AMI username is ec2-user.

Terminate the Instance

Remember to terminate the EC2 instance from the AWS console when you have finished using it. If you do not terminate the instance, charges can be generated on your account even when no users are connected.
Data stored on the EC2 instance will be deleted when the instance is terminated. Users will likely want to have stable s3 storage to store results from their work.

The SRA Toolkit in AWS

Installing The SRA Toolkit in your instance

Once you connected, you will be able to work in Unix-like command line environment where you can install and configure the SRA Toolkit.

Installing the SRA Toolkit

Using the SRA Toolkit in AWS

For downloading public SRA data from our cloud buckets to your cloud storage you can use the SRA Toolkit utilities as described in the SRA Download Guide
For downloading dbGAP data from our cloud buckets to your cloud storage you need to use jwt.cart file as descibed in the Downloading dbGaP data with JWT

Tack Don't forget to STOP your instance after you finished your work!

Youtube Video Tutorial - Setting up AWS - demo

Setting up AWS demo

Engage

NCBI wants your feedback on SRA in the Cloud. Contact sra@ncbi.nlm.nih.gov with questions or if you would like to provide input on new functionality.

SRA

SRA

Download SRA sequence data using Amazon Web Services (AWS)

SRA Data in the AWS Registry of Open Data

Coronaviridae Datasets

Public data

Public user-submitted files

Accessing SRA Data in AWS

Introduction for First Time Users

Creating an AWS Instance

Create an AWS Instance

Connect to the Instance

Terminate the Instance

The SRA Toolkit in AWS

Installing The SRA Toolkit in your instance

Using the SRA Toolkit in AWS

Youtube Video Tutorial - Setting up AWS - demo

Engage

Getting Started

Cloud Quick Start

Setting Up

Cloud Data Access

Accessing dbGAP

Search

SRA Analysis

SRA Data Formats

SRA

SRA

Download SRA sequence data using Amazon Web Services (AWS)

SRA Data in the AWS Registry of Open Data

Coronaviridae Datasets

Public data

Public user-submitted files

Accessing SRA Data in AWS

Introduction for First Time Users

Creating an AWS Instance

Sign-in and Enter the Amazon EC2 Console

Create an AWS Instance

Connect to the Instance

Terminate the Instance

The SRA Toolkit in AWS

Installing The SRA Toolkit in your instance

Using the SRA Toolkit in AWS

Youtube Video Tutorial - Setting up AWS - demo

Engage

Getting Started

Cloud Quick Start

Setting Up

Cloud Data Access

Accessing dbGAP

Search

SRA Analysis

SRA Data Formats