SRA Data Formats
SRA Data is Available with Simplified Quality Scores
SRA data are available either with full base quality scores (SRA Normalized Format), or with simplified quality scores (SRA Lite),
depending on user preference. Both formats can be streamed on demand to the same filetypes (fastq
, sam
, etc.), so they are both compatible with existing
workflows and applications that expect quality scores. However, the SRA Lite format is much smaller, enabling a reduction in storage footprint and data transfer times,
allowing dumps to complete faster. The SRA toolkit defaults to using the SRA Normalized Format that includes full, per-base quality scores, but users can opt to use simplified quality
scores in their analysis by requesting the SRA Lite version to save time on their data transfers.
To request the SRA Lite data when using the SRA toolkit, set the Prefer SRA Lite files with simplified base quality scores option on
the main page of the toolkit configuration - this will instruct the tools to preferentially use the SRA Lite format when available
(please be sure to use toolkit version 2.11.2 or later to access this feature). The quality scores generated from SRA Lite files will be the same for each base within a given read
(quality = 30 or 3, depending on whether the Read Filter flag is set to pass
or reject
). Data in the SRA Normalized Format
will continue to have a .sra
file extension, while the SRA Lite files have a .sralite
file extension.
SRA Data Going Forward
TheSRA Normalized Format was created to support FAIR (Findable, Accessible, Interoperable, Reusable) principles,
and newer, efficiently sized SRA formats continue this support, making it easier to manipulate and analyze large
datasets while also reducing file size and bandwidth requirements. Full base quality scores are not needed for many
bioinformatic use cases and workflows, and data formats with simplified scores reduce the typical SRA file footprint
by ~60%
with commensurate reductions in transfer times when accessing the data. SRA Lite and SRA Normalized Format files are both fully
accessible and stream-able using the SRA toolkit.
SRA Normalized Format - original format with full base quality scores
This is the format provided since the inception of the SRA. It contains base calls, full base quality scores, and alignments.
This format has a .sra
file extension and is available from cloud providers and via the SRA Toolkit.
SRA Lite - smaller format with simplified quality scores
This new format contains base calls, simplified quality scores, and alignments.
This format has a .sralite
file extension and is available from cloud providers and NCBI via the SRA Toolkit.
Output files derived from this format contain simplified quality scores.
SRA Lite files are produced from SRA Normalized Format by assessing overall
read quality and setting a per-read quality flag (Read_Filter
). In the resulting files, all reads have a Read_Filter
flag with value pass
or reject
.
Importantly, it is still possible to produce fastq
formatted files from SRA Lite format using the SRA toolkit. In this case,
each read will have a constant quality score set to 30 for reads with Read_Filter
value "pass" or 3 for reads with a value "reject".
Illumina fastq
and sam/bam
specifications support a quality bit that is set by the sequencing instrument and SRA Lite stores this as a "pass"/"reject" Read_Filter
value.
If this bit is set in the submitted fastq or bam file, the value is retained. If it is not, SRA will set a pass/reject value based on the quality score distribution
within each read. Reads that have more than half of quality score values <20 are flagged "reject". Reads that begin or end with a run of more than 10 quality scores <20
are also flagged "reject". Reads that pass these quality checks are flagged "pass". When dumping data using the fastq-dump
, fasterq-dump
, or sam-dump
utilities in the SRA toolkit,
all reads are included by default. However, the fastq-dump
tool has an option to include only passed or only rejected reads:
fastq-dump --read-filter <[pass|reject]>
In order to interact with these files and set your preference for SRA Lite files, please use SRA Toolkit version 2.11.2 or later.
Original Submitted Files - files as submitted to SRA
All original submitted files are available via our Cloud Data Delivery Service. From the SRA Run Selector users can request that the submitted files for a run be delivered to their own AWS or GCP cloud bucket. This service is provided for both public and authorized access (dbGaP) data with no charge for delivery.
Frequently Asked Questions
Answers to Frequently Asked Questions: Data Format FAQ
Engage
NCBI wants your feedback on SRA in the Cloud. Contact sra@ncbi.nlm.nih.gov with questions or if you would like to provide input on new functionality.