GEO Accession viewer

NCBI > GEO > Accession Display

Not logged in | Login

GEO help: Mouse over screen elements for information.

Sample GSM1242325

Query DataSets for GSM1242325

Status

Public on Feb 28, 2014

Title

HeLa

Sample type

SRA

Source name

HeLa

Organism

Homo sapiens

Characteristics

cell line: HeLa
cell type: human cervical cancer cell line

Extracted molecule

total RNA

Extraction protocol

Total RNAs were extracted from NIH3T3 or HeLa cells by TRIzol reagent (Invitrogen, 15596-018) and purified by RNeasy MinElute column (Qiagen, 74204) according to manufacturer’s instruction.
The RNA was depleted of rRNA by using Ribo-zero kit (Epicentre, MRZH11124), ligated to 3′ adapter and partially digested by RNase T1 (Ambion, AM2283). The fragmented RNAs were pull-downed with streptavidin beads (Invitrogen, 11206D), and phosphorylated by PNK reaction (Takara, 2021B) and gel purified in the range of 500 – 1000 nucleotides. The purified RNAs were ligated to 5′ adapter, subjected to reverse-transcription (Invitrogen, 18080085) and amplified by PCR using Phusion DNA polymerase (Thermo, F-530L). PCR products were gel purified again. All adapters and primers are synthesized by IDT.

Library strategy

OTHER

Library source

transcriptomic

Library selection

other

Instrument model

Illumina HiSeq 2500

Description

A TAIL-seq sample from HeLa

Data processing

Library strategy: TAIL-seq
The base calls and signal intensities were acquired from HiSeq 2500 after processing by Illumina RTA 1.17.21.3. The read 1 sequences were aligned to the common contaminants set, which is composed of rDNA repeat units (GenBank accession BK000964.1 for NIH3T3 and U13369.1 for HeLa), PhiX genome (GenBank accession J02482.1), Illumina TruSeq primer sequences, and all sequences for 5S and 5.8S rRNAs of respective species (retrieved from Rfam 11.0 of the Wellcome Trust Sanger Institute) using GSNAP 2013-03-31 with maximum 5% mismatches allowed. Clusters with any match to the contaminants were removed from the subsequent analyses. The sequences having completely identical nucleotides in the 21st to 35th cycle in read 1 (representative region of the insert) and the 1st to 15th cycle in read 2 (degenerate bases in 3′ adapter) are deduplicated by leaving only a cluster with the maximum PHRED quality sum of read 1. The degenerate and fixed delimiter sequence in 3′ adapter was clipped out from read 2 by searching perfect match of delimiter sequence (‘GTCAG’ as in the direction of read 2) between the 14th and 16th cycles in read 2. The clusters missing a delimiter sequence or having low diversity in degenerate region (at least two occurrences for all of A, C, G and T) were removed from further analyses.
The fluorescence signal intensities were processed into “Normalized T signal” as described in fig. S1A. The signals from a spike-in sample were purified with an outlier filter based on robust Mahalanobis distance (mvoutlier package 1.9.9; quan=0.5, alpha=0.025). Random 500 clusters per each spike-in were chosen for parameter calculation of a Gaussian mixture hidden Markov model (GMHMM). We trained the model using Baum-Welch algorithm implemented in the GHMM library (http://ghmm.org) with topology and initial parameters shown in fig. S1A and table S7 and S8 (1,000 iterations). The procedure was iterated to maximize likelihood, not using any property (eg. designed length of poly(A) tail) of spike-ins. Normalized T signals outside the range of [-5, 5] were clipped into the range for both training and later calculations.
The length of poly(A) tails were first measured with base call-based “Strategy II” described in fig. S1A. For clusters with the measured length is shorter than 8 nt, the length is called as the final poly(A) tail length. For the others, normalized T signals starting from the first position in T-stretch detected by Strategy II were analyzed with the GMHMM. The hidden states were decoded with the standard Viterbi algorithm implemented in the GHMM library. The number of cycles with state 1 and 2 was called as the length of poly(A) tail. For the estimation of performance, we applied the process to all spike-in samples except the clusters used for the parameter fitting of the model.
The remaining reads after contaminant filter and the first duplication filters were then aligned to the genome sequences (UCSC mm10 for NIH3T3 and UCSC hg19 for HeLa, positions of splicing junctions were processed from the UCSC Genome Browser database for version of Jan 24, 2013) using GSNAP 2013-03-31. Three different versions of alignments to genome were used in this study. (1) R1 alignment: using only the full read 1 sequences which are 51 nt long. This was used for identification of a cluster. (2) R2 short alignment: using only 40 nt right next to the 3′ adapter of read 2. This was used in searching for the poly(A)-free 3′ hydroxyl ends. (3) paired alignment: using the full read 1 sequences and part of read 2 sequences trimmed of degenerate bases and delimiter. We filtered out poly(A) stretches encoded from genome using this alignment set. All the alignments were performed with maximum mismatches of 5%, minimum mapping quality of 3. All multi-mapped reads were removed. The remaining PCR artifacts with few mismatches were removed again using the R1 alignment with 15 degenerate bases inside the 3′ adapter region. To detect that kind of artifacts, we clustered the R1 alignments with maximum distance between mapped positions of 10 bp, they were then clustered again within the first cluster using degenerate bases from read 2 of respective reads with CD-HIT-EST 4.5.4 (word size=6, sequence identity=0.85). For a set of detected duplicates, we chose a read with maximum sum of PHRED quality in read 1 to leave.
For classification and transcript-level analyses, we compiled reference annotations for human and mouse using NCBI RefSeq, RepeatMasker, gtRNAdb, Rfam and miRBase databases (the first three were downloaded from the UCSC Genome Browser on Apr 25, 2013; Rfam version 11; miRBase version 19). The R1 alignments were annotated with intersection with the compiled annotations using BEDTools {Quinlan, 2010 #66}. When multiple annotations were overlapped to an alignment, we chose a class for the statistics requiring exclusive assignment of a genomic source type by the following priority: miRNA, rRNA, tRNA, Mt-tRNA, snoRNA, scRNA, srpRNA, snRNA, lncRNA, RNA, ncRNA, misc_RNA, Cis-reg, ribozyme, RC, IRES, frameshift_element, LINE, SINE, Simple_repeat, Low_complexity, Satellite, DNA, LTR, CDS, 3′ UTR, 5′ UTR, intron, Other, Unknown (higher priority first). The transcript-level analyses were performed using our custom non-redundant RefSeq (nrRefSeq) transcript set, which is a reduced set retaining only the longest isoform or transcript when regions overlap with each other. The positions of read 1 in nrRefSeq transcripts were positioned with BEDTools intersection between alignments to genome sequences and nrRefSeq annotation set, and then translated to the transcript-level coordination with in-house software.
As poly(A) tails were initially detected with a constraint that it must begin within the first 30 cycles, so the maximum detectable 3′ end modification of poly(A) tails was limited to the last 30 nucleotides of insert. To exclude A stretches obviously encoded from genomic sequence (with or without 3′ end modifications), we masked detected poly(A) tail ranges with read 2 alignments so that the 3′-most position of alignable (not clipped) is eliminated from poly(A) tail or its 3′ end modifications. All statistics regarding transcript-level modification rates were calculated for transcripts having more than 200 tags with poly(A) tails longer than 8 nt.
Genome_build: hg19 for HeLa, mm10 for NIH3T3
Supplementary_files_format_and_content: The .csv files contain the poly(A) tail length distribution and 3' end modification frequencies next to poly(A) tails for all detected transcripts with >= 30 poly(A)+ tags

Submission date

Sep 30, 2013

Last update date

May 15, 2019

Contact name

Hyeshik Chang

E-mail(s)

hyeshik@snu.ac.kr

Organization name

Seoul National University

Department

School of Biological Sciences

Lab

Hyeshik Chang Lab

Street address

Building 203 Room 525, School of Biological Sciences, Seoul National University, 1 Gwanak-ro, Gwanak-gu

City

Seoul

State/province

South Korea

ZIP/Postal code

08826

Country

South Korea

Platform ID

GPL16791

Series (1)

GSE51299

The 3′ extremity of mRNA unveiled by TAIL-seq

Relations

BioSample

SAMN02370175

SRA

SRX361907

Supplementary file	Size	Download	File type/resource
GSM1242325_HeLa-polya.csv.gz	179.3 Kb	(ftp)(http)	CSV
SRA Run Selector
Raw data are available in SRA
Processed data provided as supplementary file