GEO Logo
   NCBI > GEO > Accession DisplayHelp Not logged in | LoginHelp
GEO help: Mouse over screen elements for information.
Sample GSM1242325 Query DataSets for GSM1242325
Status Public on Feb 28, 2014
Title HeLa
Sample type SRA
Source name HeLa
Organism Homo sapiens
Characteristics cell line: HeLa
cell type: human cervical cancer cell line
Extracted molecule total RNA
Extraction protocol Total RNAs were extracted from NIH3T3 or HeLa cells by TRIzol reagent (Invitrogen, 15596-018) and purified by RNeasy MinElute column (Qiagen, 74204) according to manufacturer’s instruction.
The RNA was depleted of rRNA by using Ribo-zero kit (Epicentre, MRZH11124), ligated to 3′ adapter and partially digested by RNase T1 (Ambion, AM2283). The fragmented RNAs were pull-downed with streptavidin beads (Invitrogen, 11206D), and phosphorylated by PNK reaction (Takara, 2021B) and gel purified in the range of 500 – 1000 nucleotides. The purified RNAs were ligated to 5′ adapter, subjected to reverse-transcription (Invitrogen, 18080085) and amplified by PCR using Phusion DNA polymerase (Thermo, F-530L). PCR products were gel purified again. All adapters and primers are synthesized by IDT.
Library strategy OTHER
Library source transcriptomic
Library selection other
Instrument model Illumina HiSeq 2500
Description A TAIL-seq sample from HeLa
Data processing Library strategy: TAIL-seq
The base calls and signal intensities were acquired from HiSeq 2500 after processing by Illumina RTA The read 1 sequences were aligned to the common contaminants set, which is composed of rDNA repeat units (GenBank accession BK000964.1 for NIH3T3 and U13369.1 for HeLa), PhiX genome (GenBank accession J02482.1), Illumina TruSeq primer sequences, and all sequences for 5S and 5.8S rRNAs of respective species (retrieved from Rfam 11.0 of the Wellcome Trust Sanger Institute) using GSNAP 2013-03-31 with maximum 5% mismatches allowed. Clusters with any match to the contaminants were removed from the subsequent analyses. The sequences having completely identical nucleotides in the 21st to 35th cycle in read 1 (representative region of the insert) and the 1st to 15th cycle in read 2 (degenerate bases in 3′ adapter) are deduplicated by leaving only a cluster with the maximum PHRED quality sum of read 1. The degenerate and fixed delimiter sequence in 3′ adapter was clipped out from read 2 by searching perfect match of delimiter sequence (‘GTCAG’ as in the direction of read 2) between the 14th and 16th cycles in read 2. The clusters missing a delimiter sequence or having low diversity in degenerate region (at least two occurrences for all of A, C, G and T) were removed from further analyses.
The fluorescence signal intensities were processed into “Normalized T signal” as described in fig. S1A. The signals from a spike-in sample were purified with an outlier filter based on robust Mahalanobis distance (mvoutlier package 1.9.9; quan=0.5, alpha=0.025). Random 500 clusters per each spike-in were chosen for parameter calculation of a Gaussian mixture hidden Markov model (GMHMM). We trained the model using Baum-Welch algorithm implemented in the GHMM library ( with topology and initial parameters shown in fig. S1A and table S7 and S8 (1,000 iterations). The procedure was iterated to maximize likelihood, not using any property (eg. designed length of poly(A) tail) of spike-ins. Normalized T signals outside the range of [-5, 5] were clipped into the range for both training and later calculations.
The length of poly(A) tails were first measured with base call-based “Strategy II” described in fig. S1A. For clusters with the measured length is shorter than 8 nt, the length is called as the final poly(A) tail length. For the others, normalized T signals starting from the first position in T-stretch detected by Strategy II were analyzed with the GMHMM. The hidden states were decoded with the standard Viterbi algorithm implemented in the GHMM library. The number of cycles with state 1 and 2 was called as the length of poly(A) tail. For the estimation of performance, we applied the process to all spike-in samples except the clusters used for the parameter fitting of the model.
The remaining reads after contaminant filter and the first duplication filters were then aligned to the genome sequences (UCSC mm10 for NIH3T3 and UCSC hg19 for HeLa, positions of splicing junctions were processed from the UCSC Genome Browser database for version of Jan 24, 2013) using GSNAP 2013-03-31. Three different versions of alignments to genome were used in this study. (1) R1 alignment: using only the full read 1 sequences which are 51 nt long. This was used for identification of a cluster. (2) R2 short alignment: using only 40 nt right next to the 3′ adapter of read 2. This was used in searching for the poly(A)-free 3′ hydroxyl ends. (3) paired alignment: using the full read 1 sequences and part of read 2 sequences trimmed of degenerate bases and delimiter. We filtered out poly(A) stretches encoded from genome using this alignment set. All the alignments were performed with maximum mismatches of 5%, minimum mapping quality of 3. All multi-mapped reads were removed. The remaining PCR artifacts with few mismatches were removed again using the R1 alignment with 15 degenerate bases inside the 3′ adapter region. To detect that kind of artifacts, we clustered the R1 alignments with maximum distance between mapped positions of 10 bp, they were then clustered again within the first cluster using degenerate bases from read 2 of respective reads with CD-HIT-EST 4.5.4 (word size=6, sequence identity=0.85). For a set of detected duplicates, we chose a read with maximum sum of PHRED quality in read 1 to leave.
For classification and transcript-level analyses, we compiled reference annotations for human and mouse using NCBI RefSeq, RepeatMasker, gtRNAdb, Rfam and miRBase databases (the first three were downloaded from the UCSC Genome Browser on Apr 25, 2013; Rfam version 11; miRBase version 19). The R1 alignments were annotated with intersection with the compiled annotations using BEDTools {Quinlan, 2010 #66}. When multiple annotations were overlapped to an alignment, we chose a class for the statistics requiring exclusive assignment of a genomic source type by the following priority: miRNA, rRNA, tRNA, Mt-tRNA, snoRNA, scRNA, srpRNA, snRNA, lncRNA, RNA, ncRNA, misc_RNA, Cis-reg, ribozyme, RC, IRES, frameshift_element, LINE, SINE, Simple_repeat, Low_complexity, Satellite, DNA, LTR, CDS, 3′ UTR, 5′ UTR, intron, Other, Unknown (higher priority first). The transcript-level analyses were performed using our custom non-redundant RefSeq (nrRefSeq) transcript set, which is a reduced set retaining only the longest isoform or transcript when regions overlap with each other. The positions of read 1 in nrRefSeq transcripts were positioned with BEDTools intersection between alignments to genome sequences and nrRefSeq annotation set, and then translated to the transcript-level coordination with in-house software.
As poly(A) tails were initially detected with a constraint that it must begin within the first 30 cycles, so the maximum detectable 3′ end modification of poly(A) tails was limited to the last 30 nucleotides of insert. To exclude A stretches obviously encoded from genomic sequence (with or without 3′ end modifications), we masked detected poly(A) tail ranges with read 2 alignments so that the 3′-most position of alignable (not clipped) is eliminated from poly(A) tail or its 3′ end modifications. All statistics regarding transcript-level modification rates were calculated for transcripts having more than 200 tags with poly(A) tails longer than 8 nt.
Genome_build: hg19 for HeLa, mm10 for NIH3T3
Supplementary_files_format_and_content: The .csv files contain the poly(A) tail length distribution and 3' end modification frequencies next to poly(A) tails for all detected transcripts with >= 30 poly(A)+ tags
Submission date Sep 30, 2013
Last update date May 15, 2019
Contact name Hyeshik Chang
Organization name Seoul National University
Department School of Biological Sciences
Lab Hyeshik Chang Lab
Street address Building 203 Room 525, School of Biological Sciences, Seoul National University, 1 Gwanak-ro, Gwanak-gu
City Seoul
State/province South Korea
ZIP/Postal code 08826
Country South Korea
Platform ID GPL16791
Series (1)
GSE51299 The 3′ extremity of mRNA unveiled by TAIL-seq
BioSample SAMN02370175
SRA SRX361907

Supplementary file Size Download File type/resource
GSM1242325_HeLa-polya.csv.gz 179.3 Kb (ftp)(http) CSV
SRA Run SelectorHelp
Raw data are available in SRA
Processed data provided as supplementary file

| NLM | NIH | GEO Help | Disclaimer | Accessibility |
NCBI Home NCBI Search NCBI SiteMap