U.S. flag

An official website of the United States government

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Mattick J, Amaral P. RNA, the Epicenter of Genetic Information: A new understanding of molecular biology. Abingdon (UK): CRC Press; 2022 Sep 20. doi: 10.1201/9781003109242-13

Cover of RNA, the Epicenter of Genetic Information

RNA, the Epicenter of Genetic Information: A new understanding of molecular biology.

Show details

Chapter 13Large RNAs with Many Functions

High-throughput RNA sequencing projects after the turn of the century revealed the existence of large numbers of low abundance long, often multi-exonic, RNAs in animals and plants that have no protein-coding potential, termed ‘lncRNAs’ (long non-coding RNAs), expressed intronically, ‘intergenically’ and antisense to protein-coding genes, as well as from thousands of pseudogenes and the 3′UTRs of mRNAs. The data also showed that most of the genome in eukaryotes is transcribed in highly complex interlacing and overlapping patterns, substantially from both DNA strands, challenging the conception of genes as discrete entities. Although initially suspected to be noise, lncRNAs were found to be dynamically expressed during differentiation and development, mostly in highly cell type-specific patterns - far more so than protein-coding RNAs - and to be associated with subnuclear and cytoplasmic organelles, chromatin-modifying proteins and/or chromatin domains. The genetic signatures of sequence variation in lncRNAs are subtler than those of protein-coding genes, but many have been found to be involved in the etiology of cancer and developmental, autoimmune, neurodegenerative and neuropsychiatric disorders. Large numbers of lncRNAs have been shown to have biological functions in, for example, DNA damage repair, cell fate determination and reprogramming, mesoderm and endoderm differentiation, retinal, skeletal, muscle and brain development, memory and behavior, neuronal differentiation, hematopoietic and immunological differentiation, inflammation and hormone production, among many others.

Following fast on the heels of the genome projects came high-throughput RNA sequencing projects, which, notwithstanding the excitement around the RNA interference pathway and small RNA control of gene expression, marked the turning point from regarding non-coding RNAs as ancillary contributors to mainstream players in cell and developmental biology. 1 , 2

The technique of cloning end fragments of RNAs as ‘Expressed Sequence Tags’ (ESTs) was developed and popularized in the 1990s to identify protein-coding genes a and their spliced isoforms. 7 , 8 Because mRNAs typically comprise only ~3% of the total RNA in cells (rRNAs comprise ~95%) and it was assumed that polyadenylated RNAs are mRNAs, b oligo(dT) hybridization was used to purify these RNAs and to prime reverse transcription from their 3′ ends.

Unexpectedly, the large-scale cDNA cloning efforts yielded many sequences that lacked protein-coding capacity, 13 , 14 which were initially suspected to be degradation products or DNA contamination during library preparation. This led to alternative methods to ‘filter’ for protein-coding genes, such as by evolutionary conservation. 15 The existence of long 3′UTRs in mRNAs also presented difficulties, as the cloned ESTs often did not extend into the upstream protein-coding sequences. c

To circumvent the latter problem strategies were developed to increase the representation of internal exons of transcripts by priming reverse transcription with random oligonucleotides. 18 This approach was used in the Human Cancer Genome Project to generate nearly one million ‘Open Reading Frame’ ESTs (‘ORESTES’) from cancers and normal tissues. 18–21 Many novel RNAs were identified, leading to the suggestion in 2000 that 36,000 human genes was likely a “significant underestimate”, even after sequences that did not correspond to predicted (protein-coding) genes were excluded. 19 It was later shown that approximately half of the ORESTES were differentially expressed intronic or intergenic RNAs. 13

The confounding problem was the wide dynamic range of gene expression, both within and between cells in heterogeneous tissues. 22 Of the ~500,000 polyadenylated RNA molecules estimated to exist in a human cell, at least half are accounted by a small number of highly expressed mRNAs. 23 Over 95% of the RNA species are expressed at low levels: in a tissue such as rat liver, for example, ~10 species are present at ~10,000 copies, 500 at ~200 copies, and 15,000 at ~10 copies or less per cell. 24 The expression of the small fraction (3%) of ‘housekeeping’ genes is ~30-fold higher than all other transcripts, 25 so the latter can only be reliably observed at high sequencing depth or with enrichment strategies. 22

Consequently, studies aiming to comprehensively survey the transcriptome used hybridization subtraction methods to deplete abundant RNAs and improve the representation of cell- and tissue-specific transcripts. 26 , 27 To reduce interpretation problems, a method termed ‘Cap Analysis of Gene Expression’ (CAGE) was developed by Piero Carninci, Yoshihide Hayashizaki and colleagues using biotinylation of terminal nucleotides to capture ‘full-length’ RNAP II transcripts d containing both the 5′ end and the 3′ polyA tail. 14 , 27 , 29–32

Pervasive Transcription

In the early 2000s, Hayashizaki, Carninci and colleagues established the influential FANTOM e projects, 33 , 34 which undertook large-scale sequencing of normalized full-length cDNA libraries constructed from a wide variety of mouse cell types, tissues and developmental stages, with the objective of characterizing the proteome. However, the unexpected and ultimately headline result was the identification of over 34,000 long “mRNA-like” (5′ capped and 3′ polyadenylated) transcripts often emanating from ‘intergenic’ regions, many of which were spliced and differentially expressed, but did not appear to code for proteins. 14 , 34–37 At least 70% of protein-coding loci were also found to express overlapping antisense RNAs, some of which were shown to have regulatory function 38 , 39 and to be conserved over large evolutionary distances in the vertebrates. 40 Widespread tissue-specific sense-antisense and intronic transcription was also observed across the spectrum from yeast to humans. 41–52

Similar findings were reported using the orthogonal method of high-density genome tiling arrays f independently by Tom Gingeras, Mike Snyder and colleagues, who showed that transcribed sequences covered far more of the human genome g than predicted from protein-coding gene annotations, 28 , 54–57 with pervasive transcription of coding and non-coding regions in embryonic stem cells, becoming more restricted as differentiation proceeds. 58 They also showed that almost half of the transcripts in human cells are not polyadenylated, 28 confirming largely forgotten reports from the 1970s and 1980s. 10–12 The non-polyadenylated RNAs are derived from repeats and introns, 28 , 59 often transcribed at high levels by RNA polymerase III h , 69–71 or from processing of RNAPII transcripts. 72–75 Moreover, the total mass of the non-ribosomal, non-protein-coding RNAs in human cells and brains was found to exceed that of the mRNAs. 76 , 77

While controversial at first, these findings were confirmed by other studies in animals, plants and fungi using a variety of techniques, including additional large-scale cloning and sequencing of cDNAs, 57 , 72 , 78–81 serial analysis of gene expression and massively parallel signature sequencing 82–85 and microarrays and genome-wide tiling arrays that probe the expression of non-coding regions. 13 , 37 , 46 , 52 , 55 , 56 , 78 , 86–89 Over 85% of the Drosophila genome was found to be dynamically expressed during the first 24 h of embryonic development, 49 , 90–95 over 70% of the C. elegans genome to be transcribed in mixed-stage populations 88 and over 85% of the yeast genome to be expressed in rich media. 87 A subsequent intensive survey of 1% of the human genome by the ENCODE i Consortium showed that at least 93% of the nucleotides in the studied regions are transcribed in one or more of the 11 cell lines analyzed and that most of the unannotated transcripts are expressed in just one or a few. 96

At the same time, we and others showed that the expression of unannotated mammalian long non-coding RNAs (lncRNAs) j is highly dynamic and tissue-specific (details below), being most extensive in brain and testis. 37 , 98 LncRNAs are also differentially expressed during development in other organisms. 99 Some of the non-coding transcripts were found to be huge, tens and sometimes hundreds of kilobases in length (‘macroRNAs’), 100–102 a well-characterized example being Air (108 kb) from the imprinted Igf2r locus (Chapter 9). 100 , 103 , 104 A more recent pan-transcriptome analysis reported the discovery of thousands of novel RNAs, including a previously poorly cataloged class of non-polyadenylated single-exon lncRNAs. 105

Widespread antisense transcription is also observed in bacteria 106 and viruses. 107 , 108 Indeed, global transcriptome analyses showed that the vast majority of all nucleotides in all genomes from viruses to humans are transcribed from one or both strands at some point in their life cycle. 109

The Amazing Complexity of the Transcriptome

Sequencing of expressed RNAs and RACE k -tiling arrays also revealed that most transcripts in mammals and insects have alternative transcription start and termination sites, the former often initiating hundreds of kilobases upstream of the previously annotated gene starting point(s) and spanning other genes in between. 57 , 91 , 110 , 111 Approximately 250,000 transcriptional start sites in mammals reside within transposon or retroviral l derived sequences, which account for up to 30% of the transcribed loci, produce 5′ capped RNAs that are generally tissue-specific, and frequently function as alternative promoters of protein-coding genes and/or express non-coding RNAs. 113

Extraordinary complexity of tissue- and lineage-specific alternative splicing was also observed, 3 , 114–118 particularly in lncRNAs. 4 , 119 LncRNAs appear to be less efficiently spliced than mRNAs, a feature that is correlated with heightened alternative over constitutive splicing, 120–122 possibly related to chromatin retention. 123 LncRNAs are enriched in transposon-derived elements and other repeat sequences, 99 , 124–127 which is likely related to their functional modularity (Chapter 16). Thousands of ‘pseudogenes’ are also transcribed, 128–131 as are developmental ‘enhancers’, whose numbers far outweigh those of protein-coding genes 132–142 (Chapter 14).

Transcriptome analyses progressively revealed the existence (and, in most cases, drastically expanded the repertoire) of other types of transcripts 143 various ‘classes’ of promoter-associated RNAs, 144–148 3′UTRs (Chapter 9) and regulatory RNAs originating from intergenic spacers between rRNA genes (in which promoters and transcripts had been known for decades – see references in 149–151 ), as well as other classes such as circular lncRNAs (circRNAs) 152–154 (see below) and intron-derived lncRNAs with “snoRNA-ends” (sno-lncRNAs). 155 Intron retention was also found to be common in plants and animals, where it is used to control cell differentiation. 46 , 59 , 156–163 Surprisingly, even RNAs modified with N-glycans and a range of other RNAs are displayed on cell surfaces, apparently cell type specifically. 164 , 165

The picture that emerged is that eukaryotic genomes express a semi-continuum of interlacing and overlapping coding and non-coding transcripts from both DNA strands, m especially in animals 14 , 45 , 57 , 171–175 (Figure 13.1). There are genes encoding proteins, snoRNAs, miRNAs and other small non-coding RNAs located within other genes encoding proteins and lncRNAs, with unclear boundaries, often in nested chains where three or more transcripts overlap in “complex loci”, and the landscape of expressed transcripts, as well as their promoters, splicing patterns and termination points, are different in different cells and tissues. 45 Indeed, the true extent of the repertoire of RNA expression is still unknown, given that most analyses to date have been carried out in cultured cells and do not capture the fine scale transcriptomes of diverse cells during the ontogeny of differentiation and development, although this is rapidly changing with the ubiquity of RNA sequencing, including increasingly powerful single-cell sequencing analyses of a variety of organisms, tissues and conditions (see below and following chapters).

Figure 13.1. Graphical representation of the complexity of the transcriptional landscape in mammals.

Figure 13.1

Graphical representation of the complexity of the transcriptional landscape in mammals. (Reproduced from Morris and Mattick.)

Collectively, these observations were revolutionary in their implications. They challenged both the equivalence of genes with proteins n and the notion of ‘genes’ as discrete entities. 72 , 174 , 178 They also suggested the opposite of what had been long thought, i.e., that the genomes of humans and other complex organisms are information dense, not information sparse 179 (Chapter 7). The genome could no longer be envisaged as a linear array of protein-coding genes and associated cis-regulatory sequences, with some infrastructural and idiosyncratic non-coding RNAs. Rather, genome biology had to be reimagined as a highly dynamic continuum of coding and non-coding transcription, 174 , 178 , 180 the latter becoming more extensive as developmental complexity increases. 109 Moreover, almost every gene is overlapped by an exon or intron of another gene expressed in some cell type, and any given sequence can be intronic, exonic or intergenic, depending on the expression state of the cells.

Protein-Coding or Noise?

Unsurprisingly, these findings were initially met with skepticism. They indicated a massive hidden layer of RNAs of unknown function that had not been countenanced by the existing models of gene regulation, although the precedents had been there for decades, especially in the homeotic loci controlling organism development (Chapters 5 and 9). The protein-centric conception of genetic information and gene regulation could accommodate a few idiosyncratic regulatory RNAs, and post-transcriptional control of gene expression by miRNAs, but not tens of thousands of non-coding RNAs. No wonder there were reservations.

One difficulty was discriminating coding from non-coding transcripts, 181 leading to the suggestion that some or many of the newly identified transcripts might contain short open reading frames that had fallen below the radar of the genome annotations, which had generally used a minimum open reading frame of 100 codons. o Subsequent studies using ORF conservation, proteomic analyses and ribosomal profiling showed that, while there may be hundreds of unrecognized short proteins, including hormones and small peptides encoded in some RNAs annotated as non-coding, 182–197 the vast majority of lncRNAs exhibit no evidence of protein-coding capacity. 185 , 186 , 198

The dichotomy may itself be false: at least some RNAs have dual function as both coding and regulatory RNAs 133 , 181 , 189 , 194 , 199–210 and mRNAs appear to play a role in cellular organization. 211 , 212 Some protein-coding loci also generate miRNAs or lncRNAs p by alternative splicing, 209 , 213–221 and there is “a non-negligible fraction of protein-coding genes (where) the major transcript does not code a protein”. 222 Some regulatory lncRNAs have evolved from protein-coding ancestors and pseudogenes. 128 , 131 , 223–225 Some genes that express lncRNAs with enhancer activity also produce miRNAs 226 , 227 and micropeptides, 189 , 190 , 208 , 228 , 229 further examples of parallel outputs from complex loci. Moreover, lncRNAs have many alternative splice isoforms 4 (Chapter 16), some of which have been shown to have different functions. 230–235

Skepticism of the significance of non-coding transcription took many forms, including speculation that the unannotated RNAs are technical artifacts, genomic DNA contamination or have dispensable functions. 179 , 236 , 237 The most common reaction to these findings, however, was to assert that the bulk of the observed transcription, although dynamic and often cell- or tissue-specific, is ‘noise’: most non-coding transcripts were detected at low levels, many are or appeared to be comprised of just one exon, and many seemingly ‘random’ fragments abounded in initial RNA sequencing datasets, with differential expression sometimes attributed to variations in chromatin accessibility. 238–241

The concept of transcriptional noise was introduced in the 1990s by the observation of the cellular heterogeneity and stochastic fluctuations in the firing of known promoters in bacteria and yeast, not spurious transcription from illegitimate initiation sites. 242–246 Nonetheless it was seized upon, and conflated with ‘neutral evolution’, 247 leading to debates about the functionality or otherwise of the plethora of non-coding transcripts in eukaryotic cells, 239 , 248–251 reprising previous discussions about the functionality of introns, transposon-derived sequences and pseudogenes.

The Restricted Expression of Long Non-coding RNAs

It transpired that the low-level and fragmentary signals from lncRNAs in sequencing datasets is mainly a consequence of their highly developmental stage-specific expression, exacerbated by insufficient sequencing depth, q especially in complex tissues. 22 , 248 , 252 , 253

The expectation that high expression levels reflect functionality is based on the prevalence of protein-coding RNAs, which are, on average, more highly expressed than regulatory RNAs, although there are exceptions. 254 , 255 Indeed, mRNAs encoding regulatory proteins such as transcription factors are usually expressed at lower levels than those encoding structural or metabolic proteins, and have shorter half-lives. 256 , 257

Regulatory RNAs would likewise require relatively low average expression levels, i.e., more localized expression, and more dynamic control. 254 , 258 Examples, among many others, include functionally validated chromatin-associated lncRNAs detected on average in less than ten copies per cell in populations. 259–265 Single-molecule RNA FISH revealed that the localized TERT (telomerase reverse transcriptase) pre-mRNA occurs in 9–10 copies per cell and is only spliced during mitosis. 163 , 266 Similar low expression levels are also observed for a number of functionally well validated regulatory RNAs, 179 including XIST (Chapter 16).

Many smaller “cryptic unstable transcripts” expressed antisense from promoters and from intra- and intergenic regions are rapidly degraded by RNA turnover and surveillance r pathways, 47 , 286–288 which were thought to be a quality control mechanism to limit “inappropriate expression”, 286 but were subsequently found to regulate promoter and enhancer function (Chapter 16). 289–297

Similarly, other transcripts associated with promoters were detected only after depletion of components of RNA-degrading ‘exosomes’, including ‘promoter-associated RNAs’ (PASRs) of ~250–500nt identified in yeast and human cells, 111 , 144 ‘promoter upstream transcripts’ (PROMPTs) of ~0.5–2.5 kb, identified in human cells in both sense and antisense orientations upstream of the transcription start sites of expressed genes. 145 Exosomes have also been shown to control the levels of “a vast number” of lncRNAs with enhancer activity in B cells and pluripotent embryonic stem cells. 298

Analysis of in situ hybridization patterns of over 1,000 lncRNAs showed that a high proportion exhibit precise expression patterns in the brain/central nervous system, easily detected in highly specific and localized populations of cells in the striatum, retina, hippocampus, cerebellum, olfactory bulb and cortical layers, among others. 299–302 Complex region- or cell-specific and developmentally transient expression patterns of ‘intergenic’ and antisense lncRNAs have also been observed in fish, 303 Drosophila, 304 , 305 honey bees 306–308 and other multicellular and unicellular eukaryotes 99 (Figure 13.2). It has also been observed in globin 133 , 309–311 and interleukin loci, 312 , 313 now being extended to others by detailed examination of more tissues and developmental time points, aided by the advent of single-cell sequencing. 314–318

Figure 13.2. Reflective in situ hybridization patterns of the expression of the protein-coding gene ultrabithorax (Ubx) and its overlapping antisense RNA in embryos of the centipede Strigamia maritima at mid- and late-segmentation stages.

Figure 13.2

Reflective in situ hybridization patterns of the expression of the protein-coding gene ultrabithorax (Ubx) and its overlapping antisense RNA in embryos of the centipede Strigamia maritima at mid- and late-segmentation stages. (Reproduced from Brena et (more...)

To get around the problem of low sequencing depth, John Rinn, Tim Mercer and colleagues developed a method called ‘RNA CaptureSeq’, s akin to exome sequencing, to enrich transcripts expressed from specific genomic locations. 252 , 253 , 326 This approach showed that, even in a relatively homogenous population of cultured fibroblasts, regions that appeared devoid of transcripts – sometimes referred to as ‘gene deserts’ – expressed lncRNAs in a subset of the cells. 252 It detected previously unknown isoforms of intensively studied protein-coding genes, such as TP53, and lncRNAs expressed from homeotic and other developmental loci. 252 , 253 , 326–328 It also revealed that most GWAS regions, including those associated with neuropsychiatric functions, are transcribed into lncRNAs. 329–331 Other studies showed that most intergenic lncRNAs originate from enhancers and are specifically expressed in cell types relevant to the associated GWAS trait 332 , 333 (see below and Chapters 14 and 16).

RNA CaptureSeq also revealed that many of the existing annotations of lncRNA (and some mRNA) structures were incomplete, t that many lncRNAs are multi-exonic, and that the internal exons of lncRNAs (but not mRNAs) are almost universally alternatively spliced. 4 These high resolution and recent single-cell RNA sequencing studies also indicate that the number of lncRNAs and lncRNA isoforms that are expressed is far greater than cataloged in current databases. 105 , 334

Thus, lncRNAs generally show more tissue-restricted and transient expression patterns than mRNAs, 46 , 185 , 254 , 300 , 318 , 335–337 helping to explain their low representation in RNA sequencing datasets (Figure 13.3). This in turn suggests that lncRNAs are more specific markers and regulators of cell state, including disease state, than proteins with generic functions in, e.g., muscle, bone or neuronal cells.

Figure 13.3. In situ hybridization of lncRNAs in mouse brain.

Figure 13.3

In situ hybridization of lncRNAs in mouse brain. (Original images from the Allen Brain Atlas, , reproduced from Mercer et al.)

Other Indices of Functionality

LncRNAs are dynamically expressed during all aspects of animal differentiation and development u along developmental axes, 89 , 132 , 260 , 303 in embryonic stem cells, 58 , 124 , 341 , 342 neuronal cells, 258 , 343–345 muscle cells, 346 mammary gland, 347 hematopoietic and immune cells, 37 , 348 , 349 among many others. 99 , 350 They are also differentially expressed in neurological responses, for example, in the songbird zebrafinch where “40% of transcripts in the unstimulated auditory forebrain are non-coding and derive from intronic or intergenic loci... Among the RNAs that are rapidly suppressed in response to new vocal signals … two-thirds are ncRNAs”. 351 LncRNAs show altered expression in cancer and other diseases 352 (see below) and are also dynamically expressed during plant and fungal development. 51 , 353–355 LncRNAs are also trafficked to specific subcellular locations in the nucleus and cytoplasm, and specific domains within them (Chapter 16). 28 , 300 , 303 , 307 , 336 , 356–372

The half-lives of lncRNAs are broadly similar to those of mRNAs, over an equally wide range, many being highly stable. 257 , 373 , 374 In fact, the loci expressing lncRNAs exhibit most of the characteristics of bona fide genes: 375 their expression is regulated by conventional hormones, morphogens and transcription factors; 344 , 376–382 many have polyadenylation sites; 383 they show non-neutral mutational patterns; 384 their promoters and exons have chromatin marks similar to those of protein-coding genes 385 (Chapter 14); and their splice junctions and structures are conserved, 4 , 14 , 384–386 allowing the identification of orthologs in other species. 387–389

Surprisingly, the promoters of lncRNAs are, on average, more conserved than those of protein-coding genes, 14 , 384 , 390 suggesting higher cell specificity, consistent with their restricted expression and the conserved expression patterns of syntenic RNAs. 391–393 Although the proportion of lncRNAs with primary sequence similarity is low among vertebrates, 384 , 389 thousands of RNAs in mammals, Drosophila, plants and yeast have conserved secondary structures and sequence motifs, and a minimum of 20% of the mammalian genome has been shown to be under evolutionary selection at the level of predicted RNA structure. 386 , 394–401

Lack of conservation does not mean lack of function. 402 There are well-described examples of lncRNAs (including Drosophila roX RNAs) that evolve rapidly while maintaining functional interactions. 402–404 Xist shows only patchy primary sequence conservation among mammals, notably in its ‘repeat’ sequences, and its adjacent lncRNA, Jpx, which activates Xist, while sharing no obvious sequence or structural homology between human and mouse, is functionally interchangeable between them. 404 , 405 .

Many other well-studied lncRNAs involved in developmental processes, including Air, 406 DISC2, 407 NTT, 408 BORG 409 and UM 9(5), show only short stretches of conserved sequences. 410 In addition, many lncRNAs have conserved functions in vertebrate development despite rapid sequence divergence 411 and orthologous lncRNAs that are developmentally regulated in different species have been identified solely on the basis of the conservation of splice sites and associated introns. 387 This indicates that lncRNAs have greater orthology than is evident from conventional sequence comparisons, reflecting positive selection for phenotypic variation 412 and more plastic structure-function relationships 413 , 414 than protein-coding sequences. 91 , 392

Furthermore, bearing in mind the difficulty of identifying orthologs that are evolving to alter the fine control of developmental processes, many lncRNAs are clade-specific and, consequently, largely unstudied. One example is the lncRNA Sphinx, which regulates courtship behavior in Drosophila and is expressed from a chimeric gene that arose by the retrotransposition of a sequence from an ATP synthase gene and capture of an adjacent exon and intron, fixed by positive selection. 415 There are many other examples of species-specific lncRNAs, with evidence of recent birth and selective sweeps, controlling cell differentiation (e.g., 416 ) and brain functions (see below and Chapter 17).

Genetic Signatures

Another reservation was that few lncRNAs had, at the time, been identified in genetic screens, which intrinsically favored protein-coding mutations. 179 , 375

Protein-coding mutations are frequently disastrous, including those affecting enzymes, motor proteins, transporters, signaling proteins, etc., as well as transcription factors, epigenetic modifiers and other regulatory proteins, which cause system-wide malfunctions. The same holds for some highly expressed non-coding RNAs with generic functions, 417 such as RMRP (the RNA component of RNase MRP, Chapter 8), mutations in which cause a pleiotropic human disease, cartilage-hair hypoplasia, first identified by linkage analysis 418 , 419 and later shown also to produce miRNAs, 420 to perturb helper T-cell epigenetic regulation 421 and to inactivate the tumor suppressor P53. 422

A major blind spot was phenotypic bias: the severe and pleiotropic effects of damage to proteins or ‘housekeeping’ RNAs contrast with damage to regulatory sequences, which may only affect a part of the networks that control differentiation and development or environmental responses, with more subtle context- and/or cell type-specific consequences, often referred to as quantitative trait variation. Indeed, the use of the word ‘mutation’, as opposed to ‘variation’, reflects an inherent bias in the identification of genetic factors that affect phenotypes in animals and plants, with those exhibiting strong negative effects (being easier to identify and map) understandably having taken precedence over those that do not (Chapters 7 and 11).

The related blind spots were expectational, technical and interpretative bias: historically, most genetic screens used experimental and informatic approaches that prioritized protein-coding genes and exons. Many now known important mutations in lncRNAs were consequently missed by exome sequencing and chromosomal microarray analyses. 423 Mutations that could not be tracked to and shown to introduce stop codons or different amino acids in a protein-coding sequence were rarely pursued, given the large number of variations in non-coding sequences, which were mostly invisible and untraceable before the availability of genome and transcriptome sequences. v Even those that were confidently mapped outside of protein-coding sequences were routinely interpreted as affecting cis-acting protein-binding sites that regulate nearby coding genes. Put simply, it was assumed that most disease-causing mutations occur in protein-coding sequences, where it was easy to identify them, or in cis-regulatory sequences that bind regulatory proteins, with scant knowledge of regulatory RNAs that might be expressed from the locus, some of which are now being identified. 333 , 425–432

In Drosophila, where careful genetic analysis identified many enhancers and other regulatory regions affecting development, w the same interpretative bias occurred, despite the abundant evidence of differential expression of lncRNAs from these regions (Chapter 5), although there were exceptions, such as the roX RNAs identified by careful genetic and expression mapping (Chapter 9). 433 , 434

There were other exceptions, especially in farm animals, where controlled breeding permitted accurate dissection of the genetic causes of quantitative trait variation. The mutation underpinning the ‘callipyge’ polar overdominance phenotype in sheep (Chapter 5) was mapped to a non-coding RNA expressed from the complex Dlk1-Dio3 imprinted region, 435–438 as were others affecting quantitative trait variation. 439 Similar strong effects are observed for mutations in other lncRNAs (see below) and non-coding regulatory regions, as exemplified by the Crest mutation in chickens, which causes a spectacular phenotype in which the small feathers normally present on the head are replaced by much larger feathers normally present in dorsal skin. It was shown to be caused by a 197 bp duplication of an evolutionarily conserved sequence in the intron of HoxC10, which causes the ectopic expression of HoxC10 and other Hox genes, altering cell regional identity (Figure 13.4). 440

Figure 13.4. The ‘crest' phenotype resulting from a 197bp duplication in the intron of HoxC10, which alters cell identity.

Figure 13.4

The ‘crest' phenotype resulting from a 197bp duplication in the intron of HoxC10, which alters cell identity. (Reproduced from Li et al., under Creative Commons 4.0 license.)

Indeed, many lncRNAs that are now known to be important went undetected or were overlooked in genetic screens. These included nearly all miRNAs, x many of which are not individually essential for viability or development; 442 a conserved lncRNA (‘yar’) that lies within Drosophila Achaete-Scute complex locus studied by Muller and others many decades ago, which was recently discovered to regulate sleep behavior; 443 and 3′UTR of the oskar gene that was unexpectedly found to function as a lncRNA controlling Drosophila oogenesis (Chapter 9). 444 , 445

Moreover, other lncRNAs initially thought to be non-essential, many of which display little ‘conservation’ and are lineage-restricted (including human-specific RNAs), have been implicated in disease, 446–449 with more coming to light with the growing awareness of their relevance to complex traits. 423 , 450 These RNAs have been identified by chromosome breakpoints, fusions and translocations, deletions, copy number variations, point mutations, insertions and deletions, aberrant imprinting, other epigenetic defects and haploinsufficiency, among others. 450

Examples, some exhibiting Mendelian inheritance, but most involved in complex disorders, include Di George Syndrome (a range of symptoms including congenital heart problems, unusual facial features, frequent infections, developmental delay, learning problems and cleft palate); 452 other neurodevelopmental and craniofacial disorders; 453–455 developmental defects, e.g., involving the lncRNA Chaserr, which regulates the expression of a chromatin remodeler implicated in neurological disease (Figure 13.5); 451 limb malformations, brachydactyly and other skeletal abnormalities (Figure 13.6); 430 , 456–458 Angelman y and Prader-Willi Syndromes; 460–463 schizophrenia; 407 , 464–466 Kallmann Syndrome (a subtype of gonadotropin-releasing hormone deficiency with a loss of smell), 467 pseudohypoparathyroidism; 468 alcohol use disorders; 469 nonalcoholic fatty liver disease; 470 diabetes; 471 multiple sclerosis; 472 autoimmune thyroid disease; 473 Sjögren syndrome; 474 celiac disease; 475 Hypereosinophilic Syndrome; 476 Kawasaki Disease; 261 psoriasis; 477 , 478 inflammatory bowel disease; 479 atherosclerosis; 231 , 429 , 480 , 481 cardiac hypertrophy, 482 Alzheimer’s Disease; 483 ataxias; 484 , 485 myocardial infarction; 486 and some types of thalassemia. 131 , 487 , 488

Figure 13.5. The severe phenotype resulting from haploinsufficiency of the lncRNA Chaserr.

Figure 13.5

The severe phenotype resulting from haploinsufficiency of the lncRNA Chaserr. (Reproduced from Rom et al. under Creative Commons CC BY license.)

Figure 13.6. Skeletal malformations due to the loss of lncRNA Maenli.

Figure 13.6

Skeletal malformations due to the loss of lncRNA Maenli. (Reproduced from Allou et al. with permission from Springer Nature.)

It has also been shown that phenylketonuria, one of the first documented human genetic disorders, which is mostly due to mutations in the enzyme phenylalanine hydroxylase, can also be caused by perturbations in a regulatory lncRNA and be modulated by administration of modified RNA mimics in mouse models. 489

As noted already, genomic regions associated with a wide variety of complex disorders and characteristics, including psychiatric traits and disorders and neurodegenerative diseases, are replete with lncRNAs, 329 , 330 , 332 , 333 which are therefore candidates for the mechanistic basis of the association. Sensibly, studies have started to focus not only on non-coding regions, but also on mutations/variations that affect RNA structure. 490

Many lncRNAs have been also associated with the etiology, progression, genomic instability and therapy resistance of cancers, 390 , 491–496 through altered expression, insertional mutagenesis and/or naturally occurring mutations, and functional validation of previously known and novel RNAs that act as oncogenes z or tumor suppressors, many of which are enriched in repetitive elements. 390 In some cases (such as H19, PVT1, MIAT/Gomafu, OIP5-AS1/Cyrano, TUG1, aa HOTAIR, MEG3, XIST, TSIX, MALAT1 and NEAT1), perturbations in lncRNAs are associated with multiple cancers; 235 , 390 , 494 , 499–519 and in other cases with particular types of cancers, including leukemias and lymphomas, 135 , 520–525 melanoma, 526 , 527 osteosarcoma, 528 gastric, 529 lung, 530 breast, 531 , 532 prostate, 13 , 533–535 bladder 536 and many others, including bone metastasis, 537 with increasing understanding of the mechanisms involved. 505 , 537–539

A major problem is that many if not most mutations in lncRNAs are cell type- and context-dependent, 540 and not evident in fish tanks or mouse cages 296 unless subjected to specific challenges or behavioral assays. Indeed, in Drosophila and C. elegans, both intensively studied, less than a third of protein-coding genes have obvious phenotypes when mutated, 541 , 542 and many apparently disruptive mutations in human genes are common in the population, indicating more subtle interactions. 543 , 544 Targeted knockout of the rodent-specific and highly expressed brain lncRNA, BC1 (Chapter 8), yielded no obvious developmental phenotype, but causes behavioral changes, and impaired experience-dependent plasticity and learning in mice. 545 , 546 Deletion of the highly expressed and relatively highly conserved lncRNAs, Neat1 and Malat1, similar to that observed with ultraconserved elements (Chapter 10), did not result in dramatic developmental deficiencies, 547 , 548 but later analyses showed changes in behavior and placental biology, as well as involvement in synapse formation, myogenesis, cancers and responses to pathogen infections. 549–553 The loss of another lncRNA, FosDT, which is highly expressed in the cerebral cortex and interacts with chromatin-modifying proteins associated with the neuronal transcription factor REST, causes no developmental anomalies but reduces brain damage from strokes. 554 , 555 The lncRNA Pnky regulates neuronal differentiation and its deletion affects postnatal cortical development, although the mice do not exhibit superficial defects. 556 , 557

On the other hand, deletion of other lncRNAs in mice by Rinn and colleagues, with visual markers that revealed exquisite expression patterns, resulted in a range of more obvious phenotypes including homeotic transformations, skeletal and neuronal abnormalities, heart and gastrointestinal defects, muscle wasting, abnormal lung morphology and aging. 558 , 559 Many more have since been reported using various in vivo approaches. 540

High-throughput siRNA and CRISPR reverse genetic screens combined with molecular phenotyping 560 are now increasing the search speed, identifying, for example, lncRNAs that are involved in chromatin interactions, 561 required for heart development, 562 regulate nuclear factor trafficking, 563 activate resistance to BRAF inhibitors in melanoma, 564 respond to Wnt signaling, 565 sensitize glioma cells to radiation and essential to cancer cell viability, 566 involved in cell growth and migration, 560 , 567 spermatogenesis, 568 lung cancer 569 or have various fitness effects. 570 A recent large-scale study using CRISPR mutagenesis of over 16,000 lncRNAs in seven cell lines identified almost 500 required for normal cellular proliferation, 89% of which were expressed in only one cell type. 571

A systematic high-throughput loss-of-function analysis of 248 protein-coding genes and 141 lncRNAs using the fission yeast S. pombe, assessing mutant growth and viability in “benign” and 145 variable conditions, showed that phenotypes are found much more frequently for the former compared to the latter (47.5% for the lncRNAs and 96% for the protein-coding genes). However, on more careful inspection (also evaluating the effect on cell-size and/or cell-cycle control), 59.6% of lncRNAs yielded phenotypes and, upon overexpression of the lncRNAs under 47 different conditions, 90.3% led to altered growth under certain conditions. These results reinforce the notion that most of the lncRNAs exert cellular functions in specific environmental or physiological contexts. 572

Intriguingly, it appears that some gene deletions may be masked by compensatory mechanisms, whereas acutely disturbing transcript levels may have more severe effects. 573 There is also evidence that ‘shadow’ enhancers (which express and likely operate through lncRNAs, Chapter 16) provide redundancy and robustness to developmental programs (Chapter 15), 574–578 which makes sense given the criticality of the process for survival and reproductive success. On the other hand, knockdown of lncRNA expression in culture often has visible effects, in terms of changes in cell shape, behavior and gene expression profiles, 375 , 560 which may have stronger manifestations in artificial in vivo settings such as xenograft models. 565

An Avalanche of Long Non-coding RNAs

With the popularization of unbiased transcriptomic studies, growing numbers of non-coding RNAs have been identified and studied in model organisms, human cell lines and disease systems, mainly ad hoc by the differential expression of intronic, intergenic, pseudogene-derived and antisense lncRNAs, but also increasingly by functional screens. The ENCODE project alone found over 850 pseudogenes that are “transcribed and associated with active chromatin” 140 with many since been shown to function as regulatory RNAs. 130 , 522 , 579–584

In addition to viroids that have small circular genomes (Chapter 8), circular RNAs (circRNAs) occur in plant and animal cells, one of the first discovered being a variant of the Sry mammalian male sex-determining gene transcript, 585 as ever considered an interesting oddity at the time. CircRNAs remained under the radar because traditional cloning and sequencing protocols and informatic methods mitigated against their detection. They are predominantly produced by back-splicing facilitated by reverse-complementary (often recently acquired transposon-derived sequences) that promote pre-mRNA folding. 586 Since their rediscovery, circRNAs have become recognized as a bona fide class of functional RNAs, in many cases comprising the dominant transcript isoform. 152 , 154 , 587 , 588

CircRNAs are predominantly nuclear ab and act through several mechanisms, having been shown to regulate, inter alia, transcription, immune responses, behavior, neural cell function and pluripotency, 589–594 Drosophila lifespan 595 and centromeric chromatin organization in maize. 596 Many circRNAs have regulatory interactions with the cognate protein-coding gene, as illustrated by the neuronal-enriched psychiatric-disease associated circHomer1 RNA and its host gene Homer1. Based on in vitro and in vivo studies with mouse models, they were found to be functionally antagonistic in synapses of the orbitofrontal cortex, with this opposing interplay regulating synaptic gene expression, cognitive flexibility and behavioral performance, with potential relevance for brain function and psychiatric diseases. 594

Over the past decade, there have been ~50,000 publications with long non-coding RNA as a key term (Figure 13.7) and over 2,000 publications reporting validated long non-coding RNA functions. 597 These studies have been assisted by the systematic cataloging and annotation of lncRNAs by the GENCODE consortium, ac The Cancer Genome Atlas (TCGA) 516 and the extension of the FANTOM projects to transcriptomic atlases, associated Web resources and functional annotation of lncRNAs. 34

Figure 13.7. The increase in the number of publications that have the terms ‘long/large non(-)coding RNA' or variations thereof (lncRNA or lincRNA) in their PubMed entry.

Figure 13.7

The increase in the number of publications that have the terms ‘long/large non(-)coding RNA' or variations thereof (lncRNA or lincRNA) in their PubMed entry.

There are now hundreds of thousands of cataloged lncRNAs and dozens of databases (and databases of databases) with curated information. 600 Well over 100,000 human lncRNAs have been recorded, 601 many of which are specific to the primate lineage, 389 , 602 including retrovirus-derived lncRNAs, 603 a vastly incomplete catalog due to the still limited analysis of different cells at different developmental stages and physiological conditions.

A Plethora of Functions

LncRNAs have been shown to regulate many aspects of mammalian development, cell differentiation, (Figure 13.8) physiology and brain function 605 , 606 (see Table and Chapter 17), as well as many other roles in other organisms, 354 , 607–610 including the translation of Doublesex in Drosophila, 175 female honeybee development, 308 plant vernalization (see below), strawberry fruit ripening, 611 DNA elimination and genome rearrangements in ciliate life cycle and reproduction; 612 the mitotic to meiotic switch 613 and meiotic chromosomal pairing in yeast, 614 , 615 and carotenoid biosynthesis in filamentous fungi. 616

Figure 13.8. Control of cell differentiation and self-renewal by lncRNAs.

Figure 13.8

Control of cell differentiation and self-renewal by lncRNAs. (Reproduced from Flynn and Chang with permission of Elsevier.)

Examples of Functions of lncRNAs in Mammalian Biology

Stem cell pluripotency, self-renewal, lineage commitment and reprogramming 262 , 315 , 617–629

Epithelial-mesenchymal transition 213 , 536 , 630 , 631

mesoderm and endoderm differentiation 632–634

Cell cycle, 635 proliferation and migration 560 , 567 , 636–640

Cellular senescence 234 and apoptosis 370 , 476 , 641

Brain evolution, 557 , 602 , 642 neocortex, 643 forebrain, 644 , 645 and retinal 498 , 646 , 647 development

Maintenance of neural progenitor cells, 593 neuronal differentiation, 556 , 607 , 648–650 outgrowth and regeneration, 637 , 651 , 652 axon integrity, 653 myelination 654

Synaptic plasticity and function 655–660

Memory, 661 , 662 sex-specific depression 663 and social hierarchy 658

Mammary 347 , 664 sclerotome, 665 heart, 562 , 666–669 lung, 670 skeletal, 457 , 458 limb 430 and intestinal 671 development

Muscle differentiation and function, 209 , 672–684 myogenesis and muscle fiber type switching 631 , 685–690

Liver regeneration 691 , 692

Cholesterol biosynthesis and homeostasis 470 , 693

Angiogenesis 694 , 695 and fibrogenesis 696

Formation of vascular endothelial cell junctions 697 , 698

Hematopoiesis, 699 granulocyte, 349 megakaryocyte, 416 T-cell 471 , 472 , 700 and keratinocyte 701–703 differentiation

Erythropoiesis and developmental regulation of globin gene expression 311 , 704–707

Innate and adaptive immune responses 708–715

Inhibition of viral replication 716

Microbial susceptibility, endotoxic shock and immunity 474 , 717–721

V(D)J and Ig class switch recombination 722 , 723

inflammation and neuropathic pain 428 , 475 , 479 , 724–728

Growth hormone and prolactin production 729

Glucocorticoid resistance 730

Testis development and spermatogenesis 194 , 568 , 731–733

DNA damage repair 517 , 734 , 735

Thermogenic adipocyte regulation 736

Mitochondrial function 737

Mechanistic details have started to emerge, some serendipitously. As with the discovery of the 7SL RNA in signal recognition particles in the early 1980s (Chapter 8), biochemical assays used to identify protein interactions detected other regulatory RNAs, such as the identification by yeast two- and three-hybrid screens of mammalian SRA RNA as a transcriptional coactivator, Gas5 RNA as a repressor of glucocorticoid receptor activity, and the plant ENOD40 RNA as a regulator of the localization of an RNA-binding protein in cytoplasmic granules. 738–741 Other lncRNAs have been found to associate with cell membranes to alter their permeability and dynamics, thereby modulating signal transduction and transport pathways, 742–746 including the reprogramming of glucose metabolism. 747

Functionally characterized examples have established that lncRNAs participate in virtually all levels of genome organization and gene expression, via RNA-RNA, RNA-DNA and RNA-protein interactions, often involving repeat elements within them, including SINEs in 3′UTRs. 748 These encompass the regulation of transcription, chromatin architecture and the organization of subcellular domains (see Chapter 16), control of protein translation and localization, 361 , 563 , 661 , 748–750 splicing 381 , 655 , 751–762 and other forms of RNA processing, editing, localization and stability. 763–767

Some of the first characterized non-coding RNAs were found to regulate transcription by modulating RNA polymerase II activity, directly (such as RNAs from B2 SINE and Alu elements) or indirectly through interaction with transcription factors. 768 , 769 7SK, for instance, acts primarily by sequestering and inactivating the transcription elongation factor b (P-TEFb), a heterodimer composed of cyclin-dependent kinase 9 (Cdk9) and cyclin T1, which connects transcription to the cell cycle and chromatin architecture. 770–777 It controls stress-induced transcriptional reprogramming 778 and regulates several aspects of the expression not just of mRNAs, but also of snRNAs, bidirectional and enhancer RNAs, 777 , 779 and acts as a multi-functional RNA scaffold that regulates neuron homeostasis. 780

Genomically associated RNAs have been shown to regulate gene expression by other mechanisms. Transcripts spanning the cyclin D1 (CCND1) regulatory promoter sequences recruit and allosterically regulate the TLS RNA-binding protein and induce chromatin modification to repress CCND1 expression. 781 At the DHFR locus, a non-coding RNA initiated from an upstream minor promoter forms a stable RNA-DNA triplex within the major promoter to repress DHFR expression. 782 Likewise, among others, 783 the lncRNAs ANRIL (CDKN2B-AS1), Khps1 and CISAL regulate expression of the cyclin-dependent kinase inhibitor CDKN2B, the proto-oncogene SPHK1 and the tumor suppressor BRAC1, respectively, via triplex-mediated changes in chromatin structure. 481 , 784–786

Non-coding RNAs from intergenic spacers and promoter regions of rDNA genes in humans establish and maintain heterochromatin structure at specific rDNA promoters via the recognition of RNA secondary structures and formation of triplexes with target DNA sequences that recruit DNA methyltransferase DNMT3b and more. 148–151 Indeed, lncRNAs play central roles in the formation and function of heterochromatic domains, including telomeres ad and centromeres, in all eukaryotes.

Many lncRNAs associate with enzymes and complexes that impart histone modifications and DNA methylation. 99 , 796 , 797 Both small and long non-coding RNAs control the target specificity of and the interplay between repressive Polycomb group (PcG) and activating Trithorax group (TrxG) protein complexes (Chapter 16), chromatin modifiers that maintain silent and active expression states of genes during development 99 , 178 , 796 , 798–800 (Chapter 14).

For example, it has been shown that the mouse chromodomain-containing PRC1 component Cbx7 binds RNA and that its association with the inactive X chromosome depends on interaction with RNA. 801 Imprinting of loci by lncRNAs such as Air and Kcnq1ot1 (Chapter 9) and X-chromosome epigenetic silencing by Xist-locus derived RNAs involve recruitment of PcG and other chromatin-modifying complexes (Chapter 16). The ~3.8 kb lncRNA ANRIL, some of whose exons are primate-specific, 389 is transcribed from a GWAS region associated with autoimmune and other disorders 231 , 425 , 426 , 429 and recruits components of the Polycomb Repressor Complexes 1 and 2 (PRC1/2) to epigenetically silence INK4B/ARF/INK4A tumor suppressor cluster. 802–804 Exon 8 of ANRIL, which is mainly comprised of repeat elements, mediates ANRIL’s association with target loci to modulate their expression through H3K27me3 deposition. 805 The lncRNA Chaer controls hypertrophic heart growth by binding to and inhibiting the function of PRC2. 674

Expression of the genes in the Hox clusters are controlled by enhancer elements present in intergenic regions that bind regulatory proteins, which are thought to activate nearby protein-coding genes in cis but that are also co-linearly transcribed into non-coding RNAs during development. 806–812 Hundreds of lncRNAs associate with PRC2 complexes, including functionally validated lncRNAs such as TUG1, Meg3/Gtl2 and HOTAIR 813–815 (Chapter 16).

HOTAIR is a ~2.2 kb spliced RNA transcribed from the HOXC locus, antisense to the flanking genes HOXC11 and HOXC12, which was originally shown to direct heterochromatin formation in trans across a 40 kb domain of the HOXD cluster in human fibroblasts. 89 HOTAIR was later shown also to influence gene expression at other sites around the genome by recruitment of PRC2 and LSD1/CoREST/REST repressive chromatin-modifying complexes. 813 , 816 , 817 As with Xist, HOTAIR has different functional domains, with a 5′ domain that binds PRC2 and a 3′ domain that binds LSD1, and has been proposed to act as a scaffold for protein complexes, 816 , 818 likely a general function of lncRNAs 800 , 819 (Chapter 16).

Other intergenic or antisense lncRNAs transcribed from homeotic loci (Evx1, HoxA13 and HoxB5/B6 and many others – Chapter 16) have been shown to bind to TrxG; 260 , 341 the spliced lncRNA HOTTIP (~3.8 kb) is transcribed from a region immediately downstream of the human HOXA13 gene, interacts with TrxG MLL component WDR5 and directs the complex to activate HOXA13 and additional neighboring HOXA genes by a mechanism that involves chromosomal looping. 260

In plants, lncRNAs also control many aspects of development and environmental responses, 820 exemplified by the lncRNAs COOLAIR transcribed antisense to the major repressor of flowering, FLOWERING LOCUS C (FLC), and COLDAIR transcribed from the first intron of FLC, which mediate cold-induced epigenetic repression (‘vernalization’) of flowering time. 821 COOLAIR and COLDAIR contain conserved modular secondary structures and act by recruitment of PRC2 and other epigenetic regulators, RNA-DNA R-loop formation, chromatin looping and the formation of phase-separated condensates. 822–834 A distal COOLAIR variant sequesters TrxG into condensates away from the promoter 835 and unspliced COOLAIR forms “clouds” around the locus, 836 similar to Xist (see Chapter 16). COOLAIR is also differentially spliced in Arabidopsis variants adapted to different climes. 837 , 838 Another lncRNA, FLAIL, represses flowering time. 839

These are prominent examples among recurrent themes for lncRNAs, incorporating their functions as scaffolds, epigenetic guides, chromatin organizers and control devices, allosteric regulators and ribozymes, as well as decoys that sequester regulatory factors, 97 , 175 , 800 , 819 , 840 , 841 acting as ‘target mimics’, ‘miRNA sponges’ or ‘competing endogenous RNAs’. 611 , 620 , 842 , 843

The Wild West

In the ‘Insights of the Decade’ section of the special issue of Science magazine in 2010, less than 10 years after the publication of the draft human genome sequence, it was noted that

Many mysteries about the genome’s dark matter are still under investigation. Even so, the overall picture is clear: 10 years ago, genes had the spotlight all to themselves. Now they have to share it with a large, and growing, ensemble. 845

Indeed, it progressively became evident that lncRNAs are ubiquitously involved in differentiation and development processes in eukaryotes (Figure 13.9).

Figure 13.9. Depictions of eukaryotic RNA regulation in 1994 and less than 15 years later.

Figure 13.9

Depictions of eukaryotic RNA regulation in 1994 and less than 15 years later. (a) At a time when the understanding of genetic information was still largely based on bacterial studies (upper left), proteins were thought to perform all the functions (more...)

A recent review observed

“The prior widely held perception that they are predominantly junk [should] also [be] factored in” to the analysis of such experiments and that [although] “there have since been more than a thousand publications on the functions of these lncRNAs, both in cis and trans, many (molecular biologists) are still only aware of the earlier dismissive publications”. 846

Most lncRNAs still remain experimentally untouched or poorly characterized – such as LINC02476 (GenBank CB338058), composed of at least five exons spanning 288 kb, found in 2003 associated with autism, in patient breakpoints that disrupt this non-coding RNA transcript, 847 but only now being studied due to its differential expression in cancer. 848

The bottom line is that these highly important regulatory RNAs were present all along but, despite the cases documented in the closing decades of the 20th century and many more detected by transcriptomic, enhancer ‘traps’ and other biochemical and functional screens in the first decade of the 21st century, they were not generally, until recently, taken seriously. They have also been studied in all sorts of ways, with all sorts of specific and general hypotheses, dubbed “The Wild West” by Jeannie Lee 799 and “The Noncoding RNA Revolution” by Tom Cech and Joan Steitz. 849

As Stent noted in relation to the unexpected discovery that DNA is the genetic material, 850 the finding of dynamic and differential genome-wide transcription of intergenic, antisense and overlapping lncRNAs, like the related discovery of intervening sequences, was “premature” in the sense that it could not be readily incorporated into the existing conceptual fabric. To put the plethora of regulatory RNAs into full perspective and to integrate them into a contemporary framework for the genetic programming of complex organisms, we must first consider the epigenome and the amount of information required for multicellular development.

Further Reading

  1. Amaral P.P. and Mattick J.S. (2008) Noncoding RNA in development. Mammalian Genome 19: 454–492. [PubMed: 18839252]
  2. Andergassen D. and Rinn J.L. (2021) From genotype to phenotype: Genetics of mammalian long non-coding RNAs in vivo. Nature Reviews Genetics 23: 229–243. [PubMed: 34837040]
  3. Briggs J.A., Wolvetang E.J., Mattick J.S., Rinn J.L. and Barry G. (2015) Mechanisms of long non-coding RNAs in mammalian nervous system development, plasticity, disease, and evolution. Neuron 88: 861–877. [PubMed: 26637795]
  4. Clark M.B. and Mattick J.S. (2011) Long noncoding RNAs in cell biology. Seminars in Cell and Developmental Biology 22: 366–376. [PubMed: 21256239]
  5. Deveson I.W., Hardwick S.A., Mercer T.R. and Mattick J.S. (2017) The dimensions, dynamics, and relevance of the mammalian noncoding transcriptome. Trends in Genetics 33: 464–478. [PubMed: 28535931]
  6. Mattick J. (2010) Video Q&A: Non-coding RNAs and eukaryotic evolution - a personal view. BMC Biology 8: 67. [PMC free article: PMC2905358] [PubMed: 20646265]
  7. Mattick J.S. (2009) The genetic signatures of noncoding RNAs. PLOS Genetics 5: e1000459. [PMC free article: PMC2667263] [PubMed: 19390609]
  8. Mercer T.R., Dinger M.E. and Mattick J.S. (2009) Long noncoding RNAs: insights into function. Nature Reviews Genetics 10: 155–159. [PubMed: 19188922]
  9. Morris K.V. and Mattick J.S. (2014) The rise of regulatory RNA. Nature Reviews Genetics 15: 423–437. [PMC free article: PMC4314111] [PubMed: 24776770]
  10. Rinn J.L. and Chang H.Y. (2020) Long noncoding RNAs: Molecular modalities to organismal functions. Annual Review of Biochemistry 89: 283–308. [PubMed: 32569523]
  11. Unfried J.P. and Ulitsky I. (2022) Substoichiometric action of long noncoding RNAs. Nature Cell Biology 24: 608–615. [PubMed: 35562482]
  12. Winkle M., El-Daly S.M., Fabbri M. and Calin G.A. (2021) Noncoding RNA therapeutics — challenges and potential solutions. Nature Reviews Drug Discovery 20: 629–651. [PMC free article: PMC8212082] [PubMed: 34145432]

Footnotes

a

High-throughput mRNA sequencing provided the necessary information not only for better annotation of gene (exon-intron) structures and alternative splicing, 3 , 4 but also for ‘proteomics’, which matches amino acid sequences predicted by mass spectrometry of peptides generated (usually) by proteolytic digestion with those in mRNA open reading frames. Although having a high false-positive rate, and complicated by post-translational modifications, proteomics has proved useful for identifying the protein constituents of subcellular organelles and complexes. 5 , 6

b

This assumption ignored historical evidence that some mRNAs such as histone mRNAs are not polyadenylated 9 and that polyA-RNAs are abundant in human cells. 10–12

c

The ‘Mammalian Gene Collection’ initiative (also) sequenced cDNAs from their 5′ ends but discarded the clone if an AUG start codon and a following open reading frame were not identified, 16 , 17 thereby excluding noncoding transcripts.

d

A large fraction of the RNAs in human cells are not polyadenylated, although they are often capped. 10–12 , 28

e

FANTOM: Functional Annotation of the Mouse (later Mammalian) Genome. The successive FANTOM projects introduced many technical innovations and have produced a wealth of data, including well-annotated transcription start and termination atlases, full-length cDNA clones and other valuable resources for the international research community. 33 , 34

f

Hybridized to cDNAs randomly primed from polyA+ or polyA- RNAs.

g

Again described as the “dark matter” in the genome. 53

h

RNA polymerase III (RNAPIII) produces various types of regulatory RNAs from repeat sequences, some of which are clade-specific, such as B2 SINE RNAs in mice and Alu RNAs in humans, in both cases with modular structures that repress RNA polymerase II during stresses like heat shock at specific loci, a striking example of convergent evolution. 60–62 Large numbers of human- or primate-specific Alu-derived short RNAs transcribed by RNAPIII have been identified by bioinformatic and biochemical strategies, including a class of structured small (<120nt) RNAs (snaRs) complexed with the dsRNA-binding nuclear factor 90 family. These are mostly genomically clustered and differentially expressed in regions of the brain, other tissues and cancer cells. 63–65 Recently it has been shown that transcription of snaR-A, which produces an miRNA that targets a metastasis inhibitor, 66 is driven by an embryonic isoform of RNAPIII that is also upregulated in cancer cells, whereas the other isoform is expressed in specialized tissues. 67 Other TE-derived RNAs are involved in other aspects of genome expression and organization, discussed in Chapter 16. For example, bidirectional transcription by RNAPII and RNAPIII of the B2 SINE sequence in mice was found to restructure the growth hormone locus into nuclear compartments and define the heterochromatin-euchromatin boundary to regulate the expression of the gene during organogenesis, suggesting a role of these abundant elements in the topological organization of the genome. 68

i

ENCODE: Encyclopedia of DNA Elements.

j

LncRNAs are defined as non-protein-coding RNAs >200 nt, an arbitrary classification partly based on a size cutoff in biochemical/biophysical commercial RNA purification kits and protocols that exclude most infrastructural RNAs, such as tRNAs, snoRNAs and snRNAs, as well as miRNAs, siRNAs and piRNAs. 97

k

RACE: Rapid amplification of cDNA ends.

l

Later studies showed that the number of long noncoding RNAs expressed from endogenous retroviral promoters correlates with pluripotency or the degree of malignant transformation. 112

m

A similar albeit less complex genomic organization pertains in prokaryotes, where hundreds of transcriptional start sites are located within operons, as well as opposite to annotated genes, indicating that the complexity of gene expression is increased by uncoupling polycistronic linkages and the genome-wide use of antisense transcription. 166 Apart from the thousands of short regulatory RNAs (Chapter 9) there are other, often still mysterious, longer (>200nt) highly structured and conserved noncoding RNAs that have been discovered in bacteria. 167–170

n

This created problems for genome annotations, which had been traditionally organized around protein-coding genes, although they are still used as landmarks. 177

o

The initial annotation of the human genome generally used the presence of a conserved open reading frame of 100 codons in RNAs (exonic fragmentation of protein-coding sequences made this assessment impossible at genome sequence level) as an arbitrary cutoff on the basis that this was unlikely to occur by chance. The sequencing of large numbers of vertebrate genomes will allow more accurate assessment of conserved open reading frames, ultimately with near statistical certainty at the codon level.

p

For example, the PNUTS gene encodes both PNUTS mRNA and lncRNA-PNUTS by alternative splicing of the primary transcript, each eliciting distinct biological functions; PNUTS mRNA is ubiquitously expressed, whereas the production of lncRNA-PNUTS is tightly regulated. 213

q

This problem also confounds the attempts to construct ‘gene networks’ from transcriptomic data, especially when different cells in a population are expressing different genes and responding differently to stimuli, and most data is derived from the 3′ end (UTRs) of transcripts, which may or may not be part of a corresponding protein-coding mRNA. 75

r

Via RNA-degrading ‘exosomes’ 267–269 (not to be confused with the extracellular vesicles that have the same name) and ‘nonsense-mediated RNA decay’ (NMD). NMD is known as a quality control mechanism to ensure only mRNAs with complete open reading frames are exported for translation, by degrading “aberrant RNAs” that have “exon junction complexes” 3′ to stop codons (because the stop codon of protein-coding genes is located primarily in the last exon). 270–276 However, recent evidence suggests that it also distinguishes short-lived regulatory RNAs from mRNAs and controls their steady-state levels in different contexts, 277–280 including stress responses. 281 The NMD pathway is developmentally regulated, 279 , 282 and required for embryonic stem cell fate determination 283 , 284 and neuronal architecture. 285 Loss of NMD components leads to developmental abnormalities and neurological disorders. 276 , 285

s

Along with sophisticated internal standards to properly measure the sensitivity of DNA and RNA sequencing analysis. 320–325

t

These studies also showed that there are far more regulatory 5′ exons in human than in mouse mRNAs, which are also highly alternatively spliced, suggesting that humans have evolved a more complex cis-regulatory architecture of mRNAs, 4 possibly related to brain function.

u

Including in sponge. 340

v

There are few promoters in the catalogs of mutations associated with genetic disorders, 424 although no one disputes their functionality.

w

An important subtlety, and difference between genetic screens in Drosophila and mammals is that many if not most naturally occurring mutations and those experimentally induced in the latter (by the mutagen ENS) are single nucleotide mutations, which can have serious consequences on a protein-coding sequence, but often subtle consequences on regulatory sequences. By contrast, most experimental mutagenesis in Drosophila involved transposable element insertion or large deletions, which have more serious phenotypic effects on both coding and regulatory sequences, and hence it is no surprise that so many regulatory loci were unearthed in bithorax and other intensively studied gene regions in Drosophila.

x

It has been proposed that miRNAs operate in a hierarchical and canalized series of regulatory networks (see Chapter 15), a fraction of miRNAs acting at the top of this hierarchy, with their loss resulting in broad developmental defects, whereas most miRNAs are expressed with high cellular specificity and play roles at the periphery of development, affecting the terminal features of specialized cells. 441 It is likely that the same applies to lncRNAs.

y

Therapies are being developed for Angelman’s Syndrome by knocking down the regulatory Ube3a-ATS non-coding RNA with antisense oligonucleotides to restore expression of the normally silent (imprinted) paternal Ube3a allele in patients lacking the maternal allele. 459

z

The lncRNA Cherub is required for the transformation of stem cells into malignant cells. 497

aa

TUG1 is required for mouse retinal differentiation 498 and male fertility. 194

ab

Containing exons, 589 or a mixture of intronic and exonic sequences. 590 Some are derived entirely from introns, and have been shown to regulate their parent protein-coding genes. 591

ac

Initially formed as part of the pilot phase of the ENCODE project 598 but expanded to annotate “human and mouse genes and transcripts supported by experimental data with high accuracy, providing a foundational resource that supports genome biology and clinical genomics”. 177 , 336 , 599

ad

Maintenance of telomeres, in addition to the telomerase RNA component TERC (Chapter 9) involves transcripts named TERRA (telomeric repeat-containing RNAs) transcribed from subtelomeric regions in a developmentally regulated fashion. TERRA RNAs contain repeat rich sequences that form G-quartet structures and regulate local heterochromatin stability, telomerase activity and telomere length, as well as biological processes such as the induction and maintenance of pluripotency. 787–793 Although first discovered and characterized in mammalian cells, analogous RNAs have similar functions in different organisms, including fungi. 794 , 795

© 2023 John Mattick and Paulo Amaral.

Open Access: This content is Open Access under the Creative Commons license CC-BY-NC-ND.

Bookshelf ID: NBK595947DOI: 10.1201/9781003109242-13

Views

Related information

  • PMC
    PubMed Central citations
  • PubMed
    Links to PubMed

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...