mediated RNA trans-splicing

Spliceosome-mediated RNA trans-splicing (SMaRT) has been used previously to reprogram mutant endogenous CFTR and factor VIII mRNAs in human epithelial cell and tissue models and knockout mice, respectively. Those studies used 3′ exon replacement (3′ER); a process in which the distal portion of RNA is reprogrammed. Here, we also show that the 5′ end of mRNA can be completely rewritten by 5′ER. For proof-of-concept, and to test whether 5′ER could generate functional CFTR, we generated a mutant minigene target containing CFTR exons 10–24 (ΔF508) and a mini-intron 10, and a pretrans-splicing molecule (targeted to intron 10) containing CFTR exons 1–10 (+F508), and tested these two constructs in 293T cells for anion efflux transport. Cells cotransfected with target and PTM showed a consistent increase in anion efflux, but there was no response in control cells that received PTM or target alone. Using a LacZ reporter system to accurately quantify trans-splicing efficiency, we tested several unique PTM designs. These studies provided two important findings as follows: (1) efficient trans-splicing can be achieved by binding the PTM to different locations in the target, and (2) relatively few changes in PTM design can have a profound impact on trans-splicing activity. Tethering the PTM close to the target 3′ splice site (as opposed to the donor site) and inserting an intron in the PTM coding resulted in a 65-fold enhancement of LacZ activity. These studies demonstrate that (1) SMaRT can be used to reprogram the 5′ end of mRNA, and (2) efficiency can be improved substantially.

Regulation of transcription of the RNA splicing factor hSlu7 by Elk-1 and Sp1 affects alternative splicing

Alternative splicing plays a major role in transcriptome diversity and plasticity, but it is largely unknown how tissue-specific and embryogenesis-specific alternative splicing is regulated. The highly conserved splicing factor Slu7 is involved in 3′ splice site selection and also regulates alternative splicing. We show that Slu7 has a unique spatial pattern of expression among human and mouse embryonic and adult tissues. We identified several functional Ets binding sites and GC-boxes in the human Slu7 (hSlu7) promoter region. The Ets and GC-box binding transcription factors, Elk-1 and Sp1, respectively, exerted opposite effects on hSlu7 transcription: Sp1 protein enhances and Elk-1 protein represses transcription in a dose-dependent manner. Sp1 protein bound to the hSlu7 promoter in vivo, and depletion of Sp1 by RNA interference (RNAi) repressed hSlu7 expression. Elk-1 protein bound to the hSlu7 promoter in vivo, and depletion of Elk-1 by RNAi caused an increase in the endogenous level of hSlu7 mRNA. Further, depletion of either Sp1 or Elk-1 affected alternative splicing. Our results provide indications of a complex transcription regulation mechanism that controls the spatial and temporal expression of Slu7, presumably allowing regulation of tissue-specific alternative splicing events.

Flexibility in the site of exon junction complex deposition revealed by functional group and RNA secondary structure alterations in the splicing subst

The exon junction complex (EJC) is critical for mammalian nonsense-mediated mRNA decay and translational regulation, but the mechanism of its stable deposition on mRNA is unknown. To examine requirements for EJC deposition, we created splicing substrates containing either DNA nucleotides or RNA secondary structure in the 5′ exon. Using RNase H protection, toeprinting, and coimmunoprecipitation assays, we found that EJC location shifts upstream when a stretch of DNA or RNA secondary structure appears at the canonical deposition site. These upstream shifts occur prior to exon ligation and are often accompanied by decreases in deposition efficiency. Although the EJC core protein eIF4AIII contacts four ribose 2′OH groups in crystal structures, we demonstrate that three 2′OH groups are sufficient for deposition. Thus, the site of EJC deposition is more flexible than previously appreciated and efficient deposition appears spatially limited.

Alternately spliced WT1 antisense transcripts interact with WT1 sense RNA and show epigenetic and splicing defects in cancer

Many mammalian genes contain overlapping antisense RNAs, but the functions and mechanisms of action of these transcripts are mostly unknown. WT1 is a well-characterized developmental gene that is mutated in Wilms’ tumor (WT) and acute myeloid leukaemia (AML) and has an antisense transcript (WT1-AS), which we have previously found to regulate WT1 protein levels. In this study, we show that WT1-AS is present in multiple spliceoforms that are usually expressed in parallel with WT1 RNA in human and mouse tissues. We demonstrate that the expression of WT1-AS correlates with methylation of the antisense regulatory region (ARR) in WT1 intron 1, displaying imprinted monoallelic expression in normal kidney and loss of imprinting in WT. However, we find no evidence for imprinting of mouse Wt1-as. WT1-AS transcripts are exported into the cytoplasm and form heteroduplexes with WT1 mRNA in the overlapping region in WT1 exon 1. In AML, there is often abnormal splicing of WT1-AS, which may play a role in the development of this malignancy. These results show that WT1 encodes conserved antisense RNAs that may have an important regulatory role in WT1 expression via RNA:RNA interactions, and which can become deregulated by a variety of mechanisms in cancer.

Analysis of the requirement for RNA polymerase II CTD heptapeptide repeats in pre-mRNA splicing and 3′-end cleavage

The carboxyl-terminal domain (CTD) of RNA polymerase II (pol II) plays an important role in coupling transcription with precursor messenger RNA (pre-mRNA) processing. Efficient capping, splicing, and 3′-end cleavage of pre-mRNA depend on the CTD. Moreover, specific processing factors are known to associate with this structure. The CTD is therefore thought to act as a platform that facilitates the assembly of complexes required for the processing of nascent transcripts. The mammalian CTD contains 52 tandemly repeated heptapeptides with the consensus sequence YSPTSPS. The C-terminal half of the mammalian CTD contains mostly repeats that diverge from this consensus sequence, whereas the N-terminal half contains mostly repeats that match the consensus sequence. Here, we demonstrate that 22 tandem repeats, from either the conserved or divergent halves of the CTD, are sufficient for approximate wild-type levels of transcription, splicing, and 3′-end cleavage of two different pre-mRNAs, one containing a constitutively spliced intron, and the other containing an intron that depends on an exon enhancer for efficient splicing. In contrast, each block of 22 repeats is not sufficient for efficient inclusion of an alternatively spliced exon in another pre-mRNA. In this case, a longer CTD is important for counteracting the negative effect of a splicing silencer element located within the alternative exon. Our results indicate that the length, rather than the composition of CTD repeats, can be the major determinant in efficient processing of different pre-mRNA substrates. However, the extent of this length requirement depends on specific sequence features within the pre-mRNA substrate.

Two reactions of Haloferax volcanii RNA splicing enzymes: Joining of exons and circularization of introns

Archaeal RNA splicing involves at least two protein enzymes, a specific endonuclease and a specific ligase. The endonuclease recognizes and cleaves within a characteristic bulge-helix-bulge (BHB) structure formed by pairing of the regions near the two exon–intron junctions, producing 2‘,3′-cyclic phosphate and 5′-hydroxyl termini. The ligase joins the exons and converts the cyclic phosphate into junction phosphate. The ligated product contains a seven-base hairpin loop, in which the splice junction is in between the two 3′ terminal residues of the loop. Archaeal splicing endonucleases are also involved in rRNA processing, cutting within the BHB structures formed by pairing of the 5′ and 3′ flanking regions of the rRNAs. Large free introns derived from pre-rRNAs have been observed as stable and abundant circular RNAs in certain Crenarchaeota, a kingdom in the domain Archaea. In the present study, we show that the cells of Haloferax volcanii, a Euryarchaeote, contain circular RNAs formed by 3′,5′-phosphodiester linkage between the two termini of the introns derived from their pre-tRNAs. H. volcanii ligase, in vitro, can also circularize both endonuclease-cleaved introns, and non-endonuclease-produced substrates. Exon joining and intron circularization are mechanistically similar ligation reactions that can occur independently. The size of the ligated hairpin loop and position of the splice junction within this loop can be changed in in vitro ligation reactions. Overall, archaeal RNA splicing seems to involve two sets of two symmetric transesterification reactions each.

B-cell and plasma-cell splicing differences: A potential role in regulated immunoglobulin RNA processing

The immunoglobulin μ pre-mRNA is alternatively processed at its 3′ end by competing splice and cleavage-polyadenylation reactions to generate mRNAs encoding the membrane-associated or secreted forms of the IgM protein, respectively. The relative use of the competing processing pathways varies during B-lymphocyte development, and it has been established previously that cleavage-polyadenylation activity is higher in plasma cells, which secrete IgM, than in B cells, which produce membrane-associated IgM. To determine whether RNA-splicing activity varies during B-lymphocyte development to contribute to μ RNA-processing regulation, we first demonstrate that μ pre-mRNA processing is sensitive to artificial changes in the splice environment by coexpressing SR proteins with the μ gene. To explore differences between the splice environments of B cells and plasma cells, we analyzed the splicing patterns from two different chimeric non-Ig genes that can be alternatively spliced but have no competing cleavage-polyadenylation reaction. The ratio of intact exon splicing to cryptic splice site use from one chimeric gene differs between several B-cell and several plasma-cell lines. Also, the amount of spliced RNA is higher in B-cell than plasma-cell lines from a set of genes whose splicing is dependent on a functional exonic splice enhancer. Thus, there is clear difference between the B-cell and plasma-cell splicing environments. We propose that both general cleavage-polyadenylation and general splice activities are modulated during B-lymphocyte development to ensure proper regulation of the alternative μ RNA processing pathways.

Conserved RNA secondary structures promote alternative splicing

Pre-mRNA splicing is carried out by the spliceosome, which identifies exons and removes intervening introns. Alternative splicing in higher eukaryotes results in the generation of multiple protein isoforms from gene transcripts. The extensive alternative splicing observed implies a flexibility of the spliceosome to identify exons within a given pre-mRNA. To reach this flexibility, splice-site selection in higher eukaryotes has evolved to depend on multiple parameters such as splice-site strength, splicing regulators, the exon/intron architecture, and the process of pre-mRNA synthesis itself. RNA secondary structures have also been proposed to influence alternative splicing as stable RNA secondary structures that mask splice sites are expected to interfere with splice-site recognition. Using structural and functional conservation, we identified RNA structure elements within the human genome that associate with alternative splice-site selection. Their frequent involvement with alternative splicing demonstrates that RNA structure formation is an important mechanism regulating gene expression and disease.

Efficient and specific repair of sickle β-globin RNA by trans-splicing ribozymes

Previously we demonstrated that a group I ribozyme can perform trans-splicing to repair sickle β-globin transcripts upon transfection of in vitro transcribed ribozyme into mammalian cells. Here, we sought to develop expression cassettes that would yield high levels of active ribozyme after gene transfer. Our initial expression constructs were designed to generate trans-slicing ribozymes identical to those used in our previous RNA transfection studies with ribozymes containing 6-nucleotide long internal guide sequences. The ribozymes expressed from these cassettes, however, were found to be unable to repair sickle β-globin RNAs. Further experiments revealed that two additional structural elements are important for ribozyme-mediate RNA repair: the P10 interaction formed between the 5′ end of the ribozyme and the beginning of the 3′ exon and an additional base-pairing interaction formed between an extended guide sequence and the substrate RNA. These optimized expression cassettes yield ribozymes that are able to amend 10%–50% of the sickle β-globin RNAs in transfected mammalian cells. Finally, a ribozyme with a 5-bp extended guide sequence preferentially reacts with sickle β-globin RNAs over wild-type β-globin RNAs, although the wild-type β-globin transcript forms only a single mismatch with the ribozyme. These results demonstrate that trans-splicing ribozyme expression cassettes can be generated to yield ribozymes that can repair a clinically relevant fraction of sickle β-globin RNAs in mammalian cells with greatly improved specificity.

Polyadenylation releases mRNA from RNA polymerase II in a process that is licensed by splicing

When transcription is coupled to pre-mRNA processing in HeLa nuclear extracts nascent transcripts become attached to RNA polymerase II during assembly of the cleavage/polyadenylation apparatus (CPA), and are not released even after cleavage at the poly(A) site. Here we show that these cleaved transcripts are anchored to the polymerase at their 3′ ends by the CPA or, when introns are present, by the larger 3′-terminal exon definition complex (EDC), which consists of splicing factors complexed with the CPA. Poly(A) addition releases the RNA from the polymerase when the RNA is anchored only by the CPA. When anchored by the EDC, poly(A) addition remains a requirement, but it triggers release only after being licensed by splicing. The process by which RNA must first be attached to the polymerase by the EDC, and then can only be released following dual inputs from splicing and polyadenylation, provides an obvious opportunity for surveillance as the RNA enters the transport pathway.

Genetic identification of potential RNA-binding regions in a group II intron-encoded reverse transcriptase

Mobile group II introns encode a reverse transcriptase that binds the intron RNA to promote RNA splicing and intron mobility, the latter via reverse splicing of the excised intron into DNA sites, followed by reverse transcription. Previous work showed that the Lactococcus lactis Ll.LtrB intron reverse transcriptase, denoted LtrA protein, binds with high affinity to DIVa, a stem–loop structure at the beginning of the LtrA open reading frame and makes additional contacts with intron core regions that stabilize the active RNA structure for forward and reverse splicing. LtrA's binding to DIVa down-regulates its translation and is critical for initiation of reverse transcription. Here, by using high-throughput unigenic evolution analysis with a genetic assay in which LtrA binding to DIVa down-regulates translation of GFP, we identified regions at LtrA's N terminus that are required for DIVa binding. Then, by similar analysis with a reciprocal genetic assay, we confirmed that residual splicing of a mutant intron lacking DIVa does not require these N-terminal regions, but does require other reverse transcriptase (RT) and X/thumb domain regions that bind the intron core. We also show that N-terminal fragments of LtrA by themselves bind specifically to DIVa in vivo and in vitro. Our results suggest a model in which the N terminus of nascent LtrA binds DIVa of the intron RNA that encoded it and nucleates further interactions with core regions that promote RNP assembly for RNA splicing and intron mobility. Features of this model may be relevant to evolutionarily related non-long-terminal-repeat (non-LTR)-retrotransposon RTs.

Widespread RNA Editing of Embedded Alu Elements in the Human Transcriptom

More than one million copies of the ∼300-bp Alu element are interspersed throughout the human genome, with up to 75% of all known genes having Alu insertions within their introns and/or UTRs. Transcribed Alu sequences can alter splicing patterns by generating new exons, but other impacts of intragenic Alu elements on their host RNA are largely unexplored. Recently, repeat elements present in the introns or 3′-UTRs of 15 human brain RNAs have been shown to be targets for multiple adenosine to inosine (A-to-I) editing. Using a statistical approach, we find that editing of transcripts with embedded Alu sequences is a global phenomenon in the human transcriptome, observed in 2674 (∼2%) of all publicly available full-length human cDNAs (n = 128,406), from >250 libraries and >30 tissue sources. In the vast majority of edited RNAs, A-to-I substitutions are clustered within transcribed sense or antisense Alu sequences. Edited bases are primarily associated with retained introns, extended UTRs, or with transcripts that have no corresponding known gene. Therefore, Alu-associated RNA editing may be a mechanism for marking nonstandard transcripts, not destined for translation.

Novel noncoding RNA

The human Y chromosome, because it is enriched in repetitive DNA, has been very intractable to genetic and molecular analyses. There is no previous evidence for developmental stage- and testis-specific transcription from the male-specific region of the Y (MSY). Here, we present evidence for the first time for a developmental stage- and testis-specific transcription from MSY distal heterochromatic block. We isolated two novel RNAs, which localize to Yq12 in multiple copies, show testis-specific expression, and lack active X-homologs. Experimental evidence shows that one of the above Yq12 noncoding RNAs (ncRNAs) trans-splices with CDC2L2 mRNA from chromosome 1p36.3 locus to generate a testis-specific chimeric β sv13 isoform. This 67-nt 5′UTR provided by the Yq12 transcript contains within it a Y box protein-binding CCAAT motif, indicating translational regulation of the β sv13 isoform in testis. This is also the first report of trans-splicing between a Y chromosomal and an autosomal transcript.

Genomic localization of RNA binding proteins reveals links between pre-mRNA processing and transcription

Pre-mRNA processing often occurs in coordination with transcription thereby coupling these two key regulatory events. As such, many proteins involved in mRNA processing associate with the transcriptional machinery and are in proximity to DNA. This proximity allows for the mapping of the genomic associations of RNA binding proteins by chromatin immunoprecipitation (ChIP) as a way of determining their sites of action on the encoded mRNA. Here, we used ChIP combined with high-density microarrays to localize on the human genome three functionally distinct RNA binding proteins: the splicing factor polypyrimidine tract binding protein (PTBP1/hnRNP I), the mRNA export factor THO complex subunit 4 (ALY/THOC4), and the 3′ end cleavage stimulation factor 64 kDa (CSTF2). We observed interactions at promoters, internal exons, and 3′ ends of active genes. PTBP1 had biases toward promoters and often coincided with RNA polymerase II (RNA Pol II). The 3′ processing factor, CSTF2, had biases toward 3′ ends but was also observed at promoters. The mRNA processing and export factor, ALY, mapped to some exons but predominantly localized to introns and did not coincide with RNA Pol II. Because the RNA binding proteins did not consistently coincide with RNA Pol II, the data support a processing mechanism driven by reorganization of transcription complexes as opposed to a scanning mechanism. In sum, we present the mapping in mammalian cells of RNA binding proteins across a portion of the genome that provides insight into the transcriptional assembly of RNA–protein complexes.

A systematic analysis of intronic sequences downstream

To identify human intronic sequences associated with 5′ splice site recognition, we performed a systematic search for motifs enriched in introns downstream of both constitutive and alternative cassette exons. Significant enrichment was observed for U-rich motifs within 100 nucleotides downstream of 5′ splice sites of both classes of exons, with the highest enrichment between positions +6 and +30. Exons adjacent to U-rich intronic motifs contain lower frequencies of exonic splicing enhancers and higher frequencies of exonic splicing silencers, compared with exons not followed by U-rich intronic motifs. These findings motivated us to explore the possibility of a widespread role for U-rich motifs in promoting exon inclusion. Since cytotoxic granule-associated RNA binding protein (TIA1) and TIA1-like 1 (TIAL1; also known as TIAR) were previously shown in vitro to bind to U-rich motifs downstream of 5′ splice sites, and to facilitate 5′ splice site recognition in vitro and in vivo, we investigated whether these factors function more generally in the regulation of splicing of exons followed by U-rich intronic motifs. Simultaneous knockdown of TIA1 and TIAL1 resulted in increased skipping of 36/41 (88%) of alternatively spliced exons associated with U-rich motifs, but did not affect 32/33 (97%) alternatively spliced exons that are not associated with U-rich motifs. The increase in exon skipping correlated with the proximity of the first U-rich motif and the overall “U-richness” of the adjacent intronic region. The majority of the alternative splicing events regulated by TIA1/TIAL1 are conserved in mouse, and the corresponding genes are associated with diverse cellular functions. Based on our results, we estimate that ∼15% of alternative cassette exons are regulated by TIA1/TIAL1 via U-rich intronic elements.

Alternative splicing of anciently exonized 5S rRNA

Identifying conserved alternative splicing (AS) events among evolutionarily distant species can prioritize AS events for functional characterization and help uncover relevant cis- and trans-regulatory factors. A genome-wide search for conserved cassette exon AS events in higher plants revealed the exonization of 5S ribosomal RNA (5S rRNA) within the gene of its own transcription regulator, TFIIIA (transcription factor for polymerase III A). The 5S rRNA-derived exon in TFIIIA gene exists in all representative land plant species but not in green algae and nonplant species, suggesting it is specific to land plants. TFIIIA is essential for RNA polymerase III-based transcription of 5S rRNA in eukaryotes. Integrating comparative genomics and molecular biology revealed that the conserved cassette exon derived from 5S rRNA is coupled with nonsense-mediated mRNA decay. Utilizing multiple independent Arabidopsis overexpressing TFIIIA transgenic lines under osmotic and salt stress, strong accordance between phenotypic and molecular evidence reveals the biological relevance of AS of the exonized 5S rRNA in quantitative autoregulation of TFIIIA homeostasis. Most significantly, this study provides the first evidence of ancient exaptation of 5S rRNA in plants, suggesting a novel gene regulation model mediated by the AS of an anciently exonized noncoding element.

Genome-wide mapping of alternative splicing in Arabidopsis thaliana

Alternative splicing can enhance transcriptome plasticity and proteome diversity. In plants, alternative splicing can be manifested at different developmental stages, and is frequently associated with specific tissue types or environmental conditions such as abiotic stress. We mapped the Arabidopsis transcriptome at single-base resolution using the Illumina platform for ultrahigh-throughput RNA sequencing (RNA-seq). Deep transcriptome sequencing confirmed a majority of annotated introns and identified thousands of novel alternatively spliced mRNA isoforms. Our analysis suggests that at least ∼42% of intron-containing genes in Arabidopsis are alternatively spliced; this is significantly higher than previous estimates based on cDNA/expressed sequence tag sequencing. Random validation confirmed that novel splice isoforms empirically predicted by RNA-seq can be detected in vivo. Novel introns detected by RNA-seq were substantially enriched in nonconsensus terminal dinucleotide splice signals. Alternative isoforms with premature termination codons (PTCs) comprised the majority of alternatively spliced transcripts. Using an example of an essential circadian clock gene, we show that intron retention can generate relatively abundant PTC+ isoforms and that this specific event is highly conserved among diverse plant species. Alternatively spliced PTC+ isoforms can be potentially targeted for degradation by the nonsense mediated mRNA decay (NMD) surveillance machinery or regulate the level of functional transcripts by the mechanism of regulated unproductive splicing and translation (RUST). We demonstrate that the relative ratios of the PTC+ and reference isoforms for several key regulatory genes can be considerably shifted under abiotic stress treatments. Taken together, our results suggest that like in animals, NMD and RUST may be widespread in plants and may play important roles in regulating gene expression.

specific alternative splicing in primates

Comparative studies of gene regulation suggest an important role for natural selection in shaping gene expression patterns within and between species. Most of these studies, however, estimated gene expression levels using microarray probes designed to hybridize to only a small proportion of each gene. Here, we used recently developed RNA sequencing protocols, which sidestep this limitation, to assess intra- and interspecies variation in gene regulatory processes in considerably more detail than was previously possible. Specifically, we used RNA-seq to study transcript levels in humans, chimpanzees, and rhesus macaques, using liver RNA samples from three males and three females from each species. Our approach allowed us to identify a large number of genes whose expression levels likely evolve under natural selection in primates. These include a subset of genes with conserved sexually dimorphic expression patterns across the three species, which we found to be enriched for genes involved in lipid metabolism. Our data also suggest that while alternative splicing is tightly regulated within and between species, sex-specific and lineage-specific changes in the expression of different splice forms are also frequent. Intriguingly, among genes in which a change in exon usage occurred exclusively in the human lineage, we found an enrichment of genes involved in anatomical structure and morphogenesis, raising the possibility that differences in the regulation of alternative splicing have been an important force in human evolution.

Splicing factor SFRS1 recognizes a functionally diverse landscape of RNA transcripts

Metazoan genes are encrypted with at least two superimposed codes: the genetic code to specify the primary structure of proteins and the splicing code to expand their proteomic output via alternative splicing. Here, we define the specificity of a central regulator of pre-mRNA splicing, the conserved, essential splicing factor SFRS1. Cross-linking immunoprecipitation and high-throughput sequencing (CLIP-seq) identified 23,632 binding sites for SFRS1 in the transcriptome of cultured human embryonic kidney cells. SFRS1 was found to engage many different classes of functionally distinct transcripts including mRNA, miRNA, snoRNAs, ncRNAs, and conserved intergenic transcripts of unknown function. The majority of these diverse transcripts share a purine-rich consensus motif corresponding to the canonical SFRS1 binding site. The consensus site was not only enriched in exons cross-linked to SFRS1 in vivo, but was also enriched in close proximity to splice sites. mRNAs encoding RNA processing factors were significantly overrepresented, suggesting that SFRS1 may broadly influence the post-transcriptional control of gene expression in vivo. Finally, a search for the SFRS1 consensus motif within the Human Gene Mutation Database identified 181 mutations in 82 different genes that disrupt predicted SFRS1 binding sites. This comprehensive analysis substantially expands the known roles of human SR proteins in the regulation of a diverse array of RNA transcripts.

Identification and Functional Analysis of Mutations in the Hypocretin (Orexin) Genes of Narcoleptic Canines

Narcolepsy is a sleep disorder affecting animals and humans. Exon skipping mutations of the Hypocretin/Orexin-receptor-2 (Hcrtr2) gene were identified as the cause of narcolepsy in Dobermans and Labradors. Preprohypocretin (Hcrt) knockout mice have symptoms similar to human and canine narcolepsy. In this study, 11 sporadic cases of canine narcolepsy and two additional multiplex families were investigated for possible Hcrt andHcrtr2 mutations. Sporadic cases have been shown to have more variable disease onset, increased disease severity, and undetectable Hypocretin-1 levels in cerebrospinal fluid. The canine Hcrtlocus was isolated and characterized for this project. Only one novel mutation was identified in these two loci. This alteration results in a single amino acid substitution (E54K) in the N-terminal region of the Hcrtr2 receptor and autosomal recessive transmission in a Dachshund family. Functional analysis of previously-described exon-skipping mutations and of the E54K substitution were also performed using HEK-293 cell lines transfected with wild-type and mutated constructs. Results indicate a truncated Hcrtr2 protein, an absence of proper membrane localization, and undetectable binding and signal transduction for exon-skipping mutated constructs. In contrast, the E54K abnormality was associated with proper membrane localization, loss of ligand binding, and dramatically diminished calcium mobilization on activation of the receptor. These results are consistent with a loss of function for all three mutations. The absence of mutation in sporadic cases also indicates genetic heterogeneity in canine narcolepsy, as reported previously in humans.

Nonradioactive multiplex PCR screening strategy for the simultaneous detection of multiple low-density lipoprotein receptor gene mutations.

We have developed a rapid, nonradioactive screening test enabling the simultaneous analysis of three low-density lipoprotein receptor (LDLR) gene mutations (D154N, D206E, and V408M), which together account for familial hypercholesterolemia (FH) in approximately 90% of the South African Afrikaner population. The assay is designed so that FH patients, negative for these founder-related mutations (found in descendants of European settlers), subsequently can be screened for unknown mutations in the mutation-rich exon 4 of the LDLR gene. Our screening assay consists of two steps: (1) multiplex allele-specific PCR amplification of exons 4 and 9, and (2) simultaneous analysis of single- and double-strand conformational polymorphisms in exon 4 by vertical electrophoresis on low cross-linked polyacrylamide gels. The simplicity, specificity, and versatility of the multiplex assay makes it an ideal system for routine screening of FH mutations in large population samples.

A nonsense mutation in the cathepsin K gene observed in a family with pycnodysostosis.

Pycnodysostosis (MIM 265800) is a rare, autosomal recessive skeletal dysplasia characterized by short stature, wide cranial sutures, and increased bone density and fragility. Linkage analysis localized the disease gene to human chromosome 1q21, and subsequently the genetic interval was narrowed to between markers D1S2612 and D1S2345. Expressed sequence tagged markers corresponding to cathepsin K, a cysteine protease highly expressed in osteoclasts and thought to be important in bone resorption, were mapped previously in the candidate region. We have identified a cytosine to thymidine transition at nucleotide 862 (GenBank accession no. S79895) of the cathepsin K coding sequence in the DNA of an affected individual from a large, consanguinous Mexican family. This mutation results in an arginine to STOP alteration at amino acid 241, predicting premature termination of cathepsin K mRNA translation. All affected individuals in this family were homozygous for the mutation, suggesting that this alteration may lead to pycnodysostosis. Recognition of the role of cathepsin K in the etiology of pycnodysostosis should provide insights into the pathogenesis and treatment of other disorders of bone remodeling, including osteoporosis.

Rapid sequence analysis of gene trap integrations to generate a resource of insertional mutations in mice.

Gene trapping in murine embryonic stem cells is a proven method for the simultaneous identification and mutation of genes in the mouse. Gene trap vectors are designed to detect insertions within genes through the production of a fusion mRNA transcript, making the identification of the endogenous gene possible by 5' rapid amplification of cDNA ends (RACE). Although the amplification of specific cDNAs can be achieved rapidly, cloning and screening of informative-sized cDNAs has proven to be time consuming. To eliminate the need for cloning, we have developed a method for solid-phase sequencing of 5' RACE products. More than 150 independent gene trap cell lines were analyzed, and sequence information was obtained for every line successfully amplified by RACE. With the vector used in this study, 40% of the cell lines were found to contain properly spliced gene trap events. The remaining lines were either spliced inefficiently or contained deletions of the vector. These results highlight the advantage of sequencing gene trap integrations before further characterization. This work now paves the way for large-scale gene trap screens in mice and should greatly accelerate the functional analysis of the mammalian genome.

Nested Patch PCR enables highly multiplexed mutation discovery in candidate genes

Medical resequencing of candidate genes in individual patient samples is becoming increasingly important in the clinic and in clinical research. Medical resequencing requires the amplification and sequencing of many candidate genes in many patient samples. Here we introduce Nested Patch PCR, a novel method for highly multiplexed PCR that is very specific, can sensitively detect SNPs and mutations, and is easy to implement. This is the first method that couples multiplex PCR with sample-specific DNA barcodes and next-generation sequencing to enable highly multiplex mutation discovery in candidate genes for multiple samples in parallel. In our pilot study, we amplified exons from colon cancer and matched normal human genomic DNA. From each sample, we successfully amplified 96% (90 of 94) targeted exons from across the genome, totaling 21.6 kbp of sequence. Ninety percent of all sequencing reads were from targeted exons, demonstrating that Nested Patch PCR is highly specific. We found that the abundance of reads per exon was reproducible across samples. We reliably detected germline SNPs and discovered a colon tumor specific nonsense mutation in APC, a gene causally implicated in colorectal cancer. With Nested Patch PCR, candidate gene mutation discovery across multiple individual patient samples can now utilize the power of second-generation sequencing.

A tandem duplication within the fibrillin 1 gene is associated with the mouse tight skin mutation.

Mice carrying the Tight skin (Tsk) mutation have thickened skin and visceral fibrosis resulting from an accumulation of extracellular matrix molecules. These and other connective tissue abnormalities have made Tskl + mice models for scleroderma, hereditary emphysema, and myocardial hypertrophy. Previously we localized Tsk to mouse chromosome 2 in a region syntenic with human chromosome 15. The microfibrillar glycoprotein gene, fibrillin 1 (FBN1), on human chromosome 15q, provided a candidate for the Tsk mutation. We now demonstrate that the Tsk chromosome harbors a 30- to 40-kb genomic duplication within the Fbn1 gene that results in a larger than normal in-frame Fbn1 transcript. These findings provide hypotheses to explain some of the phenotypic characteristics of Tskl + mice and the lethality of Tsk/Tsk embryos.

Callipyge mutation affects gene expression in cis: A potential role for chromatin structure

Muscular hypertrophy in callipyge sheep results from a single nucleotide substitution located in the genomic interval between the imprinted Delta, Drosophila, Homolog-like 1 (DLK1) and Maternally Expressed Gene 3 (MEG3). The mechanism linking the mutation to muscle hypertrophy is unclear but involves DLK1 overexpression. The mutation is contained within CLPG1 transcripts produced from this region. Herein we show that CLPG1 is expressed prenatally in the hypertrophy-responsive longissimus dorsi muscle by all four possible genotypes, but postnatal expression is restricted to sheep carrying the mutation. Surprisingly, the mutation results in nonimprinted monoallelic transcription of CLPG1 from only the mutated allele in adult sheep, whereas it is expressed biallelically during prenatal development. We further demonstrate that local CpG methylation is altered by the presence of the mutation in longissimus dorsi of postnatal sheep. For 10 CpG sites flanking the mutation, methylation is similar prenatally across genotypes, but doubles postnatally in normal sheep. This normal postnatal increase in methylation is significantly repressed in sheep carrying one copy of the mutation, and repressed even further in sheep with two mutant alleles. The attenuation in methylation status in the callipyge sheep correlates with the onset of the phenotype, continued CLPG1 transcription, and high-level expression of DLK1. In contrast, normal sheep exhibit hypermethylation of this locus after birth and CLPG1 silencing, which coincides with DLK1 transcriptional repression. These data are consistent with the notion that the callipyge mutation inhibits perinatal nucleation of regional chromatin condensation resulting in continued elevated transcription of prenatal DLK1 levels in adult callipyge sheep. We propose a model incorporating these results that can also account for the enigmatic normal phenotype of homozygous mutant sheep.

A missense mutation in the bovine SLC35A3 gene, encoding a UDP-N-acetylglucosamine transporter, causes complex vertebral malformation

The extensive use of a limited number of elite bulls in cattle breeding can lead to rapid spread of recessively inherited disorders. A recent example is the globally distributed syndrome Complex Vertebral Malformation (CVM), which is characterized by misshapen and fused vertebrae around the cervico-thoracic junction. Here, we show that CVM is caused by a mutation in the Golgi-resident nucleotide-sugar transporter encoded by SLC35A3. Thus, the disease showed complete cosegregation with the mutation in a homozygous state, and proteome patterns indicated abnormal protein glycosylation in tissues of affected animals. In addition, a yeast mutant that is deficient in the transport of UDP-N-acetylglucosamine into its Golgi lumen can be rescued by the wild-type SLC35A3 gene, but not by the mutated gene. These results provide the first demonstration of a genetic disorder associated with a defective SLC35A3 gene, and reveal a new mechanism for malformation of the vertebral column caused by abnormal nucleotide-sugar transport into the Golgi apparatus.

The modifier of Min 2 (Mom2) locus: Embryonic lethality of a mutation in the Atp5a1 gene suggests a novel mechanism of polyp suppression

Inactivation of the APC gene is considered the initiating event in human colorectal cancer. Modifier genes that influence the penetrance of mutations in tumor-suppressor genes hold great potential for preventing the development of cancer. The mechanism by which modifier genes alter adenoma incidence can be readily studied in mice that inherit mutations in the Apc gene. We identified a new modifier locus of ApcMin-induced intestinal tumorigenesis called Modifier of Min 2 (Mom2). The polyp-resistant Mom2R phenotype resulted from a spontaneous mutation and linkage analysis localized Mom2 to distal chromosome 18. To obtain recombinant chromosomes for use in refining the Mom2 interval, we generated congenic DBA.B6 ApcMin/+, Mom2R/+ mice. An intercross revealed that Mom2R encodes a recessive embryonic lethal mutation. We devised an exclusion strategy for mapping the Mom2 locus using embryonic lethality as a method of selection. Expression and sequence analyses of candidate genes identified a duplication of four nucleotides within exon 3 of the α subunit of the ATP synthase (Atp5a1) gene. Tumor analyses revealed a novel mechanism of polyp suppression by Mom2R in Min mice. Furthermore, we show that more adenomas progress to carcinomas in Min mice that carry the Mom2R mutation. The absence of loss of heterozygosity (LOH) at the Apc locus, combined with the tendency of adenomas to progress to carcinomas, indicates that the sequence of events leading to tumors in ApcMin/+ Mom2R/+ mice is consistent with the features of human tumor initiation and progression.

Analysis of the genome sequences of three Drosophila melanogaster spontaneous mutation accumulation lines

We inferred the rate and properties of new spontaneous mutations in Drosophila melanogaster by carrying out whole-genome shotgun sequencing-by-synthesis of three mutation accumulation (MA) lines that had been maintained by close inbreeding for an average of 262 generations. We tested for the presence of new mutations by generating alignments of each MA line to the D. melanogaster reference genome sequence and then compared these alignments base by base. We determined empirically that at least five reads at a site within each line are required for accurate single nucleotide mutation calling. We mapped a total of 174 single-nucleotide mutations, giving a single nucleotide mutation rate of 3.5 × 10−9 per site per generation. There were no false positives in a random sample of 40 of these mutations checked by Sanger sequencing. Variation in the numbers of mutations among the MA lines was small and nonsignificant. Numbers of transition and transversion mutations were 86 and 88, respectively, implying that transition mutation rate is close to 2× the transversion rate. We observed 1.5× as many G or C → A or T as A or T → G or C mutations, implying that the G or C → A or T mutation rate is close to 2× the A or T → G or C mutation rate. The base composition of the genome is therefore not at an equilibrium determined solely by mutation. The predicted G + C content at mutational equilibrium (33%) is similar to that observed in transposable element remnants. Nearest-neighbor mutational context dependencies are nonsignificant, suggesting that this is a weak phenomenon in Drosophila. We also saw nonsignificant differences in the mutation rate between transcribed and untranscribed regions, implying that any transcription-coupled repair process is weak. Of seven short indel mutations confirmed, six were deletions, consistent with the deletion bias that is thought to exist in Drosophila.

DNA copy number aberrations across multiple array-CGH experiments

Regions of gain and loss of genomic DNA occur in many cancers and can drive the genesis and progression of disease. These copy number aberrations (CNAs) can be detected at high resolution by using microarray-based techniques. However, robust statistical approaches are needed to identify nonrandom gains and losses across multiple experiments/samples. We have developed a method called Significance Testing for Aberrant Copy number (STAC) to address this need. STAC utilizes two complementary statistics in combination with a novel search strategy. The significance of both statistics is assessed, and P-values are assigned to each location on the genome by using a multiple testing corrected permutation approach. We validate our method by using two published cancer data sets. STAC identifies genomic alterations known to be of clinical and biological significance and provides statistical support for 85% of previously reported regions. Moreover, STAC identifies numerous additional regions of significant gain/loss in these data that warrant further investigation. The P-values provided by STAC can be used to prioritize regions for follow-up study in an unbiased fashion. We conclude that STAC is a powerful tool for identifying nonrandom genomic amplifications and deletions across multiple experiments.

A Generic System for Fast and Flexible Access to Biological Data

generic data warehousing solution for fast and flexible querying of large biological data sets and integration with third-party data and tools. The system consists of a query-optimized database and interactive, user-friendly interfaces. EnsMart has been applied to Ensembl, where it extends its genomic browser capabilities, facilitating rapid retrieval of customized data sets. A wide variety of complex queries, on various types of annotations, for numerous species are supported. These can be applied to many research problems, ranging from SNP selection for candidate gene screening, through cross-species evolutionary comparisons, to microarray annotation. Users can group and refine biological data according to many criteria, including cross-species analyses, disease links, sequence variations, and expression patterns. Both tabulated list data and biological sequence output can be generated dynamically, in HTML, text, Microsoft Excel, and compressed formats. A wide range of sequence types, such as cDNA, peptides, coding regions, UTRs, and exons, with additional upstream and downstream regions, can be retrieved. The EnsMart database can be accessed via a public Web site, or through a Java application suite. Both implementations and the database are freely available for local installation, and can be extended or adapted to `non-Ensembl' data sets.

Customized Annotation of Genome Regions

genetic analysis of regions of human and other genomes. It provides a set of components that may be assembled to construct a self-updating database of DNA sequence, mapping data, and annotations of possible genome features. Once one or more remote sources of data for the target region have been identified, all sequences for that region are downloaded, assimilated, and subjected to a (configurable) set of standard database-searching and genome-analysis packages. The results are stored in compressed form in a relational database, and are updated automatically on a regular schedule so that they are always immediately available in their most up-to-date versions. A Java front-end, executed as a stand alone application or web applet, provides a graphical interface for navigating the database and for viewing the annotations. There are facilities for importing and exporting data in the format of the Distributed Annotation System (DAS), enabling a GANESH database to be used as a component of a DAS configuration. The system has been used to construct databases for about a dozen regions of human chromosomes and for three regions of mouse chromosomes.


Genome-Wide Duplications at the Origin of Vertebrates Using an Amphioxus Gene Set and Completed Animal Genomes

The 2R hypothesis predicting two genome duplications at the origin of vertebrates is highly controversial. Studies published so far include limited sequence data from organisms close to the hypothesized genome duplications. Through the comparison of a gene catalog from amphioxus, the closest living invertebrate relative of vertebrates, to 3453 single-copy genes orthologous between Caenorhabditis elegans (C), Drosophila melanogaster (D), and Saccharomyces cerevisiae (Y), and to Ciona intestinalis ESTs, mouse, and human genes, we show with a large number of genes that the gene duplication activity is significantly higher after the separation of amphioxus and the vertebrate lineages, which we estimate at 650 million years (Myr). The majority of human orthologs of 195 CDY groups that could be dated by the molecular clock appear to be duplicated between 300 and 680 Myr with a mean at 488 million years ago (Mya). We detected 485 duplicated chromosomal segments in the human genome containing CDY orthologs, 331 of which are found duplicated in the mouse genome and within regions syntenic between human and mouse, indicating that these were generated earlier than the human–mouse split. Model based calculations of the codon substitution rate of the human genes included in these segments agree with the molecular clock duplication time-scale prediction. Our results favor at least one large duplication event at the origin of vertebrates, followed by smaller scale duplication closer to the bird–mammalian split.

Spidey: A Tool for mRNA-to-Genomic Alignments

We have developed a computer program that aligns spliced sequences to genomic sequences, using local alignment algorithms and heuristics to put together a global spliced alignment. Spidey can produce reliable alignments quickly, even when confronted with noise from alternative splicing, polymorphisms, sequencing errors, or evolutionary divergence. We show how Spidey was used to align reference sequences to known genomic sequences and then to the draft human genome, to align mRNAs to gene clusters, and to align mouse mRNAs to human genomic sequence. We compared Spidey to two other spliced alignment programs; Spidey generally performed quite well in a very reasonable amount of time.

DIAN: A Novel Algorithm for Genome Ontological Classification

Faced with the determination of many completely sequenced genomes, computational biology is now faced with the challenge of interpreting the significance of these data sets. A multiplicity of data-related problems impedes this goal: Biological annotations associated with raw data are often not normalized, and the data themselves are often poorly interrelated and their interpretation unclear. All of these problems make interpretation of genomic databases increasingly difficult. With the current explosion of sequences now available from the human genome as well as from model organisms, the importance of sorting this vast amount of conceptually unstructured source data into a limited universe of genes, proteins, functions, structures, and pathways has become a bottleneck for the field. To address this problem, we have developed a method of interrelating data sources by applying a novel method of associating biological objects to ontologies. We have developed an intelligent knowledge-based algorithm, DIAN, to support biological knowledge mapping, and, in particular, to facilitate the interpretation of genomic data. In this respect, the method makes it possible to inventory genomes by collapsing multiple types of annotations and normalizing them to various ontologies. By relying on a conceptual view of the genome, researchers can now easily navigate the human genome in a biologically intuitive, scientifically accurate manner.

Sequencing the Complete Human Genome

A 30-fold redundant human bacterial artificial chromosome (BAC) library with a large average insert size (178 kb) has been constructed to provide the intermediate substrate for the international genome sequencing effort. The DNA was obtained from a single anonymous volunteer, whose identity was protected through a double-blind donor selection protocol. DNA fragments were generated by partial digestion with EcoRI (library segments 1–4: 24-fold) and MboI (segment 5: sixfold) and cloned into the pBACe3.6 and pTARBAC1 vectors, respectively. The quality of the library was assessed by extensive analysis of 169 clones for rearrangements and artifacts. Eighteen BACs (11%) revealed minor insert rearrangements, and none was chimeric. This BAC library, designated as “RPCI-11,” has been used widely as the central resource for insert-end sequencing, clone fingerprinting, high-throughput sequence analysis and as a source of mapped clones for diagnostic and functional studies.

An improved method for the detection of hepatitis C virus RNA in plasma utilizing heminested primers and internal control RNA.

The majority of transfusion-associated, non-A, non-B hepatitis cases are caused by hepatitis C virus (HCV), a positive-stranded RNA virus. Although high titers of HCV in clinical specimens have been reported, in some cases extremely low titers of virus are not uncommon. Therefore, an extremely sensitive and reliable assay is required to determine viremia and replication of HCV accurately. We report here the systematic investigation of factors influencing the detection of HCV RNA by a reverse transcription-polymerase chain reaction (RT-PCR) assay utilizing "drop in-drop out" heminested primers derived from the conserved 5' non-coding region of the viral genome. A genetically engineered 5' noncoding region has been constructed and used as an internal control. Addition of the control RNA to each test not only allowed semiquantitation of positive reactions but also validated the performance of reverse transcription and PCR for every specimen. The optimized heminested PCR (HN-PCR) protocol is capable of amplifying one molecule of cloned HCV DNA or 10 molecules of in vitro-transcribed HCV RNA to levels detectable in ethidium bromide-stained agarose gels. We evaluated the improved method for the detection of HCV RNA on a human plasma sample containing the pedigreed strain H of HCV with a chimpanzee infectious dose of 10(6)/ml. Utilizing the internal control RNA, we calculated 2 x 10(7) virions in 1 ml of the original human plasma. The HN-PCR achieves the sensitivity and specificity of the double-nested PCR (DN-PCR) in a simplified format that avoids the false-positive results associated with DN-PCR.
Selective RNA amplification: a novel method using dUMP-containing primers and uracil DNA glycosylase.

The application of PCR to a wide variety of biological problems and molecular techniques has gained wide acceptance. RNA-PCR, a technique in which first-strand cDNA synthesis is followed by PCR amplification, has enabled detection and characterization of rare transcripts. One problem confronting the researcher involves specific amplification of transcribed sequences in the presence of small amounts of genomic DNA of identical sequence. We describe a novel technique, selective RNA amplification, which will specifically amplify RNA sequences in a background of homologous DNA. The method involves first-strand cDNA synthesis from a specific dUMP-containing oligonucleotide that contains unique user-defined 5' sequence (adapter sequence) not found in the message of interest. RNA template is degraded using RNase H, which is specific for RNA/DNA hybrids. This is followed by second-strand synthesis using a gene-specific primer (GSP). The original adapter primer is digested with uracil DNA glycosylase (UDG) to prevent its participation in subsequent amplification. PCR is then performed using the GSP and a second primer corresponding to the unique adapter sequence. In this paper, we apply this method to the amplification of RNA derived from human papilloma virus sequences. Using Southern analysis, we demonstrate specific amplification of 10(5) molecules of an in vitro-transcribed RNA. Denatured DNA of identical sequence and concentration was not amplified using the RNA-specific method. The method could eliminate the need for stringent purification of RNA and enables amplification of rare messages from RNA preparations containing homologous DNA of identical sequence and size.

