Repetitive Elements May Comprise Over Two-Thirds of the Human Genome
Transposable elements (TEs) are conventionally identified in eukaryotic genomes by alignment to consensus element sequences. Using this approach, about half of the human genome has been previously identified as TEs and low-complexity repeats. We recently developed a highly sensitive alternative de novo strategy, P-clouds, that instead searches for clusters of high-abundance oligonucleotides that are related in sequence space (oligo “clouds”). We show here that P-clouds predicts >840 Mbp of additional repetitive sequences in the human genome, thus suggesting that 66%–69% of the human genome is repetitive or repeat-derived. To investigate this remarkable difference, we conducted detailed analyses of the ability of both P-clouds and a commonly used conventional approach, RepeatMasker (RM), to detect different sized fragments of the highly abundant human Alu and MIR SINEs. RM can have surprisingly low sensitivity for even moderately long fragments, in contrast to P-clouds, which has good sensitivity down to small fragment sizes (∼25 bp). Although short fragments have a high intrinsic probability of being false positives, we performed a probabilistic annotation that reflects this fact. We further developed “element-specific” P-clouds (ESPs) to identify novel Alu and MIR SINE elements, and using it we identified ∼100 Mb of previously unannotated human elements. ESP estimates of new MIR sequences are in good agreement with RM-based predictions of the amount that RM missed. These results highlight the need for combined, probabilistic genome annotation approaches and suggest that the human genome consists of substantially more repetitive sequence than previously believed.
Vyšlo v časopise:
Repetitive Elements May Comprise Over Two-Thirds of the Human Genome. PLoS Genet 7(12): e32767. doi:10.1371/journal.pgen.1002384
Kategorie:
Research Article
prolekare.web.journal.doi_sk:
https://doi.org/10.1371/journal.pgen.1002384
Souhrn
Transposable elements (TEs) are conventionally identified in eukaryotic genomes by alignment to consensus element sequences. Using this approach, about half of the human genome has been previously identified as TEs and low-complexity repeats. We recently developed a highly sensitive alternative de novo strategy, P-clouds, that instead searches for clusters of high-abundance oligonucleotides that are related in sequence space (oligo “clouds”). We show here that P-clouds predicts >840 Mbp of additional repetitive sequences in the human genome, thus suggesting that 66%–69% of the human genome is repetitive or repeat-derived. To investigate this remarkable difference, we conducted detailed analyses of the ability of both P-clouds and a commonly used conventional approach, RepeatMasker (RM), to detect different sized fragments of the highly abundant human Alu and MIR SINEs. RM can have surprisingly low sensitivity for even moderately long fragments, in contrast to P-clouds, which has good sensitivity down to small fragment sizes (∼25 bp). Although short fragments have a high intrinsic probability of being false positives, we performed a probabilistic annotation that reflects this fact. We further developed “element-specific” P-clouds (ESPs) to identify novel Alu and MIR SINE elements, and using it we identified ∼100 Mb of previously unannotated human elements. ESP estimates of new MIR sequences are in good agreement with RM-based predictions of the amount that RM missed. These results highlight the need for combined, probabilistic genome annotation approaches and suggest that the human genome consists of substantially more repetitive sequence than previously believed.
Zdroje
1. FrithMCPheasantMMattickJS 2005 Genomics: The amazing complexity of the human transcriptome. Eur J Hum Genet 13 894 897
2. MattickJSMakuninIV 2006 Non-coding RNA. Hum Mol Genet 15 R17 29
3. PheasantMMattickJS 2007 Raising the estimate of functional human sequences. Genome Res 17 1245 1253
4. BatzerMADeiningerPL 2002 Alu repeats and human genomic diversity. Nat Rev Genet 3 370 379
5. EichlerEE 2001 Recent duplication, domain accretion and the dynamic mutation of the human genome. Trends Genet 17 661 669
6. KazazianHHJr 2004 Mobile Elements: Drivers of Genome Evolution. Science 303 1626 1632
7. SmitAFAHubleyRGreenP 1996–2004 RepeatMasker Open-3.0. http://www.repeatmasker.org
8. JurkaJ 2000 Repbase Update: a database and an electronic journal of repetitive elements. Trends Genet 16 418 420
9. International Chicken Genome Sequencing Consortium 2004 Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature 432 695 716
10. International Human Genome Sequencing Consortium 2001 Initial sequencing and analysis of the human genome. Nature 409 860 921
11. KirknessEFBafnaVHalpernALLevySRemingtonK 2003 The Dog Genome: Survey Sequencing and Comparative Analysis. Science 301 1898 1903
12. Lindblad-TohKWadeCMMikkelsenTSKarlssonEKJaffeDB 2005 Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature 438 803 819
13. MikkelsenTSWakefieldMJAkenBAmemiyaCTChangJL 2007 Genome of the marsupial Monodelphis domestica reveals innovation in non-coding sequences. Nature 447 167 177
14. Mouse Genome Sequencing Consortium 2002 Initial sequencing and comparative analysis of the mouse genome. Nature 420 520 562
15. PontiusJUMullikinJCSmithDRAgencourt SequencingTLindblad-TohK 2007 Initial sequence and comparative analysis of the cat genome. Genome Res 17 1675 1689
16. Rat Genome Sequencing Project Consortium 2004 Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature 428 493 521
17. LunterGRoccoAMimouniNHegerACaldeiraA 2008 Uncertainty in homology inferences: assessing and improving genomic sequence alignment. Genome Res 18 298 309
18. BrosiusJ 1999 Genomes were forged by massive bombardments with retroelements and retrosequences. Genetica 107 209 238
19. JurkaJKapitonovVVKohanyOJurkaMV 2007 Repetitive sequences in complex genomes: structure and evolution. Annu Rev Genomics Hum Genet 8 241 259
20. GuWCastoeTAHedgesDJBatzerMAPollockDD 2008 Identification of repeat structure in large genomes using repeat probability clouds. Anal Biochem 380 77 83
21. WarrenWCClaytonDFEllegrenHArnoldAPHillierLW 2010 The genome of a songbird. Nature 464 757 762
22. PriceALJonesNCPevznerPA 2005 De novo identification of repeat families in large genomes. Bioinformatics 21 i351 358
23. JurkaJZietkiewiczELabudaD 1995 Ubiquitous mammalian-wide interspersed repeats (MIRs) are molecular fossils from the mesozoic era. Nucl Acids Res 23 170 175
24. KuhnRMKarolchikDZweigASTrumbowerHThomasDJ 2007 The UCSC genome browser database: update 2007. Nucl Acids Res 35 D668 673
25. NekrutenkoALiWH 2001 Transposable elements are found in a large number of human protein-coding genes. Trends Genet 17 619 621
26. KarlinSAltschulSF 1990 Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci U S A 87 2264 2268
27. AchazGBoyerFRochaEPCViariACoissacE 2007 Repseek, a tool to retrieve approximate repeats from large DNA sequences. Bioinformatics 23 119 121
28. BaoZEddySR 2002 Automated De Novo Identification of Repeat Sequence Families in Sequenced Genomes. Genome Res 12 1269 1276
29. LiRYeJLiSWangJHanY 2005 ReAS: Recovery of ancestral sequences for transposable elements from the unassembled reads of a whole genome shotgun. PLoS Comput Biol 1 e43
30. EdgarRCMyersEW 2005 PILER: identification and classification of genomic repeats. Bioinformatics 21 i152 158
31. QuesnevilleHBergmanCMAndrieuOAutardDNouaudD 2005 Combined evidence annotation of transposable elements in genome sequences. PLoS Comput Biol 1 166 175
32. KurtzS 2011 Vmatch large scale sequence analysis software. http://www.vmatch.de
33. LeratE 2010 Identifying repeats and transposable elements in sequenced genomes: how to find your way through the dense forest of programs. Heredity 104 520 533
34. RayDABatzerMA 2011 Reading TE leaves: new approaches to the identification of transposable element insertions. Genome Res 21 813 820
35. CastoeTAHallKTGuibotsy MboulasMLGuWde KoningAP 2011 Discovery of highly divergent repeat landscapes in snake genomes using high-throughput sequencing. Genome Biol Evol 3 641 653
36. FeschotteCKeswaniURanganathanNGuibotsyMLLevineD 2009 Exploring repetitive DNA landscapes using REPCLASS, a tool that automates the classification of transposable elements in eukaryotic genomes. Genome Biol Evol 1 205 220
37. QuinlanARHallIM 2010 BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26 841 842
38. BensonG 1999 Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res 27 573 580
39. AltschulSFMaddenTLSchafferAAZhangJZhangZ 1997 Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25 3389 3402
Štítky
Genetika Reprodukčná medicínaČlánok vyšiel v časopise
PLOS Genetics
2011 Číslo 12
- Je „freeze-all“ pro všechny? Odborníci na fertilitu diskutovali na virtuálním summitu
- Gynekologové a odborníci na reprodukční medicínu se sejdou na prvním virtuálním summitu
Najčítanejšie v tomto čísle
- Targeted Proteolysis of Plectin Isoform 1a Accounts for Hemidesmosome Dysfunction in Mice Mimicking the Dominant Skin Blistering Disease EBS-Ogna
- The RNA Silencing Enzyme RNA Polymerase V Is Required for Plant Immunity
- The FGFR4-G388R Polymorphism Promotes Mitochondrial STAT3 Serine Phosphorylation to Facilitate Pituitary Growth Hormone Cell Tumorigenesis
- Hierarchical Generalized Linear Models for Multiple Groups of Rare and Common Variants: Jointly Estimating Group and Individual-Variant Effects