Repetitive Elements May Comprise Over Two-Thirds of the Human Genome

English version České info

Transposable elements (TEs) are conventionally identified in eukaryotic genomes by alignment to consensus element sequences. Using this approach, about half of the human genome has been previously identified as TEs and low-complexity repeats. We recently developed a highly sensitive alternative de novo strategy, P-clouds, that instead searches for clusters of high-abundance oligonucleotides that are related in sequence space (oligo “clouds”). We show here that P-clouds predicts >840 Mbp of additional repetitive sequences in the human genome, thus suggesting that 66%–69% of the human genome is repetitive or repeat-derived. To investigate this remarkable difference, we conducted detailed analyses of the ability of both P-clouds and a commonly used conventional approach, RepeatMasker (RM), to detect different sized fragments of the highly abundant human Alu and MIR SINEs. RM can have surprisingly low sensitivity for even moderately long fragments, in contrast to P-clouds, which has good sensitivity down to small fragment sizes (∼25 bp). Although short fragments have a high intrinsic probability of being false positives, we performed a probabilistic annotation that reflects this fact. We further developed “element-specific” P-clouds (ESPs) to identify novel Alu and MIR SINE elements, and using it we identified ∼100 Mb of previously unannotated human elements. ESP estimates of new MIR sequences are in good agreement with RM-based predictions of the amount that RM missed. These results highlight the need for combined, probabilistic genome annotation approaches and suggest that the human genome consists of substantially more repetitive sequence than previously believed.

Vyšlo v časopise: Repetitive Elements May Comprise Over Two-Thirds of the Human Genome. PLoS Genet 7(12): e32767. doi:10.1371/journal.pgen.1002384
Kategorie: Research Article
prolekare.web.journal.doi_sk: https://doi.org/10.1371/journal.pgen.1002384

Souhrn

Zdroje

1. FrithMCPheasantMMattickJS 2005 Genomics: The amazing complexity of the human transcriptome. Eur J Hum Genet 13 894 897

2. MattickJSMakuninIV 2006 Non-coding RNA. Hum Mol Genet 15 R17 29

3. PheasantMMattickJS 2007 Raising the estimate of functional human sequences. Genome Res 17 1245 1253

4. BatzerMADeiningerPL 2002 Alu repeats and human genomic diversity. Nat Rev Genet 3 370 379

5. EichlerEE 2001 Recent duplication, domain accretion and the dynamic mutation of the human genome. Trends Genet 17 661 669

6. KazazianHHJr 2004 Mobile Elements: Drivers of Genome Evolution. Science 303 1626 1632

7. SmitAFAHubleyRGreenP 1996–2004 RepeatMasker Open-3.0. http://www.repeatmasker.org

8. JurkaJ 2000 Repbase Update: a database and an electronic journal of repetitive elements. Trends Genet 16 418 420

9. International Chicken Genome Sequencing Consortium 2004 Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature 432 695 716

10. International Human Genome Sequencing Consortium 2001 Initial sequencing and analysis of the human genome. Nature 409 860 921

11. KirknessEFBafnaVHalpernALLevySRemingtonK 2003 The Dog Genome: Survey Sequencing and Comparative Analysis. Science 301 1898 1903

12. Lindblad-TohKWadeCMMikkelsenTSKarlssonEKJaffeDB 2005 Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature 438 803 819

13. MikkelsenTSWakefieldMJAkenBAmemiyaCTChangJL 2007 Genome of the marsupial Monodelphis domestica reveals innovation in non-coding sequences. Nature 447 167 177

14. Mouse Genome Sequencing Consortium 2002 Initial sequencing and comparative analysis of the mouse genome. Nature 420 520 562

15. PontiusJUMullikinJCSmithDRAgencourt SequencingTLindblad-TohK 2007 Initial sequence and comparative analysis of the cat genome. Genome Res 17 1675 1689

16. Rat Genome Sequencing Project Consortium 2004 Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature 428 493 521

17. LunterGRoccoAMimouniNHegerACaldeiraA 2008 Uncertainty in homology inferences: assessing and improving genomic sequence alignment. Genome Res 18 298 309

18. BrosiusJ 1999 Genomes were forged by massive bombardments with retroelements and retrosequences. Genetica 107 209 238

19. JurkaJKapitonovVVKohanyOJurkaMV 2007 Repetitive sequences in complex genomes: structure and evolution. Annu Rev Genomics Hum Genet 8 241 259

20. GuWCastoeTAHedgesDJBatzerMAPollockDD 2008 Identification of repeat structure in large genomes using repeat probability clouds. Anal Biochem 380 77 83

21. WarrenWCClaytonDFEllegrenHArnoldAPHillierLW 2010 The genome of a songbird. Nature 464 757 762

22. PriceALJonesNCPevznerPA 2005 De novo identification of repeat families in large genomes. Bioinformatics 21 i351 358

23. JurkaJZietkiewiczELabudaD 1995 Ubiquitous mammalian-wide interspersed repeats (MIRs) are molecular fossils from the mesozoic era. Nucl Acids Res 23 170 175

24. KuhnRMKarolchikDZweigASTrumbowerHThomasDJ 2007 The UCSC genome browser database: update 2007. Nucl Acids Res 35 D668 673

25. NekrutenkoALiWH 2001 Transposable elements are found in a large number of human protein-coding genes. Trends Genet 17 619 621

26. KarlinSAltschulSF 1990 Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci U S A 87 2264 2268

27. AchazGBoyerFRochaEPCViariACoissacE 2007 Repseek, a tool to retrieve approximate repeats from large DNA sequences. Bioinformatics 23 119 121

28. BaoZEddySR 2002 Automated De Novo Identification of Repeat Sequence Families in Sequenced Genomes. Genome Res 12 1269 1276

29. LiRYeJLiSWangJHanY 2005 ReAS: Recovery of ancestral sequences for transposable elements from the unassembled reads of a whole genome shotgun. PLoS Comput Biol 1 e43

30. EdgarRCMyersEW 2005 PILER: identification and classification of genomic repeats. Bioinformatics 21 i152 158

31. QuesnevilleHBergmanCMAndrieuOAutardDNouaudD 2005 Combined evidence annotation of transposable elements in genome sequences. PLoS Comput Biol 1 166 175

32. KurtzS 2011 Vmatch large scale sequence analysis software. http://www.vmatch.de

33. LeratE 2010 Identifying repeats and transposable elements in sequenced genomes: how to find your way through the dense forest of programs. Heredity 104 520 533

34. RayDABatzerMA 2011 Reading TE leaves: new approaches to the identification of transposable element insertions. Genome Res 21 813 820

35. CastoeTAHallKTGuibotsy MboulasMLGuWde KoningAP 2011 Discovery of highly divergent repeat landscapes in snake genomes using high-throughput sequencing. Genome Biol Evol 3 641 653

36. FeschotteCKeswaniURanganathanNGuibotsyMLLevineD 2009 Exploring repetitive DNA landscapes using REPCLASS, a tool that automates the classification of transposable elements in eukaryotic genomes. Genome Biol Evol 1 205 220

37. QuinlanARHallIM 2010 BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26 841 842

38. BensonG 1999 Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res 27 573 580

39. AltschulSFMaddenTLSchafferAAZhangJZhangZ 1997 Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25 3389 3402