#PAGE_PARAMS# #ADS_HEAD_SCRIPTS# #MICRODATA#

gapFinisher: A reliable gap filling pipeline for SSPACE-LongRead scaffolder output


Autoři: Juhana I. Kammonen aff001;  Olli-Pekka Smolander aff001;  Lars Paulin aff001;  Pedro A. B. Pereira aff001;  Pia Laine aff001;  Patrik Koskinen aff001;  Jukka Jernvall aff003;  Petri Auvinen aff001
Působiště autorů: DNA Sequencing and Genomics Laboratory, Institute of Biotechnology, University of Helsinki, Helsinki, Finland aff001;  Department of Neurology, Helsinki University Hospital, Helsinki, Finland aff002;  Evolutionary Phenomics Group, Institute of Biotechnology, University of Helsinki, Helsinki, Finland aff003
Vyšlo v časopise: PLoS ONE 14(9)
Kategorie: Research Article
prolekare.web.journal.doi_sk: https://doi.org/10.1371/journal.pone.0216885

Souhrn

Unknown sequences, or gaps, are present in many published genomes across public databases. Gap filling is an important finishing step in de novo genome assembly, especially in large genomes. The gap filling problem is nontrivial and while there are many computational tools partially solving the problem, several have shortcomings as to the reliability and correctness of the output, i.e. the gap filled draft genome. SSPACE-LongRead is a scaffolding tool that utilizes long reads from multiple third-generation sequencing platforms in finding links between contigs and combining them. The long reads potentially contain sequence information to fill the gaps created in the scaffolding, but SSPACE-LongRead currently lacks this functionality. We present an automated pipeline called gapFinisher to process SSPACE-LongRead output to fill gaps after the scaffolding. gapFinisher is based on the controlled use of a previously published gap filling tool FGAP and works on all standard Linux/UNIX command lines. We compare the performance of gapFinisher against two other published gap filling tools PBJelly and GMcloser. We conclude that gapFinisher can fill gaps in draft genomes quickly and reliably. In addition, the serial design of gapFinisher makes it scale well from prokaryote genomes to larger genomes with no increase in the computational footprint.

Klíčová slova:

Biology and life sciences – Genetics – Genomics – Genome analysis – Organisms – Eukaryota – Computational biology – Research and analysis methods – Sequence assembly tools – Database and informatics methods – Bioinformatics – Sequence analysis – Sequence alignment – Animals – Microbiology – Vertebrates – Amniotes – Mammals – Genomic libraries – Bacteriology – Microbial genomics – BLAST algorithm – Computational techniques – Computational pipelines – Bacterial genetics – Bacterial genomics – Microbial genetics – Genomics statistics


Zdroje

1. Vasilinetc I, Prjibelski AD, Gurevich A, Korobeynikov A & Pevzner PA. Assembling short reads from jumping libraries with large insert sizes. Bioinformatics, 2015 Oct 15;31(20):3262–8. doi: 10.1093/bioinformatics/btv337 26040456

2. Boetzer M, Henkel CV, Jansen HJ, Butler D & Pirovano W. Scaffolding pre-assembled contigs using SSPACE. Bioinformatics 2011;4(27): 578–579.

3. Boetzer M & Pirovano W. Toward almost finished genomes with GapFiller. Genome Biology 2012;13(6): R56. doi: 10.1186/gb-2012-13-6-r56 22731987

4. Li YI & Copley RR. Scaffolding low quality genomes using orthologous protein sequences. Bioinformatics 2013;29(2): 160–165. doi: 10.1093/bioinformatics/bts661 23162087

5. Zerbino DR, Achuthan P, Akanni W, Amode MR, Barrell D, Bhai J, et al. Ensembl 2018. Nucleic Acids Research, 2018, 4;46(D1):D754–D761. doi: 10.1093/nar/gkx1098 29155950

6. English AC, Richards S, Han Y, Wang M, Vee V, Qu J et al. Mind the gap: upgrading genomes with Pacific Biosciences RS long-read sequencing technology. PloS ONE, 2012;7(11), e47768. doi: 10.1371/journal.pone.0047768 23185243

7. Kosuqi S, Hirakawa H & Tabata S. GMcloser: closing gaps in assemblies accurately with a likelihood-based selection of contig or long-read alignments. Bioinformatics, 2015; 31(23):3733–41. doi: 10.1093/bioinformatics/btv465 26261222

8. Piro VC, Faoro H, Weiss VA, Steffens MB, Pedrosa FO, Souza EM et al. FGAP: an automated gap closing tool. BMC Research Notes 2014;7:371. doi: 10.1186/1756-0500-7-371 24938749

9. Boetzer M & Pirovano W. SSPACE-LongRead: scaffolding bacterial draft genomes using long read sequence information. BMC Bioinformatics 2014;15(1): 211.

10. Chaisson MJ & Tessler G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics 2012;13:238. doi: 10.1186/1471-2105-13-238 22988817

11. Laver T, Harrison J, O’Neill PA, Moore K, Farbos A, Paszkiewicz K et al. Assessing the performance of the Oxford Nanopore Technologies MinION. Biomolecular Detection and Quantification 2015;3(3):1–8.

12. Nakamura K, Oshima T, Morimoto T, Ikeda S, Yoshikawa H, Shiwa Y et al. Sequence-specific error profile of Illumina sequencers. Nucleic Acids Research, 2011;13(39): e90.

13. Schirmer M, Ijaz UZ, D’Amore R, Hall N, Sloan WT & Quince C. Insight into biases and sequencing errors for amplicon sequencing with the Illumina MiSeq platform. Nucleic Acids Research, 2015;6(43), e37.

14. Altschul SF, Gish W, Miller W, Myers EW & Lipman DJ. Basic local alignment search tool. Journal of Molecular Biology, 1990;215(3):403–10. doi: 10.1016/S0022-2836(05)80360-2 2231712

15. Salmela L, Sahlin K, Mäkinen V & Tomescu A. Gap Filling as Exact Path Length Problem. Journal of Computational Biology 2016;23(5):347–61. doi: 10.1089/cmb.2015.0197 26959081

16. Gentzsch W. Sun Grid Engine: Towards Creating a Compute Power Grid. In: CCGRID '01: Proceedings of the 1st International Symposium on Cluster Computing and the Grid. 2001;35.

17. Christiansen T, Orwant J, Wall L, Foy B. Programming Perl. O’Reilly Media 2012.

18. Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C et al. Versatile and open software for comparing large genomes. Genome biology 2004; 5(2):R12. doi: 10.1186/gb-2004-5-2-r12 14759262

19. Langmead B & Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nature Methods, 2012;9(4):357–359. doi: 10.1038/nmeth.1923 22388286

20. Noé L & Kucherov G. YASS: enhancing the sensitivity of DNA similarity search. Nucleic Acids Research 2005 33(1): W540–3.

21. de Koning AJ, Gu W, Castoe TA, Batzer MA & Pollock DD. Repetitive elements may comprise over two-thirds of the human genome. PLoS Genetics, 2011;7(12), e1002384. doi: 10.1371/journal.pgen.1002384 22144907

22. Smit AFA, Hubley R & Green P. 2013–2015. RepeatMasker Open-4.0. Retrieved from: Smit, AFA, Hubley, R & Green, P. RepeatMasker Open-4.0. 2013–2015. Available from: http://www.repeatmasker.org (11 Feb 2019, date last accessed)

23. Li H & Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 2009;25(14):1754–1760. doi: 10.1093/bioinformatics/btp324 19451168

24. Walker BJ, Abeel T, Shea T, Priest M, Abouelliel A, Sakthikumar S et al. Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement. PLoS ONE 2014;9(11): e112963. doi: 10.1371/journal.pone.0112963 25409509

25. Slater GS & Birney E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 2005;6:31. doi: 10.1186/1471-2105-6-31 15713233

26. Harhay GP, McVey DS, Koren S, Phillippy AM, Bono J, Harhay DM et al. Complete Closed Genome Sequences of Three Bibersteinia trehalosi Nasopharyngeal Isolates from Cattle with Shipping Fever. Genome announcements 2014;2(1): e00084–14. doi: 10.1128/genomeA.00084-14 24526647

27. Eidam C, Poehlein A, Brenner Michael G, Kadlec K, Liesegang H, Brzuszkiewicz E et al. Complete Genome Sequence of Mannheimia haemolytica Strain 42548 from a Case of Bovine Respiratory Disease. Genome announcements 2013;1(3): e00318–13. doi: 10.1128/genomeA.00318-13 23723408

28. Li H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 2016;32(14):2103–10. doi: 10.1093/bioinformatics/btw152 27153593

29. Magoč T & Salzberg SL. FLASH: fast length adjustment of short reads. Bioinformatics 2011;27(21): 2957–2963. doi: 10.1093/bioinformatics/btr507 21903629

30. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature, 2005;437:376–380. doi: 10.1038/nature03959 16056220

31. Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of Computational Biology, 2012;19(5): 455–477. doi: 10.1089/cmb.2012.0021 22506599

32. Koren S, Schatz M, Walenz B, Martin J, Howard J, Ganapathy G et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nature Biotechnology, 2012;30: 693–700. doi: 10.1038/nbt.2280 22750884

33. Darling ACE, Mau B, Blattner FR & Perna NT. Mauve: Multiple Alignment of Conserved Genomic Sequence With Rearrangements. Genome Research, 2004;14(7): 1394–1403. doi: 10.1101/gr.2289704 15231754

34. Dohm JC, Lottaz C, Borodina T & Himmelbauer H. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Research 2008;16(36): e105.

35. Kammonen JI, Smolander OP, Sipilä T, Overmyer K, Auvinen P & Paulin L. Increased transcriptome sequencing efficiency with modified Mint-2 digestion-ligation protocol. Analytical Biochemistry, 2015;477:38–40. doi: 10.1016/j.ab.2014.12.001 25513723

36. Camacho C, Madden T, Coulouris G, Avagyan V, Ma N, Tao T et al. BLAST command line applications user manual. National Center for Biotechnology Information. https://www.ncbi.nlm.nih.gov/books/NBK279690 (11 Feb 2019, date last accessed)


Článok vyšiel v časopise

PLOS One


2019 Číslo 9
Najčítanejšie tento týždeň
Najčítanejšie v tomto čísle
Kurzy

Zvýšte si kvalifikáciu online z pohodlia domova

Aktuální možnosti diagnostiky a léčby litiáz
nový kurz
Autori: MUDr. Tomáš Ürge, PhD.

Všetky kurzy
Prihlásenie
Zabudnuté heslo

Zadajte e-mailovú adresu, s ktorou ste vytvárali účet. Budú Vám na ňu zasielané informácie k nastaveniu nového hesla.

Prihlásenie

Nemáte účet?  Registrujte sa

#ADS_BOTTOM_SCRIPTS#