gapFinisher: A reliable gap filling pipeline for SSPACE-LongRead scaffolder output
Autoři:
Juhana I. Kammonen aff001; Olli-Pekka Smolander aff001; Lars Paulin aff001; Pedro A. B. Pereira aff001; Pia Laine aff001; Patrik Koskinen aff001; Jukka Jernvall aff003; Petri Auvinen aff001
Působiště autorů:
DNA Sequencing and Genomics Laboratory, Institute of Biotechnology, University of Helsinki, Helsinki, Finland
aff001; Department of Neurology, Helsinki University Hospital, Helsinki, Finland
aff002; Evolutionary Phenomics Group, Institute of Biotechnology, University of Helsinki, Helsinki, Finland
aff003
Vyšlo v časopise:
PLoS ONE 14(9)
Kategorie:
Research Article
prolekare.web.journal.doi_sk:
https://doi.org/10.1371/journal.pone.0216885
Souhrn
Unknown sequences, or gaps, are present in many published genomes across public databases. Gap filling is an important finishing step in de novo genome assembly, especially in large genomes. The gap filling problem is nontrivial and while there are many computational tools partially solving the problem, several have shortcomings as to the reliability and correctness of the output, i.e. the gap filled draft genome. SSPACE-LongRead is a scaffolding tool that utilizes long reads from multiple third-generation sequencing platforms in finding links between contigs and combining them. The long reads potentially contain sequence information to fill the gaps created in the scaffolding, but SSPACE-LongRead currently lacks this functionality. We present an automated pipeline called gapFinisher to process SSPACE-LongRead output to fill gaps after the scaffolding. gapFinisher is based on the controlled use of a previously published gap filling tool FGAP and works on all standard Linux/UNIX command lines. We compare the performance of gapFinisher against two other published gap filling tools PBJelly and GMcloser. We conclude that gapFinisher can fill gaps in draft genomes quickly and reliably. In addition, the serial design of gapFinisher makes it scale well from prokaryote genomes to larger genomes with no increase in the computational footprint.
Klíčová slova:
Biology and life sciences – Genetics – Genomics – Genome analysis – Organisms – Eukaryota – Computational biology – Research and analysis methods – Sequence assembly tools – Database and informatics methods – Bioinformatics – Sequence analysis – Sequence alignment – Animals – Microbiology – Vertebrates – Amniotes – Mammals – Genomic libraries – Bacteriology – Microbial genomics – BLAST algorithm – Computational techniques – Computational pipelines – Bacterial genetics – Bacterial genomics – Microbial genetics – Genomics statistics
Zdroje
1. Vasilinetc I, Prjibelski AD, Gurevich A, Korobeynikov A & Pevzner PA. Assembling short reads from jumping libraries with large insert sizes. Bioinformatics, 2015 Oct 15;31(20):3262–8. doi: 10.1093/bioinformatics/btv337 26040456
2. Boetzer M, Henkel CV, Jansen HJ, Butler D & Pirovano W. Scaffolding pre-assembled contigs using SSPACE. Bioinformatics 2011;4(27): 578–579.
3. Boetzer M & Pirovano W. Toward almost finished genomes with GapFiller. Genome Biology 2012;13(6): R56. doi: 10.1186/gb-2012-13-6-r56 22731987
4. Li YI & Copley RR. Scaffolding low quality genomes using orthologous protein sequences. Bioinformatics 2013;29(2): 160–165. doi: 10.1093/bioinformatics/bts661 23162087
5. Zerbino DR, Achuthan P, Akanni W, Amode MR, Barrell D, Bhai J, et al. Ensembl 2018. Nucleic Acids Research, 2018, 4;46(D1):D754–D761. doi: 10.1093/nar/gkx1098 29155950
6. English AC, Richards S, Han Y, Wang M, Vee V, Qu J et al. Mind the gap: upgrading genomes with Pacific Biosciences RS long-read sequencing technology. PloS ONE, 2012;7(11), e47768. doi: 10.1371/journal.pone.0047768 23185243
7. Kosuqi S, Hirakawa H & Tabata S. GMcloser: closing gaps in assemblies accurately with a likelihood-based selection of contig or long-read alignments. Bioinformatics, 2015; 31(23):3733–41. doi: 10.1093/bioinformatics/btv465 26261222
8. Piro VC, Faoro H, Weiss VA, Steffens MB, Pedrosa FO, Souza EM et al. FGAP: an automated gap closing tool. BMC Research Notes 2014;7:371. doi: 10.1186/1756-0500-7-371 24938749
9. Boetzer M & Pirovano W. SSPACE-LongRead: scaffolding bacterial draft genomes using long read sequence information. BMC Bioinformatics 2014;15(1): 211.
10. Chaisson MJ & Tessler G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics 2012;13:238. doi: 10.1186/1471-2105-13-238 22988817
11. Laver T, Harrison J, O’Neill PA, Moore K, Farbos A, Paszkiewicz K et al. Assessing the performance of the Oxford Nanopore Technologies MinION. Biomolecular Detection and Quantification 2015;3(3):1–8.
12. Nakamura K, Oshima T, Morimoto T, Ikeda S, Yoshikawa H, Shiwa Y et al. Sequence-specific error profile of Illumina sequencers. Nucleic Acids Research, 2011;13(39): e90.
13. Schirmer M, Ijaz UZ, D’Amore R, Hall N, Sloan WT & Quince C. Insight into biases and sequencing errors for amplicon sequencing with the Illumina MiSeq platform. Nucleic Acids Research, 2015;6(43), e37.
14. Altschul SF, Gish W, Miller W, Myers EW & Lipman DJ. Basic local alignment search tool. Journal of Molecular Biology, 1990;215(3):403–10. doi: 10.1016/S0022-2836(05)80360-2 2231712
15. Salmela L, Sahlin K, Mäkinen V & Tomescu A. Gap Filling as Exact Path Length Problem. Journal of Computational Biology 2016;23(5):347–61. doi: 10.1089/cmb.2015.0197 26959081
16. Gentzsch W. Sun Grid Engine: Towards Creating a Compute Power Grid. In: CCGRID '01: Proceedings of the 1st International Symposium on Cluster Computing and the Grid. 2001;35.
17. Christiansen T, Orwant J, Wall L, Foy B. Programming Perl. O’Reilly Media 2012.
18. Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C et al. Versatile and open software for comparing large genomes. Genome biology 2004; 5(2):R12. doi: 10.1186/gb-2004-5-2-r12 14759262
19. Langmead B & Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nature Methods, 2012;9(4):357–359. doi: 10.1038/nmeth.1923 22388286
20. Noé L & Kucherov G. YASS: enhancing the sensitivity of DNA similarity search. Nucleic Acids Research 2005 33(1): W540–3.
21. de Koning AJ, Gu W, Castoe TA, Batzer MA & Pollock DD. Repetitive elements may comprise over two-thirds of the human genome. PLoS Genetics, 2011;7(12), e1002384. doi: 10.1371/journal.pgen.1002384 22144907
22. Smit AFA, Hubley R & Green P. 2013–2015. RepeatMasker Open-4.0. Retrieved from: Smit, AFA, Hubley, R & Green, P. RepeatMasker Open-4.0. 2013–2015. Available from: http://www.repeatmasker.org (11 Feb 2019, date last accessed)
23. Li H & Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 2009;25(14):1754–1760. doi: 10.1093/bioinformatics/btp324 19451168
24. Walker BJ, Abeel T, Shea T, Priest M, Abouelliel A, Sakthikumar S et al. Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement. PLoS ONE 2014;9(11): e112963. doi: 10.1371/journal.pone.0112963 25409509
25. Slater GS & Birney E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 2005;6:31. doi: 10.1186/1471-2105-6-31 15713233
26. Harhay GP, McVey DS, Koren S, Phillippy AM, Bono J, Harhay DM et al. Complete Closed Genome Sequences of Three Bibersteinia trehalosi Nasopharyngeal Isolates from Cattle with Shipping Fever. Genome announcements 2014;2(1): e00084–14. doi: 10.1128/genomeA.00084-14 24526647
27. Eidam C, Poehlein A, Brenner Michael G, Kadlec K, Liesegang H, Brzuszkiewicz E et al. Complete Genome Sequence of Mannheimia haemolytica Strain 42548 from a Case of Bovine Respiratory Disease. Genome announcements 2013;1(3): e00318–13. doi: 10.1128/genomeA.00318-13 23723408
28. Li H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 2016;32(14):2103–10. doi: 10.1093/bioinformatics/btw152 27153593
29. Magoč T & Salzberg SL. FLASH: fast length adjustment of short reads. Bioinformatics 2011;27(21): 2957–2963. doi: 10.1093/bioinformatics/btr507 21903629
30. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature, 2005;437:376–380. doi: 10.1038/nature03959 16056220
31. Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of Computational Biology, 2012;19(5): 455–477. doi: 10.1089/cmb.2012.0021 22506599
32. Koren S, Schatz M, Walenz B, Martin J, Howard J, Ganapathy G et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nature Biotechnology, 2012;30: 693–700. doi: 10.1038/nbt.2280 22750884
33. Darling ACE, Mau B, Blattner FR & Perna NT. Mauve: Multiple Alignment of Conserved Genomic Sequence With Rearrangements. Genome Research, 2004;14(7): 1394–1403. doi: 10.1101/gr.2289704 15231754
34. Dohm JC, Lottaz C, Borodina T & Himmelbauer H. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Research 2008;16(36): e105.
35. Kammonen JI, Smolander OP, Sipilä T, Overmyer K, Auvinen P & Paulin L. Increased transcriptome sequencing efficiency with modified Mint-2 digestion-ligation protocol. Analytical Biochemistry, 2015;477:38–40. doi: 10.1016/j.ab.2014.12.001 25513723
36. Camacho C, Madden T, Coulouris G, Avagyan V, Ma N, Tao T et al. BLAST command line applications user manual. National Center for Biotechnology Information. https://www.ncbi.nlm.nih.gov/books/NBK279690 (11 Feb 2019, date last accessed)
Článok vyšiel v časopise
PLOS One
2019 Číslo 9
- Metamizol jako analgetikum první volby: kdy, pro koho, jak a proč?
- Nejasný stín na plicích – kazuistika
- Masturbační chování žen v ČR − dotazníková studie
- Těžké menstruační krvácení může značit poruchu krevní srážlivosti. Jaký management vyšetření a léčby je v takovém případě vhodný?
- Fixní kombinace paracetamol/kodein nabízí synergické analgetické účinky
Najčítanejšie v tomto čísle
- Graviola (Annona muricata) attenuates behavioural alterations and testicular oxidative stress induced by streptozotocin in diabetic rats
- CH(II), a cerebroprotein hydrolysate, exhibits potential neuro-protective effect on Alzheimer’s disease
- Comparison between Aptima Assays (Hologic) and the Allplex STI Essential Assay (Seegene) for the diagnosis of Sexually transmitted infections
- Assessment of glucose-6-phosphate dehydrogenase activity using CareStart G6PD rapid diagnostic test and associated genetic variants in Plasmodium vivax malaria endemic setting in Mauritania