Samtools get consensus sequences

9/9/2023

ONT data simulation was performed using NanoSim-H software. As the raw ONT data were not found in the above-mentioned website, we finally downloaded the genome sequences of Acinetobacter baumannii strain K09-14 (NZ_CP043953.1), Klebsiella oxytoca strain FDAARGOS_335 (NZ_CP027426.1), and Klebsiella variicola strain FH-1 (NZ_CP054254.1) and used them as reference genomes. Thus, these three strains were excluded from analysis and comparison. For Citrobacter koseri MINF 9D, Enterobacter kobei MSB1 1B, and Haemophilus M1C132 1, no suitable reference genomes were available in NCBI. gz was not found for Serratia marcescens 1 dataset. For the Wick et al.’s data, the rapid sequencing and assembly results in full assemblies of three strains were selected for analysis. For the Lang et al.’s data, it included both nanopore sequencing data and NGS sequencing data of Alcaligenes faecalis ( A. We downloaded the real-world sequencing data published by Lang et al. NanoPlot (version: 1.38.0) was used for quality control and QUAST (version: 5.0.2) was used for evaluation of the sequencing data and assemblies, with default parameters. Furtherly, if polishing with NGS data was required, default is Pilon (version: 1.24), was used to polish the consensus sequence with default parameters.īy default, BWA (version: 0.7.17-r1188) and Minimap2 (version: 2.21-r1071) were installed for alignment, while Sambamba (version: 0.8.0) and Samtools (version: 1.12) were installed for alignment processing. Finally, the sequencing data and consensus sequence 2 were used for three rounds of error correction to obtain the final consensus sequence. In the second round of error correction, the sequencing data were aligned against consensus sequence 1, and the assembly from FlyE was used as the target genome to generate consensus sequence 2. For example, if the number of scaffolds from Canu, Wtdbg2, and FlyE was 3, 2, and 1, respectively, the sequencing data were aligned against the assembly from Canu in the first round of correction, with the assembly from Wtdbg2 used as the target genome to generate consensus sequence 1. Then, error correction algorithm, default is Racon (version: v1.4.20), was used for “2+3” rounds of self-correction. Then, the assembly results from various software and/or algorithm were sorted according to the number of scaffolds in descending order. MAECI takes advantage of the fact that different assembly algorithms produce different assembly errors for the same data, and corrects them by methods of self-correction to produce a single consensus sequence with fewer assembly error and more accurate than any of the inputs.įig 1. It takes nanopore sequencing data as input, uses multiple assembly algorithms to generate a single consensus sequence, and then uses nanopore sequencing data to perform self-error correction. Therefore, we develop MAECI, a pipeline that enables the assembly for nanopore long-read sequencing data of bacterial genomes. Since genome assembly is often the beginning of bioinformatics analysis by de novo sequencing of bacterial genomes, assembly errors may have critical implications for downstream analysis. Therefore, the assembly, especially of bacterial genomes, is far from perfect, and there are many details to consider and substantial room for improvement. Both approaches can mitigate some of these problems and improve the accuracy of the assemblies, but assembly errors cannot be completely avoided. Alternatively, the assemblies can be corrected using nanopore sequencing data and then polished with NGS data. Hybrid assembly, which uses both short and long reads from next- and third-generation sequencing platforms, is gaining popularity. Nanopore sequencing data are characterized by the presence of indels, non-random systematic errors and the occurrence of assembly errors spanning hundreds of bases, which may lead to inaccurate or incomplete assemblies. They have relative advantages and disadvantages as well as varying performance and assembly outcomes, but in terms of overall performance, FlyE and Raven stands out as the best bacterial genome assembler. Many software or algorithm have been developed for bacterial genome assembly, such as Canu, FlyE, and Wtdbg2. Compared with short reads from next-generation sequencing (NGS), long reads can span larger genomic repeats and complex genomic structures, thus facilitating downstream genome assembly and analysis. Long reads from nanopore sequencing platforms such as Oxford Nanopore Technologies (ONT) are widely used in the study of bacterial genomes.

0 Comments

Samtools get consensus sequences

Leave a Reply.

Author

Archives

Categories