Dados do Trabalho
Título
A User-Friendly Nextflow Pipeline for Mycobacterium tuberculosis Complex
Introdução
The Mycobacterium tuberculosis sequencing (MTBseq) pipeline was created to address the challenges of bioinformatics analysis in tuberculosis research using whole-genome sequencing data, and to facilitate reproducibility. It is the first and till date the only publicly available pipeline to perform full analysis from quality control through mapping, variant calling for lineage classification, drug resistance prediction and phylogenetic inference. Because MTBseq default batch mode of analysis is not optimal in the context of high-performance computing (HPC) or cloud environment, it needs optimization to use all available resources to perform a large set of data analysis.
Objetivo(s)
To optimize MTBseq using the scripting language Nextflow DSL for parallel computation and user-friendliness.
Material e Métodos
For implementation, we relied on the modular nature of MTBseq TBfull analysis, which by default analysed all the input raw FASTQ files in a linear manner and added a separate mode of parallel analysis in the Nextflow wrapper by relying on the individual analysis steps available within the MTBseq tool such as TBbwa and TBvariants. As a proof of concept, we used 71 M. tuberculosis genomes (NCBI accession numbers PRJNA494931 and PRJNA630228) for the benchmarking analysis on a server environment (16 vCPUs and 40GB RAM).
Resultados e Conclusão
We optimized the MTBseq software by creating a wrapper in the Nextflow language (MTBseq-nf) which (i) is capable of automatically setting up the conda environments and pulling the necessary docker containers (ii) adds a new parallel mode of execution on top of the base MTBseq tool and addresses scalability with size of dataset and available hardware (iii) makes it easy to add new functionality in the pipeline such as a custom MultiQC report. The performance of MTBseq-nf parallel analysis mode (11h 1m 52s) is at least twice as fast as the batch mode (22h 22m 20s). MTBseq-nf facilitates reproducibility using the conda package manager for platform independence and docker containers which enables the pipeline execution in a cloud context. Compared to the original MTBseq we proposed MTBseq-nf, a user-friendly pipeline, which is optimized for efficiency of hardware usage, scalability for the larger datasets as well as improved reproducibility. The implemented pipeline can be used for analyzing large datasets in HPC or cloud computing context, the optimal infrastructure for big data and genomic surveillance.
Palavras-chave
Mycobacterium tuberculosis complex; whole-genome sequencing; Pipeline; Nexflow
Área
Eixo 02 | Tecnologia e Inovação em saúde
Categoria
NÃO desejo concorrer ao Prêmio Jovem Pesquisador
Autores
Davi Josué Marcon, Johannes Loubser, Maria Cristina da Silva Lourenço, Robin Warren, Karla Valeria Batista Lima, Emilyn Costa Conceição, Abhinav Sharma