Large-scaleWGS Analysis on the supercomputer Fugaku

Authors: Ito, Satoshi and Ono, Kenji and Miyano, Satoru

Conference: Proceedings of the 2024 14th International Conference on Bioscience, Biochemistry and Bioinformatics

DOI: 10.1145/3640900.3640903

URL: https://doi.org/10.1145/3640900.3640903

Peer Review: Refereed

Abstract

The remarkable improvement in the performance of next-generation sequencers has led to the analysis of large numbers of samples in recent years for whole-genome analysis. The computing capabilities required for the same have also increased dramatically; in particular, supercomputers and cloud services have become essential. However, few analyses have used TOP500 large-scale parallel computers because of the difference in the computational scientific properties of existing software for supercomputers and software for bioinformatics and the associated supercomputer resource management methods. In this study, we investigated the difference between bioinformatics pipelines and legacy supercomputer applications carefully and found that job management systems of TOP500-class supercomputers and distributed file systems have issues for bioinformatics pipelines. Based on the results, we developed some easy and efficient methods to overcome them, then we ported and optimized Genomon, a whole-genome analysis pipeline to Fugaku. We performed large-scale WGS analysis on Fugaku, and successfully analyzed over 1 thousand samples with 1,460 thousand node-hours, which is currently the largest analysis of the world’s. Availability: The ported and optimized Genomon is available at the following URL. https://bitbucket.org/genomon_wg/genomonpipeline/src/GPF_develop/