Large-scaleWGS Analysis on the supercomputer Fugaku
Abstract
The remarkable improvement in the performance of next-generation sequencers has led to the analysis of large numbers of samples in recent years for whole-genome analysis. The computing capabilities required for the same have also increased dramatically; in particular, supercomputers and cloud services have become essential. However, few analyses have used TOP500 large-scale parallel computers because of the difference in the computational scientific properties of existing software for supercomputers and software for bioinformatics and the associated supercomputer resource management methods. In this study, we investigated the difference between bioinformatics pipelines and legacy supercomputer applications carefully and found that job management systems of TOP500-class supercomputers and distributed file systems have issues for bioinformatics pipelines. Based on the results, we developed some easy and efficient methods to overcome them, then we ported and optimized Genomon, a whole-genome analysis pipeline to Fugaku. We performed large-scale WGS analysis on Fugaku, and successfully analyzed over 1 thousand samples with 1,460 thousand node-hours, which is currently the largest analysis of the world’s. Availability: The ported and optimized Genomon is available at the following URL. https://bitbucket.org/genomon_wg/genomonpipeline/src/GPF_develop/