This tutorial is done on Puhti, which requires that:
π¬ Snakemake is a popular scientific workflow manager, especially within the bioinformatics community. The workflow manager enables scalable and reproducible scientific pipelines by chaining a series of rules in a fully-specified software environment. Snakemake is available as a pre-installed module on Puhti.
π¬ HPC-friendly containers like Singularity/Apptainer can be used as an
alternative to native or Tykky-based installations for better portability and
reproducibility. If you donβt have a ready-made container image for your needs,
you can build a Singularity/Apptainer image on Puhti using --fakeroot
option.
βπ» For the purpose of this tutorial, a pre-built container image is provided later to run Snakemake workflows at scale.
βΌοΈ If a workflow manager is using sbatch
(or srun
) for each process execution
(i.e., a rule in Snakemake terminology), and the workflow has many short
processes, itβs advisable to use HyperQueue executor to improve throughput and
decrease load on the Slurm batch job scheduler.
module load hyperqueue/0.16.0
module load snakemake/8.4.6
βΌοΈ Note! In case you are planning to use Snakemake on LUMI supercomputer, you can use CSC module installations as below:
module use /appl/local/csc/modulefiles/
module load hyperqueue/0.18.0
module load snakemake/8.4.6
HyperQueue executor settings for a Snakemake workflow can be changed depending on the version of Snakemake as shown below:
# snakemake version 7.x.x
snakemake --cluster "hq submit ..."
# snakemake version 8.x.x
snakemake --executor cluster-generic --cluster-generic-submit-cmd "hq submit ..."
Create and enter a suitable scratch directory on Puhti (replace <project>
with your CSC project, e.g. project_2001234
):
mkdir -p /scratch/<project>/$USER/snakemake-ht
cd /scratch/<project>/$USER/snakemake-ht
Download the tutorial material, which has been adapted from the official Snakemake documentation, from Allas:
wget https://a3s.fi/snakemake_scale/snakemake_scaling.tar.gz
tar -xavf snakemake_scaling.tar.gz
The downloaded material includes scripts and data to run a Snakemake
pipeline. You can use snakemake_hq_puhti.sh
, the contents of which are
posted below:
#!/bin/bash
#SBATCH --job-name=snakemake
#SBATCH --account=<project> # replace <project> with your CSC project, e.g. project_2001234
#SBATCH --partition=small
#SBATCH --time=00:10:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=40
#SBATCH --mem-per-cpu=2G
module load hyperqueue/0.16.0
module load snakemake/8.4.6
# Specify a location for the HyperQueue server
export HQ_SERVER_DIR=${PWD}/hq-server-${SLURM_JOB_ID}
mkdir -p "${HQ_SERVER_DIR}"
# Start the server in the background (&) and wait until it has started
hq server start &
until hq job list &>/dev/null ; do sleep 1 ; done
# Start the workers in the background and wait for them to start
srun --exact --cpu-bind=none --mpi=none hq worker start --cpus=${SLURM_CPUS_PER_TASK} &
hq worker wait "${SLURM_NTASKS}"
snakemake -s Snakefile --jobs 1 --use-singularity --executor cluster-generic --cluster-generic-submit-cmd "hq submit --cpus 5"
# For Snakemake versions 7.x.x, use command:
# snakemake -s Snakefile --jobs 1 --use-singularity --cluster "hq submit --cpus 5"
# Wait for all jobs to finish, then shut down the workers and server
hq job wait all
hq worker stop all
hq server stop
βπ» The default script provided above is not optimized for throughput, as the Snakemake workflow manager just submits one job at a time to the HyperQueue meta-scheduler.
You can run multiple workflow tasks (i.e., rules) concurrently by submitting
more jobs using the snakemake
command as:
snakemake -s Snakefile --jobs 8 --use-singularity --executor cluster-generic --cluster-generic-submit-cmd "hq submit --cpus 5"
Replace the above modification in the snakemake_hq_puhti.sh
batch script
(and use your own project number) before submitting the Snakemake workflow
job with:
sbatch snakemake_hq_puhti.sh
βπ» Note that just increasing the value of --jobs
will not automatically make
all those jobs run at the same time. This option of the snakemake
command is
just a maximum limit for the number of concurrent jobs. Jobs will eventually
run when resources are available. In this case, we run 8 concurrent jobs, each
using 5 CPU cores to match the reserved 40 CPU cores (one Puhti node) in the
batch script. In practice, it is also a good idea to dedicate a few cores for
the workflow manager itself.
π‘ It is also possible to use more than one node to achieve even higher throughput as HyperQueue can make use of multi-node resource allocations. Just remember that with HyperQueue the workflow tasks themselves should be sub-node (use one node at most) as MPI tasks are poorly supported.
π‘ You can already check the progress of your workflow by simply observing the current working directory where lots of new task-specific folders are being created. However, there are also formal ways to check the progress of your jobs as shown below.
Monitor the status of submitted Slurm job:
squeue -j <slurmjobid>
# or
squeue --me
# or
squeue -u $USER
Monitor the progress of the individual sub-tasks using HyperQueue commands:
module load hyperqueue
export HQ_SERVER_DIR=$PWD/hq-server-<slurmjobid>
hq worker list
hq job list
hq job info <hqjobid>
hq job progress <hqjobid>
hq task list <hqjobid>
hq task info <hqjobid> <hqtaskid>
π HyperQueue creates task-specific folders (job-<n>
) in the same directory
from where you submitted the batch script. These are sometimes useful for
debugging. However, if your code is working fine, the creation of many folders
may be annoying besides causing some load on the Lustre parallel file system.
You can prevent the creation of such task-specific folders by setting stdout
and stderr
HyperQueue flags to none
as shown below:
snakemake -s Snakefile -j 24 --use-singularity --executor cluster-generic --cluster-generic-submit-cmd "hq submit --stdout=none --stderr=none --cpus 5"