Run DeepVariant method to perform variant calling on WGS and WES data sets in Puhti supercomputing environment using Apptainer container. One needs to prepare DeepVariant Apptainer image, models and test data to run the analysis. Additionally, other input files for running DeepVariant method include 1) A reference genome in FASTA format and its corresponding index file (.fai). 2) An aligned reads file in BAM format and its corresponding index file (.bai). For the sake of this tutorial, test data is provided as a downloadable link in the later sections.
Upon completion of this tutorial you will learn to:
ssh yourcscusername@puhti.csc.fi
cd /scratch/project_xxxx/$USER # replace xxxx with your course (or own) project number
mkdir deepvariant && cd deepvariant
sinteractive -c 2 -m 4G -d 100
You have to choose project number of the course on the command prompt to start an interactive session.
Prepare Apptainer image from docker image for DeepVariant analysis. Here we use DeepVariant Docker image from DockerHub with a specific tag (i.e., 1.2.0). You can explore more about the image on the DokcerHub. It is advisable to use LOCAL_SCRATCH for Apptainer TMPDIR and CACHEDIR. Unsetting XDG_RUNTIME_DIR will silence some unnecessary warnings. We will learn more about these settings later in the course.
export APPTAINER_TMPDIR=$LOCAL_SCRATCH
export APPTAINER_CACHEDIR=$LOCAL_SCRATCH
unset XDG_RUNTIME_DIR
apptainer build deepvariant_cpu_1.2.0.sif docker://google/deepvariant:1.2.0
This image conversion process for DeepVariant takes sometime as it is a bigger image with several layers.
wget https://a3s.fi/containers-workflows/deepvariant_testdata.tar.gz
tar -xavf deepvariant_testdata.tar.gz
Prepare a batch script (e.g., deepvariant_puhti.sh) to run WGS analysis on Puhti. A batch script template with all necessary information is provided below. Please note that this batch script also uses a special CSC-specific singualrity_wrapper command to set appropriate options automatically for running Apptainer. You are free to use plain Apptainer command by taking care of bind mounts appropriately. You are required to use a valid project number in the script before submitting it to Puhti cluster.
#!/bin/bash
#SBATCH --time=00:10:00
#SBATCH --partition=test # You can also choose partition : "small" for this toy example
#SBATCH --account=project_xxxx
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=4000
export SING_IMAGE="$PWD/deepvariant_cpu_1.2.0.sif"
apptainer_wrapper exec \
/opt/deepvariant/bin/run_deepvariant \
--model_type=WGS --ref=$PWD/testdata/ucsc.hg19.chr20.unittest.fasta \
--reads=$PWD/testdata/NA12878_S1.chr20.10_10p1mb.bam \
--regions "chr20:10,000,000-10,010,000" \
--output_vcf=$PWD/output.vcf.gz \
--output_gvcf=$PWD/output.g.vcf.gz
Submit your job to Puhti cluster
sbatch -J deepvariant deepvariant_puhti.sh
If the analysis is completed successfully (hint: check the status of submitted job using squeue -u $USER
command or using seff <jobid>
), you are able to see the vcf files as output in the current directory.