Running and managing workflows for bioinformatics applications can be challenging as the workflows usually are fragile eco-systems of several software tools and their dependencies. We therefore need a workflow manager like Nextflow to manage our scientific workflow. Nextflow is a groovy-based language for expressing the entire workflow in a single script and also facilitates the ease of working with workflows by rendering several useful features as mentioned below:
Upon completion of this tutorial, you will be able to learn:
fastqc
softwareThis hands-on tutorial uses Puhti supercomputer for executing Nextflow scripts for interactive and batch jobs. One therefore needs to have either a training or user account at CSC to access Puhti.
Login to Puhti using ssh
command followed by making a work directory named nextflow_tutorial
on scratch drive as shown below:
ssh <your_csc_username>@puhti.csc.fi
mkdir -p /scratch/project_xxxx/$USER/nextflow_tutorial && cd /scratch/project_xxxx/$USER/nextflow_tutorial
Lanuch an interactive session on Puhti as below:
sinteractive -c 2 -m 4G -d 250 -A project_xxxx # replace actual project number here
module load nextflow/22.10.1
In this Hello-world tutorial, you will learn how to run a Nextflow script as well as understand the default location of resulting output files.
Download course material from CSC’s allas
object storage as shown below:
wget https://a3s.fi/nextflow/tutorial_demo.tar.gz
tar -xavf tutorial_demo.tar.gz && rm tutorial_demo.tar.gz
After unpacking the tutorial_demo.tar.gz
file, you can see hello_demo
folder which has hello-world script (ending with .nf
) for running this demo. Execute the script by entering the following command on your interactive Puhti terminal:
cd hello_demo
nextflow run hello-world.nf
This script defines one process named sayHello
. This process takes a set of greetings from different languages and then writes each one to a separate file.
The resulting terminal output would look similar to the text shown below:
N E X T F L O W ~ version 20.07.1
Launching `hello-world.nf` [cheeky_shaw] - revision: 3ffdbdd5c7
executor > local (5)
[e2/9aa8c8] process > sayHello (5) [100%] 5 of 5 ✔
The hexadecimal number, like e2/9aa8c8, identifies a unique process execution and the number is also the prefix of directory where sayHello
process is executed. You can inspect the files produced by above script by changing to directory $PWD/work.
Execute the following command on your terminal:
ls -l work/*/*
You can see that there is a separate output file created under each directory.
Hidden files are present in each process directory, and the files are very handy when you want to debug a failed process.
you can find the hidden files as shown below:
ls -la work/*/*
fastqc
softwareIn this example, let’s use some real-world example that involves working with samples from sequencing experiments. We specifically learn:
fastqc_demo
folder has the necessary files for running this tutorial. You can run Nextflow script for fastqc analysis on your interactive terminal by issuing the following command:
cd fastqc_demo
nextflow run fastqc.nf
tip: check $PWD/work directory as shown in previous example
Nextflow parameters inside a script are declared by prepending to a variable name with the prefix params, separated by dot character (e.g., params.reads). Parameters thus specified in script are used by default. The parameter can also be specified on the commandline by prefixing the parameter name with a double dash character (e.g., –reads).
Here is an example to declare parameters (here, input files) to fastqc software inside Nextflow script (NOT for running the command on terminal):
params.reads = "$baseDir/data/*_{1,2}.fq.gz"
input_ch = Channel.fromFilePairs(params.reads)
One can also override parameter values (here files inside $baseDir/data/
directory) in nextflow script by passing the parameters in commandline when executing script as shown below:
nextflow run fastqc.nf --reads data2/*_{1,2}_subset.fq.gz
Please note that data2 folder has different samples (i.e., lymphnode4a samples) than the ones (i.e.,lung3e samples) in data folder which ould have been used by default. You can see that fastqc analysis was performed on a new set of samples now as shown below:
ls -l $PWD/work/*/*
NB: single dash (
-
) represents core nextflow parameters (e.g., -resume). double dash (--
) represents user-defined and completely extensible one – they are used to populateparams
.
Checking resulting files from a workflow analysis as shown above is quite tedious especially when there are several processes inside a workflow. Nextflow provides an easy way to collect resulting files to a convenient place using a special directive, publishDir.
Open fastqc.nf script in any text editor and uncomment (= remove double slashes) the following line:
// publishDir params.outdir
and then run pipeline again. But this time, let’s use -resume flag as we don’t need to perform quality control analysis again so that actual analysis is skipped due to the capability of nextflow to track cached results from the previous analysis.
nextflow run fastqc.nf -resume
Once the script is run successfully, you can check the files:
ls -l results/
By using -resume
flag, the resulting files from previous analysis are simply copied to folder results .
Channels and operators as core features of nextflow. Please read and learn different ways of creating Channels and operators to manupulate content of channels. Channels support different data types like file
, val
annd set
Here are few examples on how one can create channels in nextflow script:
Channel.create(); Channel.empty; Channel.fromPath()
This default semantics can be changed using the channel operators that Nexflow provides, some of which are shown below:
split merge view
filter map/reduce group
In this tutorial, you will learn nextflow script that uses
Containerised applications are highly portable and reproducible for scientific applications. Fortunately, Nextflow smoothly supports integration with popular containers ( e.g., Docker and Singularity) to provide a light-weight virtualisation layer for running software applications. You can either create your own Docker/Singularity image or download pre-existing one from a container registry. Please note that you can only work with Singularity containers on Puhti as docker containers require prevelized access which CSC users don’t have it on Puhti.
When working with Nextflow scripts using containers, pay attention to the following things:
Let’s download material needed for this tutorial from github as shown below:
cd /scratch/project_xxxx/$USER/nextflow_tutorial
git clone https://github.com/yetulaxman/nf_coverage_demo.git
cd nf_coverage_demo
git clone https://github.com/iarcbioinfo/data_test
Here is a simple example syntax (for an alternative approach, see profiles section below) to use docker/singularity containers:
## For Docker
nextflow run <nextflow_script> -with-docker <image_path> # e.g.,image_path = docker://biocontainers/fastqc:v0.11.9_cv7
## For Singularity
nextflow run <nextflow_script> -with-singularity <image_path>
Because of the way how nextflow works with containers, you don’t need to have software (e.g., fastqc
) installed on your machine. It will download container image and uses fastqc from the image.
We often need to add some other attributes besides a container flag as mentioned above. This is accomplished using profiles. A profile is a set of configuration attributes that can be activated/chosen when launching a pipeline execution. When a workflow script is launched, Nextflow first looks for a file named nextflow.config
in the current directory and in the workflow (or script base) directory (if different from current directory). Finally, it checks for the file $HOME/.nextflow/config. Configuration files can contain the definition of one or more profiles.
Example profiles are shown below:
profiles {
docker {
docker.enabled = true
process.container = 'iarcbioinfo/nf_coverage_demo:v2.3'
pullTimeout = "200 min"
}
singularity {
singularity.enabled = true
singularity.autoMounts = true
process.container = 'shub://IARCbioinfo/nf_coverage_demo:v2.3'
pullTimeout = "200 min"
}
}
copy above script and paste in nextflow.config
file which is located in current directory.
You can then launch nf_coverage workflow (from nf_coverage_demo
folder) with defined profiles as shown below:
nextflow run plot_coverage.nf \
-profile singularity \
--bam_folder data_test/BAM/BAM_multiple/ \
--bed data_test/BED/TP53_exon2_11.bed
Nextflow provides options for reporting and visualisation your pipeline using the following nextflow flags:
-with-dag
-with-timeline
-with-report
You can either use the flags in commandline or add each feature to config file as discussed below:
dag
Either use the following flag (-with-dag) when launching script as below:
nextflow run <nextflow_script> -with-dag <file-name>.dot
or add the following script to nextflow.config
file at the end.
dag {
enabled = true
file="dag.png"
}
timeline
Either use the following flag (-with-timeline) when launching script as below:
nextflow run <nextflow_script> -with-timeline <file-name>.html
or add the following script to nextflow.config
file at the end.
timeline {
enabled = true
}
report
Either use the following flag (-with-report) when launching script as below:
nextflow run <nextflow_script> -with-report <file-name>.html
or add the following script to nextflow.config
file at the end.
report {
enabled = true
}
trace
Either use the following flag (-with-trace) when launching script as below:
nextflow run <nextflow_script> -with-trace <file-name>.txt
or add the following script to nextflow.config
file at the end.
trace {
enabled = true
}
For the convenience of this tutorial, configure all visualisation features (i.e., dag/timeline/reports/trace) into nextflow.config
file.
Once you have configured profiles for singularity and enabled reporting/visualisation features in nextflow.config file, you can use the following batch script to submit on Puhti:
#!/bin/bash
#SBATCH --time=00:10:00
#SBATCH --partition=test
#SBATCH --account=project_xxx
export TMPDIR=$PWD
export SINGULARITY_TMPDIR=$PWD
export SINGULARITY_CACHEDIR=$PWD
unset XDG_RUNTIME_DIR
# Activate Nextflow on Puhti
module load nextflow
# Nextflow command here
nextflow run plot_coverage.nf \
-profile singularity \
--bam_folder data_test/BAM/BAM_multiple/ \
--bed data_test/BED/TP53_exon2_11.bed
copy and paste above script to a file (nf_coverage.sh), replace project number with correct one in slurm directives and finally submit sbatch script to Puhti cluster:
rm -fr work/ # remove previous analysis results
rm *.html *.png trace.txt # remove these visualisation files if any
sbatch nf_coverage.sh # start a fresh job
Copy all nextflow report and visualisation files from working directory (i.e., .html, .dot and .txt files) to home directory to view them from your local browser.
mkdir -p $HOME/nextflow_output
cp *.html *.png *.txt *.pdf $HOME/nextflow_output
One has to open a port on Puhti login node to access files on your Puhti home directory from your local computer via browser. In this course, every participant should have a unique port number opened on Puhti login node. Open a new terminal on your local machine and replace $port value with some random number (e.g., a number between 5000 and 9000) before executing the following command:
ssh -L $port:localhost:$port <your_csc_username>@puhti.csc.fi # e.g., with port number: 7077
# ssh -L 7077:localhost:7077 <username>@puhti.csc.fi
and then run the following command (also use the same port value that you have slected before) on the login node:
python3 -m http.server $port # with port number: 7077 -> python3 -m http.server 7077
Point your browser to http://localhost:$port (remember to replace your port number with $port) on your local machine. You can now view all files available on your Puhti home directory.
One of the advantages of nextflow is that the actual pipeline functional logic is separated from the execution environment. The same script can therefore be executed in different environment by changing the execution environment without touching actual pipeline code. Nextflow uses executor
information to decide where the job should be run. Once executor is configured, Nextflow submits each process to the specified job scheduler on your behalf (=you don’t need to write sbatch script, nextflow writes on the fly for you, instead).
Default executor is local
where process is run in your computer/localhost where Nextflow is launched. Other executors include:
To enable the SLURM executor on Puhti, simply set process.executor
property to slurm value in the nextflow.config
file as shown below:
profiles {
standard {
process.executor = 'local'
}
puhti {
process.clusterOptions = '--account=project_xxxx --ntasks-per-node=1 --cpus-per-task=4 --ntasks=1 --time=00:00:05'
process.executor = 'slurm'
process.queue = 'small'
process.memory = '10GB'
}
}
In this case, you can run a nextflow script as below:
nextflow run <nextflow_script> -profile puhti
This will submit each process of your job to Puhti cluster.
nf-core is a community effort to collect a curated set of analysis pipelines built using Nextflow. Here we use nfcore/atacseq as an example pipeline for ATAC-seq data.
Here is an example batch script to run the pipeline on Puhti:
#!/bin/bash
#SBATCH --time=01:00:00
#SBATCH --partition=small
#SBATCH --account=project_xxxx
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=4000
export TMPDIR=$PWD
export SINGULARITY_TMPDIR=$PWD
export SINGULARITY_CACHEDIR=$PWD
unset XDG_RUNTIME_DIR
# Activate Nextflow on Puhti
module load nextflow
# Nextflow command here
nextflow run nf-core/atacseq -r 1.2.1 -profile test,singularity -resume
copy and paste the above script to a file named atacseq.sh
and replace your project number with project_xxxx
in slurm directives.
Finally, submit your job
sbatch atacseq.sh
We recommend using singularity containers over conda environment to work with nextflow pipelines on Puhti. You can convert the most of bioconda packges into singularity images. You can for example take a look at tastx_toolkit package as available on bioconda page. In order to convert it into a singularity image, you need to know the url of image from the docker pull command which in this case appears as below:
docker pull quay.io/biocontainers/fastx_toolkit:<tag> # url: docker://quay.io/biocontainers/fastx_toolkit:<tag>
In addition to the url of a docker image, it is good practice to use a specific tag for reproducibility of workflows. Here, you can pick a tag (e.g., “0.0.14–he1b5a44_8”) from list of tags associated with fastx_toolkit sofwtare.
you can then use the the following script on Puhti interactive terminal to prepare fastx_toolkit singularity image:
export SINGULARITY_TMPDIR=$LOCAL_SCRATCH
export SINGULARITY_CACHEDIR=$LOCAL_SCRATCH
unset XDG_RUNTIME_DIR
singularity build fastx_toolkit.sif docker://quay.io/biocontainers/fastx_toolkit:0.0.14--he1b5a44_8
Once above script is successfully executed, there should be a singularity image named fastx_toolkit.sif
in current folder