BioMonth - CSC supercomputing and data management for bioscientists

Biosoftwares in Puhti

This exercise requires that you have a user account at CSC and it is a member of a project that has access to Puhti service.

You will learn:

Let’s imagine that we have some sequencing data that we wish to align to a reference genome, and check the quality of the alignment.

  1. Let’s see the list of applications and look for suitable aligners. Can you find for example TopHat, STAR, Bowtie and BWA aligners in the list? Which module is needed to run these applications?

The biokit module sets up a set of commonly used bioinformatics tools.

  1. All softwares installed in CSCs super computers don’t necessarily have their own manual page in the application list (yet): they might be new installations, or installed from request of a single research group etc. Let’s check if HISAT2 aligner is also available:
module spider hisat

Is there some version of HISAT2 also available?

  1. Let’s load the biokit module and see what is included.
    module load biokit
    module list
    

    Was HISAT2 available in the biokit?

Bioconda environment

  1. After aligning, we might want to check the quality of the alignment with RSeQC tool. As we can see from the module list command above, it was not included in the biokit. Like we learned, you can try to look for it from the application manual page and by using the module spider reseqc.

No luck? What next? Let’s take a look at the bioconda environment.

Some applications are installed and used as Conda environments in Puhti. You can use CSC’s bioconda environment also to easily install tools from Bioconda repository.

Let’s check what is available with spider again, and load one of the modules:

module spider bioconda
module load bioconda/3

Take a look at the message you get. Note, that some dependency modules were re-loaded in the background. It says that we first need to set the PROJAPPL environment variable. To do so, run command (you can check the name/number of your project(s) with command csc-workspaces)

export PROJAPPL=/projappl/project_XXXXXXX

Check which applications are available in this bioconda environment:

conda env list

See RSeQc there?

Using modules in a batch script

  1. When we loaded the bioconda module, some dependency modules were loaded in the background. This means, that the environment changed, and the softwares that were previously loaded might not be available anymore. Note, that if you are writing a batch script that uses applications from different modules, you want to be careful that you load and unload the modules at the right time!