Containers and Workflows in Bioinformatics

Getting familiar with Puhti

The main purpose of this tutorial is to get familar wit working in CSC Puhti supercomputer.

Some exercises are done in interactive mode using the sinteractive command and some as batch jobs.

Interactive use: Retrieving data from bio repositories

These exercises cover retrieving data from various commonly used bio data repositories.

1. Downloading data with curl

2. Downloading data with NCBI edirect

Batch jobs: Align the cellulose synthase 3 set with mafft

  1. Make a file called mafft.sh:
     module load nano   # The compute nodes do not have nano by default
     nano mafft.sh
    
  2. Copy the following contents into the file and change “project_xxxx” to the correct project name:
#!/bin/bash
#SBATCH --job-name=test           # Name of the job visible in the queue.
#SBATCH --account=project_xxxx    # Choose the billing project. Has to be defined!
#SBATCH --partition=test          # Job queues: test, interactive, small, large, longrun, hugemem, hugemem_longrun
#SBATCH --time=00:10:00           # Maximum duration of the job. Max: depends of the partition. 
#SBATCH --mem=8G                  # How much RAM is reserved for job per node.
#SBATCH --ntasks=1                # Number of tasks. Max: depends on partition.
#SBATCH --cpus-per-task=1         # How many processors work on one task. Max: Number of CPUs per node.

# 
module load biokit
mafft cesy3.fasta > cesy3_aln.fasta
   
đź’¬ In nano you can use `ctrl + o` to save and `ctrl + x` to exit.
  1. Submit the job to the queue with:

    sbatch test.sh
    
  2. Study the results: