Containers and Workflows in Bioinformatics

Getting familiar with Puhti

The main purpose of this tutorial is to get familar wit working in CSC Puhti supercomputer.

Some exercises are done in interactive mode using the sinteractive command and some as batch jobs.

Interactive use: Retrieving data from bio repositories

These exercises cover retrieving data from various commonly used bio data repositories.

We will do these exercises using the sinteractive command (substitute xxxx with correct project number):
```
sinteractive --account project_xxxx
```
Move to /scratch directory area of the course project (unless there already):
```
cd /scratch/project_xxxx  # substitute xxxx with correct project number
```
Create a directory for yourself and move to it (unless there already):
```
mkdir $USER
cd $USER
```
💭 Everyone in the project shares the same /scratch directory, so it is a good idea to use subdirectories for each user and task, so you won’t accidentally delete or overwrite each others files.

🗯 In normal usage it may be a good idea to even use chmod command to alter file access rights so that only you have write access to your own subfolder, but please do not do this in the course project, as it would make clean-up after course harder.
You can find more information about this in Disk areas page in the Docs.

1. Downloading data with curl

curl and wget are general tools to download data from an URL.
Download a dataset from internet using curl and uncompress it. The dataset contains some Pythium genomes with related BWA indexes.
```
  curl https://a3s.fi/course_12.11.2019/pythium.tgz > pythium.tgz
  ls -ltr
  tar -zxvf pythium.tgz  
  ls -ltr
```

2. Downloading data with NCBI edirect

In this exercise we are using some specialized tools to download data. To use them we must first load the biokit module.
```
  module load biokit
```
Create directory cellulose_synthase and move to this new directory:
```
 mkdir cellulose_synthase
 cd cellulose_synthase
```
Next we use NCBI edirect tool to retrieve some data.
Check how many proteins are found the NCBI protein databanks for Pythium species (count row in the results):
```
  esearch -db protein -query "Pythium [ORGN]" 
```
Then check the number of proteins: cellulose synthase 1, cellulose synthase 2 and cellulose synthase 3 that are found for Pythium species.

For cellulose synthase 1 this can be done with:

  esearch -db protein -query "Pythium [ORGN] AND cellulose synthase 1 [PROT]"

Do the same for the other proteins.

Retrive the cellulose synthase 3 sequenses in Fasta format

  esearch -db protein -query "Pythium [ORGN] AND cellulose synthase 3 [PROT]" | efetch -format fasta > cesy3.fasta

Then run esearch command that tells how many cellulose synthase 3 sequences there are in total in the NCBI protein database?

Batch jobs: Align the cellulose synthase 3 set with mafft

Make a file called mafft.sh:

 module load nano   # The compute nodes do not have nano by default
 nano mafft.sh

Copy the following contents into the file and change “project_xxxx” to the correct project name:

#!/bin/bash
#SBATCH --job-name=test           # Name of the job visible in the queue.
#SBATCH --account=project_xxxx    # Choose the billing project. Has to be defined!
#SBATCH --partition=test          # Job queues: test, interactive, small, large, longrun, hugemem, hugemem_longrun
#SBATCH --time=00:10:00           # Maximum duration of the job. Max: depends of the partition. 
#SBATCH --mem=8G                  # How much RAM is reserved for job per node.
#SBATCH --ntasks=1                # Number of tasks. Max: depends on partition.
#SBATCH --cpus-per-task=1         # How many processors work on one task. Max: Number of CPUs per node.

# 
module load biokit
mafft cesy3.fasta > cesy3_aln.fasta
   

💬 In nano you can use `ctrl + o` to save and `ctrl + x` to exit.

Submit the job to the queue with:
```
sbatch test.sh
```
Study the results:

What files were created?

Study the alignment in more detail:

  infoalign cesy3_aln.fasta
  showalign cesy3_aln.fasta

This site is open source. Improve this page.