The main purpose of this tutorial is to get familar wit working in CSC Puhti supercomputer.
Some exercises are done in interactive mode using the sinteractive command and some as batch jobs.
These exercises cover retrieving data from various commonly used bio data repositories.
We will do these exercises using the sinteractive command (substitute xxxx with correct project number):
sinteractive --account project_xxxx
Move to /scratch
directory area of the course project (unless there already):
cd /scratch/project_xxxx # substitute xxxx with correct project number
Create a directory for yourself and move to it (unless there already):
mkdir $USER
cd $USER
đź’ Everyone in the project shares the same /scratch
directory, so
it is a good idea to use subdirectories for each user and task, so
you won’t accidentally delete or overwrite each others files.
đź—Ż In normal usage it may be a good idea to even use chmod
command
to alter file access rights so that only you have write access to
your own subfolder, but please do not do this in the course project,
as it would make clean-up after course harder.
You can find more information about this in Disk areas page in the Docs.
curl
and wget
are general tools to download data from an URL.
Download a dataset from internet using curl
and uncompress it. The dataset contains some Pythium genomes with related BWA indexes.
curl https://a3s.fi/course_12.11.2019/pythium.tgz > pythium.tgz
ls -ltr
tar -zxvf pythium.tgz
ls -ltr
In this exercise we are using some specialized tools to download
data. To use them we must first load the biokit
module.
module load biokit
Create directory cellulose_synthase
and move to this new directory:
mkdir cellulose_synthase
cd cellulose_synthase
Next we use NCBI edirect tool to retrieve some data.
Check how many proteins are found the NCBI protein databanks for Pythium species (count
row in the results):
esearch -db protein -query "Pythium [ORGN]"
Then check the number of proteins: cellulose synthase 1, cellulose synthase 2 and cellulose synthase 3 that are found for Pythium species.
For cellulose synthase 1 this can be done with:
esearch -db protein -query "Pythium [ORGN] AND cellulose synthase 1 [PROT]"
Do the same for the other proteins.
Retrive the cellulose synthase 3 sequenses in Fasta format
esearch -db protein -query "Pythium [ORGN] AND cellulose synthase 3 [PROT]" | efetch -format fasta > cesy3.fasta
Then run esearch
command that tells how many cellulose synthase 3 sequences there are in total in the NCBI protein database?
mafft.sh
:
module load nano # The compute nodes do not have nano by default
nano mafft.sh
#!/bin/bash
#SBATCH --job-name=test # Name of the job visible in the queue.
#SBATCH --account=project_xxxx # Choose the billing project. Has to be defined!
#SBATCH --partition=test # Job queues: test, interactive, small, large, longrun, hugemem, hugemem_longrun
#SBATCH --time=00:10:00 # Maximum duration of the job. Max: depends of the partition.
#SBATCH --mem=8G # How much RAM is reserved for job per node.
#SBATCH --ntasks=1 # Number of tasks. Max: depends on partition.
#SBATCH --cpus-per-task=1 # How many processors work on one task. Max: Number of CPUs per node.
#
module load biokit
mafft cesy3.fasta > cesy3_aln.fasta
đź’¬ In nano you can use `ctrl + o` to save and `ctrl + x` to exit.
Submit the job to the queue with:
sbatch test.sh
Study the results:
What files were created?
Study the alignment in more detail:
infoalign cesy3_aln.fasta
showalign cesy3_aln.fasta