BioMonth - CSC supercomputing and data management for bioscientists

Imagine that you have a data file and software binary (e.g., data.txt and softwareA_binary) on your Puhti home directory and also have a shared project (e.g., project_1234) on Puhti and Mahti. How would you safely share your files to other project members on the same supercomputer (i.e., on Puhti) as well as on Mahti (i.e, another supercomputer at CSC)?

Background: This exercise is aimed at familiarising yourself with main disc areas in Puhti and Mahti supercomputers. Data files needed for computational analysis should be stored and shared in scratch directories and any software compilations and binaries should be shared in proappl directory. In order to find actual directories use commands such as csc-workspaces and csc-projects. Data transfer between two supercomputers can be done with many tools including rsync. In this example try to avoid using allas for data transfer between the supercomupters.

Solution:

  1. First login to Puhti supecomputer using ssh command as below:
ssh <username>@puhti.csc.fi

Authenticate using the password associated with CSC user account. Once your login to Puhti is successful, Linux terminal will be opened for command-line interaction in your home directory. Let’s assume that file data.txt is intended for computational use and softwareA_binary is a software tool needed for analysis. As you know the project name, you can share file data.txt in scratch folder and softwareA_binary file in projapple directory.

  1. Share your softwareA_binary file in projapple directory
cp softwareA_binary  /projapple/project_1234
  1. Share data.txt file in scratch directory
    cp data.txt /scratch/project_1234
    

    All new files and directories are also fully accessible for other group members (including read, write and execution permissions). If you want to restrict access from your group members, you can reset the permissions with the chmod command.

Set read-only permissions for your group members for the file data.txt:

chmod -R g-w data.txt
  1. sharing files in Mahti supercomputer you can copy data.txt file on puhti to scratch drive on Mahti as below:
rsync -P data.txt <username>@mahti.csc.fi:/scratch/project_1234

you can copy sofwtareA_binary file on puhti to projapple directory on Mahti as below:

rsync -P sofwtareA_binary <username>@mahti.csc.fi:/scratch/project_1234

What would be ideal disk area to perform the following task that require high I/O operations ?

The task description: Analysis is based on a big tar file containing around 52000 small files, each one comprises one or more nucleotide sequences. Unpack the tar file and convert the nucleic acids sequences in each file to corresponding protein sequences using transeq software. Once analysis is finished, pack all files into a tar file again.

Background: The “normal” Lustre based project-specific directories, scratch and projappl, can store large amounts of data and make it accessible to all the nodes of Puhti. However, these directories are not good for managing a large number of files. If you need to work with a huge number of smaller files, you should consider using the NVME based local temporary scratch directories, either through normal or interactive batch jobs.

hints:

Solution:

# launch an interactive session with required resources
interactive -c 2 -m 4G -d 250

# chnage working directory to fast local scratch directory
cd $LOCAL_SCRATCH

# download tar file and unpack it (fix this link)

wget https//:a3s.fi/big_data.tar
tar big_data.tar ./

# perform actual analysis

for ffile in $(find ./ | grep fasta$ )
do
     transeq $ffile ${ffile}.pep
done 

# pack all the resulting files
tar cvf big_data.pep.tar ./

Below is the execution time comparison for running the three steps above in LOCAL_SCRATCH and in normal scratch.

  LOCAL_SCRATCH scratch
Step 1. Opening tar file 2m 8s 4m 12s
Step 2. Analysis 9m 42s 21m 58s
Step 3. Creating new tar file 2m 25s 42m 21s
Total 14m 15s 1h 8m 31s

How do you make use of local scratch drive on compute node for faster computational tasks? Convert the following normal batch job into the one that uses local scratch drive?

Below is a normal batch job that pulls docker image from DockerHub and converts into a singularity one that is compatible with working in HPC environments such as CSC Puhti and Mahti supercomputers. During the conversion process, several layers are retrieved, cached and then converted into a singularity file (.sif format)

#!/bin/bash
#SBATCH --time=01:00:00
#SBATCH --partition=small
#SBATCH --account=project_xxx

export SINGULARITY_TMPDIR=/scratch/project_xxx/$USER
export SINGULARITY_CACHEDIR=/scratch/project_xxx/$USER
singularity pull --name trinity.simg  docker://trinityrnaseq/trinityrnaseq

Hints:

#SBATCH --gres=nvme:<local_storage_space_per_node>  # e.g., to claim 200 GB of storage, use option --gres=nvme:200. 

Solution:

#!/bin/bash
#SBATCH --time=01:00:00
#SBATCH --partition=small
#SBATCH --account=project_xxx
#SBATCH  --gres=nvme:100

export SINGULARITY_TMPDIR=$LOCAL_SCRATCH
export SINGULARITY_CACHEDIR=$LOCAL_SCRATCH
unset XDG_RUNTIME_DIR

cd $LOCAL_SCRATCH
#pwd
#df -lh
singularity pull --name trinity.simg docker://trinityrnaseq/trinityrnaseq
mv trinity.simg /scratch/project_xxx/$USER/                                                            

Below is the comparison of execution time for running the same job in LOCAL_SCRATCH vs. normal scratch.

  LOCAL_SCRATCH scratch
Wall-clock time 22m 06s 50m 06s