BioMonth - CSC supercomputing and data management for bioscientists

Disk areas in CSC supercomputing environment

CSC users working in a supercomputing environment have access to different disk areas (or directories) to manage their data in supercomputers. It is therefore important to understand his or her own disk areas to manage personal and project-specific data.

Upon completion of this tutorial, you will get familiar with:

Identify your personal and project-specific directories in Puhti and Mahti supercomputers

Each user at CSC supercomputer (Puhti or Mahti) owns disk areas (or directories), each one with a specific purpose. You can get an overview of directories by using the following command in login node:

csc-workspaces 

The above command shows information about your directories and their current quotas. These directories can be briefly summerised as below:

Perform a light-weight pre-porcessing on data files using fast I/O local disks

We sometimes come across situations where we have to handle an uncommonly large number of smaller files that can cause heavy I/O load on supercomputing environment. In order to facilitate such operations, CSC provides fast local disk areas in login and compute nodes.

In order to identify such directories in login nodes, use the following command:

echo $TMPDIR

This local disk area in login nodes is meant for some light-weight preprocessing of data before you start actual analysis on scratch drive. Let’s look at the below toy example where you can download a tar file containing thousands of small files and then you can merge all of those files into one big file using local storage disks.

  1. Download tar file from allas object storage as shown below:
cd $TMPDIR           
wget https://a3s.fi/CSC_training/Individual_files.tar.gz
  1. Unpack the downloaded tar file as below:
tar -xavf Individual_files.tar.gz
cd Individual_files
  1. Merge all those small files into one file and remove all small files
find . -name 'individual.fasta*' | xargs cat  >> Merged.fasta
find . -name 'individual.fasta*' | xargs rm

However, if you are going to perform heavy-weight computing tasks on those larger number of smaller files, you have to use local storage areas in compute nodes which are accessed either interactively or using batch jobs.

In the interactive jobs, use the following command to find out a local storage area in that compute node:

echo $LOCAL_SCRATCH 

When using batch job, use the environment variable $LOCAL_SCRATCH in your batch job scripts to access the local storage on that node.

Move your pre-proceessed data to a project-specific scratch area before analysis

Currently, all directories on scratch drive are project-based and one should be aware of a project number to find out actual path on scratch directory. While we can actually find scratch directories corresponding to all your project numbers using csc-workspace, it may not be immediately obvious to map those project numbers to metadata of your projects. You can instead also use the following command to find more details on your project(s).

csc-projects

Once you know the project number which would be in the form of project_xxx, you can move your pre-processed data (i.e., Merged.fasta file) from earlier step to a project-specific directory on scratch area as below:

mkdir /scratch/project_xxx/$USER
mv Merged.fasta /scratch/project_xxx/$USER

You have now successfully moved your data to scratch area and can start performing actual analysis using batch job scripts which you will learn in-depth in a different module.