These exercises cover retrieving data from various commonly used bio data repositories.
We will do these exercises using the sinteractive command (substitute xxxx with correct project number):
sinteractive --account project_xxxxx
To use the applications in exercises2 and 3, we will need to load the biokit module:
module load biokit
Move to /scratch
directory area of the course project (substitute
xxxx with correct project number):
cd /scratch/project_xxxx
Create a directory for yourself and move to it:
mkdir $USER
cd $USER
Everyone in the project shares the same /scratch
directory, so
it is a good idea to use subdirectories for each user and task, so
you won’t accidentally delete or overwrite each others files.
In normal usage it may be a good idea to even use chmod
command
to alter file access rights so that only you have write access to
your own subfolder, but please do not do this in the course project,
as it would make clean-up after course harder.
You can find more information about this in Disk areas page in the Docs.
Curl
and wget
are general tools to download data from an URL.
Download a dataset from internet using curl
and uncompress it. The
dataset contains some Pythium genomes with related BWA indexes.
curl https://a3s.fi/course_12.11.2019/pythium.tgz > pythium.tgz
ls -ltr
tar zxvf pythium.tgz
ls -ltr
tree pythium
Create directory cellulose_synthase
and move to this new directory:
mkdir cellulose_synthase
cd cellulose_synthase
Next we use NCBI edirect tool to retrieve some data.
Check how many proteins are found the NCBI protein databanks for
Pythium species (count
row in the results):
esearch -db protein -query "Pythium [ORGN]"
Then check the nuber of proteins: cellulose synthase 1, cellulose synthase 2 and cellulose synthase 3 that are found for Pythium species.
For cellulose synthase 1 this can be done with:
esearch -db protein -query "Pythium [ORGN] AND cellulose synthase 1 [PROT]"
do the same for the other proteins.
Retrive the cellulose synthase 3 sequenses in Fasta format
esearch -db protein -query "Pythium [ORGN] AND cellulose synthase 3 [PROT]" | efetch -format fasta > cesy3.fasta
Then run esearch command that tells how many cellulose synthase 3 sequences there are in total in NCBI protein database?
mafft cesy3.fasta > cesy3_aln.fasta
And study the results:
infoalign cesy3_aln.fasta
showalign cesy3_aln.fasta
Check the options of enaDataGet with command:
enaDataGet -h
Download a file (Pythium iwayamai genome assembly)
enaDataGet AKYA02000000 -f fasta
gunzip AKYA02.fasta.gz
ls -ltr
head -20 AKYA02.fasta
tail AKYA02.fasta
infoseq_summary AKYA02.fasta
Then compare the cellulose synthase 3 sequences against the genome using BLAST
pb tblastn -query cesy3.fasta -dbnuc AKYA02.fasta -out blast_result.txt