======================================================================+

HIGH-THROUGHPUT PREDICTION OF PROTEIN FEATURES

=======================================================================
0) I/O (tmpfs part)

a) Log in to ENTROPY

b) Allocate 1 CPU and log in to random node using srun command 

c) Downlaoad https://www.mimuw.edu.pl/~lukaskoz/teaching/adp/labs/lab_human_genome/chm13v2.0.fa.tar.gz

- untar it at your home directory (measure the time of the operation)
time tar -zxvf chm13v2.0.fa.tar.gz

- tmpfs

run 'df -h' and look for:
tmpfs                         63G   60K   63G   1% /dev/shm

move the file to /dev/shm

cd /dev/shm; mv ~/chm13v2.0.fa.tar.gz .

Re-run the command:
time tar -zxvf chm13v2.0.fa.tar.gz

Clean:
rm  /dev/shm/chm13v2.0.fa*
rm ~/chm13v2.0.fa*

Read: https://en.wikipedia.org/wiki/Tmpfs


1) Protein disorder (CPU PART)

Theory:
https://en.wikipedia.org/wiki/Intrinsically_disordered_proteins

Cool animations:
https://iimcb.genesilico.pl/metadisorder/protein_disorder_intrinsically_unstructured_proteins_gallery_images.html

TASK 1 Predict protein disorder for E. coli, human and SwissProt.

E. coli     https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Bacteria/UP000000625/UP000000625_83333.fasta.gz
H. sapiens  https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/UP000005640/UP000005640_9606.fasta.gz
SwissProt   https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz

TODO:
- download (wget)
- extract (gunzip)
- check how many sequences you have in each file (grep)
- write script that will normalize the sequence lines (join them into one line):
e.g. 
>sp|P04982|RBSD_ECOLI D-ribose pyranase OS=Escherichia coli (strain K12) OX=83333 GN=rbsD PE=1 SV=3
MKKGTVLNSDISSVISRLGHTDTLVVCDAGLPIPKSTTRIDMALTQGVPSFMQVLGVVTN
EMQVEAAIIAEEIKHHNPQLHETLLTHLEQLQKHQGNTIEIRYTTHEQFKQQTAESQAVI
RSGECSPYANIILCAGVTF
>sp|P0DPM9|YKGV_ECOLI Protein YkgV OS=Escherichia coli (strain K12) OX=83333 GN=ykgV PE=1 SV=1
MSAFKLPDTSQSQLISTAELAKIISYKSQTIRKWLCQDKLPEGLPRPKQINGRHYWLRKD
VLDFIDTFSVRESL

becames:
>sp|P04982|RBSD_ECOLI D-ribose pyranase OS=Escherichia coli (strain K12) OX=83333 GN=rbsD PE=1 SV=3
MKKGTVLNSDISSVISRLGHTDTLVVCDAGLPIPKSTTRIDMALTQGVPSFMQVLGVVTNEMQVEAAIIAEEIKHHNPQLHETLLTHLEQLQKHQGNTIEIRYTTHEQFKQQTAESQAVIRSGECSPYANIILCAGVTF
>sp|P0DPM9|YKGV_ECOLI Protein YkgV OS=Escherichia coli (strain K12) OX=83333 GN=ykgV PE=1 SV=1
MSAFKLPDTSQSQLISTAELAKIISYKSQTIRKWLCQDKLPEGLPRPKQINGRHYWLRKDVLDFIDTFSVRESL

python3 normalize_fasta.py -i UP000000625_83333.fasta -o UP000000625_83333_flat.fasta
From now on you can work with those files with standard unix commands like head, tail, wc, etc.

- write the script that will split the fasta input file into multiple files having arbitrary
fixed number of sequences e.g. having at most 10k sequences

For instance, UP000000625_83333.fasta conatins 4403 sequences and we want to split it into 
1k sequences parts (thus 5 files should be created):

python3 split_fasta.py UP000000625_83333_flat.fasta 1 <-- this last parameter defines the number of 
sequences in thousends

UP000000625_83333_flat_0k.fasta (sequences 1-1000)
UP000000625_83333_flat_1k.fasta (sequences 1001-2000)
UP000000625_83333_flat_2k.fasta (sequences 2001-3000)
UP000000625_83333_flat_3k.fasta (sequences 3001-4000)
UP000000625_83333_flat_4k.fasta (sequences 4001-4403)


Now, you have 5 files that can be run on different CPUs (separate jobs, threads, etc.). In theory, you could
get the same effect in one python script delegating the data splits into separate threads, but in practice
you never know on which node and when given part will be calculated on the cluster thus if your workload can 
be divided on independent parts then those should be run as separate jobs.

Note, that you should benchmark the speed of the method you are running and adjust the size
of the parts according the need. For instance, you will not benifit much in splitting SwissProt
into >500 parts with 1k sequences as you do not have 500 CPU/GPUs. On the other hand, 
too big size of file may make calculating on the cluster impossible (you will hit walltime limit).

MAIN TASK:
1) Run the predictions on single CPU
2) Distribute the load into N CPUs using SLURM on the entropy cluster
   - divide the input file into 4 approx. equal parts,
   - write SBATCH script that will allocate 1 CPU and run IUPred (see below) on single part file,
   - submit SBATCH jobs on 4 different parts of fasta file. 

Note the runtime of (1) and (2).
 
The program we will use is called IUPred (https://iupred2a.elte.hu/)
Paper: https://doi.org/10.1093/nar/gky384
wget https://www.mimuw.edu.pl/~lukaskoz/teaching/adp/labs/lab_hpc1/iupred2a.tar.gz


Specification
Input: fasta file
>header1
sequence1
>header2
sequence2
...

Output: pseudo-fasta file with predictions stored as follow (note the extra lines):
>header1
MEPEPEPEQEANKEEEKILSAAVRAKIERNRQRA...
DDDDDDDDDDD-----------------------...
8667556667843222222112333343322221...         
>header2
LACRPYPTGEGISTVKAPPKVISGGGFFIEEEEA...
DDDDDDDDDD------------------DDDDD-...
6668757656323233244323211113778953...
...

Note that provided iupred2a.py script works only for single sequence, thus you need to 
write simple wrapper to run it multiple times for all sequences from our multi-fasta file


The orginal output of the method is:
# IUPred2A: context-dependent prediction of protein disorder as a function of redox state and protein binding
# Balint Meszaros, Gabor Erdos, Zsuzsanna Dosztanyi
# Nucleic Acids Research 2018;46(W1):W329-W337.
#
# Prediction type: short
# Prediction output
# POS   RES     IUPRED2
1       M       0.5331
2       M       0.3847
3       K       0.2333
4       R       0.1117
5       K       0.0455
6       R       0.0167
...

The last line in our format we store the prediction accuracy (rounded to 10 bins).
Scores <0.5 are '-' and the rest is 'D'.

Use 'short' version.

As the format is also different, we need to write the parser.

Note: that wrapper and parser can be one script.

python3 iupred_runner.py UP000000625_83333_flat_0k.fasta 
will produce
UP000000625_83333_flat_0k_iupred_long.fasta

Notice that merging parts can be simply made by 'cat' command.

Finally: calculate percent of disorder for both preoteomes (separate script).

======================================================================================================


2) Secondary structure (GPU PART)

Theory:
https://en.wikipedia.org/wiki/Protein_secondary_structure

TASK 2: Predict protein secondary structure for two proteomes (use the same as in TASK 1).

1) Run the predictions on CPU
2) Run the predictions on GPU

To switch between CPU and GPU change:
tokenizer=AutoTokenizer.from_pretrained("Rostlab/prot_bert_bfd_ss3", skip_special_tokens=True), device=0) #GPU
tokenizer=AutoTokenizer.from_pretrained("Rostlab/prot_bert_bfd_ss3", skip_special_tokens=True), device=1) #CPU

Use srun (interactive mode) - Just reserve some time

Note the runtime of (1) and (2). For SwissProt use only GPU.

 
The program we will use is called ProtTrans (https://github.com/agemagician/ProtTrans/tree/master)

To be exact we will use Rostlab/prot_bert_bfd_ss3 model. Read:
https://github.com/agemagician/ProtTrans/blob/master/Prediction/ProtBert-BFD-Predict-SS3.ipynb

Paper: https://doi.org/10.1109/TPAMI.2021.3095381

===================================================================
INSTALLATION:

python3 -m venv venv_ProtBert
source /home/lukaskoz/venv_ProtBert/bin/activate
pip install torch
pip install transformers
pip install sentencepiece
exit()

===================================================================
python3
===================================================================
from transformers import AutoTokenizer, AutoModelForTokenClassification, TokenClassificationPipeline
import torch

# on master node we do not use GPUs thus we switch to CPU
device_type = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

#JUST ONCE TO DOWNLOAD THE MODEL FILE
pipeline_bfd = TokenClassificationPipeline( model=AutoModelForTokenClassification.from_pretrained("Rostlab/prot_bert_bfd_ss3"), tokenizer=AutoTokenizer.from_pretrained("Rostlab/prot_bert_bfd_ss3", skip_special_tokens=True), device=device_type)
===================================================================

TODO: write the wrapper&parser that will store the predictions in fasta file in pseudo-fasta format:

>header1
MVIHFSNKPAKYTPNTTVAFLALVD...
-EEEE-----------EEEEEEEE-...
8648775778844488767777776...
>header2
QGVDLVAAFEAHRTQIEAVARVKLP...
----HHHHHHHHHHHHHHHHEEE--...
3787887767787677533344454...
...

Finally: calculate percent of secondary structure elements for both preoteomes and SwissProt file 
(separate script).

=====================================================================

Useful links:
https://entropy.mimuw.edu.pl/grafana/ - for monitoring ENTROPY load
https://entropy-doc.mimuw.edu.pl - the documentation