======================================================================+

HIGH-THROUGHPUT PREDICTION OF PROTEIN FEATURES

=======================================================================
0) I/O (tmpfs part)

a) Log in to ENTROPY

b) Allocate 1 CPU and log in to random node using srun command 

c) Downlaoad https://www.mimuw.edu.pl/~lukaskoz/teaching/adp/labs/lab_human_genome/chm13v2.0.fa.tar.gz

- untar it at your home directory (measure the time of the operation)
time tar -zxvf chm13v2.0.fa.tar.gz

- tmpfs

run 'df -h' and look for:
tmpfs                         63G   60K   63G   1% /dev/shm

move the file to /dev/shm

cd /dev/shm; mv ~/chm13v2.0.fa.tar.gz .

Re-run the command:
time tar -zxvf chm13v2.0.fa.tar.gz

Clean:
rm  /dev/shm/chm13v2.0.fa*
rm ~/chm13v2.0.fa*

Read: https://en.wikipedia.org/wiki/Tmpfs


1) Protein disorder (CPU PART)

Theory:
https://en.wikipedia.org/wiki/Intrinsically_disordered_proteins

Cool animations:
https://iimcb.genesilico.pl/metadisorder/protein_disorder_intrinsically_unstructured_proteins_gallery_images.html

TASK 1 Predict protein disorder for E. coli, human and SwissProt.

E. coli     https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Bacteria/UP000000625/UP000000625_83333.fasta.gz
H. sapiens  https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/UP000005640/UP000005640_9606.fasta.gz
SwissProt   https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz

TODO:
- download (wget)
- extract (gunzip)
- check how many sequences you have in each file (grep)
- write script that will normalize the sequence lines (join them into one line):
e.g. 
>sp|P04982|RBSD_ECOLI D-ribose pyranase OS=Escherichia coli (strain K12) OX=83333 GN=rbsD PE=1 SV=3
MKKGTVLNSDISSVISRLGHTDTLVVCDAGLPIPKSTTRIDMALTQGVPSFMQVLGVVTN
EMQVEAAIIAEEIKHHNPQLHETLLTHLEQLQKHQGNTIEIRYTTHEQFKQQTAESQAVI
RSGECSPYANIILCAGVTF
>sp|P0DPM9|YKGV_ECOLI Protein YkgV OS=Escherichia coli (strain K12) OX=83333 GN=ykgV PE=1 SV=1
MSAFKLPDTSQSQLISTAELAKIISYKSQTIRKWLCQDKLPEGLPRPKQINGRHYWLRKD
VLDFIDTFSVRESL

becames:
>sp|P04982|RBSD_ECOLI D-ribose pyranase OS=Escherichia coli (strain K12) OX=83333 GN=rbsD PE=1 SV=3
MKKGTVLNSDISSVISRLGHTDTLVVCDAGLPIPKSTTRIDMALTQGVPSFMQVLGVVTNEMQVEAAIIAEEIKHHNPQLHETLLTHLEQLQKHQGNTIEIRYTTHEQFKQQTAESQAVIRSGECSPYANIILCAGVTF
>sp|P0DPM9|YKGV_ECOLI Protein YkgV OS=Escherichia coli (strain K12) OX=83333 GN=ykgV PE=1 SV=1
MSAFKLPDTSQSQLISTAELAKIISYKSQTIRKWLCQDKLPEGLPRPKQINGRHYWLRKDVLDFIDTFSVRESL

python3 normalize_fasta.py -i UP000000625_83333.fasta -o UP000000625_83333_flat.fasta
From now on you can work with those files with standard unix commands like head, tail, wc, etc.

- write the script that will split the fasta input file into multiple files having arbitrary
fixed number of sequences e.g. having at most 10k sequences

For instance, UP000000625_83333.fasta conatins 4403 sequences and we want to split it into 
1k sequences parts (thus 5 files should be created):

python3 split_fasta.py UP000000625_83333_flat.fasta 1 <-- this last parameter defines the number of 
sequences in thousends

UP000000625_83333_flat_0k.fasta (sequences 1-1000)
UP000000625_83333_flat_1k.fasta (sequences 1001-2000)
UP000000625_83333_flat_2k.fasta (sequences 2001-3000)
UP000000625_83333_flat_3k.fasta (sequences 3001-4000)
UP000000625_83333_flat_4k.fasta (sequences 4001-4403)


Now, you have 5 files that can be run on different CPUs (separate jobs, threads, etc.). In theory, you could
get the same effect in one python script delegating the data splits into separate threads, but in practice
you never know on which node and when given part will be calculated on the cluster thus if your workload can 
be divided on independent parts then those should be run as separate jobs.

Note, that you should benchmark the speed of the method you are running and adjust the size
of the parts according the need. For instance, you will not benifit much in splitting SwissProt
into >500 parts with 1k sequences as you do not have 500 CPU/GPUs. On the other hand, 
too big size of file may make calculating on the cluster impossible (you will hit walltime limit).

MAIN TASK:
1) Run the predictions on single CPU
2) Distribute the load into N CPUs using SLURM on the entropy cluster
   - divide the input file into 4 approx. equal parts,
   - write SBATCH script that will allocate 1 CPU and run IUPred (see below) on single part file,
   - submit SBATCH jobs on 4 different parts of fasta file. 

Note the runtime of (1) and (2).
 
The program we will use is called IUPred (https://iupred2a.elte.hu/)
Paper: https://doi.org/10.1093/nar/gky384
wget https://www.mimuw.edu.pl/~lukaskoz/teaching/adp/labs/lab_hpc1/iupred2a.tar.gz


Specification
Input: fasta file
>header1
sequence1
>header2
sequence2
...

Output: pseudo-fasta file with predictions stored as follow (note the extra lines):
>header1
MEPEPEPEQEANKEEEKILSAAVRAKIERNRQRA...
DDDDDDDDDDD-----------------------...
8667556667843222222112333343322221...         
>header2
LACRPYPTGEGISTVKAPPKVISGGGFFIEEEEA...
DDDDDDDDDD------------------DDDDD-...
6668757656323233244323211113778953...
...

Note that provided iupred2a.py script works only for single sequence, thus you need to 
write simple wrapper to run it multiple times for all sequences from our multi-fasta file


The orginal output of the method is:
# IUPred2A: context-dependent prediction of protein disorder as a function of redox state and protein binding
# Balint Meszaros, Gabor Erdos, Zsuzsanna Dosztanyi
# Nucleic Acids Research 2018;46(W1):W329-W337.
#
# Prediction type: short
# Prediction output
# POS   RES     IUPRED2
1       M       0.5331
2       M       0.3847
3       K       0.2333
4       R       0.1117
5       K       0.0455
6       R       0.0167
...

The last line in our format we store the prediction accuracy (rounded to 10 bins).
Scores <0.5 are '-' and the rest is 'D'.

Use 'short' version.

As the format is also different, we need to write the parser.

Note: that wrapper and parser can be one script.

python3 iupred_runner.py UP000000625_83333_flat_0k.fasta 
will produce
UP000000625_83333_flat_0k_iupred_long.fasta

Notice that merging parts can be simply made by 'cat' command.

Finally: calculate percent of disorder for both preoteomes (separate script).

======================================================================================================


2) Secondary structure (GPU PART)

Theory:
https://en.wikipedia.org/wiki/Protein_secondary_structure

TASK 2: Predict protein secondary structure for two proteomes (use the same as in TASK 1).

1) Run the predictions on CPU
2) Run the predictions on GPU

To switch between CPU and GPU change:
tokenizer=AutoTokenizer.from_pretrained("Rostlab/prot_bert_bfd_ss3", skip_special_tokens=True), device=0) #GPU
tokenizer=AutoTokenizer.from_pretrained("Rostlab/prot_bert_bfd_ss3", skip_special_tokens=True), device=1) #CPU

Use srun (interactive mode) - Just reserve some time

Note the runtime of (1) and (2). For SwissProt use only GPU.

 
The program we will use is called ProtTrans (https://github.com/agemagician/ProtTrans/tree/master)

To be exact we will use Rostlab/prot_bert_bfd_ss3 model. Read:
https://github.com/agemagician/ProtTrans/blob/master/Prediction/ProtBert-BFD-Predict-SS3.ipynb

Paper: https://doi.org/10.1109/TPAMI.2021.3095381

===================================================================
INSTALLATION:

python3 -m venv venv_ProtBert
source /home/lukaskoz/venv_ProtBert/bin/activate
pip install torch
pip install transformers
pip install sentencepiece
exit()

===================================================================
python3
===================================================================
from transformers import AutoTokenizer, AutoModelForTokenClassification, TokenClassificationPipeline
import torch

# on master node we do not use GPUs thus we switch to CPU
device_type = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

#JUST ONCE TO DOWNLOAD THE MODEL FILE
pipeline_bfd = TokenClassificationPipeline( model=AutoModelForTokenClassification.from_pretrained("Rostlab/prot_bert_bfd_ss3"), tokenizer=AutoTokenizer.from_pretrained("Rostlab/prot_bert_bfd_ss3", skip_special_tokens=True), device=device_type)
===================================================================

TODO: write the wrapper&parser that will store the predictions in fasta file in pseudo-fasta format:

>header1
MVIHFSNKPAKYTPNTTVAFLALVD...
-EEEE-----------EEEEEEEE-...
8648775778844488767777776...
>header2
QGVDLVAAFEAHRTQIEAVARVKLP...
----HHHHHHHHHHHHHHHHEEE--...
3787887767787677533344454...
...

Finally: calculate percent of secondary structure elements for both preoteomes and SwissProt file 
(separate script).

=====================================================================

Useful links:
https://entropy.mimuw.edu.pl/grafana/ - for monitoring ENTROPY load
https://entropy-doc.mimuw.edu.pl - the documentation

======================================================================================================

EXTRA:
If some node do not work well (or we want for some reason running jobs there e.g. not enough RAM), we 
can easily exclude this node in sbatch or srun. For instance:

#SBATCH --exclude=asusgpu2

OOM error: entropy contains mainly GPUs with 12GB vRAM. ProtBert's vRAM requirement is
bound to sequence length, thus if you run it for big proteins it will crush. In our case,
it is safe up to 2,5k aa, thus for any sequence that is longer, you can artificialy cut
seq = seq[:2500]* and provide the results only for such N-terminus.

* Alternative solution would be to chop long sequence into parts that fit into vRAM with 
some overlap like 2500aa parts with 400aa overlap and merge them:

len(seq)=5512

part1 =    0-2500
part2 = 2300-4800
part3 = 4600-5512

Meging parts: [0-2500, 2300-4800, 4600-5512]

Half of overlap (200aa) should be enough to preserve secondary structure

=====================================================================
EXAMPLE SBATCH SCRIPT (see also the lecture):
_____________________________________________________________________

#!/bin/bash -l
#SBATCH --job-name=ec_0_ss     # Some descriptive name for a job
#SBATCH --qos=32gpu14d         # Queue you belong
#SBATCH --partition=common     # Most likely you do not need to change it, all students belong to 'common'
#SBATCH --gres=gpu:1           # Only if you need GPU
#SBATCH --mem=12000            # This is RAM (not vRAM)
#SBATCH --time=0-16:00:10      # 0 days 16 hours and 10 seconds (need to be lower than your wall-time)
#SBATCH --output="/home/lukaskoz/logs/8502_8829.out"       # it is usefull to capture the stdout 
#SBATCH --error="/home/lukaskoz/logs/8502_8829.err"        # it is even more usefull to cature the stderr 

cd /home/lukaskoz/					
source /home/lukaskoz/venv_ProtBert/bin/activate
srun python3 protbert_runner.py UP000000625_83333_flat_0k.fasta  
deactivate
_____________________________________________________________________

SRUN EXAMPLES

#1 task, 10 CPU on sylvester (common node) for one day (CPU only)
srun --partition=common --nodelist sylvester --qos=lukaskoz_common --time=1-0 --ntasks 1 --cpus-per-task 10 --pty $SHELL

#1 task CPU only on any common node for five day 
srun --partition=common --nodelist asusgpu1 âqos=lukaskoz_common  --time 5-0 --pty /bin/bash -l

#CPU only task on t5-15 node (topola, ICM) for 10 min 
srun -p topola -A g91-1438 --pty --time=0:10:00 --nodelist t5-15 /bin/bash -l

#1 CPU with 160GB RAM for 20h10min (Athena)
srun -p plgrid -A plghpd2024-cpu -n 1 --mem 160GB --time=20:10:00 --pty /bin/bash -l

#CPU only job on a100 (entropy) with just 15GB RAM, but 180 CPUs for 3 days
srun --partition=a100 --qos=8gpu14d --mem=15000 --ntasks 1 --cpus-per-task 180  --time 3-0 --pty /bin/bash


#-J avoid 'bash' as a name in squeue
srun -J ec_0_ss --gres=gpu:1 --partition=common --qos=lukaskoz_common --time 3:00:00 --mem 12000 --pty $SHELL

The after allocating the resources you can test it:

lukaskoz@asusgpu6:~/batch_scripts$ nvidia-smi
Mon Apr 29 18:16:16 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2080 Ti     On  | 00000000:88:00.0 Off |                  N/A |
| 27%   27C    P8              10W / 250W |      1MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

#without GPU (only CPU)
srun -J ec_0_ss --partition=common --qos=lukaskoz_common --time 3:00:00 --mem 12000 --pty $SHELL

lukaskoz@asusgpu3:~/batch_scripts$ nvidia-smi 
No devices were found


USEFULL COMMANDS:

srun --overlap --pty --jobid=<job_id> $SHELL
_____________________________________________________________________

Other usefull settings:

#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=4GB
#SBATCH --account=project_465000991 	# not used on entropy
#SBATCH -N 4				# NUMBER OF NODES TO BE USED, rather not on entropy
#SBATCH -J jobname			# the same as --job-name=

To check the configuration of the running job:

scontrol show jobid -dd <job_id>

=====================================================================

HOMEWORK

Prepare the report file sumarizing the results. It should contain:

1) Table for the speed tests

_____________________________________

Method          CPU           GPU 
_____________________________________

IUPred         45.1            NA

ProtBert     3887.1           42.2
_____________________________________

*  provided time for 1k predictions in seconds - (be)"head" first 1k sequences from SwissProt
** ProtBert -do not include time of loading the model into memory

2) Tables for the percentages of disorder/order and secondary sturcture predictions:

________________________________________________________________________________________

Method     |  Disorder    |    Helix    Strand    Coil       Total Time   Total Time
                                                            (dis on CPU)  (ss on GPU)
________________________________________________________________________________________

E. coli          8.8            42.5     15.7     41.8          3.7 min     3.3 min

H. sapiens      21.4            30.1     10.7     59.2         20.3 min    31.6 min

SwissProT       17.2            37.6     15.7     46.7        528.3 min   589.3 min
________________________________________________________________________________________


Note: povided above timings are just exemplary, you will most likely get different ones
(depending on the node used, wrapper details, etc.), but should be in similar order
(thus if you see that you have for instance 10 times slower wrapper, it means that something
is wrong).

Additionally, add to the project folder:
a) the sbatch scripts (and srun commands) you used
b) all python scripts (normalizer, splitter, wrappers & parsers, filter)
c) the predictions files (both disorder and secondary structures for E. coli and H. sapiens 
   (do not include SwissProt predictions)
d) for SwissProt provide file with top 5000 "super-helical proteins" sorted by the score with length of at least 100 aa, 


All files should be sent until the end of 27.04.2025 via email to lukaskoz@mimuw.edu.pl 
with the email subject: 'ADP25_lab_hpc_hw_Surname_Name' without the email text body 
and with 'ADP25_lab_hpc_hw_Surname_Name.7z'* (without Polish letters) attachment.

* you may consider using ultra compression rate (-mx=9) to minimize te size