======================================================================+ HIGH-THROUGHPUT PREDICTION OF PROTEIN FEATURES ======================================================================= 0) I/O (tmpfs part) a) Log in to ENTROPY b) Allocate 1 CPU and log in to random node using srun command c) Downlaoad https://www.mimuw.edu.pl/~lukaskoz/teaching/adp/labs/lab_human_genome/chm13v2.0.fa.tar.gz - untar it at your home directory (measure the time of the operation) time tar -zxvf chm13v2.0.fa.tar.gz - tmpfs run 'df -h' and look for: tmpfs 63G 60K 63G 1% /dev/shm move the file to /dev/shm cd /dev/shm; mv ~/chm13v2.0.fa.tar.gz . Re-run the command: time tar -zxvf chm13v2.0.fa.tar.gz Clean: rm /dev/shm/chm13v2.0.fa* rm ~/chm13v2.0.fa* Read: https://en.wikipedia.org/wiki/Tmpfs 1) Protein disorder (CPU PART) Theory: https://en.wikipedia.org/wiki/Intrinsically_disordered_proteins Cool animations: https://iimcb.genesilico.pl/metadisorder/protein_disorder_intrinsically_unstructured_proteins_gallery_images.html TASK 1 Predict protein disorder for E. coli, human and SwissProt. E. coli https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Bacteria/UP000000625/UP000000625_83333.fasta.gz H. sapiens https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/UP000005640/UP000005640_9606.fasta.gz SwissProt https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz TODO: - download (wget) - extract (gunzip) - check how many sequences you have in each file (grep) - write script that will normalize the sequence lines (join them into one line): e.g. >sp|P04982|RBSD_ECOLI D-ribose pyranase OS=Escherichia coli (strain K12) OX=83333 GN=rbsD PE=1 SV=3 MKKGTVLNSDISSVISRLGHTDTLVVCDAGLPIPKSTTRIDMALTQGVPSFMQVLGVVTN EMQVEAAIIAEEIKHHNPQLHETLLTHLEQLQKHQGNTIEIRYTTHEQFKQQTAESQAVI RSGECSPYANIILCAGVTF >sp|P0DPM9|YKGV_ECOLI Protein YkgV OS=Escherichia coli (strain K12) OX=83333 GN=ykgV PE=1 SV=1 MSAFKLPDTSQSQLISTAELAKIISYKSQTIRKWLCQDKLPEGLPRPKQINGRHYWLRKD VLDFIDTFSVRESL becames: >sp|P04982|RBSD_ECOLI D-ribose pyranase OS=Escherichia coli (strain K12) OX=83333 GN=rbsD PE=1 SV=3 MKKGTVLNSDISSVISRLGHTDTLVVCDAGLPIPKSTTRIDMALTQGVPSFMQVLGVVTNEMQVEAAIIAEEIKHHNPQLHETLLTHLEQLQKHQGNTIEIRYTTHEQFKQQTAESQAVIRSGECSPYANIILCAGVTF >sp|P0DPM9|YKGV_ECOLI Protein YkgV OS=Escherichia coli (strain K12) OX=83333 GN=ykgV PE=1 SV=1 MSAFKLPDTSQSQLISTAELAKIISYKSQTIRKWLCQDKLPEGLPRPKQINGRHYWLRKDVLDFIDTFSVRESL python3 normalize_fasta.py -i UP000000625_83333.fasta -o UP000000625_83333_flat.fasta From now on you can work with those files with standard unix commands like head, tail, wc, etc. - write the script that will split the fasta input file into multiple files having arbitrary fixed number of sequences e.g. having at most 10k sequences For instance, UP000000625_83333.fasta conatins 4403 sequences and we want to split it into 1k sequences parts (thus 5 files should be created): python3 split_fasta.py UP000000625_83333_flat.fasta 1 <-- this last parameter defines the number of sequences in thousends UP000000625_83333_flat_0k.fasta (sequences 1-1000) UP000000625_83333_flat_1k.fasta (sequences 1001-2000) UP000000625_83333_flat_2k.fasta (sequences 2001-3000) UP000000625_83333_flat_3k.fasta (sequences 3001-4000) UP000000625_83333_flat_4k.fasta (sequences 4001-4403) Now, you have 5 files that can be run on different CPUs (separate jobs, threads, etc.). In theory, you could get the same effect in one python script delegating the data splits into separate threads, but in practice you never know on which node and when given part will be calculated on the cluster thus if your workload can be divided on independent parts then those should be run as separate jobs. Note, that you should benchmark the speed of the method you are running and adjust the size of the parts according the need. For instance, you will not benifit much in splitting SwissProt into >500 parts with 1k sequences as you do not have 500 CPU/GPUs. On the other hand, too big size of file may make calculating on the cluster impossible (you will hit walltime limit). MAIN TASK: 1) Run the predictions on single CPU 2) Distribute the load into N CPUs using SLURM on the entropy cluster - divide the input file into 4 approx. equal parts, - write SBATCH script that will allocate 1 CPU and run IUPred (see below) on single part file, - submit SBATCH jobs on 4 different parts of fasta file. Note the runtime of (1) and (2). The program we will use is called IUPred (https://iupred2a.elte.hu/) Paper: https://doi.org/10.1093/nar/gky384 wget https://www.mimuw.edu.pl/~lukaskoz/teaching/adp/labs/lab_hpc1/iupred2a.tar.gz Specification Input: fasta file >header1 sequence1 >header2 sequence2 ... Output: pseudo-fasta file with predictions stored as follow (note the extra lines): >header1 MEPEPEPEQEANKEEEKILSAAVRAKIERNRQRA... DDDDDDDDDDD-----------------------... 8667556667843222222112333343322221... >header2 LACRPYPTGEGISTVKAPPKVISGGGFFIEEEEA... DDDDDDDDDD------------------DDDDD-... 6668757656323233244323211113778953... ... Note that provided iupred2a.py script works only for single sequence, thus you need to write simple wrapper to run it multiple times for all sequences from our multi-fasta file The orginal output of the method is: # IUPred2A: context-dependent prediction of protein disorder as a function of redox state and protein binding # Balint Meszaros, Gabor Erdos, Zsuzsanna Dosztanyi # Nucleic Acids Research 2018;46(W1):W329-W337. # # Prediction type: short # Prediction output # POS RES IUPRED2 1 M 0.5331 2 M 0.3847 3 K 0.2333 4 R 0.1117 5 K 0.0455 6 R 0.0167 ... The last line in our format we store the prediction accuracy (rounded to 10 bins). Scores <0.5 are '-' and the rest is 'D'. Use 'short' version. As the format is also different, we need to write the parser. Note: that wrapper and parser can be one script. python3 iupred_runner.py UP000000625_83333_flat_0k.fasta will produce UP000000625_83333_flat_0k_iupred_long.fasta Notice that merging parts can be simply made by 'cat' command. Finally: calculate percent of disorder for both preoteomes (separate script). ====================================================================================================== 2) Secondary structure (GPU PART) Theory: https://en.wikipedia.org/wiki/Protein_secondary_structure TASK 2: Predict protein secondary structure for two proteomes (use the same as in TASK 1). 1) Run the predictions on CPU 2) Run the predictions on GPU To switch between CPU and GPU change: tokenizer=AutoTokenizer.from_pretrained("Rostlab/prot_bert_bfd_ss3", skip_special_tokens=True), device=0) #GPU tokenizer=AutoTokenizer.from_pretrained("Rostlab/prot_bert_bfd_ss3", skip_special_tokens=True), device=1) #CPU Use srun (interactive mode) - Just reserve some time Note the runtime of (1) and (2). For SwissProt use only GPU. The program we will use is called ProtTrans (https://github.com/agemagician/ProtTrans/tree/master) To be exact we will use Rostlab/prot_bert_bfd_ss3 model. Read: https://github.com/agemagician/ProtTrans/blob/master/Prediction/ProtBert-BFD-Predict-SS3.ipynb Paper: https://doi.org/10.1109/TPAMI.2021.3095381 =================================================================== INSTALLATION: python3 -m venv venv_ProtBert source /home/lukaskoz/venv_ProtBert/bin/activate pip install torch pip install transformers pip install sentencepiece exit() =================================================================== python3 =================================================================== from transformers import AutoTokenizer, AutoModelForTokenClassification, TokenClassificationPipeline import torch # on master node we do not use GPUs thus we switch to CPU device_type = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu') #JUST ONCE TO DOWNLOAD THE MODEL FILE pipeline_bfd = TokenClassificationPipeline( model=AutoModelForTokenClassification.from_pretrained("Rostlab/prot_bert_bfd_ss3"), tokenizer=AutoTokenizer.from_pretrained("Rostlab/prot_bert_bfd_ss3", skip_special_tokens=True), device=device_type) =================================================================== TODO: write the wrapper&parser that will store the predictions in fasta file in pseudo-fasta format: >header1 MVIHFSNKPAKYTPNTTVAFLALVD... -EEEE-----------EEEEEEEE-... 8648775778844488767777776... >header2 QGVDLVAAFEAHRTQIEAVARVKLP... ----HHHHHHHHHHHHHHHHEEE--... 3787887767787677533344454... ... Finally: calculate percent of secondary structure elements for both preoteomes and SwissProt file (separate script). ===================================================================== Useful links: https://entropy.mimuw.edu.pl/grafana/ - for monitoring ENTROPY load https://entropy-doc.mimuw.edu.pl - the documentation ====================================================================================================== EXTRA: If some node do not work well (or we want for some reason running jobs there e.g. not enough RAM), we can easily exclude this node in sbatch or srun. For instance: #SBATCH --exclude=asusgpu2 OOM error: entropy contains mainly GPUs with 12GB vRAM. ProtBert's vRAM requirement is bound to sequence length, thus if you run it for big proteins it will crush. In our case, it is safe up to 2,5k aa, thus for any sequence that is longer, you can artificialy cut seq = seq[:2500]* and provide the results only for such N-terminus. * Alternative solution would be to chop long sequence into parts that fit into vRAM with some overlap like 2500aa parts with 400aa overlap and merge them: len(seq)=5512 part1 = 0-2500 part2 = 2300-4800 part3 = 4600-5512 Meging parts: [0-2500, 2300-4800, 4600-5512] Half of overlap (200aa) should be enough to preserve secondary structure ===================================================================== EXAMPLE SBATCH SCRIPT (see also the lecture): _____________________________________________________________________ #!/bin/bash -l #SBATCH --job-name=ec_0_ss # Some descriptive name for a job #SBATCH --qos=32gpu14d # Queue you belong #SBATCH --partition=common # Most likely you do not need to change it, all students belong to 'common' #SBATCH --gres=gpu:1 # Only if you need GPU #SBATCH --mem=12000 # This is RAM (not vRAM) #SBATCH --time=0-16:00:10 # 0 days 16 hours and 10 seconds (need to be lower than your wall-time) #SBATCH --output="/home/lukaskoz/logs/8502_8829.out" # it is usefull to capture the stdout #SBATCH --error="/home/lukaskoz/logs/8502_8829.err" # it is even more usefull to cature the stderr cd /home/lukaskoz/ source /home/lukaskoz/venv_ProtBert/bin/activate srun python3 protbert_runner.py UP000000625_83333_flat_0k.fasta deactivate _____________________________________________________________________ SRUN EXAMPLES #1 task, 10 CPU on sylvester (common node) for one day (CPU only) srun --partition=common --nodelist sylvester --qos=lukaskoz_common --time=1-0 --ntasks 1 --cpus-per-task 10 --pty $SHELL #1 task CPU only on any common node for five day srun --partition=common --nodelist asusgpu1 âqos=lukaskoz_common --time 5-0 --pty /bin/bash -l #CPU only task on t5-15 node (topola, ICM) for 10 min srun -p topola -A g91-1438 --pty --time=0:10:00 --nodelist t5-15 /bin/bash -l #1 CPU with 160GB RAM for 20h10min (Athena) srun -p plgrid -A plghpd2024-cpu -n 1 --mem 160GB --time=20:10:00 --pty /bin/bash -l #CPU only job on a100 (entropy) with just 15GB RAM, but 180 CPUs for 3 days srun --partition=a100 --qos=8gpu14d --mem=15000 --ntasks 1 --cpus-per-task 180 --time 3-0 --pty /bin/bash #-J avoid 'bash' as a name in squeue srun -J ec_0_ss --gres=gpu:1 --partition=common --qos=lukaskoz_common --time 3:00:00 --mem 12000 --pty $SHELL The after allocating the resources you can test it: lukaskoz@asusgpu6:~/batch_scripts$ nvidia-smi Mon Apr 29 18:16:16 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 2080 Ti On | 00000000:88:00.0 Off | N/A | | 27% 27C P8 10W / 250W | 1MiB / 11264MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+ #without GPU (only CPU) srun -J ec_0_ss --partition=common --qos=lukaskoz_common --time 3:00:00 --mem 12000 --pty $SHELL lukaskoz@asusgpu3:~/batch_scripts$ nvidia-smi No devices were found USEFULL COMMANDS: srun --overlap --pty --jobid=<job_id> $SHELL _____________________________________________________________________ Other usefull settings: #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=1 #SBATCH --mem-per-cpu=4GB #SBATCH --account=project_465000991 # not used on entropy #SBATCH -N 4 # NUMBER OF NODES TO BE USED, rather not on entropy #SBATCH -J jobname # the same as --job-name= To check the configuration of the running job: scontrol show jobid -dd <job_id> ===================================================================== HOMEWORK Prepare the report file sumarizing the results. It should contain: 1) Table for the speed tests _____________________________________ Method CPU GPU _____________________________________ IUPred 45.1 NA ProtBert 3887.1 42.2 _____________________________________ * provided time for 1k predictions in seconds - (be)"head" first 1k sequences from SwissProt ** ProtBert -do not include time of loading the model into memory 2) Tables for the percentages of disorder/order and secondary sturcture predictions: ________________________________________________________________________________________ Method | Disorder | Helix Strand Coil Total Time Total Time (dis on CPU) (ss on GPU) ________________________________________________________________________________________ E. coli 8.8 42.5 15.7 41.8 3.7 min 3.3 min H. sapiens 21.4 30.1 10.7 59.2 20.3 min 31.6 min SwissProT 17.2 37.6 15.7 46.7 528.3 min 589.3 min ________________________________________________________________________________________ Note: povided above timings are just exemplary, you will most likely get different ones (depending on the node used, wrapper details, etc.), but should be in similar order (thus if you see that you have for instance 10 times slower wrapper, it means that something is wrong). Additionally, add to the project folder: a) the sbatch scripts (and srun commands) you used b) all python scripts (normalizer, splitter, wrappers & parsers, filter) c) the predictions files (both disorder and secondary structures for E. coli and H. sapiens (do not include SwissProt predictions) d) for SwissProt provide file with top 5000 "super-helical proteins" sorted by the score with length of at least 100 aa, All files should be sent until the end of 27.04.2025 via email to lukaskoz@mimuw.edu.pl with the email subject: 'ADP25_lab_hpc_hw_Surname_Name' without the email text body and with 'ADP25_lab_hpc_hw_Surname_Name.7z'* (without Polish letters) attachment. * you may consider using ultra compression rate (-mx=9) to minimize te size