======================================================================+ HIGH-THROUGHPUT PREDICTION OF PROTEIN FEATURES ======================================================================= 0) I/O (tmpfs part) a) Log in to ENTROPY b) Allocate 1 CPU and log in to random node using srun command c) Downlaoad https://www.mimuw.edu.pl/~lukaskoz/teaching/adp/labs/lab_human_genome/chm13v2.0.fa.tar.gz - untar it at your home directory (measure the time of the operation) time tar -zxvf chm13v2.0.fa.tar.gz - tmpfs run 'df -h' and look for: tmpfs 63G 60K 63G 1% /dev/shm move the file to /dev/shm cd /dev/shm; mv ~/chm13v2.0.fa.tar.gz . Re-run the command: time tar -zxvf chm13v2.0.fa.tar.gz Clean: rm /dev/shm/chm13v2.0.fa* rm ~/chm13v2.0.fa* Read: https://en.wikipedia.org/wiki/Tmpfs 1) Protein disorder (CPU PART) Theory: https://en.wikipedia.org/wiki/Intrinsically_disordered_proteins Cool animations: https://iimcb.genesilico.pl/metadisorder/protein_disorder_intrinsically_unstructured_proteins_gallery_images.html TASK 1 Predict protein disorder for E. coli, human and SwissProt. E. coli https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Bacteria/UP000000625/UP000000625_83333.fasta.gz H. sapiens https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/UP000005640/UP000005640_9606.fasta.gz SwissProt https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz TODO: - download (wget) - extract (gunzip) - check how many sequences you have in each file (grep) - write script that will normalize the sequence lines (join them into one line): e.g. >sp|P04982|RBSD_ECOLI D-ribose pyranase OS=Escherichia coli (strain K12) OX=83333 GN=rbsD PE=1 SV=3 MKKGTVLNSDISSVISRLGHTDTLVVCDAGLPIPKSTTRIDMALTQGVPSFMQVLGVVTN EMQVEAAIIAEEIKHHNPQLHETLLTHLEQLQKHQGNTIEIRYTTHEQFKQQTAESQAVI RSGECSPYANIILCAGVTF >sp|P0DPM9|YKGV_ECOLI Protein YkgV OS=Escherichia coli (strain K12) OX=83333 GN=ykgV PE=1 SV=1 MSAFKLPDTSQSQLISTAELAKIISYKSQTIRKWLCQDKLPEGLPRPKQINGRHYWLRKD VLDFIDTFSVRESL becames: >sp|P04982|RBSD_ECOLI D-ribose pyranase OS=Escherichia coli (strain K12) OX=83333 GN=rbsD PE=1 SV=3 MKKGTVLNSDISSVISRLGHTDTLVVCDAGLPIPKSTTRIDMALTQGVPSFMQVLGVVTNEMQVEAAIIAEEIKHHNPQLHETLLTHLEQLQKHQGNTIEIRYTTHEQFKQQTAESQAVIRSGECSPYANIILCAGVTF >sp|P0DPM9|YKGV_ECOLI Protein YkgV OS=Escherichia coli (strain K12) OX=83333 GN=ykgV PE=1 SV=1 MSAFKLPDTSQSQLISTAELAKIISYKSQTIRKWLCQDKLPEGLPRPKQINGRHYWLRKDVLDFIDTFSVRESL python3 normalize_fasta.py -i UP000000625_83333.fasta -o UP000000625_83333_flat.fasta From now on you can work with those files with standard unix commands like head, tail, wc, etc. - write the script that will split the fasta input file into multiple files having arbitrary fixed number of sequences e.g. having at most 10k sequences For instance, UP000000625_83333.fasta conatins 4403 sequences and we want to split it into 1k sequences parts (thus 5 files should be created): python3 split_fasta.py UP000000625_83333_flat.fasta 1 <-- this last parameter defines the number of sequences in thousends UP000000625_83333_flat_0k.fasta (sequences 1-1000) UP000000625_83333_flat_1k.fasta (sequences 1001-2000) UP000000625_83333_flat_2k.fasta (sequences 2001-3000) UP000000625_83333_flat_3k.fasta (sequences 3001-4000) UP000000625_83333_flat_4k.fasta (sequences 4001-4403) Now, you have 5 files that can be run on different CPUs (separate jobs, threads, etc.). In theory, you could get the same effect in one python script delegating the data splits into separate threads, but in practice you never know on which node and when given part will be calculated on the cluster thus if your workload can be divided on independent parts then those should be run as separate jobs. Note, that you should benchmark the speed of the method you are running and adjust the size of the parts according the need. For instance, you will not benifit much in splitting SwissProt into >500 parts with 1k sequences as you do not have 500 CPU/GPUs. On the other hand, too big size of file may make calculating on the cluster impossible (you will hit walltime limit). MAIN TASK: 1) Run the predictions on single CPU 2) Distribute the load into N CPUs using SLURM on the entropy cluster - divide the input file into 4 approx. equal parts, - write SBATCH script that will allocate 1 CPU and run IUPred (see below) on single part file, - submit SBATCH jobs on 4 different parts of fasta file. Note the runtime of (1) and (2). The program we will use is called IUPred (https://iupred2a.elte.hu/) Paper: https://doi.org/10.1093/nar/gky384 wget https://www.mimuw.edu.pl/~lukaskoz/teaching/adp/labs/lab_hpc1/iupred2a.tar.gz Specification Input: fasta file >header1 sequence1 >header2 sequence2 ... Output: pseudo-fasta file with predictions stored as follow (note the extra lines): >header1 MEPEPEPEQEANKEEEKILSAAVRAKIERNRQRA... DDDDDDDDDDD-----------------------... 8667556667843222222112333343322221... >header2 LACRPYPTGEGISTVKAPPKVISGGGFFIEEEEA... DDDDDDDDDD------------------DDDDD-... 6668757656323233244323211113778953... ... Note that provided iupred2a.py script works only for single sequence, thus you need to write simple wrapper to run it multiple times for all sequences from our multi-fasta file The orginal output of the method is: # IUPred2A: context-dependent prediction of protein disorder as a function of redox state and protein binding # Balint Meszaros, Gabor Erdos, Zsuzsanna Dosztanyi # Nucleic Acids Research 2018;46(W1):W329-W337. # # Prediction type: short # Prediction output # POS RES IUPRED2 1 M 0.5331 2 M 0.3847 3 K 0.2333 4 R 0.1117 5 K 0.0455 6 R 0.0167 ... The last line in our format we store the prediction accuracy (rounded to 10 bins). Scores <0.5 are '-' and the rest is 'D'. Use 'short' version. As the format is also different, we need to write the parser. Note: that wrapper and parser can be one script. python3 iupred_runner.py UP000000625_83333_flat_0k.fasta will produce UP000000625_83333_flat_0k_iupred_long.fasta Notice that merging parts can be simply made by 'cat' command. Finally: calculate percent of disorder for both preoteomes (separate script). ====================================================================================================== 2) Secondary structure (GPU PART) Theory: https://en.wikipedia.org/wiki/Protein_secondary_structure TASK 2: Predict protein secondary structure for two proteomes (use the same as in TASK 1). 1) Run the predictions on CPU 2) Run the predictions on GPU To switch between CPU and GPU change: tokenizer=AutoTokenizer.from_pretrained("Rostlab/prot_bert_bfd_ss3", skip_special_tokens=True), device=0) #GPU tokenizer=AutoTokenizer.from_pretrained("Rostlab/prot_bert_bfd_ss3", skip_special_tokens=True), device=1) #CPU Use srun (interactive mode) - Just reserve some time Note the runtime of (1) and (2). For SwissProt use only GPU. The program we will use is called ProtTrans (https://github.com/agemagician/ProtTrans/tree/master) To be exact we will use Rostlab/prot_bert_bfd_ss3 model. Read: https://github.com/agemagician/ProtTrans/blob/master/Prediction/ProtBert-BFD-Predict-SS3.ipynb Paper: https://doi.org/10.1109/TPAMI.2021.3095381 =================================================================== INSTALLATION: python3 -m venv venv_ProtBert source /home/lukaskoz/venv_ProtBert/bin/activate pip install torch pip install transformers pip install sentencepiece exit() =================================================================== python3 =================================================================== from transformers import AutoTokenizer, AutoModelForTokenClassification, TokenClassificationPipeline import torch # on master node we do not use GPUs thus we switch to CPU device_type = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu') #JUST ONCE TO DOWNLOAD THE MODEL FILE pipeline_bfd = TokenClassificationPipeline( model=AutoModelForTokenClassification.from_pretrained("Rostlab/prot_bert_bfd_ss3"), tokenizer=AutoTokenizer.from_pretrained("Rostlab/prot_bert_bfd_ss3", skip_special_tokens=True), device=device_type) =================================================================== TODO: write the wrapper&parser that will store the predictions in fasta file in pseudo-fasta format: >header1 MVIHFSNKPAKYTPNTTVAFLALVD... -EEEE-----------EEEEEEEE-... 8648775778844488767777776... >header2 QGVDLVAAFEAHRTQIEAVARVKLP... ----HHHHHHHHHHHHHHHHEEE--... 3787887767787677533344454... ... Finally: calculate percent of secondary structure elements for both preoteomes and SwissProt file (separate script). ===================================================================== Useful links: https://entropy.mimuw.edu.pl/grafana/ - for monitoring ENTROPY load https://entropy-doc.mimuw.edu.pl - the documentation