====================================================================== HIGH-THROUGHPUT PREDICTION OF PROTEIN FEATURES (PART 3) ======================================================================= TASK 1: "SUPER-HELICAL PROTEINS" Now, having in hand secondary structure predictions let's do something usefull with it. Suppose, we are interested in helical proteins. Our the task is to find the most helical proteins in SwissProt. See also: https://www.cathdb.info/browse/tree Write a script that will: a) filter the proteins that are at least 40 aa long and contain not less than 60% of helix (H) and not more than 20% of strand (E). b) sort those proteins according to our super-helicity score SHS-score: SHS-score = [ ['H' * 2] + ['-' * -1] + ['E' * -2] ] / len(seq) As you can see the SHS-score rewards helical amino acids (H*2) and gives penalty for coil (-1) and strongly disfavours strands ('E'). Additionally, it is normalized by sequence length. Store the scores in the header by extending it e.g. '...PE=1 SV=1|SHS-score=3.21 For instance: python3 filter_ss.py uniprot_sprot_flat_protbert.fasta 5000 100 60 20 2 -1 -2 1 gives uniprot_sprot_flat_protbert_SHS_5000.fasta The script uses positional options: - sys.argv[1] input file - sys.argv[2] number of top, UNIQUE predictions (can be less)* - sys.argv[3] min length - sys.argv[4] min percent of 'H' - sys.argv[5] max percent of 'E' - sys.argv[6] wight for 'H' - sys.argv[7] wight for '-' - sys.argv[8] wight for 'E' - sys.argv[9] remove duplicates [1/0] You can also use argpars or sth similar if you want. *Note that uniprot_sprot is redundant database (thus add a flag to remove duplicates) ===================================================================== TASK 2: "INDUCED FIT DISORDER PROTEINS" a) First, let's compare our results with experimentally validated disorder. DisProt, the database of intrinsically disordered proteins Go to https://disprot.org As we already predicted disorder for human proteins from UniProt we can check how good those predictions are. For your conveninance you can downoload pre-processed data from DisProt from here: https://www.mimuw.edu.pl/~lukaskoz/teaching/adp/labs/lab_hpc3/DisProt_human_fragments_flat.csv The format looks like: acc,fragments P49913,[134-170] P27695,[1-43] P0DMM9,[64-77][216-261] P13569,[654-838] P0DN86,[132-165] P38936,[1-164] Note, that one protein can have more than one disorder fragment (here: P0DMM9) IUPred can predict disorder with 2 options: 'short' and 'long' Predict both versions of disorder with IUPred (you should already have one ready): UP000005640_9606_flat_iupred_short.fasta UP000005640_9606_flat_iupred_long.fasta As in all files we have uniprot uids, let's use them to filter out the preteins that are present both in *.csv and *.fasta files. Calculate MCC* score for our prediction (TP is the real disorder predicted as a disorder). python3 calculate_mcc.py UP000005640_9606_flat_iupred_short.fasta DisProt_human_fragments_flat.csv MCC=0.23 (or sth else) *MCC definition: https://en.wikipedia.org/wiki/Phi_coefficient from sklearn.metrics import matthews_corrcoef https://www.statology.org/matthews-correlation-coefficient-python/ Which version of IUPred (long or short) has better MCC? b) We have also plenty of other disorder predictions. The disordered fragments of proteins can perform many functions. One of it is so called induced fit disorder where upon binding to other protein/molecule the disordered fragment is no longer disordered and it forms standard secondary structure elements. In predictions this can be seen as from one site the fragment is predicted as disordered, but from other site helices (H) and strands (E) are abundant in secondary structure predictions. Now, we want to filter the predictions of disorder that are long enough and have a lot of 'H' and 'E': python3 filter_induced_disorder.py uniprot_sprot_flat_iupred_long.fasta uniprot_sprot_flat_protbert.fasta 5000 40 20 1 gives uniprot_sprot_flat_iupred_long_top5000_40_20_1.fasta The script uses positional options: - sys.argv[1] input file (disorder predictions from iupred) - sys.argv[2] input file (secondary structure predictions from protbert) - sys.argv[3] number of top, UNIQUE predictions (can be less) - sys.argv[4] min length of the disorder fragment (we skip shorter) - sys.argv[5] max percent of '-' in secondary structure prediction in disorder fragment (40% is good starting point) - sys.argv[6] remove duplicates [1/0] Note that above sys.argv[4] and sys.argv[5] does not give you any scoring how good the hit is, thus as previously let's define simple score of disorder fragment: IndDis-score = (2 * 'H') + (2 *'E') - (4 * '-') Now we can use this score for sorting our hits (the bigger the better). In the output file enrich the header with the information about the hits and add lines for disorder and secondary structure (you can skip/ignore the score lines in fasta files). For instance: >header1|disorder[26-75]:H:25,E:16,-:9,InducedDIS-score=46, disorder[110-179]:H:20,E:25,-:15,InducedDIS-score=30 MEPEPEPEQEANKEEEKILSAAVRAAKIERNRQRA... DDDDDDDDDDD--------------DDDDDDDDDD... -EEEE---------------------EEEEEEEE-... ... The hit is described by its localization and the number of H, E, and '-' letters. There can be more than one hit in the sequence (then we take into account the highest for the sorting) H:25,E:16,-:9 states that the disorderedfragment is 50aa long, contains 25 'H', 16 'E' and 9 '-' thus the score is (25*2) + (16*2) - (9*4) = 46 ===================================================================== HOMEWORK Prepare the report file sumarizing the results. It should contain: 1) Table for the speed tests _____________________________________ Method CPU GPU _____________________________________ IUPred 45.1 NA ProtBert 3887.1 42.2 _____________________________________ * provided time for 1k predictions in seconds - (be)"head" first 1k sequences from SwissProt ** ProtBert -do not include time of loading the model into memory 2) Tables for the percentages of disorder/order and secondary sturcture predictions: ________________________________________________________________________________________ Method | Disorder | Helix Strand Coil Total Time Total Time (dis on CPU) (ss on GPU) ________________________________________________________________________________________ E. coli 8.8 42.5 15.7 41.8 3.7 min 3.3 min H. sapiens 21.4 30.1 10.7 59.2 20.3 min 31.6 min SwissProT 17.2 37.6 15.7 46.7 528.3 min 589.3 min ________________________________________________________________________________________ Note: povided above timings are just exemplary, you will most likely get different ones (depending on the node used, wrapper details, etc.), but should be in similar order (thus if you see that you have for instance 10 times slower wrapper, it means that something is wrong). Additionally, add to the project folder: a) the sbatch scripts (and srun commands) you used b) all python scripts (normalizer, splitter, wrappers & parsers, filter, etc.) c) the predictions files (both disorder and secondary structures for E. coli and H. sapiens (do not include SwissProt predictions) d) for SwissProt provide file with top 5000 "super-helical proteins" sorted by the score with length of at least 100 aa e) for Disprot analysis: - raport MCC for long and short versions for human proteins - files for top 5000 "induced disorder proteins" for short and long version All files should be sent until the end of 11.05.2025 (next Sunday) via email to lukaskoz@mimuw.edu.pl with the email subject: 'ADP25_lab_hpc_hw_Surname_Name' without the email text body and with 'ADP25_lab_hpc_hw_Surname_Name.7z'* (without Polish letters) attachment. * you may consider using ultra compression rate (-mx=9) to minimize te size