======================================================================

	     HIGH-THROUGHPUT PREDICTION OF PROTEIN FEATURES
                             (PART 3)
                                   
=======================================================================


TASK 1: "SUPER-HELICAL PROTEINS" 
Now, having in hand secondary structure predictions let's do something usefull with it.
Suppose, we are interested in helical proteins. Our the task is to find the most helical
proteins in SwissProt.

See also: https://www.cathdb.info/browse/tree

Write a script that will:
a) filter the proteins that are at least 40 aa long and contain
  not less than 60% of helix (H) and not more than 20% of strand (E).
b) sort those proteins according to our super-helicity score SHS-score:

      SHS-score = [ ['H' * 2] + ['-' * -1] + ['E' * -2] ] / len(seq)

As you can see the SHS-score rewards helical amino acids (H*2) and gives
penalty for coil (-1) and strongly disfavours strands ('E'). Additionally,
it is normalized by sequence length.

Store the scores in the header by extending it e.g. '...PE=1 SV=1|SHS-score=3.21

For instance:
python3 filter_ss.py uniprot_sprot_flat_protbert.fasta 5000 100 60 20 2 -1 -2 1

gives uniprot_sprot_flat_protbert_SHS_5000.fasta

The script uses positional options: 
- sys.argv[1] input file
- sys.argv[2] number of top, UNIQUE predictions (can be less)*
- sys.argv[3] min length
- sys.argv[4] min percent of 'H'
- sys.argv[5] max percent of 'E'
- sys.argv[6] wight for 'H'
- sys.argv[7] wight for '-'
- sys.argv[8] wight for 'E'
- sys.argv[9] remove duplicates [1/0]
You can also use argpars or sth similar if you want.

*Note that uniprot_sprot is redundant database (thus add a flag to remove duplicates)

=====================================================================

TASK 2: "INDUCED FIT DISORDER PROTEINS"

a) First, let's compare our results with experimentally validated disorder.

DisProt, the database of intrinsically disordered proteins
Go to https://disprot.org

As we already predicted disorder for human proteins from UniProt we can check how good
those predictions are.

For your conveninance you can downoload pre-processed data from DisProt from here:
https://www.mimuw.edu.pl/~lukaskoz/teaching/adp/labs/lab_hpc3/DisProt_human_fragments_flat.csv

The format looks like:
acc,fragments
P49913,[134-170]
P27695,[1-43]
P0DMM9,[64-77][216-261]
P13569,[654-838]
P0DN86,[132-165]
P38936,[1-164]
Note, that one protein can have more than one disorder fragment (here: P0DMM9)

IUPred can predict disorder with 2 options: 'short' and 'long'

Predict both versions of disorder with IUPred (you should already have one ready):
UP000005640_9606_flat_iupred_short.fasta
UP000005640_9606_flat_iupred_long.fasta 

As in all files we have uniprot uids, let's use them to filter out the preteins that are 
present both in *.csv and *.fasta files.

Calculate MCC* score for our prediction (TP is the real disorder predicted as a disorder).

python3 calculate_mcc.py UP000005640_9606_flat_iupred_short.fasta DisProt_human_fragments_flat.csv
MCC=0.23 (or sth else)

*MCC definition: https://en.wikipedia.org/wiki/Phi_coefficient
from sklearn.metrics import matthews_corrcoef
https://www.statology.org/matthews-correlation-coefficient-python/

Which version of IUPred (long or short) has better MCC?

b) We have also plenty of other disorder predictions. The disordered fragments of proteins 
can perform many functions. One of it is so called induced fit disorder where 
upon binding to other protein/molecule the disordered fragment is no longer disordered 
and it forms standard secondary structure elements.

In predictions this can be seen as from one site the fragment is predicted as disordered,
but from other site helices (H) and strands (E) are abundant in secondary structure predictions.

Now, we want to filter the predictions of disorder that are long enough and have a lot of 'H' and 'E':

python3 filter_induced_disorder.py  uniprot_sprot_flat_iupred_long.fasta  uniprot_sprot_flat_protbert.fasta 5000 40 20 1

gives uniprot_sprot_flat_iupred_long_top5000_40_20_1.fasta


The script uses positional options: 
- sys.argv[1] input file (disorder predictions from iupred)
- sys.argv[2] input file (secondary structure predictions from protbert)
- sys.argv[3] number of top, UNIQUE predictions (can be less)
- sys.argv[4] min length of the disorder fragment (we skip shorter)
- sys.argv[5] max percent of '-' in secondary structure prediction in disorder fragment 
              (40% is good starting point)
- sys.argv[6] remove duplicates [1/0]

Note that above sys.argv[4] and sys.argv[5] does not give you any scoring how good the hit is, thus
as previously let's define simple score of disorder fragment:

IndDis-score = (2 * 'H') +  (2 *'E') - (4 * '-')

Now we can use this score for sorting our hits (the bigger the better).

In the output file enrich the header with the information about the hits and add lines for disorder and secondary structure (you can skip/ignore the score lines in fasta files). 
For instance:
>header1|disorder[26-75]:H:25,E:16,-:9,InducedDIS-score=46, disorder[110-179]:H:20,E:25,-:15,InducedDIS-score=30
MEPEPEPEQEANKEEEKILSAAVRAAKIERNRQRA...
DDDDDDDDDDD--------------DDDDDDDDDD...
-EEEE---------------------EEEEEEEE-...
...

The hit is described by its localization and the number of H, E, and '-' letters. There can be more than one hit in the sequence (then we take into account the highest for the sorting)

H:25,E:16,-:9 states that the disorderedfragment is 50aa long, contains 25 'H', 16 'E' and 9 '-' thus
the score is (25*2) + (16*2) - (9*4) = 46

=====================================================================

HOMEWORK

Prepare the report file sumarizing the results. It should contain:

1) Table for the speed tests

_____________________________________

Method          CPU           GPU 
_____________________________________

IUPred         45.1            NA

ProtBert     3887.1           42.2
_____________________________________

*  provided time for 1k predictions in seconds - (be)"head" first 1k sequences from SwissProt
** ProtBert -do not include time of loading the model into memory

2) Tables for the percentages of disorder/order and secondary sturcture predictions:

________________________________________________________________________________________

Method     |  Disorder    |    Helix    Strand    Coil       Total Time   Total Time
                                                            (dis on CPU)  (ss on GPU)
________________________________________________________________________________________

E. coli          8.8            42.5     15.7     41.8          3.7 min     3.3 min

H. sapiens      21.4            30.1     10.7     59.2         20.3 min    31.6 min

SwissProT       17.2            37.6     15.7     46.7        528.3 min   589.3 min
________________________________________________________________________________________


Note: povided above timings are just exemplary, you will most likely get different ones
(depending on the node used, wrapper details, etc.), but should be in similar order
(thus if you see that you have for instance 10 times slower wrapper, it means that something
is wrong).

Additionally, add to the project folder:
a) the sbatch scripts (and srun commands) you used
b) all python scripts (normalizer, splitter, wrappers & parsers, filter, etc.)
c) the predictions files (both disorder and secondary structures for E. coli and H. sapiens 
   (do not include SwissProt predictions)
d) for SwissProt provide file with top 5000 "super-helical proteins" sorted by the score with length of at least 100 aa 
e) for Disprot analysis:
- raport MCC for long and short versions for human proteins
- files for top 5000 "induced disorder proteins" for short and long version

All files should be sent until the end of 11.05.2025 (next Sunday) via email to lukaskoz@mimuw.edu.pl 
with the email subject: 'ADP25_lab_hpc_hw_Surname_Name' without the email text body 
and with 'ADP25_lab_hpc_hw_Surname_Name.7z'* (without Polish letters) attachment.

* you may consider using ultra compression rate (-mx=9) to minimize te size