====================================================================
                    ISOELECTRIC POINT
====================================================================

General info: https://en.wikipedia.org/wiki/Isoelectric_point

Isoelectric point is the pH at which a molecule carries no net electrical charge or is electrically neutral on average.

TASK 1:
Is the protein acidic (pI below 5)?

Our task is to classify proteins/peptides into acidic (pI < 5) and non-acidic (pI >= 5).

PREPARATION OF THE DATASET
To build an ML model, we need a dataset. Let's use the ones used in IPC construction:
http://isoelectric.org/datasets.html
More info: http://dx.doi.org/10.1186/s13062-016-0159-9

There are two main datasets:
- IPC_peptide (16,882 items)
- IPC_protein (2,324 items)

Download:
- http://isoelectric.org/datasets/Gauci_PHENYX_SEQUEST_0.99_duplicates_out.fasta
- http://isoelectric.org/datasets/pip_ch2d19_2_1st_isoform_outliers_3units_cleaned_0.99.fasta

These datasets were preprocessed and cleaned, but we will need to prepare them further.

Since we aim for binary classification, we will label data based on a threshold of pI = 5.0:
- pI < 5 --> acidic (label: 1)
- pI >= 5 --> non-acidic (label: 0)

a) First, analyze the dataset contents:
- Open both files and check their format.
- Create histograms based on average pI and sequence length.
- Calculate IQR (interquartile range) for both.

See example plots:
- IPC_protein: pI and length
- IPC_peptide: pI and length

https://www.mimuw.edu.pl/~lukaskoz/teaching/dav/labs/lab11/IPC_protein_pI_histogram.png
https://www.mimuw.edu.pl/~lukaskoz/teaching/dav/labs/lab11/IPC_protein_len_histogram.png
https://www.mimuw.edu.pl/~lukaskoz/teaching/dav/labs/lab11/IPC_peptide_pI_histogram.png
https://www.mimuw.edu.pl/~lukaskoz/teaching/dav/labs/lab11/IPC_peptide_len_histogram.png

Note: The datasets differ in size and sequence length distribution.

b) Count how many proteins/peptides have pI < 5.0.

Let's create a dataset of acidic proteins and peptides:
- From the protein dataset, take all proteins with pI < 5.0 (e.g. 550 items --> positives, label 1).
- Then, take 550 most basic peptides with pI > 10 (negatives, label 0), for proteins sort and take 550 the most basic ones

For the larger peptide dataset:
- Extract all peptides with pI < 5.0 and those with pI > 10.
- Randomly select 550 from each group (550 positives and 550 negatives).

Finally, merge the protein and peptide records into one dataset of 2200 items.

Format:
>UID|peptide/protein|pI|1/0
sequence

Use MD5 hash of the sequence as UID:

python
import hashlib
sequence = "YDNSLTVVSNASCTTNCLAPLAK"
res = hashlib.md5(sequence.encode())
MD5_hash_uid = res.hexdigest()
print(MD5_hash_uid)
# Output: ec22b6bb20f548f06b81a1a0760d78d6

Save as: IPC_classification_dataset_100.fasta

Now split the data into training, testing, and validation sets (60/20/20):

First iteration:
- Randomly split the data. Count how many items in each split come from proteins/peptides and how many are positives/negatives.

Due to dataset size, proportions may slightly vary.

Second iteration:
- To ensure balanced source types, split proteins and peptides separately, then randomly divide and merge the splits.

Compare the distributions between the two methods.

Save the splits:
    IPC_classification_dataset_60_train.fasta
    IPC_classification_dataset_20_test.fasta
    IPC_classification_dataset_20_val.fasta

Now, having these files, we can think about some features that can be engineered from the sequences for our task. Some obvious ones include:
- the protein/peptide sequence itself
- the number of amino acids of a given type
- overall charge at neutral pH

Additional useful features might be:
- sequence length
- amino acid composition (relative frequency of each residue)
- fraction of acidic residues (e.g., D, E)
- fraction of basic residues (e.g., K, R, H)
- hydrophobicity score (e.g., using Kyte-Doolittle scale)
- aromaticity (presence of F, Y, W)
- molecular weight
- isoelectric point (calculated from sequence)
- aliphatic index
- instability index
- presence of specific motifs (e.g., dipeptides or tripeptides)
- sequence entropy or complexity

Make csv files for IPC_classification_dataset that will have some of those features included.

Recommended Features:
Feature Name        Type        Description
seq_length          Integer     Total number of amino acids
mw                  Float       Molecular weight (Da)
avg_hydrophobicity  Float       Mean hydrophobicity (e.g., Kyte-Doolittle)
fraction_A, ..., fraction_Y Float Fraction of each of the 20 amino acids
num_acidic          Intege      Count of D and E residues
num_basic           Integer     Count of K, R, H residues
charge_at_pH7       Float       Approx. net charge at pH 7
aromaticity         Float       Proportion of F, Y, W residues
aliphatic_index     Float       Relative volume occupied by aliphatic side chains (A, V, I, L)
instability_index   Float       Predicted stability (lower = more stable)
pI                  Float       Calculated isoelectric point
label_acidic        0/1         Classification label: 1 if pI < 5.0, else 0 (for binary classification task)

Output Format (CSV Example)
uid,seq_length,mw,avg_hydrophobicity,fraction_A,...,charge_at_pH7,pI,label_acidic
ec22b6bb,23,2400.1,0.12,0.13,...,-1.25,4.82,1
abcd1234,95,10840.3,-0.05,0.06,...,+0.75,9.71,0

See also:
https://www.mimuw.edu.pl/~lukaskoz/teaching/dav/labs/lab11/features.txt