==================================================================== ISOELECTRIC POINT ==================================================================== General info: https://en.wikipedia.org/wiki/Isoelectric_point Isoelectric point is the pH at which a molecule carries no net electrical charge or is electrically neutral on average. TASK 1: Is the protein acidic (pI below 5)? Our task is to classify proteins/peptides into acidic (pI < 5) and non-acidic (pI >= 5). PREPARATION OF THE DATASET To build an ML model, we need a dataset. Let's use the ones used in IPC construction: http://isoelectric.org/datasets.html More info: http://dx.doi.org/10.1186/s13062-016-0159-9 There are two main datasets: - IPC_peptide (16,882 items) - IPC_protein (2,324 items) Download: - http://isoelectric.org/datasets/Gauci_PHENYX_SEQUEST_0.99_duplicates_out.fasta - http://isoelectric.org/datasets/pip_ch2d19_2_1st_isoform_outliers_3units_cleaned_0.99.fasta These datasets were preprocessed and cleaned, but we will need to prepare them further. Since we aim for binary classification, we will label data based on a threshold of pI = 5.0: - pI < 5 --> acidic (label: 1) - pI >= 5 --> non-acidic (label: 0) a) First, analyze the dataset contents: - Open both files and check their format. - Create histograms based on average pI and sequence length. - Calculate IQR (interquartile range) for both. See example plots: - IPC_protein: pI and length - IPC_peptide: pI and length https://www.mimuw.edu.pl/~lukaskoz/teaching/dav/labs/lab11/IPC_protein_pI_histogram.png https://www.mimuw.edu.pl/~lukaskoz/teaching/dav/labs/lab11/IPC_protein_len_histogram.png https://www.mimuw.edu.pl/~lukaskoz/teaching/dav/labs/lab11/IPC_peptide_pI_histogram.png https://www.mimuw.edu.pl/~lukaskoz/teaching/dav/labs/lab11/IPC_peptide_len_histogram.png Note: The datasets differ in size and sequence length distribution. b) Count how many proteins/peptides have pI < 5.0. Let's create a dataset of acidic proteins and peptides: - From the protein dataset, take all proteins with pI < 5.0 (e.g. 550 items --> positives, label 1). - Then, take 550 most basic peptides with pI > 10 (negatives, label 0), for proteins sort and take 550 the most basic ones For the larger peptide dataset: - Extract all peptides with pI < 5.0 and those with pI > 10. - Randomly select 550 from each group (550 positives and 550 negatives). Finally, merge the protein and peptide records into one dataset of 2200 items. Format: >UID|peptide/protein|pI|1/0 sequence Use MD5 hash of the sequence as UID: python import hashlib sequence = "YDNSLTVVSNASCTTNCLAPLAK" res = hashlib.md5(sequence.encode()) MD5_hash_uid = res.hexdigest() print(MD5_hash_uid) # Output: ec22b6bb20f548f06b81a1a0760d78d6 Save as: IPC_classification_dataset_100.fasta Now split the data into training, testing, and validation sets (60/20/20): First iteration: - Randomly split the data. Count how many items in each split come from proteins/peptides and how many are positives/negatives. Due to dataset size, proportions may slightly vary. Second iteration: - To ensure balanced source types, split proteins and peptides separately, then randomly divide and merge the splits. Compare the distributions between the two methods. Save the splits: IPC_classification_dataset_60_train.fasta IPC_classification_dataset_20_test.fasta IPC_classification_dataset_20_val.fasta Now, having these files, we can think about some features that can be engineered from the sequences for our task. Some obvious ones include: - the protein/peptide sequence itself - the number of amino acids of a given type - overall charge at neutral pH Additional useful features might be: - sequence length - amino acid composition (relative frequency of each residue) - fraction of acidic residues (e.g., D, E) - fraction of basic residues (e.g., K, R, H) - hydrophobicity score (e.g., using Kyte-Doolittle scale) - aromaticity (presence of F, Y, W) - molecular weight - isoelectric point (calculated from sequence) - aliphatic index - instability index - presence of specific motifs (e.g., dipeptides or tripeptides) - sequence entropy or complexity Make csv files for IPC_classification_dataset that will have some of those features included. Recommended Features: Feature Name Type Description seq_length Integer Total number of amino acids mw Float Molecular weight (Da) avg_hydrophobicity Float Mean hydrophobicity (e.g., Kyte-Doolittle) fraction_A, ..., fraction_Y Float Fraction of each of the 20 amino acids num_acidic Intege Count of D and E residues num_basic Integer Count of K, R, H residues charge_at_pH7 Float Approx. net charge at pH 7 aromaticity Float Proportion of F, Y, W residues aliphatic_index Float Relative volume occupied by aliphatic side chains (A, V, I, L) instability_index Float Predicted stability (lower = more stable) pI Float Calculated isoelectric point label_acidic 0/1 Classification label: 1 if pI < 5.0, else 0 (for binary classification task) Output Format (CSV Example) uid,seq_length,mw,avg_hydrophobicity,fraction_A,...,charge_at_pH7,pI,label_acidic ec22b6bb,23,2400.1,0.12,0.13,...,-1.25,4.82,1 abcd1234,95,10840.3,-0.05,0.06,...,+0.75,9.71,0 See also: https://www.mimuw.edu.pl/~lukaskoz/teaching/dav/labs/lab11/features.txt