==================================================================== ISOELECTRIC POINT PART 2 ==================================================================== Having some features like: Feature Name Type Description seq_length Integer Total number of amino acids mw Float Molecular weight (Da) avg_hydrophobicity Float Mean hydrophobicity (e.g., Kyte-Doolittle) fraction_A, ..., fraction_Y Float Fraction of each of the 20 amino acids num_acidic Intege Count of D and E residues num_basic Integer Count of K, R, H residues charge_at_pH7 Float Approx. net charge at pH 7 aromaticity Float Proportion of F, Y, W residues aliphatic_index Float Relative volume occupied by aliphatic side chains (A, V, I, L) instability_index Float Predicted stability (lower = more stable) pI Float Calculated isoelectric point label_acidic 0/1 Classification label: 1 if pI < 5.0, else 0 (for binary classification task) Output Format (CSV Example) uid,seq_length,mw,avg_hydrophobicity,fraction_A,...,charge_at_pH7,pI,label_acidic ec22b6bb,23,2400.1,0.12,0.13,...,-1.25,4.82,1 abcd1234,95,10840.3,-0.05,0.06,...,+0.75,9.71,0 See also: https://www.mimuw.edu.pl/~lukaskoz/teaching/dav/labs/lab11/features.txt (*.py script) Task 2: 1) Explore the data: - make plots regarding each feature (e.g. scatter plots, histograms, box plots, heat map, etc. minimum two types of plots for each feature) - decide which columns should be used for the prediction - calculate the correlation of each feature* 2) We will start with decision trees as the most intuitive a) train DecisionTreeClassifier (test the depth parameter) - calculate the accuracy for different tree structures (e.g. depth, number of features)** - visualise the trees (use meaningful labels) b) do the training again using RandomForestClassifier 3) ML models building a) train Nearest Neighbors b) train Support Vector Machine (RBF & GridSearchCV) In all cases make some scripts and summaries in the form of tables (e.g. with scores from different setups) * Actually before making any ML model when you have your features you should do proper 'feature selection' step to avoid non-informative or/and highly correlated features: - https://scikit-learn.org/stable/modules/feature_selection.html - https://scikit-learn.org/stable/modules/unsupervised_reduction.html ** you can use here GridSearchCV https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html https://scikit-learn.org/stable/auto_examples/model_selection/plot_randomized_search.html https://scikit-learn.org/stable/auto_examples/model_selection/plot_successive_halving_heatmap.html