Task: You are asked to make phylogenetic analysis of the proteins belonging to hemoglobin family across the vertebrates
Rules: everything need to be done in R (you will be asked to provide R script doing the complete analysis, alongside with the intermediate, output files).
Input: hemoglobin.fasta (>2k protein sequences)
Create three subsets from the file:
Additional constrains:
among sequences in subsets A, B, and C you need to have a core of shared proteins (common in all three subsets): one sequence from Homo sapiens, 4 from other mammals, 3 from birds, 2 from amphibia, 2 from reptile, 1 from fish (thus for the subset C you are missing 7 proteins you pick randomly)
the sequences should be ~140 aa long, thus you may need to remove too short or too long sequences (e.g. calculate some statistics such as the mean, std and remove the outliers).
Cleaning the data: after checking fasta headers you will notice immediately that there is no taxonomy information in it. Therefore, your first non-trivial task is to extract this information from UniProt (and additionally reformat the headers into shorter, but more informative ones, see below).
The sequences in hemoglobin.fasta file look like that:
>P01941.1 RecName: Full=Hemoglobin subunit alpha; AltName: Full=Alpha-globin; AltName: Full=Hemoglobin alpha chain
VLSPGDKSNIKAAWGKIGGQAPQYGAEALERMFLSFPTTKTYFPHFDMSHGSAQIQAHGKKVADALSTAVGHLDDLPTAL
SALSDLHAHKLRVDPANFKLLSHCILVTLACHHPGDFTPEIHASLDKFLANVSTVLTSKYR
>P11755.1 RecName: Full=Hemoglobin subunit alpha-1; AltName: Full=Alpha-1-globin; AltName: Full=Hemoglobin alpha-1 chain
VLSPEDKNNVKAAWSKVGGQAGDYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVGEALTTAVNHMDDLPGAL
STLSDLHAYKLRVDPVNFKLLSHCLLVTLACHNPGEFTPAVHASLDKFLASVSTVLTSKYR
….
The very first identifier come from UniProt database and from UniProt you can extract the information about the taxonomy.
The API for UniProt is described here: https://www.uniprot.org/help/api_retrieve_entries. At this stage you may want to parse XML files or use some R libraries which can retrieve the information about taxonowmy of the sequence given the Uniprot id. Additionally, while parsing the headers you are asked to extract species information. At the end your fasta headers should became:
>name-shortSpeciesName-Class-uniprotId
where the “name” is “Name” tag from UniProt, shortSpeciesName is the first letter of the genus and three letters are taken from the species e.g. Homo sapiens -> Hsap, Rat norveticus -> Rnor
Thus, for above example it would be:
>HBA_TUPGL-Tgli-Mammalia-P01941
VLSPGDKSNIKAAWGKIGGQAPQYGAEALERMFLSFPTTKTYFPHFDMSHGSAQIQAHGKKVADALSTAVGHLDDLPTAL
SALSDLHAHKLRVDPANFKLLSHCILVTLACHHPGDFTPEIHASLDKFLANVSTVLTSKYR
>HBA1_TADBR-Tbra-Mammalia-P11755
VLSPEDKNNVKAAWSKVGGQAGDYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVGEALTTAVNHMDDLPGAL
STLSDLHAYKLRVDPVNFKLLSHCLLVTLACHNPGEFTPAVHASLDKFLASVSTVLTSKYR
….
Now, store those 3 subsets as subsetA.fasta, subsetB.fasta subsetC.fasta
do MSA (test clustal and one more algorithm e.g. MUSCLE, ClustalOmega) make html or pdf with the alignment, store the alignment in some common formats, try to optimize the alignment (tweak some parameters if needed)
make distance matrices (and store them in separate files)
test multiple methods/models (upgma, nj, parsimony, minimum evolution, ml e.g. modelTest function with a AIC, AICc and BIC calculated using phangorn)
General notes:
Store the output trees in newick and nexus formated files. Make plots (store them in in png, jpeg or pdf files).
Do the phylogenetic trees agrees with the class divisions. If not, can you explain why this is not the case for the given sequence(s) - mark the suspicious branches on the trees?
Make pdf which summarizes all important steps of your analysis and try to draw some conclusions (which model/methods worked the best, discuss the results e.g. the conservation of the hemoglobin, whether the results agree with the taxonomy of organims, etc. support your claims with the figures e.g. trees, alignments etc.).