DEADLINE: two weeks from today (27.11.2018)

Task: You are asked to make phylogenetic analysis of the proteins belonging to hemoglobin family across the vertebrates

Rules: everything need to be done in R (you will be asked to provide R script doing the complete analysis, alongside with the intermediate, output files).

Input: hemoglobin.fasta (>2k protein sequences)

Create three subsets from the file:

  1. subset A: 100 randomly selected sequences
  2. subset B: 50 randomly selected sequences from subset A
  3. subset C: 20 randomly selected sequences from subset B

Additional constrains:

Cleaning the data: after checking fasta headers you will notice immediately that there is no taxonomy information in it. Therefore, your first non-trivial task is to extract this information from UniProt (and additionally reformat the headers into shorter, but more informative ones, see below).

The sequences in hemoglobin.fasta file look like that:

>P01941.1 RecName: Full=Hemoglobin subunit alpha; AltName: Full=Alpha-globin; AltName: Full=Hemoglobin alpha chain
VLSPGDKSNIKAAWGKIGGQAPQYGAEALERMFLSFPTTKTYFPHFDMSHGSAQIQAHGKKVADALSTAVGHLDDLPTAL
SALSDLHAHKLRVDPANFKLLSHCILVTLACHHPGDFTPEIHASLDKFLANVSTVLTSKYR
>P11755.1 RecName: Full=Hemoglobin subunit alpha-1; AltName: Full=Alpha-1-globin; AltName: Full=Hemoglobin alpha-1 chain
VLSPEDKNNVKAAWSKVGGQAGDYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVGEALTTAVNHMDDLPGAL
STLSDLHAYKLRVDPVNFKLLSHCLLVTLACHNPGEFTPAVHASLDKFLASVSTVLTSKYR
….

The very first identifier come from UniProt database and from UniProt you can extract the information about the taxonomy.

The API for UniProt is described here: https://www.uniprot.org/help/api_retrieve_entries. At this stage you may want to parse XML files or use some R libraries which can retrieve the information about taxonowmy of the sequence given the Uniprot id. Additionally, while parsing the headers you are asked to extract species information. At the end your fasta headers should became:

>name-shortSpeciesName-Class-uniprotId

where the “name” is “Name” tag from UniProt, shortSpeciesName is the first letter of the genus and three letters are taken from the species e.g. Homo sapiens -> Hsap, Rat norveticus -> Rnor

Thus, for above example it would be:

>HBA_TUPGL-Tgli-Mammalia-P01941
VLSPGDKSNIKAAWGKIGGQAPQYGAEALERMFLSFPTTKTYFPHFDMSHGSAQIQAHGKKVADALSTAVGHLDDLPTAL
SALSDLHAHKLRVDPANFKLLSHCILVTLACHHPGDFTPEIHASLDKFLANVSTVLTSKYR
>HBA1_TADBR-Tbra-Mammalia-P11755
VLSPEDKNNVKAAWSKVGGQAGDYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVGEALTTAVNHMDDLPGAL
STLSDLHAYKLRVDPVNFKLLSHCLLVTLACHNPGEFTPAVHASLDKFLASVSTVLTSKYR
….

Now, store those 3 subsets as subsetA.fasta, subsetB.fasta subsetC.fasta

Make as much complete phylogenetic analysis of the subsets as you can:

  1. do MSA (test clustal and one more algorithm e.g. MUSCLE, ClustalOmega) make html or pdf with the alignment, store the alignment in some common formats, try to optimize the alignment (tweak some parameters if needed)

  2. make distance matrices (and store them in separate files)

  3. test multiple methods/models (upgma, nj, parsimony, minimum evolution, ml e.g. modelTest function with a AIC, AICc and BIC calculated using phangorn)

General notes:

Store the output trees in newick and nexus formated files. Make plots (store them in in png, jpeg or pdf files).

Do the phylogenetic trees agrees with the class divisions. If not, can you explain why this is not the case for the given sequence(s) - mark the suspicious branches on the trees?

Make pdf which summarizes all important steps of your analysis and try to draw some conclusions (which model/methods worked the best, discuss the results e.g. the conservation of the hemoglobin, whether the results agree with the taxonomy of organims, etc. support your claims with the figures e.g. trees, alignments etc.).

Send the results as NameSurname_SAD2_phylogenetics.tar.gz (containing R code and all suporting files) to lukaskoz@mimuw.edu.pl (the subject of the email should be: “SAD2 2018 Phylogenetics project NameSurname”).