Subject: Phylogenetics

1) Reading:
a) Wikipedia
- https://en.wikipedia.org/wiki/Computational_phylogenetics
- https://en.wikipedia.org/wiki/Phylogenetic_tree
- https://en.wikipedia.org/wiki/Tree_of_life_(biology)
- 
https://en.wikipedia.org/wiki/MEGA,_Molecular_Evolutionary_Genetics_Analysis 
(HW: install MEGA and use for some data; remember that this is sort of a 
GUI blackbox and many people consider it as not a valid method of doing 
phylogenetics, but ...)

b) Chapters 6-9 (87-178) from 
https://www.mimuw.edu.pl/~lukaskoz/teaching/sad2/books/Molecular-Evolution-and-Phylogenetics.pdf
c) Chapter 4 (pp. 83-145) from 
https://www.mimuw.edu.pl/~lukaskoz/teaching/sad2/books/Dirk_Husmeier.pdf
d) Chapter 7 (pp. 160-190) from 
https://www.mimuw.edu.pl/~lukaskoz/teaching/sad2/books/Durbin.pdf

f) finally, the lecture: 05_Statistical_phylogenetics.pdf


==========================================================================

2) Exercise:
During today's sesion we will cover methods for tree buliding.

a) First we need to obtain some data:
Go to NCBI (https://www.ncbi.nlm.nih.gov) and from "Nucleotide" database 
extract manually the complete mitochondrial DNA for some primates (Homo 
sapiens, Gorilla gorilla, Pan troglodytes, Pan paniscus, Pongo pygmaeus, 
Pongo pygmaeus abelii, Hylobates lar).

Hint: e.g. "Gorilla gorilla mitochondrial DNA, complete sequence" and 
then store in the fasta 16,364 bp circular DNA. To limit query use 
"Mitochondrion" filter.

You are interested in sequences which have ~16kbp (do not use fragments).

Store all seuences in one *.fasta file.
Re-name fasta headers into short ones e.g:
">gi|1632801|emb|X99256.1|HLMITCSEQ Hylobates lar complete mitochondrial 
DNA sequence" -> ">Hlar"

b) Read the fasta file(s) into R (use package such as msa, ape, seqinr, 
Biostrings, phytools, phangorn)
c) Do MSA using ClustalW method in R! 

Browse the alignment and write it to pdf/html and MSA format file (e.g. 
CLUSTAL, NEXUS, PHYLIP, PIR, GDE, MSF, fasta).

fas = 'primates2.fasta'
msa_fsa <- read.phyDat(fas, format="fasta", type = "DNA")

msa_dna <-read.dna(fas, format="fasta", as.character = FALSE)
align_primates<-phyDat(msa_dna, type = "DNA")


d) Construct distanse matrix (phangorn: explore different distanse 
matrix e.g. hamming, ml, etc.)

e) Do trees using:
- NJ (https://en.wikipedia.org/wiki/Neighbor_joining)
- UPGMA (https://en.wikipedia.org/wiki/UPGMA)
- parsimony 
(https://en.wikipedia.org/wiki/Maximum_parsimony_(phylogenetics))

f) Construct initial model using 'pml' with NJ tree, the data and K80 
model

For details about K80 see:
http://www.bioinf.man.ac.uk/resources/phase/manual/node67.html

Store the trees in different formats e.g. (e.g. Newick, NEXUS, PHYLIP).

g) Optimize with respect to branch lengths (optim.pml)
h) Optimize with respect to nucleotide substitution model parameters
i) Optimize simultaneously with respect to branch lengths, nucleotide
Writes trees g-h to pdf file

j) bootstrap the trees with ape using phylo.boot or/and with phangorn 
using bootstrap.pml

Check support for individual branches.

k) try to re-root the tree on arbitrary choosen nodes.


Based on trees can you say which organism is most distant from others and 
which one is the closest relative of the human?


If times allows:
a) go to page 156 (168 in pdf) 
https://en.wikipedia.org/wiki/MEGA,_Molecular_Evolutionary_Genetics_Analysis 
Try to reproduce the ML trees from Fig 8.3

==========================================================================

3) Additional material:
R specific:
https://www.mimuw.edu.pl/~lukaskoz/teaching/sad2/books/Analysis_of_Phylogenetics_and_Evolution_with_R.pdf
https://cran.r-project.org/web/packages/phangorn/phangorn.pdf
https://cran.r-project.org/web/packages/ape/ape.pdf