Genomic Duplication Problems for Unrooted Gene Trees

Jarosław Paszek, Paweł Górecki

Algorithms for the rooted variant of Episode Clustering Problem.
Based on the linear-time algorithm, presented in [2]. Implemented in groovy.

Installation:
1. Download jar file into your working directory.
2. Download lib and extract it in the same directory.

Simple usage:
    INPUT: set of rooted gene trees, a species tree 
OUTPUT: Episode Clustering cost
1. java -jar EC.jar -h displays help
2. -g gtrees.txt to select the input file with gene trees
3. -s strees.txt to select the input file with species trees
4. -i in order to solve Episode Clustering Problem
5. -e (optional) in order to print episodes on a species tree
6. -m 1 Model of valid mappings presented in [3]
-m 2
The model of valid mappings from [1]

[1] Paszek, J., Górecki, P.: Genomic Duplication Problems for Unrooted Gene Trees accepted to APBC 2016
[2] Luo, C.-W., Chen, M.-C., Chen, Y.-C., Yang, R.W.L., Liu, H.-F., Chao, K.-M.: Linear-time algorithms for the multiple gene duplication problems. IEEE/ACM Transactions on Computational Biology and Bioinformatics 8(1), 260–265 (2011)
[3] Bansal, M.S., Eulenstein, O.: The multiple gene duplication problem revisited. Bioinformatics 24(13), 132–8 (2008)

Urec
Installation of urec is required in order to run the software. See [4].

Installation:
1. Download enhanced urec sources file into your working directory.
2. Extract urec.zip and compile sources (cd urec ; make -f Makefile).

    INPUT: file - first line: unrooted gene tree, second: a species tree 
OUTPUT: α edge rooting (or nothing) for -a 51; β edge rooting (or two rootings) for -a 52
Usage:
1. urec/urec -T double.txt -b -a 51 double.txt consist of two lines - first: input gene tree ; second: input species tree
-a 51 print the rooting on the empty edge if it is present in input gene tree, or nothing otherwise, see α - Algorithm 1.
2.
-a 52
print one double edge rooting if the double edge is present, or at most two rooting candidates for gene trees having two S2 stars, see β - Algorithm 1.
As a result we have rooted gene tree (trees). We can use them as an input to EC.jar

[4] Górecki, P., Tiuryn, J.: Inferring phylogeny from whole genomes. Bioinformatics 23(2), 116–122 (2007)

DATASETS
Guigó dataset consists of 53 rooted gene trees from 16 Eukaryotes from [5].
This dataset was evaluated with 71 species trees from [6], known to have the total minimal duplication cost.
"The most biologically reasonable" [7] tree from those 71 is first in s.txt and the only one in s1.txt
Download and extract Guigó dataset into your working directory.

Génolevures is a dataset of 4144 gene trees [8] from nine yeast genomes and two species trees:
a) one from [9] (s.txt) and
b) the second one (s1.txt) having the lowest duplication-loss cost computed by Fasturec [10].
Download and extract Génolevures dataset into your working directory.

The third dataset TreeFam, spanning 25 mostly animal species, consists of 1274 curated gene family trees from TreeFam v7.0 [11].
The species tree for TreeFam is based on NCBI taxonomy.
Download and extract TreeFam dataset into your working directory.

[5] Guigó, R., Muchnik, I.B., Smith, T.F.: Reconstruction of ancient molecular phylogeny. Molecular Phylogenetics and Evolution 6(2), 189–213 (1996).
[6] Chang, W., Górecki, P., Eulenstein, O.: Exact solutions for species tree inference from discordant gene trees. Journal of Bioinformatics and Computational Biology 11(5) (2013)
[7] Page, R.D.M., Charleston, M.A.: Reconciled trees and incongruent gene and species trees. In: DIMACS Series in Discrete Mathematics and Theoretical Computer Sciences, vol. 37 (1997)
[8] Górecki, P., Eulenstein, O.: Algorithms: simultaneous error-correction and rooting for gene tree reconciliation and the gene duplication problem. BMC Bioinformatics 13(Suppl 10), 14 (2012)
[9] Dujon B. Yeasts illustrate the molecular mechanisms of eukaryotic genome evolution. Trends Genet. 2006;22(7):375 – 387.
[10] Górecki, P., Eulenstein, O.: GTP supertrees from unrooted gene trees: linear time algorithms for nni based local searches. Lecture Notes in Computer Science 7292, 83–105 (2012)
[11] Ruan, J., Li, H., Chen, Z., Coghlan, A., Coin, L.J., Guo, Y., Hériché, J.-K., Hu, Y., Kristiansen, K., Li, R., Liu, T., Moses, A., Qin, J., Vang, S., Vilella, A.J., Ureta-Vidal, A., Bolund, L., Wang, J., Durbin, R.: TreeFam: 2008 Update. Nucleic Acids Research 36, 735–40 (2008)

ALGORITHM 2
    INPUT: set of unrooted gene trees, set of species trees, (optional) 1 for valid mappings model from [3], 2 for valid mappings model from our article [1], no parameter - same as 2; 
OUTPUT: three sets of rootings with EC costs for each species tree
Every input dataset has a gene tree with double edge.
This algorithm generates EC solutions for 3 variants of rootings for input species trees.
a) tmp/grootingsalphaplus.txt - alpha part, that means rootings on empty edges.
This set is extended - for gene trees with double edges we add rootings of that trees into candidate set.
Please observe that EC cost of this set of rootings is not minimal which follows from the fact that input set consist of a gene tree with a double edge implies existence of a top-cluster (see Thm 5 from [1]).
b) tmp/grootingsbetawithout2S2.txt - delta subset of candidate rooting set
- consists of top-cluster double edges rootings,
- consists of top-cluster rootings for empty edges when there is only one rooting (one S2 star)
- consists of an empty rooting when there is no top-cluster rooting
- when there are two S2 stars we omit such tree
c) tmp/grootingssamplebeta.txt - as above, but :
- when there are two S2 stars we choose one rooting of such tree
please note that only this one candidate set is generated

1. Do Installation from previous points.
2. Download datasets.
3. Download and "chmod u+x" Algorithm 2 script
4. Reproduce the results :

./algorithm2.sh guigo/g.txt guigo/s.txt
Note: for Guigo k = 0 so b) and c) are the same sets.
Note: Total minimal EC (here:4) - is the solution of UEC Problem

./algorithm2.sh guigo/g.txt guigo/s.txt 1
Note: 1 to change model. UEC solution equals 5.

java -jar EC.jar -g tmp/grootingssamplebeta.txt -s guigo/s1.txt -i -m 2 -e
To obtain duplication clusters marked on species tree.

./algorithm2.sh treefam/g.txt treefam/s.txt
For b) we have 45 (so for the whole input set it cannot be lower)
For c) sample all input rooting we obtained 45 - so it is optimal.