1
<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook V4.1//EN">
4
<title>Phylogenetic Tree HOWTO</title>
6
<firstname>Jason</firstname>
7
<surname>Stajich</surname>
9
<para>Bioperl Core Developer</para>
12
<orgname>Dept Molecular Genetics and Microbiology,
13
Duke University</orgname>
14
<address><email>jason AT bioperl.org</email></address>
18
<pubdate>2003-12-01</pubdate>
21
<revnumber>0.1</revnumber>
22
<date>2003-12-01</date>
23
<authorinitials>JES</authorinitials>
24
<revremark>First version</revremark>
28
<para>This document is copyright Jason Stajich, 2003. It can
29
be copied and distributed under the terms of the Perl
36
This HOWTO intends to show how to use the Bioperl Tree objects to
37
manipulate phylogenetic trees. It shows how to read and write
38
trees, query them for information about specific nodes or
39
overall statistics. Advanced topics include discussion of
40
generating random trees and extensions of the basic structure
41
for integration with other modules in Bioperl.
46
<section id="introduction">
47
<title>Introduction</title>
49
Generating and manipulating phylogenetic trees is an important
50
part of modern systematics and molecular evolution research.
51
The construction of trees is the subject of a rich literature
52
and active research. This HOWTO and the modules described
53
within are focused on querying and manipulating trees once they
58
The data we intend to capture with these objects concerns the notion
59
of Trees and their Nodes. A Tree is made up of Nodes and the
60
relationships which connect these nodes. The basic
61
representation of parent and child nodes is intended to
62
represent the directionality of evolution. This is to capture
63
the idea that some ancestral species gave rise, through speciation
64
events, to a number of child species. The data in the trees
65
need not be a strictly bifurcating tree (or binary trees to the
66
CS types), and a parent node can give rise to 1 or many child
71
In practice there are just a few main
72
objects, or modules, you need to know about. There is the main
73
Tree object <emphasis>Bio::Tree::Tree</emphasis> which is the main
74
entry point to the data represented by a tree. A Node is
75
represented generically by <emphasis>Bio::Tree::Node</emphasis>,
76
however there are subclasses of this object to handle particular
77
cases where we need a richer object (see
78
<emphasis>Bio::PopGen::Simulations::Coalescent</emphasis> and
79
the PopGen HOWTO for more information). The connections between
80
Nodes are described using a few simple concepts. There is the concept of
81
pointers or references where a particular Node keeps track of
82
who its parent is and who its children are. A Node can only have 1
83
parent and it can have 1 or many children. In fact all of the
84
information in a tree pertaining to the relationships between
85
Nodes and specific data, like bootstrap values and labels, are all
86
stored in the Node objects while the
87
<emphasis>Bio::Tree::Tree</emphasis> object is just a
88
container for some summary information about the tree and a
89
description of the tree's root node.
93
<section id="simple_usage">
94
<title>Simple Usage</title>
96
Trees are used to represent the history of a collection of taxa,
97
sequences, or populations.
101
<title>Reading and Writing Trees</title>
103
Using <emphasis>Bio::TreeIO</emphasis> one can read trees
104
from files or datastreams and
105
create <emphasis>Bio::Tree::Tree</emphasis> objects. This is
106
analagous to how we read sequences from sequence files with
107
<emphasis>Bio::SeqIO</emphasis> to create Bioperl sequence
108
objects which can be queried and manipulated. Similarly we can
109
write <emphasis>Bio::Tree::Tree</emphasis> objects out to string
110
representations like the Newick or New Hampshire format which can
111
be printed to a file, a datastream, stored in database, etc.
114
The main module for reading and writing trees is the
115
<emphasis>Bio::TreeIO</emphasis> factory module which has
116
several driver modules which plug into it. These drivers
117
include <emphasis>Bio::TreeIO::newick</emphasis> for New
118
Hampshire or Newick format,
119
<emphasis>Bio::TreeIO::nhx</emphasis> for the New Hampshire eXtended
120
format from Sean Eddy and Christian Zmeck as part of their RIO
121
and ATV system [reference here]. The driver
122
<emphasis>Bio::TreeIO::nexus</emphasis> supports parsing tree
123
data from PAUP's Nexus format. However this driver currently
124
only supports parsing, not writing, of Nexus format tree files.
128
<title>Example Code</title>
130
Here is some code which will read in a Tree from a file called
131
"tree.tre" and produce a Bio::Tree::Tree object which is stored in
132
the variable <emphasis>$tree</emphasis>.
135
Like most modules which do input/output you can also specify
136
the argument -fh in place of -file to provide a glob or filehandle in
137
place of the filename.
142
# parse in newick/new hampshire format
143
my $input = new Bio::TreeIO(-file => "tree.tre",
144
-format => "newick");
145
my $tree = $input->next_tree;
149
Once you have a Tree object you can do a number of things with it.
150
These are all methods required in <emphasis>Bio::Tree::TreeI</emphasis>.
155
<title>Bio::Tree::TreeI methods</title>
157
Request the taxa (leaves of the tree).
158
<programlisting>my @taxa = $tree->get_leaf_nodes;</programlisting>
163
<programlisting>my $root = $tree->get_root_node; </programlisting>
166
Get the total length of the tree (sum of all the branch lengths),
167
which is only useful if the nodes actually have the branch
168
length stored, of course.
169
<programlisting>my $total_length = $tree->total_branch_length;</programlisting>
174
<section id="TreeFunctionsI">
175
<title>Bio::Tree::TreeFunctionsI</title>
177
An additional interface was written which implements
178
utility functions which are useful for manipulating a Tree.
182
Find a particular node, either by name or by some other field that is
183
stored in a Node. The field type should be the function name we
184
can call on all of the Nodes in the Tree.
186
# find all the nodes named 'node1' (there should be only one)
187
my @nodes = $tree->find_node(-id => 'node1');
188
# find all the nodes which have description 'BMP'
189
my @nodes = $tree->find_node(-description => 'BMP');
190
# find all the nodes with bootstrap value of 70
191
my @nodes = $tree->find_node(-bootstrap => 70);
194
If you would like to do more sophisticated searches, like "find all
195
the nodes with boostrap value better than 70", you can easily
196
implement this yourself.
198
my @nodes = grep { $_->bootstrap > 70 } $tree->get_nodes;
201
Remove a Node from the Tree and update the children/ancestor links
202
where the Node is an intervening one.
204
# provide the node object to remove from the Tree
205
$tree->remove_Node($node);
206
# or specify the node Name to remove
207
$tree->remove_Node('Node12');
210
Get the lowest common ancestor for a set of Nodes. This method is
211
used to find an internal Node of the Tree which can be traced,
212
through its children, to the requested set of Nodes. It is used in
213
the calculations of monophyly and paraphyly and in determining the
214
distance between two nodes.
217
# Provide a list of Nodes that are in this tree
218
my $lca = $tree->get_lca(-nodes => \@nodes);
221
Get the distance between two nodes by adding up the branch lengths
222
of all the connecting edges between two nodes.
225
my $distances = $tree->distance(-nodes => [$node1,$node2]);
228
Perform a test of monophyly for a set of nodes and a given outgroup
229
node. This means the common ancestor for the members of the
230
internal_nodes group is more recent than the common ancestor that any of them
231
share with the outgroup node.
234
if( $tree->is_monophyletic(-nodes => \@internal_nodes,
235
-outgroup => $outgroup) ) {
236
print "these nodes are monophyletic: ",
237
join(",",map { $_->id } @internal_nodes ), "\n";
241
Perform a test of paraphyly for a set of nodes and a given outgroup
242
node. This means that a common ancestor 'A' for the members of the
243
group is more recent than a common ancestor 'B' that they share with
244
the outgroup node <emphasis>and</emphasis> that there are no other
245
nodes in the tree which have 'A' as a common ancestor before 'B'.
248
if( $tree->is_paraphyletic(-nodes => \@internal_nodes,
249
-outgroup => $outgroup) ) {
250
print "these nodes are monophyletic: ",
251
join(",",map { $_->id } @internal_nodes ), "\n";
255
Reroot a tree, specifying a different node as the root (and a
256
different node as the outgroup).
258
# node can either be a Leaf node in which case it becomes the
259
# outgroup and its ancestor is the new root of the tree
260
# or it can be an internal node which will become the new
262
$tree->reroot($node);
267
<section id="advanced_topics">
268
<title>Advanced Topics</title>
270
It is possible to generate random tree topologies with a Bioperl
271
object called <emphasis>Bio::Tree::RandomFactory</emphasis>. The
272
factory only requires the specification of the total number of taxa
273
in order to simulate a history. One can request different methods for
274
generating the random phylogeny. At present, however, only the
275
simple Yule backward is implemented and is the default.
278
The trees can be generated with the following code. You can either
279
specify the names of taxa or just a count of total number of taxa
285
use Bio::Tree::RandomFactory;
286
# initialize a TreeIO writer to output the trees as we create them
287
my $out = Bio::TreeIO->new(-format => 'newick',
288
-file => ">randomtrees.tre");
289
my @listoftaxa = qw(A B C D E F G H);
290
my $factory = new Bio::Tree::RandomFactory(-taxa => \@listoftaxa);
291
# generate 10 random trees
292
for( my $i = 0; $i < 10; $i++ ) {
293
$out->write_tree($factory->next_tree);
295
# One can also just request a total number of taxa (8 here) and
296
# not provide labels for them
297
# In addition one can specify the total number of trees
298
# the object should return so we can call this in a while
300
$factory = new Bio::Tree::RandomFactory(-num_taxa => 8
302
while( my $tree = $factory->next_tree) {
303
$out->write_tree($tree);
308
There are more sophisticated operations that you may wish to pursue
309
with these objects. We have tried to create a framework for this type
310
of data, but by no means should this be looked at as the final
311
product. If you have a particular statistic or function that
312
applies to trees that you would like to see included in the
313
toolkit we encourage you to send details to the Bioperl list.
317
<section id="References">
318
<title>References and More Reading</title>
320
For more reading and some references for the techniques above see