8
Phylogenetic Diversity (PD)
9
---------------------------
11
For each environment (i.e. sample), calculates the amount of branch length in a phylogenetic tree that lead to its sequences.
13
First we will load in a Newick formatted tree.
17
>>> from cogent.parse.tree import DndParser
18
>>> from cogent.maths.unifrac.fast_tree import UniFracTreeNode
19
>>> tree_in = open("data/Crump_example_tree_newick.txt")
20
>>> tree = DndParser(tree_in, UniFracTreeNode)
22
Next we will load information on which sequences in the tree come from which environment.
26
>>> from cogent.maths.unifrac.fast_unifrac import count_envs
27
>>> envs_in = open("data/Crump_et_al_example_env_file.txt")
28
>>> envs = count_envs(envs_in)
30
Finally, we can calculate the PD values for each environment in the tree
34
>>> from cogent.maths.unifrac.fast_unifrac import PD_whole_tree
35
>>> envs, PD_values = PD_whole_tree(tree, envs)
37
['E_FL', 'E_PA', 'O_FL', 'O_UN', 'R_FL', 'R_PA']
38
>>> print PD_values#doctest: +SKIP
39
[ 5.85389 7.60352 2.19215 2.81821 3.93728 3.7534 ]
41
``PD_vals`` is a ``numpy`` ``array`` with the values representing each environment in envs.
64
The Fast UniFrac implementation in PyCogent is the source code for the `Fast UniFrac web tool <http://bmf2.colorado.edu/fastunifrac>`_ and the `QIIME pipeline <http://qiime.sourceforge.net>`_ for Microbial community analysis.
66
Calculate a UniFrac Distance Matrix and apply PCoA and UPGMA
67
------------------------------------------------------------
69
The UniFrac analysis is run on open tree and environment file objects. The resulting dictionary has a distance matrix of pairwise UniFrac values ('distance_matrix'), a Newick string representing the results of performing UPGMA clustering on this distance matrix ('cluster_envs') and the results of running Principal Coordinates Analysis on the distance matrix ('pcoa'). One can specify weighted UniFrac with ``weighted=True``. Here we run an unweighted analysis.
73
>>> from cogent.maths.unifrac.fast_unifrac import fast_unifrac_file
74
>>> tree_in = open("data/Crump_example_tree_newick.txt")
75
>>> envs_in = open("data/Crump_et_al_example_env_file.txt")
76
>>> result = fast_unifrac_file(tree_in, envs_in, weighted=False)
77
>>> print result['cluster_envs']#doctest: +SKIP
78
((((('E_FL':0.339607103063,'R_FL':0.339607103063):0.0279991540511,'R_PA':0.367606257114):0.0103026524101,'E_PA':0.377908909524):0.0223322024492,'O_UN':0.400241111973):0.00976759866402,'O_FL':0.410008710637);
79
>>> print result['pcoa']#doctest: +SKIP
80
=================================================================================================
81
Type Label vec_num-0 vec_num-1 vec_num-2 vec_num-3 vec_num-4 vec_num-5
82
-------------------------------------------------------------------------------------------------
83
Eigenvectors E_FL 0.05 0.22 -0.09 -0.26 -0.29 -0.00
84
Eigenvectors E_PA -0.36 0.24 0.21 -0.08 0.18 -0.00
85
Eigenvectors O_FL 0.32 -0.26 0.30 -0.13 0.05 -0.00
86
Eigenvectors O_UN -0.28 -0.40 -0.24 -0.04 0.01 -0.00
87
Eigenvectors R_FL 0.29 0.18 -0.28 0.09 0.22 -0.00
88
Eigenvectors R_PA -0.02 0.02 0.11 0.42 -0.17 -0.00
89
Eigenvalues eigenvalues 0.40 0.36 0.29 0.27 0.19 -0.00
90
Eigenvalues var explained (%) 26.34 23.84 19.06 18.02 12.74 -0.00
91
-------------------------------------------------------------------------------------------------
93
Perform pairwise tests of whether samples are significantly different with UniFrac
94
----------------------------------------------------------------------------------
96
The analysis is run on open tree and environment file objects. In this example, we use unweighted unifrac (``weighted=False``), we permute the environment assignments on the tree 50 times (``num_iters=50``) and we perform UniFrac on all pairs of environments (``test_on="Pairwise"``). A list is returned with a tuple for each pairwise comparison with items: 0 - the first environment, 1 - the second environment, 2- the uncorrected p-value and 3 - the p-value after correcting for multiple comparisons with the Bonferroni correction.
100
>>> from cogent.maths.unifrac.fast_unifrac import fast_unifrac_permutations_file
101
>>> tree_in = open("data/Crump_example_tree_newick.txt")
102
>>> envs_in = open("data/Crump_et_al_example_env_file.txt")
103
>>> result = fast_unifrac_permutations_file(tree_in, envs_in, weighted=False, num_iters=50, test_on="Pairwise")
104
>>> print result[0]#doctest: +SKIP
105
('E_FL', 'E_PA', 0.17999999999999999, 1.0)
107
Perform a single UniFrac significance test on the whole tree
108
------------------------------------------------------------
110
The analysis is run on open tree and environment file objects. In this example, we use weighted unifrac (``weighted=True``), we permute the environment assignments on the tree 50 times (``num_iters=50``) and we perform a unifrac significance test on the whole tree (``test_on="Tree"``). The resulting list has only one item since a single test was performed. It is a 3 item tuple where the second and third values are the p-value.
114
>>> from cogent.maths.unifrac.fast_unifrac import fast_unifrac_permutations_file
115
>>> tree_in = open("data/Crump_example_tree_newick.txt")
116
>>> envs_in = open("data/Crump_et_al_example_env_file.txt")
117
>>> result = fast_unifrac_permutations_file(tree_in, envs_in, weighted=True, num_iters=50, test_on="Tree")
118
>>> print result#doctest: +SKIP
119
[('whole tree', 0.56000000000000005, 0.56000000000000005)]
124
Perform pairwise tests of whether samples are significantly different with the P-test (Martin, 2002)
125
----------------------------------------------------------------------------------------------------
127
The analysis is run on open tree and environment file objects. In this example, we permute the environment assignments on the tree 50 times (``num_iters=50``) and perform the p test for all pairs of environments (``test_on="Pairwise"``). A list is returned with a tuple for each pairwise comparison with items: 0 - the first environment, 1 - the second environment, 2- the uncorrected p-value and 3 - the p-value after correcting for multiple comparisons with the Bonferroni correction.
131
>>> from cogent.maths.unifrac.fast_unifrac import fast_p_test_file
132
>>> tree_in = open("data/Crump_example_tree_newick.txt")
133
>>> envs_in = open("data/Crump_et_al_example_env_file.txt")
134
>>> result = fast_p_test_file(tree_in, envs_in, num_iters=50, test_on="Pairwise")
135
>>> print result[0]#doctest: +SKIP
136
('E_FL', 'E_PA', 0.040000000000000001, 0.59999999999999998)
141
Computing a distance matrix between samples
142
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
144
PyCogent provides many different ways to compute pairwise distances between objects. ``cogent/maths/distance_transform.py`` provides a set of functions to calculate dissimilarities/distances between samples, given an abundance matrix. Here is one example:
148
>>> from cogent.maths.distance_transform import dist_euclidean
149
>>> from numpy import array
150
>>> abundance_data = array([[1, 3],
152
... [0.1, 22]],'float')
154
.. note:: see ``distance_transform.py`` for other metrics than euclidean
156
We now have 3 samples, and the abundance of each column (e.g.: species) in that sample. The first sample has 1 individual of species 1, 3 individuals of species 2. We now compute the relatedness between these samples, using euclidean distance between the rows:
160
>>> dists = dist_euclidean(abundance_data)
161
>>> print str(dists.round(2)) # doctest: +SKIP
164
[ 19.02, 20.59 , 0. ]]
166
this distance matrix can be visualized via multivariate reduction techniques such as :ref:`multivariate-analysis`.