1
.. _multivariate-analysis:
1
3
**************************
2
4
Multivariate data analysis
3
5
**************************
7
.. sectionauthor Justin Kuczynski, Catherine Lozupone, Andreas Wilm
9
Principal Coordinates Analysis
10
==============================
12
Principal Coordinates Analysis works on a matrix of pairwise distances. In this example we start by calculating the pairwise distances for a set of aligned sequences, though any distance matrix can be used with PCoA, relating any objects, not only sequences.
16
>>> from cogent import LoadSeqs
17
>>> from cogent.phylo import distance
18
>>> from cogent.cluster.metric_scaling import PCoA
20
Import a substitution model (or create your own).
24
>>> from cogent.evolve.models import HKY85
30
>>> al = LoadSeqs("data/test.paml")
32
Create a pairwise distances object calculator for the alignment, providing a substitution model instance.
36
>>> d = distance.EstimateDistances(al, submodel= HKY85())
37
>>> d.run(show_progress=False)
39
Now use this matrix to perform principal coordinates analysis.
43
>>> PCoA_result = PCoA(d.getPairwiseDistances())
44
>>> print PCoA_result # doctest: +SKIP
45
======================================================================================
46
Type Label vec_num-0 vec_num-1 vec_num-2 vec_num-3 vec_num-4
47
--------------------------------------------------------------------------------------
48
Eigenvectors NineBande -0.02 0.01 0.04 0.01 0.00
49
Eigenvectors DogFaced -0.04 -0.06 -0.01 0.00 0.00
50
Eigenvectors HowlerMon -0.07 0.01 0.01 -0.02 0.00
51
Eigenvectors Mouse 0.20 0.01 -0.01 -0.00 0.00
52
Eigenvectors Human -0.07 0.04 -0.03 0.01 0.00
53
Eigenvalues eigenvalues 0.05 0.01 0.00 0.00 -0.00
54
Eigenvalues var explained (%) 85.71 9.60 3.73 0.95 -0.00
55
--------------------------------------------------------------------------------------
57
We can save these results to a file in a delimited format (we'll use tab here) that can be opened up in any data analysis program, like R or Excel. Here the principal coordinates can be plotted against each other for visualization.
61
>>> PCoA_result.writeToFile('PCoA_results.txt',sep='\t')
67
The eigendecomposition step in Principal Coordinates Analysis (PCoA)
68
doesn't scale very well. And for thousands of objects the computation
69
of all pairwise distance alone can get very slow, because it scales
70
quadratically. For a huge number of objects this might even pose a
71
memory problem. Fast-MDS methods approximate an MDS/PCoA solution and
72
do not suffer from these problems.
74
First, let's simulate a big data sample by creating 1500 objects living
75
in 10 dimension. Then compute their pairwise distances and perform a
76
principal coordinates analysis on it. Note that the last two steps might take
77
already a couple of minutes.
81
>>> from cogent.maths.distance_transform import dist_euclidean
82
>>> from cogent.cluster.metric_scaling import principal_coordinates_analysis
83
>>> from numpy import random
84
>>> objs = random.random((1500, 10))
85
>>> distmtx = dist_euclidean(objs)
86
>>> full_pcoa = principal_coordinates_analysis(distmtx)
89
PyCogent implements two fast MDS approximations called
90
Split-and-Combine MDS (SCMDS, still in development) and Nystrom (also known as
91
Landmark-MDS). Both can easily handle many thousands objects. One
92
reason is that they don't require all distances to be computed.
93
Instead you pass down the distance function and only required
94
distances are calculated.
96
Nystrom works by using a so called seed-matrix, which contains (only) k by
97
n distances, where n is the total number of objects and k<<n. The
98
bigger k, the more exact the approximation will be and the longer the
99
computation will take. One further difference to normal Principal
100
Coordinates Analysis is, that no eigenvalues, but only approximate
101
eigenvectors of length dim will be returned.
105
>>> from cogent.cluster.approximate_mds import nystrom
106
>>> from random import sample
107
>>> from numpy import array
109
>>> seeds = array(sample(distmtx,n_seeds))
111
>>> nystrom_3d = nystrom(seeds, dims)
113
A good rule of thumb for picking n_seeds is log(n), log(n)**2 or
117
SCMDS works by dividing the pairwise distance matrix into chunks of
118
certain size and overlap. MDS is performed on each chunk individually
119
and the resulting solutions are progressively joined. As in the case
120
of Nystrom not all distances will be computed, but only those of the
121
overlapping tiles. The size and overlap of the tiles determine the
122
quality of the approximation as well as the run-time.
126
>>> from cogent.cluster.approximate_mds import CombineMds, cmds_tzeng
127
>>> combine_mds = CombineMds()
128
>>> tile_overlap = 100
130
>>> tile_eigvecs, tile_eigvals = cmds_tzeng(distmtx[0:500,0:500], dims)
131
>>> combine_mds.add(tile_eigvecs, tile_overlap)
132
>>> tile_eigvecs, tile_eigvals = cmds_tzeng(distmtx[400:900,400:900], dims)
133
>>> combine_mds.add(tile_eigvecs, tile_overlap)
134
>>> tile_eigvecs, tile_eigvals = cmds_tzeng(distmtx[800:1300,800:1300], dims)
135
>>> combine_mds.add(tile_eigvecs, tile_overlap)
136
>>> tile_eigvecs, tile_eigvals = cmds_tzeng(distmtx[1200:1500,1200:1500], dims)
137
>>> combine_mds.add(tile_eigvecs, tile_overlap)
138
>>> combien_mds_3d = combine_mds.getFinalMDS()
140
If you want to know how good the returned approximations are, you will
141
have to perform principal_coordinates_analysis() on a smallish
142
submatrix and perform a goodness_of_fit analysis.
149
NMDS (Non-metric MultiDimensional Scaling) works on a matrix of pairwise distances. In this example, we generate a matrix based on the euclidean distances of an abundance matrix.
153
>>> from cogent.cluster.nmds import NMDS
154
>>> from cogent.maths.distance_transform import dist_euclidean
155
>>> from numpy import array
157
We start with an abundance matrix, samples (rows) by sequences/species (cols)
161
>>> abundance = array(
162
... [[7,1,0,0,0,0,0,0,0],
163
... [4,2,0,0,0,1,0,0,0],
164
... [2,4,0,0,0,1,0,0,0],
165
... [1,7,0,0,0,0,0,0,0],
166
... [0,8,0,0,0,0,0,0,0],
167
... [0,7,1,0,0,0,0,0,0],#idx 5
168
... [0,4,2,0,0,0,2,0,0],
169
... [0,2,4,0,0,0,1,0,0],
170
... [0,1,7,0,0,0,0,0,0],
171
... [0,0,8,0,0,0,0,0,0],
172
... [0,0,7,1,0,0,0,0,0],#idx 10
173
... [0,0,4,2,0,0,0,3,0],
174
... [0,0,2,4,0,0,0,1,0],
175
... [0,0,1,7,0,0,0,0,0],
176
... [0,0,0,8,0,0,0,0,0],
177
... [0,0,0,7,1,0,0,0,0],#idx 15
178
... [0,0,0,4,2,0,0,0,4],
179
... [0,0,0,2,4,0,0,0,1],
180
... [0,0,0,1,7,0,0,0,0]], 'float')
182
Then compute a distance matrix using euclidean distance, and perform nmds on that matrix
186
>>> euc_distmtx = dist_euclidean(abundance)
187
>>> nm = NMDS(euc_distmtx, verbosity=0)
189
The NMDS object provides a list of points, which can be plotted if desired
193
>>> pts = nm.getPoints()
194
>>> stress = nm.getStress()
196
With matplotlib installed, we could then do ``plt.plot(pts[:,0], pts[:,1])``
15
198
Hierarchical clustering (UPGMA, NJ)
16
199
===================================
201
Hierarchical clustering techniques work on a matrix of pairwise distances. In this case, we use the distance matrix from the NMDS example, relating samples of species to one another using UPGMA (NJ below).
203
.. note:: UPGMA should not be used for phylogenetic reconstruction.
207
>>> from cogent.cluster.UPGMA import upgma
209
we start with the distance matrix and list of sample names:
213
>>> sample_names = ['sample'+str(i) for i in range(len(euc_distmtx))]
219
>>> euc_distdict = {}
220
>>> for i in range(len(sample_names)):
221
... for j in range(len(sample_names)):
222
... euc_distdict[(sample_names[i],sample_names[j])]=euc_distmtx[i,j]
224
e.g.: ``euc_distdict[('sample6', 'sample5')] == 3.7416573867739413``
226
Now use this matrix to build a UPGMA cluster.
230
>>> mycluster = upgma(euc_distdict)
231
>>> print mycluster.asciiArt()
248
| | /edge.9--| \-sample14
250
| | /edge.8--| \-sample13
252
| \edge.7--| \-sample16
260
| /edge.13-| \-sample4
270
We demonstrate saving this UPGMA cluster to a file.
274
>>> mycluster.writeToFile('test_upgma.tree')
277
We don't actually want to keep that file now, so I'm importing the ``os`` module to delete it.
283
>>> os.remove('test_upgma.tree')
285
We can use neighbor joining (NJ) instead of UPGMA:
289
>>> from cogent.phylo.nj import nj
290
>>> njtree = nj(euc_distdict)
291
>>> print njtree.asciiArt()
303
|-edge.14-| /edge.5--|
306
| | /edge.6--| | /-sample10
312
| \edge.13-| /-sample6
315
| | /edge.10-| /edge.7--|
316
| | | | /edge.8--| \-sample3
318
| | | \edge.9--| \-sample5