~ubuntu-branches/ubuntu/natty/python-cogent/natty

« back to all changes in this revision

Viewing changes to doc/cookbook/multivariate_data_analysis.rst

  • Committer: Bazaar Package Importer
  • Author(s): Steffen Moeller
  • Date: 2010-12-04 22:30:35 UTC
  • mfrom: (1.1.1 upstream)
  • Revision ID: james.westby@ubuntu.com-20101204223035-j11kinhcrrdgg2p2
Tags: 1.5-1
* Bumped standard to 3.9.1, no changes required.
* New upstream version.
  - major additions to Cookbook
  - added AlleleFreqs attribute to ensembl Variation objects.
  - added getGeneByStableId method to genome objects.
  - added Introns attribute to Transcript objects and an Intron class.
  - added Mann-Whitney test and a Monte-Carlo version
  - exploratory and confirmatory period estimation techniques (suitable for
    symbolic and continuous data)
  - Information theoretic measures (AIC and BIC) added
  - drawing of trees with collapsed nodes
  - progress display indicator support for terminal and GUI apps
  - added parser for illumina HiSeq2000 and GAiix sequence files as 
    cogent.parse.illumina_sequence.MinimalIlluminaSequenceParser.
  - added parser to FASTQ files, one of the output options for illumina's
    workflow, also added cookbook demo.
  - added functionality for parsing of SFF files without the Roche tools in
    cogent.parse.binary_sff
  - thousand fold performance improvement to nmds
  - >10-fold performance improvements to some Table operations

Show diffs side-by-side

added added

removed removed

Lines of Context:
 
1
.. _multivariate-analysis:
 
2
 
1
3
**************************
2
4
Multivariate data analysis
3
5
**************************
4
6
 
5
 
PCoA
6
 
====
7
 
 
8
 
*To be written.*
 
7
.. sectionauthor Justin Kuczynski, Catherine Lozupone, Andreas Wilm
 
8
 
 
9
Principal Coordinates Analysis
 
10
==============================
 
11
 
 
12
Principal Coordinates Analysis works on a matrix of pairwise distances. In this example we start by calculating the pairwise distances for a set of aligned sequences, though any distance matrix can be used with PCoA, relating any objects, not only sequences.
 
13
 
 
14
.. doctest::
 
15
 
 
16
    >>> from cogent import LoadSeqs
 
17
    >>> from cogent.phylo import distance
 
18
    >>> from cogent.cluster.metric_scaling import PCoA
 
19
 
 
20
Import a substitution model (or create your own).
 
21
 
 
22
.. doctest::
 
23
 
 
24
    >>> from cogent.evolve.models import HKY85
 
25
 
 
26
Load the alignment.
 
27
 
 
28
.. doctest::
 
29
 
 
30
    >>> al = LoadSeqs("data/test.paml")
 
31
 
 
32
Create a pairwise distances object calculator for the alignment, providing a substitution model instance.
 
33
 
 
34
.. doctest::
 
35
 
 
36
    >>> d = distance.EstimateDistances(al, submodel= HKY85())
 
37
    >>> d.run(show_progress=False)
 
38
 
 
39
Now use this matrix to perform principal coordinates analysis.
 
40
 
 
41
.. doctest::
 
42
 
 
43
    >>> PCoA_result = PCoA(d.getPairwiseDistances())
 
44
    >>> print PCoA_result # doctest: +SKIP
 
45
    ======================================================================================
 
46
            Type              Label  vec_num-0  vec_num-1  vec_num-2  vec_num-3  vec_num-4
 
47
    --------------------------------------------------------------------------------------
 
48
    Eigenvectors          NineBande      -0.02       0.01       0.04       0.01       0.00
 
49
    Eigenvectors           DogFaced      -0.04      -0.06      -0.01       0.00       0.00
 
50
    Eigenvectors          HowlerMon      -0.07       0.01       0.01      -0.02       0.00
 
51
    Eigenvectors              Mouse       0.20       0.01      -0.01      -0.00       0.00
 
52
    Eigenvectors              Human      -0.07       0.04      -0.03       0.01       0.00
 
53
     Eigenvalues        eigenvalues       0.05       0.01       0.00       0.00      -0.00
 
54
     Eigenvalues  var explained (%)      85.71       9.60       3.73       0.95      -0.00
 
55
    --------------------------------------------------------------------------------------
 
56
 
 
57
We can save these results to a file in a delimited format (we'll use tab here) that can be opened up in any data analysis program, like R or Excel. Here the principal coordinates can be plotted against each other for visualization.
 
58
 
 
59
.. doctest::
 
60
 
 
61
    >>> PCoA_result.writeToFile('PCoA_results.txt',sep='\t')
 
62
 
 
63
 
 
64
Fast-MDS
 
65
========
 
66
 
 
67
The eigendecomposition step in Principal Coordinates Analysis (PCoA)
 
68
doesn't scale very well. And for thousands of objects the computation
 
69
of all pairwise distance alone can get very slow, because it scales
 
70
quadratically. For a huge number of objects this might even pose a
 
71
memory problem. Fast-MDS methods approximate an MDS/PCoA solution and
 
72
do not suffer from these problems.
 
73
 
 
74
First, let's simulate a big data sample by creating 1500 objects living
 
75
in 10 dimension. Then compute their pairwise distances and perform a
 
76
principal coordinates analysis on it. Note that the last two steps might take
 
77
already a couple of minutes.
 
78
 
 
79
.. doctest::
 
80
 
 
81
    >>> from cogent.maths.distance_transform import dist_euclidean
 
82
    >>> from cogent.cluster.metric_scaling import principal_coordinates_analysis
 
83
    >>> from numpy import random
 
84
    >>> objs = random.random((1500, 10))
 
85
    >>> distmtx = dist_euclidean(objs)
 
86
    >>> full_pcoa = principal_coordinates_analysis(distmtx)
 
87
 
 
88
 
 
89
PyCogent implements two fast MDS approximations called
 
90
Split-and-Combine MDS (SCMDS, still in development) and Nystrom (also known as
 
91
Landmark-MDS). Both can easily handle many thousands objects. One
 
92
reason is that they don't require all distances to be computed.
 
93
Instead you pass down the distance function and only required
 
94
distances are calculated.
 
95
 
 
96
Nystrom works by using a so called seed-matrix, which contains (only) k by
 
97
n distances, where n is the total number of objects and k<<n. The
 
98
bigger k, the more exact the approximation will be and the longer the
 
99
computation will take. One further difference to normal Principal
 
100
Coordinates Analysis is, that no eigenvalues, but only approximate
 
101
eigenvectors of length dim will be returned.
 
102
 
 
103
.. doctest::
 
104
 
 
105
   >>> from cogent.cluster.approximate_mds import nystrom
 
106
   >>> from random import sample
 
107
   >>> from numpy import array
 
108
   >>> n_seeds = 100
 
109
   >>> seeds = array(sample(distmtx,n_seeds))
 
110
   >>> dims = 3
 
111
   >>> nystrom_3d = nystrom(seeds, dims)
 
112
   
 
113
A good rule of thumb for picking n_seeds is log(n), log(n)**2 or
 
114
sqrt(n).
 
115
 
 
116
 
 
117
SCMDS works by dividing the pairwise distance matrix into chunks of
 
118
certain size and overlap. MDS is performed on each chunk individually
 
119
and the resulting solutions are progressively joined. As in the case
 
120
of Nystrom not all distances will be computed, but only those of the
 
121
overlapping tiles. The size and overlap of the tiles determine the
 
122
quality of the approximation as well as the run-time.
 
123
 
 
124
.. doctest::
 
125
 
 
126
   >>> from cogent.cluster.approximate_mds import CombineMds, cmds_tzeng
 
127
   >>> combine_mds = CombineMds()
 
128
   >>> tile_overlap = 100
 
129
   >>> dims = 3
 
130
   >>> tile_eigvecs, tile_eigvals = cmds_tzeng(distmtx[0:500,0:500], dims)
 
131
   >>> combine_mds.add(tile_eigvecs, tile_overlap)
 
132
   >>> tile_eigvecs, tile_eigvals = cmds_tzeng(distmtx[400:900,400:900], dims)
 
133
   >>> combine_mds.add(tile_eigvecs, tile_overlap)
 
134
   >>> tile_eigvecs, tile_eigvals = cmds_tzeng(distmtx[800:1300,800:1300], dims)
 
135
   >>> combine_mds.add(tile_eigvecs, tile_overlap)
 
136
   >>> tile_eigvecs, tile_eigvals = cmds_tzeng(distmtx[1200:1500,1200:1500], dims)
 
137
   >>> combine_mds.add(tile_eigvecs, tile_overlap)
 
138
   >>> combien_mds_3d = combine_mds.getFinalMDS()
 
139
 
 
140
If you want to know how good the returned approximations are, you will
 
141
have to perform principal_coordinates_analysis() on a smallish
 
142
submatrix and perform a goodness_of_fit analysis.
 
143
 
 
144
 
9
145
 
10
146
NMDS
11
147
====
12
148
 
13
 
*To be written.*
 
149
NMDS (Non-metric MultiDimensional Scaling) works on a matrix of pairwise distances. In this example, we generate a matrix based on the euclidean distances of an abundance matrix.
 
150
 
 
151
.. doctest::
 
152
 
 
153
    >>> from cogent.cluster.nmds import NMDS
 
154
    >>> from cogent.maths.distance_transform import dist_euclidean
 
155
    >>> from numpy import array
 
156
 
 
157
We start with an abundance matrix, samples (rows) by sequences/species (cols)
 
158
 
 
159
.. doctest::
 
160
 
 
161
    >>> abundance = array(
 
162
    ...        [[7,1,0,0,0,0,0,0,0],
 
163
    ...        [4,2,0,0,0,1,0,0,0],
 
164
    ...        [2,4,0,0,0,1,0,0,0],
 
165
    ...        [1,7,0,0,0,0,0,0,0],
 
166
    ...        [0,8,0,0,0,0,0,0,0],
 
167
    ...        [0,7,1,0,0,0,0,0,0],#idx 5
 
168
    ...        [0,4,2,0,0,0,2,0,0],
 
169
    ...        [0,2,4,0,0,0,1,0,0],
 
170
    ...        [0,1,7,0,0,0,0,0,0],
 
171
    ...        [0,0,8,0,0,0,0,0,0],
 
172
    ...        [0,0,7,1,0,0,0,0,0],#idx 10
 
173
    ...        [0,0,4,2,0,0,0,3,0],
 
174
    ...        [0,0,2,4,0,0,0,1,0],
 
175
    ...        [0,0,1,7,0,0,0,0,0],
 
176
    ...        [0,0,0,8,0,0,0,0,0],
 
177
    ...        [0,0,0,7,1,0,0,0,0],#idx 15
 
178
    ...        [0,0,0,4,2,0,0,0,4],
 
179
    ...        [0,0,0,2,4,0,0,0,1],
 
180
    ...        [0,0,0,1,7,0,0,0,0]], 'float')
 
181
 
 
182
Then compute a distance matrix using euclidean distance, and perform nmds on that matrix
 
183
 
 
184
.. doctest::
 
185
 
 
186
    >>> euc_distmtx = dist_euclidean(abundance)
 
187
    >>> nm = NMDS(euc_distmtx, verbosity=0)
 
188
 
 
189
The NMDS object provides a list of points, which can be plotted if desired
 
190
 
 
191
.. doctest::
 
192
 
 
193
    >>> pts = nm.getPoints()
 
194
    >>> stress = nm.getStress()
 
195
 
 
196
With matplotlib installed, we could then do ``plt.plot(pts[:,0], pts[:,1])``
14
197
 
15
198
Hierarchical clustering (UPGMA, NJ)
16
199
===================================
17
200
 
18
 
*To be written.*
19
 
 
20
 
k-means clustering
21
 
==================
22
 
 
23
 
*To be written.*
24
 
 
 
201
Hierarchical clustering techniques work on a matrix of pairwise distances. In this case, we use the distance matrix from the NMDS example, relating samples of species to one another using UPGMA (NJ below).
 
202
 
 
203
.. note:: UPGMA should not be used for phylogenetic reconstruction.
 
204
 
 
205
.. doctest::
 
206
 
 
207
    >>> from cogent.cluster.UPGMA import upgma
 
208
 
 
209
we start with the distance matrix and list of sample names:
 
210
 
 
211
.. doctest::
 
212
 
 
213
    >>> sample_names = ['sample'+str(i) for i in range(len(euc_distmtx))]
 
214
 
 
215
make 2d dict:
 
216
 
 
217
.. doctest::
 
218
 
 
219
    >>> euc_distdict = {}
 
220
    >>> for i in range(len(sample_names)):
 
221
    ...    for j in range(len(sample_names)):
 
222
    ...        euc_distdict[(sample_names[i],sample_names[j])]=euc_distmtx[i,j]
 
223
 
 
224
e.g.: ``euc_distdict[('sample6', 'sample5')] == 3.7416573867739413``
 
225
 
 
226
Now use this matrix to build a UPGMA cluster.
 
227
 
 
228
.. doctest::
 
229
 
 
230
    >>> mycluster = upgma(euc_distdict)
 
231
    >>> print mycluster.asciiArt()
 
232
                                                      /-sample10
 
233
                                            /edge.3--|
 
234
                                  /edge.2--|          \-sample8
 
235
                                 |         |
 
236
                                 |          \-sample9
 
237
                        /edge.1--|
 
238
                       |         |                    /-sample12
 
239
                       |         |          /edge.5--|
 
240
                       |         |         |          \-sample11
 
241
                       |          \edge.4--|
 
242
                       |                   |          /-sample6
 
243
                       |                    \edge.6--|
 
244
              /edge.0--|                              \-sample7
 
245
             |         |
 
246
             |         |                                        /-sample15
 
247
             |         |                              /edge.10-|
 
248
             |         |                    /edge.9--|          \-sample14
 
249
             |         |                   |         |
 
250
             |         |          /edge.8--|          \-sample13
 
251
             |         |         |         |
 
252
             |          \edge.7--|          \-sample16
 
253
    -root----|                   |
 
254
             |                   |          /-sample17
 
255
             |                    \edge.11-|
 
256
             |                              \-sample18
 
257
             |
 
258
             |                              /-sample5
 
259
             |                    /edge.14-|
 
260
             |          /edge.13-|          \-sample4
 
261
             |         |         |
 
262
             |         |          \-sample3
 
263
              \edge.12-|
 
264
                       |                    /-sample2
 
265
                       |          /edge.16-|
 
266
                        \edge.15-|          \-sample1
 
267
                                 |
 
268
                                  \-sample0
 
269
 
 
270
We demonstrate saving this UPGMA cluster to a file.
 
271
 
 
272
.. doctest::
 
273
 
 
274
    >>> mycluster.writeToFile('test_upgma.tree')
 
275
 
 
276
..
 
277
    We don't actually want to keep that file now, so I'm importing the ``os`` module to delete it.
 
278
 
 
279
.. doctest::
 
280
    :hide:
 
281
 
 
282
    >>> import os
 
283
    >>> os.remove('test_upgma.tree')
 
284
 
 
285
We can use neighbor joining (NJ) instead of UPGMA:
 
286
 
 
287
.. doctest::
 
288
 
 
289
    >>> from cogent.phylo.nj import nj
 
290
    >>> njtree = nj(euc_distdict)
 
291
    >>> print njtree.asciiArt()
 
292
              /-sample16
 
293
             |
 
294
             |                    /-sample12
 
295
             |          /edge.2--|
 
296
             |         |         |          /-sample13
 
297
             |         |          \edge.1--|
 
298
             |         |                   |          /-sample14
 
299
             |         |                    \edge.0--|
 
300
             |         |                              \-sample15
 
301
             |         |
 
302
             |         |                              /-sample7
 
303
             |-edge.14-|                    /edge.5--|
 
304
             |         |                   |         |          /-sample8
 
305
             |         |                   |          \edge.4--|
 
306
             |         |          /edge.6--|                   |          /-sample10
 
307
             |         |         |         |                    \edge.3--|
 
308
             |         |         |         |                              \-sample9
 
309
    -root----|         |         |         |
 
310
             |         |         |          \-sample11
 
311
             |         |         |
 
312
             |          \edge.13-|                    /-sample6
 
313
             |                   |                   |
 
314
             |                   |                   |                              /-sample4
 
315
             |                   |          /edge.10-|                    /edge.7--|
 
316
             |                   |         |         |          /edge.8--|          \-sample3
 
317
             |                   |         |         |         |         |
 
318
             |                   |         |          \edge.9--|          \-sample5
 
319
             |                    \edge.12-|                   |
 
320
             |                             |                    \-sample2
 
321
             |                             |
 
322
             |                             |          /-sample0
 
323
             |                              \edge.11-|
 
324
             |                                        \-sample1
 
325
             |
 
326
             |          /-sample18
 
327
              \edge.15-|
 
328
                        \-sample17