2
2
Analysis of sequence composition
3
3
********************************
5
.. sectionauthor:: Jesse Zaneveld
7
PyCogent provides several tools for analyzing the composition of DNA, RNA, or
13
Let us say that we wish to study the sequence composition of the *Y. pseudotuberculosis* PB1 DNA Polymerase III beta subunit.
15
First we input the sequence as a string.
20
... """ atgaaatttatcattgaacgtgagcatctgctaaaaccactgcaacaggtcagtagcccg
21
... ctgggtggacgccctacgttgcctattttgggtaacttgttgctgcaagtcacggaaggc
22
... tctttgcggctgaccggtaccgacttggagatggagatggtggcttgtgttgccttgtct
23
... cagtcccatgagccgggtgctaccacagtacccgcacggaagttttttgatatctggcgt
24
... ggtttacccgaaggggcggaaattacggtagcgttggatggtgatcgcctgctagtgcgc
25
... tctggtcgcagccgtttctcgctgtctaccttgcctgcgattgacttccctaatctggat
26
... gactggcagagtgaggttgaattcactttaccgcaggctacgttaaagcgtctgattgag
27
... tccactcagttttcgatggcccatcaggatgtccgttattatttgaacggcatgctgttt
28
... gagaccgaaggcgaagagttacgtactgtggcgaccgatgggcatcgcttggctgtatgc
29
... tcaatgcctattggccagacgttaccctcacattcggtgatcgtgccgcgtaaaggtgtg
30
... atggagctggttcggttgctggatggtggtgatacccccttgcggctgcaaattggcagt
31
... aataatattcgtgctcatgtgggcgattttattttcacatctaagctggttgatggccgt
32
... ttcccggattatcgccgcgtattgccgaagaatcctgataaaatgctggaagccggttgc
33
... gatttactgaaacaggcattttcgcgtgcggcaattctgtcaaatgagaagttccgtggt
34
... gttcggctctatgtcagccacaatcaactcaaaatcactgctaataatcctgaacaggaa
35
... gaagcagaagagatcctcgatgttagctacgaggggacagaaatggagatcggtttcaac
36
... gtcagctatgtgcttgatgtgctaaatgcactgaagtgcgaagatgtgcgcctgttattg
37
... actgactctgtatccagtgtgcagattgaagacagcgccagccaagctgcagcctatgtc
38
... gtcatgccaatgcgtttgtag"""
40
To check that our results are reasonable, we can also load a small example string.
44
>>> example_seq = "GCGTTT"
46
In order to calculate compositional statistics, we need to import one of the ``Usage`` objects from ``cogent.core.usage``, create an object from our string, and normalize the counts contained in the string into frequencies. ``Usage`` objects include ``BaseUsage``, ``PositionalBaseUsage``, ``CodonUsage``, and ``AminoAcidUsage``.
48
Let us start with the ``BaseUsage`` object. The first few steps will be the same for the other Usage objects, however (as we will see below).
56
GC content is one commonly used compositional statistic. To calculate the total GC content of our gene, we will need to initiate and normalize a ``BaseUsage`` object.
60
>>> from cogent.core.usage import BaseUsage
61
>>> example_bu = BaseUsage(example_seq)
62
>>> # Print raw counts
63
>>> print example_bu.content("GC")
65
>>> example_bu.normalize()
66
>>> print example_bu.content("GC")
69
We can now visually verify that the reported GC contents are correct, and use the same technique on our full sequence.
73
>>> y_pseudo_bu = BaseUsage(y_pseudo_seq)
74
>>> # Print raw counts
75
>>> y_pseudo_bu.content("GC")
77
>>> y_pseudo_bu.normalize()
78
>>> print y_pseudo_bu.content("GC")
81
Positional GC content of Codons
82
-------------------------------
84
When analyzing protein coding genes, it is often useful to subdivide the GC content by codon position. In particular, the 3rd codon position ``CodonUsage`` objects allow us to calculate the GC content at each codon position.
86
First, let us calculate the GC content for the codons in the example sequence as follows.
90
>>> # Import CodonUsage object
91
>>> from cogent.core.usage import CodonUsage
92
>>> # Initiate & normalize CodonUsage object
93
>>> example_seq_cu = CodonUsage(example_seq)
94
>>> example_seq_cu.normalize()
95
>>> GC,P1,P2,P3 = example_seq_cu.positionalGC()
97
Here, GC is the overall GC content for the sequence, while P1, P2, and P3 are the GC content at the first, second, and third codon positions, respectively.
99
Printing the results for the example gives the following results.
112
We can then do the same for our biological sequence.
116
>>> y_pseudo_cu = CodonUsage(y_pseudo_seq)
117
>>> y_pseudo_cu.normalize()
118
>>> y_pseudo_GC = y_pseudo_cu.positionalGC()
119
>>> print y_pseudo_GC
120
[0.51874999999999993, 0.58437499999999987, 0.47500000000000009, 0.49687499999999996]
122
These results could then be fed into downstream analyses.
124
One important note is that ``CodonUsage`` objects calculate the GC content of codons within nucleotide sequences, rather than the full GC content. Therefore, ``BaseUsage`` rather than ``CodonUsage`` objects should be used for calculating the GC content of non-coding sequences.
129
A more detailed view of composition incorporates the relative counts or frequencies of all bases. We can calculate total base usage as follows.
133
>>> from cogent.core.usage import BaseUsage
134
>>> example_bu = BaseUsage(example_seq)
135
>>> # Print raw counts
136
>>> for k in example_bu.RequiredKeys:
137
... print k, example_bu[k]
142
>>> example_bu.normalize()
143
>>> for k in example_bu.RequiredKeys:
144
... print k, example_bu[k]
153
The ``DinucUsage`` object allows us to calculate Dinucleotide usage for our sequence.
155
Dinucleotide usage can be calculated using overlapping, non-overlapping, or '3-1' dinucleotides.
157
Given the sequence "AATTAAGCC", each method will count dinucleotide usage differently. Overlapping dinucleotide usage will count "AA", "AT", "TT", "TA", "AA", "AG", "GC", "CC". Non-overlapping dinucleotide usage will count "AA", "TT", "AA", "GC" 3-1 dinucleotide usage will count "TT", "AC".
159
Calculating the GC content at the third and first codon positions ("3-1" usage) is useful for some applications, such as gene transfer detection, because changes at these positions tend to produce the most conservative amino acid substitutions, and thus are thought to better reflect mutational (rather than selective) pressure.
161
Overlapping dinucleotide content
162
--------------------------------
164
To calculate overlapping dinucleotide usage for our *Y. pseudotuberculosis* PB1 sequence.
168
>>> from cogent.core.usage import DinucUsage
169
>>> du = DinucUsage(y_pseudo_seq, Overlapping=True)
172
We can inspect individual dinucleotide usages and confirm that the results add to 100% as follows
177
>>> for k in du.RequiredKeys:
196
>>> print "Total:",total
199
Non-overlapping Dinucleotide Content
200
------------------------------------
202
To calculate non-overlapping dinucleotide usage we simply change the ``Overlapping`` parameter to ``False`` when initiating the ``DinucUsage`` object.
206
>>> from cogent.core.usage import DinucUsage
207
>>> du_no = DinucUsage(y_pseudo_seq, Overlapping=False)
208
>>> du_no.normalize()
210
>>> for k in du_no.RequiredKeys:
211
... print k, du_no[k]
212
... total += du_no[k]
229
>>> print "Total:",total
232
'3-1' Dinucleotide Content
233
--------------------------
235
To calculate dinucleotide usage considering only adjacent first and third codon positions, we set the Overlapping parameter to '3-1' when constructing our ``DinucUsage`` object
239
>>> from cogent.core.usage import DinucUsage
240
>>> du_3_1 = DinucUsage(y_pseudo_seq, Overlapping='3-1')
241
>>> du_3_1.normalize()
243
>>> for k in du_3_1.RequiredKeys:
244
... print k, du_3_1[k]
245
... total += du_3_1[k]
262
>>> print "Total:",total
265
Comparing dinucleotide usages
266
-----------------------------
268
Above, we noted that there are several ways to calculate dinucleotide usages on a single sequence, and that the choice of methods changes the reported frequencies somewhat. How could we quantify the effect this choice make on the result?
270
One way to test this is to calculate the Euclidean distance between the resulting frequencies. We can do this using the dinucleotide usage's
274
>>> du_vs_du_3_1_dist = du.distance(du_3_1)
276
As required of a true distance, the results are independent of the direction of the calculation.
280
>>> du_3_1_vs_du_dist = du_3_1.distance(du)
281
>>> print du_3_1_vs_du_dist == du_vs_du_3_1_dist
284
Caution regarding unnormalized distances
285
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
287
Note that in this case we have already called ``du.normalize()`` on each ``DinucUsage`` object. You MUST call ``du.normalize()`` before calculating distances. Otherwise the distance calculated will be for the dinucleotide counts, rather than frequencies. Distances of counts can be non-zero even for sequences with identical dinucleotide usage, if those sequences are of different lengths.
5
294
Codon usage analyses
6
295
====================
297
In addition to allowing a more detailed examination of GC content in coding sequences, ``CodonUsage`` objects (as the name implies) let us examine the codon usage of our sequence.
301
>>> from cogent.core.usage import CodonUsage
302
>>> y_pseudo_cu = CodonUsage(y_pseudo_seq)
303
>>> # Print raw counts
304
>>> for k in y_pseudo_cu.RequiredKeys:
305
... print k, y_pseudo_cu[k]
316
Note that before normalization the ``CodonUsage`` object holds raw counts of results. However, for most purposes, we will want frequencies, so we normalize the counts.
320
>>> y_pseudo_cu.normalize()
321
>>> # Print normalized frequencies
322
>>> for k in y_pseudo_cu.RequiredKeys:
323
... print k, y_pseudo_cu[k]
332
UAU 0.0225988700565...
334
Relative Synonymous Codon Usage
335
-------------------------------
337
The RSCU or relative synonymous codon usage metric divides the frequency of each codon by the total frequency of all codons encoding the same amino acid.
341
>>> y_pseudo_cu.normalize()
342
>>> y_pseudo_rscu = y_pseudo_cu.rscu()
343
>>> # Print rscu frequencies
344
>>> for k in y_pseudo_rscu.keys():
345
... print k, y_pseudo_rscu[k]
354
GUA 0.190476190476...