1
# $Id: SiteMatrixI.pm,v 1.16.4.1 2006/10/02 23:10:22 sendu Exp $
4
Bio::Matrix::PSM::SiteMatrixI - SiteMatrix interface, holds a position
5
scoring matrix (or position weight matrix)
5
Bio::Matrix::PSM::SiteMatrixI - SiteMatrixI implementation, holds a
6
position scoring matrix (or position weight matrix) and log-odds
9
use Bio::Matrix::PSM::SiteMatrix;
11
# Create from memory by supplying probability matrix hash
12
# both as strings or arrays
14
my ($a,$c,$g,$t,$score,$ic, $mid)=@_;
15
# where $a,$c,$g and $t are either arrayref or string
17
my ($a,$c,$g,$t,$score,$ic,$mid)=
18
('05a011','110550','400001','100104',0.001,19.2,'CRE1');
19
# Where a stands for all (this frequency=1), see explanation bellow
21
my %param=(-pA=>$a,-pC=>$c,-pG=>$g,-pT=>$t,-IC=>$ic,
22
-e_val=>$score, -id=>$mid);
23
my $site=new Bio::Matrix::PSM::SiteMatrix(%param);
25
# Or get it from a file:
27
use Bio::Matrix::PSM::IO;
28
my $psmIO= new Bio::Matrix::PSM::IO(-file=>$file, -format=>'transfac');
29
while (my $psm=$psmIO->next_psm) {
30
# Now we have a Bio::Matrix::PSM::Psm object, see
31
# Bio::Matrix::PSM::PsmI for details
32
my $matrix=$psm->matrix;
33
# This is a Bio::Matrix::PSM::SiteMatrix object now
36
# Get a simple consensus, where alphabet is {A,C,G,T,N}, choosing
37
# the highest probability or N if prob is too low
39
my $consensus=$site->consensus;
41
#Getting/using regular expression
42
my $regexp=$site->regexp;
43
my $count=grep($regexp,$seq);
44
my $count=($seq=~ s/$regexp/$1/eg);
45
print "Motif $mid is present $count times in this sequence\n";
10
# You cannot use this module directly; see Bio::Matrix::PSM::SiteMatrix
11
# for an example implementation
49
SiteMatrix is designed to provide some basic methods when working with
50
position scoring (weight) matrices, such as transcription factor
51
binding sites for example. A DNA PSM consists of four vectors with
52
frequencies {A,C,G,T). This is the minimum information you should
53
provide to construct a PSM object. The vectors can be provided as
54
strings with frequencies where the frequency is {0..a} and a=1. This
55
is the way MEME compressed representation of a matrix and it is quite
56
useful when working with relational DB. If arrays are provided as an
57
input (references to arrays actually) they can be any number, real or
58
integer (frequency or count).
60
When creating the object the constructor will check for positions that
61
equal 0. If such is found it will increase the count for all
62
positions by one and recalculate the frequency. Potential bug- if you
63
are using frequencies and one of the positions is 0 it will change
64
significantly. However, you should never have frequency that equals
67
Throws an exception if: You mix as an input array and string (for
68
example A matrix is given as array, C - as string). The position
69
vector is (0,0,0,0). One of the probability vectors is shorter than
15
SiteMatrix is designed to provide some basic methods when working with position
16
scoring (weight) matrices, such as transcription factor binding sites for
17
example. A DNA PSM consists of four vectors with frequencies {A,C,G,T}. This is
18
the minimum information you should provide to construct a PSM object. The
19
vectors can be provided as strings with frequenciesx10 rounded to an int, going
20
from {0..a} and 'a' represents the maximum (10). This is like MEME's compressed
21
representation of a matrix and it is quite useful when working with relational
22
DB. If arrays are provided as an input (references to arrays actually) they can
23
be any number, real or integer (frequency or count).
25
When creating the object you can ask the constructor to make a simple pseudo
26
count correction by adding a number (typically 1) to all positions (with the
27
-correction option). After adding the number the frequencies will be
28
calculated. Only use correction when you supply counts, not frequencies.
30
Throws an exception if: You mix as an input array and string (for example A
31
matrix is given as array, C - as string). The position vector is (0,0,0,0). One
32
of the probability vectors is shorter than the rest.
72
34
Summary of the methods I use most frequently (details bellow):
76
38
IC - information content. Returns a real number
77
39
id - identifier. Returns a string
78
40
accession - accession number. Returns a string
79
next_pos - return the sequence probably for each letter, IUPAC symbol,
80
IUPAC probability and simple sequence consenus letter for this
81
position. Rewind at the end. Returns a hash.
41
next_pos - return the sequence probably for each letter, IUPAC
42
symbol, IUPAC probability and simple sequence
43
consenus letter for this position. Rewind at the end. Returns a hash.
82
44
pos - current position get/set. Returns an integer.
83
regexp- construct a regular expression based on IUPAC consensus.
84
For example AGWV will be [Aa][Gg][AaTt][AaCcGg] width- site width
85
get_string- gets the probability vector for a single base as a
45
regexp - construct a regular expression based on IUPAC consensus.
46
For example AGWV will be [Aa][Gg][AaTt][AaCcGg]
48
get_string - gets the probability vector for a single base as a string.
49
get_array - gets the probability vector for a single base as an array.
50
get_logs_array - gets the log-odds vector for a single base as an array.
52
New methods, which might be of interest to anyone who wants to store PSM in a relational
53
database without creating an entry for each position is the ability to compress the
54
PSM vector into a string with losing usually less than 1% of the data.
55
this can be done with:
57
my $str=$matrix->get_compressed_freq('A');
61
my $str=$matrix->get_compressed_logs('A');
63
Loading from a database should be done with new, but is not yest implemented.
64
However you can still uncompress such string with:
66
my @arr=Bio::Matrix::PSM::_uncompress_string ($str,1,1); for PSM
70
my @arr=Bio::Matrix::PSM::_uncompress_string ($str,1000,2); for log odds
119
102
package Bio::Matrix::PSM::SiteMatrixI;
121
use Bio::Root::RootI;
124
@ISA=qw(Bio::Root::RootI);
130
Usage : my $site=new Bio::Matrix::PSM::SiteMatrix
131
(-pA=>$a,-pC=>$c,-pG=>$g,-pT=>$t,
132
-IC=>$ic,-e_val=>$score, -id=>$mid);
133
Function: Creates a new Bio::Matrix::PSM::SiteMatrix object from memory
134
Throws : If inconsistent data for all vectors (A,C,G and T) is provided,
135
if you mix input types (string vs array) or if a position freq is 0.
105
use base qw(Bio::Root::RootI);
110
Usage : $self->calc_weight({A=>0.2562,C=>0.2438,G=>0.2432,T=>0.2568});
111
Function: Recalculates the PSM (or weights) based on the PFM (the frequency matrix)
112
and user supplied background model.
113
Throws : if no model is supplied
137
Returns : Bio::Matrix::PSM::SiteMatrix object
116
Args : reference to a hash with background frequencies for A,C,G and T
144
121
my $self = shift;
145
122
$self->throw_not_implemented();
151
129
Usage : my %base=$site->next_pos;
154
Retrieves the next position features: frequencies for
132
Retrieves the next position features: frequencies and weights for
155
133
A,C,G,T, the main letter (as in consensus) and the
156
134
probabilty for this letter to occur at this position and
157
135
the current position
161
Returns : hash (pA,pC,pG,pT,base,prob,rel)
139
Returns : hash (pA,pC,pG,pT,lA,lC,lG,lT,base,prob,rel)
436
414
my $self = shift;
437
415
$self->throw_not_implemented();
418
=head2 _compress_array
420
Title : _compress_array
422
Function: Will compress an array of real signed numbers to a string (ie vector of bytes)
423
-127 to +127 for bi-directional(signed) and 0..255 for unsigned ;
425
Example : Internal stuff
427
Args : array reference, followed by an max value and
428
direction (optional, default 1-unsigned),1 unsigned, any other is signed.
432
sub _compress_array {
434
$self->throw_not_implemented();
437
=head2 _uncompress_string
439
Title : _uncompress_string
441
Function: Will uncompress a string (vector of bytes) to create an array of real
442
signed numbers (opposite to_compress_array)
444
Example : Internal stuff
445
Returns : string, followed by an max value and
446
direction (optional, default 1-unsigned), 1 unsigned, any other is signed.
451
sub _uncompress_string {
453
$self->throw_not_implemented();
456
=head2 get_compressed_freq
458
Title : get_compressed_freq
460
Function: A method to provide a compressed frequency vector. It uses one byte to
461
code the frequence for one of the probability vectors for one position.
462
Useful for relational database. Improvment of the previous 0..a coding.
464
Example : my $strA=$self->get_compressed_freq('A');
470
sub get_compressed_freq {
472
$self->throw_not_implemented();
475
=head2 get_compressed_logs
477
Title : get_compressed_logs
479
Function: A method to provide a compressed log-odd vector. It uses one byte to
480
code the log value for one of the log-odds vectors for one position.
482
Example : my $strA=$self->get_compressed_logs('A');
488
sub get_compressed_logs {
490
$self->throw_not_implemented();
493
=head2 sequence_match_weight
495
Title : sequence_match_weight
497
Function: This method will calculate the score of a match, based on the PWM
498
if such is associated with the matrix object. Returns undef if no
499
PWM data is available.
500
Throws : if the length of the sequence is different from the matrix width
501
Example : my $score=$matrix->sequence_match_weight('ACGGATAG');
502
Returns : Floating point
507
sub sequence_match_weight {
509
$self->throw_not_implemented();
512
=head2 get_all_vectors
514
Title : get_all_vectors
516
Function: returns all possible sequence vectors to satisfy the PFM under
518
Throws : If threshold outside of 0..1 (no sense to do that)
519
Example : my @vectors=$self->get_all_vectors(4);
520
Returns : Array of strings
521
Args : (optional) floating
525
sub get_all_vectors {
527
$self->throw_not_implemented();