1
## $Id: HMM.pm 11480 2007-06-14 14:16:21Z sendu $
3
# BioPerl module for Bio::Tools::HMM
5
# Cared for by Yee Man Chan <ymc@yahoo.com>
7
# Copyright Yee Man Chan
9
# You may distribute this module under the same terms as perl itself
11
# POD documentation - main docs before the code
15
Bio::Tools::HMM - Perl extension to perform Hidden Markov Model calculations
21
use Bio::Matrix::Scoring;
24
# ACGT are the bases NC mean non-coding and coding
25
$hmm = Bio::Tools::HMM->new('-symbols' => "ACGT", '-states' => "NC");
27
# initialize some training observation sequences
28
$seq1 = Bio::SeqIO->new(-file => $ARGV[0], -format => 'fasta');
29
$seq2 = Bio::SeqIO->new(-file => $ARGV[1], -format => 'fasta');
30
@seqs = ($seq1, $seq2);
32
# train the HMM with the observation sequences
33
$hmm->baum_welch_training(\@seqs);
36
$init = $hmm->init_prob; # returns an array reference
37
$matrix1 = $hmm->transition_prob; # returns Bio::Matrix::Scoring
38
$matrix2 = $hmm->emission_prob; # returns Bio::Matrix::Scoring
40
# initialize training hidden state sequences
41
$hs1 = "NCNCNNNNNCCCCNNCCCNNNNC";
42
$hs2 = "NCNNCNNNNNNCNCNCNNNCNCN";
45
# train the HMM with both observation sequences and hidden state
47
$hmm->statistical_training(\@seqs, \@hss);
49
# with the newly calibrated HMM, we can use viterbi algorithm
50
# to obtain the hidden state sequence underlying an observation
52
$hss = $hmm->viterbi($seq); # returns a string of hidden states
56
Hidden Markov Model (HMM) was first introduced by Baum and his colleagues
57
in a series of classic papers in the late 1960s and 1970s. It was first
58
applied to the field of speech recognition with great success in the 1970s.
60
Explosion in the amount sequencing data in the 1990s opened the field
61
of Biological Sequence Analysis. Seeing HMM's effectiveness in detecing
62
signals in biological sequences, Krogh, Mian and Haussler used HMM to find
63
genes in E. coli DNA in a classical paper in 1994. Since then, there have
64
been extensive application of HMM to other area of Biology, for example,
65
multiple sequence alignment, CpG island detection and so on.
69
This package comes with the main bioperl distribution. You also need
70
to install the lastest bioperl-ext package which contains the XS code
71
that implements the algorithms. This package won't work if you haven't
72
compiled the bioperl-ext package.
81
Allow people to set and get the tolerance level in the EM algorithm.
85
Allow people to set and get the maximum number of iterations
86
to run in the EM algorithm.
90
A function to calculate the probability of an observation sequence
94
A function to do posterior decoding, ie to find the probabilty of
95
seeing a certain observation symbol at position i.
103
User feedback is an integral part of the evolution of this and other
104
Bioperl modules. Send your comments and suggestions preferably to one
105
of the Bioperl mailing lists. Your participation is much appreciated.
107
bioperl-l@bioperl.org - General discussion
108
http://bioperl.org/wiki/Mailing_lists - About the mailing lists
110
=head2 Reporting Bugs
112
Report bugs to the Bioperl bug tracking system to help us keep track
113
the bugs and their resolution. Bug reports can be submitted via the
116
http://bugzilla.open-bio.org/
120
This implementation was written by Yee Man Chan (ymc@yahoo.com).
121
Copyright (c) 2005 Yee Man Chan. All rights reserved. This program
122
is free software; you can redistribute it and/or modify it under
123
the same terms as Perl itself. All the code are written by Yee
124
Man Chan without borrowing any code from anywhere.
128
package Bio::Tools::HMM;
130
use base qw(Bio::Root::Root);
134
require Bio::Ext::HMM;
137
die("\nThe C-compiled engine for Hidden Markov Model (HMM) has not been installed.\n Please read the install the bioperl-ext package\n\n");
143
my ($class, @args) = @_;
145
my $self = $class->SUPER::new(@args);
147
my ($symbols, $states, $init, $a_mat, $e_mat) = $self->_rearrange([qw(SYMBOLS
154
$self->throw("Observation Symbols are not defined!") unless defined $symbols;
155
$self->throw("Hidden States are not defined!") unless defined $states;
157
if (defined $symbols) {
158
if (scalar($symbols)) {
159
# check duplicate symbols
160
if ($self->symbols($symbols) < 0) {
161
$self->throw("Duplicate symbols!\n");
165
$self->throw("We don't support list of symbols in this version.\n");
169
if (defined $states) {
170
if (scalar($states)) {
171
# check duplicate states
172
if ($self->states($states) < 0) {
173
$self->throw("Duplicate states!\n");
177
$self->throw("We don't support list of states in this version.\n");
181
$self->{'hmm'} = Bio::Ext::HMM::HMM->new($symbols, $states);
188
Usage : $prob = $hmm->likelihood($seq)
189
Function: Calculate the probability of an observation sequence given an HMM
190
Returns : An floating point number that is the logarithm of the probability
191
of an observation sequence given an HMM
192
Args : The only argument is a string that is the observation sequence
193
you are interested in. Note that the sequence must not contain
194
any character that is not in the alphabet of observation symbols.
199
my ($self, $seq) = @_;
202
if( ! defined $seq) {
203
$self->warn("Cannot calculate without supply an observation sequence!");
206
my $s = $self->{'symbols'};
208
$valid_symbols = eval "tr/$s//;";
209
if ($valid_symbols != length($seq)) {
210
$self->throw("Observation Sequence contains characters that is not in the
211
alphabet of observation symbols!\n");
213
return Bio::Ext::HMM->HMM_likelihood($self->{'hmm'}, $seq);
216
=head2 statistical_training
218
Title : statistical_training
219
Usage : $hmm->statistical_training(\@seqs, \@hss)
220
Function: Estimate the parameters of an HMM given an array of observation
221
sequence and an array of the corresponding hidden state
223
Returns : Returns nothing. The parameters of the HMM will be set to the
225
Args : The first argument is a reference to an array of observation
226
sequences. The second argument is a reference to an array of
227
hidden state sequences. Note that the lengths of an observation
228
sequence and a hidden state sequence must be the same.
232
sub statistical_training {
233
my ($self, $seqs, $hss) = @_;
235
my $seq_cnt, $hs_cnt;
238
if( ! defined $seqs or ! defined $hss) {
239
$self->warn("Cannot calculate without supply an observation and a hidden state sequence!");
244
if ($seq_cnt != $hs_cnt) {
245
$self->throw("There must be the same number of observation sequences and
246
hidden state sequences!\n");
248
for ($i = 0; $i < $seq_cnt; ++$i) {
249
if (length(@{$seqs}[$i]) != length(@{$hss}[$i])) {
250
$self->throw("The corresponding observation sequences and hidden state sequences must be of the same length!\n");
253
foreach $seq (@{$seqs}) {
254
my $s = $self->{'symbols'};
256
$valid_symbols = eval "tr/$s//;";
257
if ($valid_symbols != length($seq)) {
258
$self->throw("Observation Sequence contains characters that is not in the
259
alphabet of observation symbols!\n");
262
foreach $seq (@{$hss}) {
263
my $s = $self->{'states'};
265
$valid_symbols = eval "tr/$s//;";
266
if ($valid_symbols != length($seq)) {
267
$self->throw("Hidden State Sequence contains characters that is not in the
268
alphabet of hidden states!\n");
271
Bio::Ext::HMM->HMM_statistical_training($self->{'hmm'}, $seqs, $hss);
274
=head2 baum_welch_training
276
Title : baum_welch_training
277
Usage : $hmm->baum_welch_training(\@seqs)
278
Function: Estimate the parameters of an HMM given an array of observation
280
Returns : Returns nothing. The parameters of the HMM will be set to the
282
Args : The only argument is a reference to an array of observation
287
sub baum_welch_training {
288
my ($self, $seqs) = @_;
291
if( ! defined $seqs) {
292
$self->warn("Cannot calculate without supply an observation sequence!");
295
foreach $seq (@{$seqs}) {
296
my $s = $self->{'symbols'};
298
$valid_symbols = eval "tr/$s//;";
299
if ($valid_symbols != length($seq)) {
300
$self->throw("Observation Sequence contains characters that is not in the
301
alphabet of observation symbols!\n");
304
Bio::Ext::HMM->HMM_baum_welch_training($self->{'hmm'}, $seqs);
310
Usage : $hss = $hmm->viterbi($seq)
311
Function: Find out the hidden state sequence that can maximize the
312
probability of seeing observation sequence $seq.
313
Returns : Returns a string that is the hidden state sequence that maximizes
314
the probability of seeing $seq.
315
Args : The only argument is an observation sequence.
320
my ($self, $seq) = @_;
323
if( ! defined $seq) {
324
$self->warn("Cannot calculate without supply an observation sequence!");
327
my $s = $self->{'symbols'};
329
$valid_symbols = eval "tr/$s//;";
330
if ($valid_symbols != length($seq)) {
331
$self->throw("Observation Sequence contains characters that is not in the
332
alphabet of observation symbols!\n");
334
return Bio::Ext::HMM->HMM_viterbi($self->{'hmm'}, $seq);
340
Usage : $symbols = $hmm->symbols() #get
341
: $hmm->symbols($value) #set
342
Function : the set get for the observation symbols
344
Returns : symbols string
345
Arguments : new value
350
my ($self,$val) = @_;
354
if ( defined $val ) {
357
for ($i = 0; $i < length($val); ++$i) {
358
$c = substr($val, $i, 1);
359
if (defined $alphabets{$c}) {
360
$self->throw("Can't have duplicate symbols!");
366
$self->{'symbols'} = $val;
368
return $self->{'symbols'};
375
Usage : $states = $hmm->states() #get
376
: $hmm->states($value) #set
377
Function : the set get for the hidden states
379
Returns : states string
380
Arguments : new value
385
my ($self,$val) = @_;
389
if ( defined $val ) {
392
for ($i = 0; $i < length($val); ++$i) {
393
$c = substr($val, $i, 1);
394
if (defined $alphabets{$c}) {
395
$self->throw("Can't have duplicate states!");
401
$self->{'states'} = $val;
403
return $self->{'states'};
409
Usage : $init = $hmm->init_prob() #get
410
: $hmm->transition_prob(\@init) #set
411
Function : the set get for the initial probability array
413
Returns : reference to double array
414
Arguments : new value
419
my ($self, $init) = @_;
431
$self->throw("The sum of initial probability array must be 1.0!\n");
433
if ($size != length($self->{'states'})) {
434
$self->throw("The size of init array $size is different from the number of HMM's hidden states!\n");
436
for ($i = 0; $i < $size; ++$i) {
437
Bio::Ext::HMM::HMM->set_init_entry($self->{'hmm'}, substr($self->{'states'}, $i, 1), @{$init}[$i]);
441
$self->throw("Initial Probability array must be a reference!\n");
445
for ($i = 0; $i < length($self->{'states'}); ++$i) {
446
$A[$i] = Bio::Ext::HMM::HMM->get_init_entry($self->{'hmm'}, substr($self->{'states'}, $i, 1));
452
=head2 transition_prob
454
Title : transition_prob
455
Usage : $transition_matrix = $hmm->transition_prob() #get
456
: $hmm->transition_prob($matrix) #set
457
Function : the set get for the transition probability mairix
459
Returns : Bio::Matrix::Scoring
460
Arguments : new value
464
sub transition_prob {
465
my ($self, $matrix) = @_;
469
if (defined $matrix) {
470
if ($matrix->isa('Bio::Matrix::Scoring')) {
471
my $row = join("", $matrix->row_names);
472
my $col = join("", $matrix->column_names);
473
if ($row ne $self->{'states'}) {
474
$self->throw("Names of the rows ($row) is different from the states of HMM " . $self->{'states'});
476
if ($col ne $self->{'states'}) {
477
$self->throw("Names of the columns ($col) is different from the states of HMM " . $self->{'states'});
479
for ($i = 0; $i < length($self->{'states'}); ++$i) {
481
my $a = substr($self->{'states'}, $i, 1);
482
for ($j = 0; $j < length($self->{'states'}); ++$j) {
483
my $b = substr($self->{'states'}, $j, 1);
484
$sum += $matrix->entry($a, $b);
487
$self->throw("Sum of probabilities for each from-state must be 1.0!\n");
490
for ($i = 0; $i < length($self->{'states'}); ++$i) {
491
my $a = substr($self->{'states'}, $i, 1);
492
for ($j = 0; $j < length($self->{'states'}); ++$j) {
493
my $b = substr($self->{'states'}, $j, 1);
494
Bio::Ext::HMM::HMM->set_a_entry($self->{'hmm'}, $a, $b, $matrix->entry($a, $b));
499
$self->throw("Transition Probability matrix must be of type Bio::Matrix::Scoring.\n");
503
for ($i = 0; $i < length($self->{'states'}); ++$i) {
504
for ($j = 0; $j < length($self->{'states'}); ++$j) {
505
$A[$i][$j] = Bio::Ext::HMM::HMM->get_a_entry($self->{'hmm'}, substr($self->{'states'}, $i, 1), substr($self->{'states'}, $j, 1));
508
my @rows = split(//, $self->{'states'});
509
return $matrix = Bio::Matrix::Scoring->new(-values => \@A, -rownames => \@rows, -colnames => \@rows);
515
Title : emission_prob
516
Usage : $emission_matrix = $hmm->emission_prob() #get
517
: $hmm->emission_prob($matrix) #set
518
Function : the set get for the emission probability mairix
520
Returns : Bio::Matrix::Scoring
521
Arguments : new value
526
my ($self, $matrix) = @_;
530
if (defined $matrix) {
531
if ($matrix->isa('Bio::Matrix::Scoring')) {
532
my $row = join("", $matrix->row_names);
533
my $col = join("", $matrix->column_names);
534
if ($row ne $self->{'states'}) {
535
$self->throw("Names of the rows ($row) is different from the states of HMM " . $self->{'states'});
537
if ($col ne $self->{'symbols'}) {
538
$self->throw("Names of the columns ($col) is different from the symbols of HMM " . $self->{'symbols'});
540
for ($i = 0; $i < length($self->{'states'}); ++$i) {
542
my $a = substr($self->{'states'}, $i, 1);
543
for ($j = 0; $j < length($self->{'symbols'}); ++$j) {
544
my $b = substr($self->{'symbols'}, $j, 1);
545
$sum += $matrix->entry($a, $b);
548
$self->throw("Sum of probabilities for each state must be 1.0!\n");
551
for ($i = 0; $i < length($self->{'states'}); ++$i) {
552
my $a = substr($self->{'states'}, $i, 1);
553
for ($j = 0; $j < length($self->{'symbols'}); ++$j) {
554
my $b = substr($self->{'symbols'}, $j, 1);
555
Bio::Ext::HMM::HMM->set_e_entry($self->{'hmm'}, $a, $b, $matrix->entry($a, $b));
560
$self->throw("Emission Probability matrix must be of type Bio::Matrix::Scoring.\n");
564
for ($i = 0; $i < length($self->{'states'}); ++$i) {
565
for ($j = 0; $j < length($self->{'symbols'}); ++$j) {
566
$A[$i][$j] = Bio::Ext::HMM::HMM->get_e_entry($self->{'hmm'}, substr($self->{'states'}, $i, 1), substr($self->{'symbols'}, $j, 1));
569
my @rows = split(//, $self->{'states'});
570
my @cols = split(//, $self->{'symbols'});
571
return $matrix = Bio::Matrix::Scoring->new(-values => \@A, -rownames => \@rows, -colnames => \@cols);