19
use Bio::Tools::dpAlign;
24
$seq1 = new Bio::SeqIO(-file => $ARGV[0], -format => 'fasta');
25
$seq2 = new Bio::SeqIO(-file => $ARGV[1], -format => 'fasta');
27
# create a dpAlign object
28
$factory = new dpAlign(-match => 3,
32
-alg => Bio::Tools::dpAlign::DPALIGN_LOCAL_MILLER_MYERS);
34
# actually do the alignment
35
$out = $factory->pairwise_alignment($seq1->next_seq, $seq2->next_seq);
36
$alnout = new Bio::AlignIO(-format => 'pfam', -fh => \*STDOUT);
37
$alnout->write_aln($out);
39
# To do protein alignment, set the sequence type to protein
40
# currently all protein alignments are using BLOSUM62 matrix
41
# the gap opening cost is 10 and gap extension is 2. These
42
# values are from ssearch. They won't be changed even though
43
# you set other values for now. Also DPALIGN_LOCAL_GREEN is not
44
# supported for protein in this version.
45
$seq1->alphabet('protein');
46
$seq2->alphabet('protein');
47
$out = $factory->pairwise_alignment($seq1->next_seq, $seq2->next_seq);
48
$alnout->write_aln($out);
50
# use the factory to make some output
52
$factory->align_and_show($seq1, $seq2, STDOUT);
19
use Bio::Tools::dpAlign;
25
$seq1 = new Bio::SeqIO(-file => $ARGV[0], -format => 'fasta');
26
$seq2 = new Bio::SeqIO(-file => $ARGV[1], -format => 'fasta');
28
# create a dpAlign object
29
# to do global alignment, specify DPALIGN_GLOBAL_MILLER_MYERS
30
# to do ends-free alignment, specify DPALIGN_ENDSFREE_MILLER_MYERS
31
$factory = new dpAlign(-match => 3,
35
-alg => Bio::Tools::dpAlign::DPALIGN_LOCAL_MILLER_MYERS);
37
# actually do the alignment
38
$out = $factory->pairwise_alignment($seq1->next_seq, $seq2->next_seq);
39
$alnout = new Bio::AlignIO(-format => 'pfam', -fh => \*STDOUT);
40
$alnout->write_aln($out);
42
# To do protein alignment, set the sequence type to protein
43
# By default all protein alignments are using BLOSUM62 matrix
44
# the gap opening cost is 7 and gap extension is 1. These
45
# values are from ssearch. To use your own custom substitution
46
# matrix, you can create a Bio::Matrix::MatrixI object.
48
$parser = new Bio::Matrix::IO(-format => 'scoring', -file => 'blosum50.mat');
49
$matrix = $parser->next_matrix;
50
$factory = new Bio::Tools::dpAlign(-matrix => $matrix, -alg => Bio::Tools::dpAlign::DPALIGN_LOCAL_MILLERMYERS);
51
$seq1->alphabet('protein');
52
$seq2->alphabet('protein');
53
$out = $factory->pairwise_alignment($seq1->next_seq, $seq2->next_seq);
54
$alnout->write_aln($out);
56
# use the factory to make some output
58
$factory->align_and_show($seq1, $seq2, STDOUT);
60
# use Phil Green's algorithm to calculate the optimal local
61
# alignment score between two sequences quickly. It is very
62
# useful when you are searching a query sequence in a database
63
# of sequences. Since finding a alignment is more costly
64
# than just calculating scores, you can save time if you only
65
# align sequences that have a high alignment score.
67
# To use this feature, first you call the sequence_profile function
68
# to obtain the profile of the query sequence.
69
$profile = $factory->sequence_profile($query);
72
# Then use a loop to run a database of sequences against the
73
# profile to obtain a table of alignment scores
74
$dbseq = Bio::SeqIO(-file => 'dbseq.fa', -format => 'fasta');
75
while (defined($seq = $dbseq->next_seq)) {
76
$scores{$seq->id} = $factory->pairwise_alignment_score($profile, $seq);
56
Dynamic Programming approach is considered to be the most
57
sensitive way to align two biological sequences. There are
58
currently three major types of dynamic programming algorithms:
59
Global Alignment, Local Alignment and Ends-free Alignment.
61
Global Alignment compares two sequences in their entirety.
62
By inserting gaps in the two sequences, it aligns two
63
sequences to minimize the edit distance as defined by the
64
gap cost function and the substitution matrix. Global Alignment
65
is generally applied to two sequences that are very similar
66
in length and content.
68
Local Alignment instead attempts to find out the subsequences
69
that has the minimal edit distance among all possible subsequences.
70
It is good for sequences that has a stretch of subsequences
71
that are similar to each other.
73
Ends-free Alignment is a special case of Global Alignment. There
74
are no gap penalty imposed for the gaps that extended from
75
the end points of two sequences. Therefore it will be a good
76
application when you think one sequence is contained by the
77
other or when you think two sequences overlap each other.
79
Dynamic Programming was first introduced by Needleman-Wunsch (1970)
80
to globally align two sequences. The idea of local alignment
81
was later introduced by Smith-Waterman (1981). Gotoh (1982)
82
improved both algorithms by introducing auxillary arrays that
83
reduced the time complexity of the algorithms to O(m*n).
84
Miller-Myers (1988) exploits the divide-and-conquer idea
85
introduced by Hirschberg (1975) to solve the affine gap cost
86
dynamic programming using only linear space. At the time of
87
this writing, it is accepted that Miller-Myers is the fastest
88
single CPU implementation and using the least memory that is
89
truly equivalent to original algorithm introduced by
90
Needleman-Wunsch. According to Aaron Mackey, Phil Green's SWAT
91
implemention introduced a heuristic that does not consider
92
paths throught the matrix where the score would be less than
93
the gap opening penalty, yielding a 1.5-2X speedup on most
94
comparisons. to skip the calculation of some cells. However,
95
his approach is only good for calculating the minimum edit
96
distance and find out the corresponding subsequences (aka
97
search phase). Bill Pearson's popular dynamic programming
98
alignment program SSEARCH uses Phil Green's algorithm to
99
find the subsequences and then Miller-Myers's algorithm to
100
find the actual alignment. (aka alignment phase)
102
The current implementation supports local alignment of
103
either DNA sequences or protein sequences. It allows you
104
to specify either the Phil Green (DPALIGN_LOCAL_GREEN)
105
or Miller-Myers (DPALIGN_LOCAL_MILLER_MYERS). For DNA
106
alignment, you can specify the scores for match, mismatch,
107
gap opening cost and gap extension cost. For protein
108
alignment, it is using BLOSUM62 by default. Currently the
109
substitution matrix is not configurable.
81
Dynamic Programming approach is considered to be the most sensitive
82
way to align two biological sequences. There are currently three major
83
types of dynamic programming algorithms: Global Alignment, Local
84
Alignment and Ends-free Alignment.
86
Global Alignment compares two sequences in their entirety. By
87
inserting gaps in the two sequences, it aligns two sequences to
88
minimize the edit distance as defined by the gap cost function and the
89
substitution matrix. Global Alignment is generally applied to two
90
sequences that are very similar in length and content.
92
Local Alignment instead attempts to find out the subsequences that has
93
the minimal edit distance among all possible subsequences. It is good
94
for sequences that has a stretch of subsequences that are similar to
97
Ends-free Alignment is a special case of Global Alignment. There are
98
no gap penalty imposed for the gaps that extended from the end points
99
of two sequences. Therefore it will be a good application when you
100
think one sequence is contained by the other or when you think two
101
sequences overlap each other.
103
Dynamic Programming was first introduced by Needleman-Wunsch (1970) to
104
globally align two sequences. The idea of local alignment was later
105
introduced by Smith-Waterman (1981). Gotoh (1982) improved both
106
algorithms by introducing auxillary arrays that reduced the time
107
complexity of the algorithms to O(m*n). Miller-Myers (1988) exploits
108
the divide-and-conquer idea introduced by Hirschberg (1975) to solve
109
the affine gap cost dynamic programming using only linear space. At
110
the time of this writing, it is accepted that Miller-Myers is the
111
fastest single CPU implementation and using the least memory that is
112
truly equivalent to original algorithm introduced by
113
Needleman-Wunsch. According to Aaron Mackey, Phil Green's SWAT
114
implemention introduced a heuristic that does not consider paths
115
throught the matrix where the score would be less than the gap opening
116
penalty, yielding a 1.5-2X speedup on most comparisons. to skip the
117
calculation of some cells. However, his approach is only good for
118
calculating the minimum edit distance and find out the corresponding
119
subsequences (aka search phase). Bill Pearson's popular dynamic
120
programming alignment program SSEARCH uses Phil Green's algorithm to
121
find the subsequences and then Miller-Myers's algorithm to find the
122
actual alignment. (aka alignment phase)
124
The current implementation supports local alignment of either DNA
125
sequences or protein sequences. It allows you to specify either the
126
Miller-Myers Global Alignment (DPALIGN_GLOBAL_MILLER_MYERS) or
127
Miller-Myers Local Alignment (DPALIGN_LOCAL_MILLER_MYERS). For DNA
128
alignment, you can specify the scores for match, mismatch, gap opening
129
cost and gap extension cost. For protein alignment, it is using
130
BLOSUM62 by default. Currently the substitution matrix is not
111
133
=head1 DEPENDENCIES
113
This package comes with the main bioperl distribution. You
114
also need to install the lastest bioperl-ext package which
115
contains the XS code that implements the algorithms. This
116
package won't work if you haven't compiled the bioperl-ext
135
This package comes with the main bioperl distribution. You also need
136
to install the lastest bioperl-ext package which contains the XS code
137
that implements the algorithms. This package won't work if you haven't
138
compiled the bioperl-ext package.
121
1) Allow custom substitution matrix.
123
2) Support Global Alignment.
125
3) Support Ends-free Alignment.
127
4) Include a score only calculation based on Phil Green's
128
algorithm. The code will be borrowed from do_work in
131
5) Support IUPAC code for DNA sequence
147
Support IUPAC code for DNA sequence
151
Allow custom substitution matrix for DNA. Note that for proteins, you
152
can now use your own subsitution matirx.
254
276
if (defined $alg) {
255
if ($alg == DPALIGN_LOCAL_MILLER_MYERS or $alg == DPALIGN_LOCAL_GREEN) {
277
if ($alg == DPALIGN_LOCAL_MILLER_MYERS or $alg == DPALIGN_GLOBAL_MILLER_MYERS or $alg == DPALIGN_ENDSFREE_MILLER_MYERS) {
256
278
$self->alg($alg);
259
$self->throw("Algorithm must be either 1 or 2");
281
$self->throw("Algorithm must be either 1, 2 or 3");
285
if (defined $matrix and $matrix->isa('Bio::Matrix::MatrixI')) {
286
$self->{'matrix'} = Bio::Ext::Align::ScoringMatrix->new(join("", $matrix->row_names), $self->gap, $self->ext);
287
foreach $rowname ($matrix->row_names) {
288
foreach $colname ($matrix->column_names) {
289
Bio::Ext::Align::ScoringMatrix->set_entry($self->{'matrix'}, $rowname, $colname, $matrix->entry($rowname, $colname));
294
$self->{'matrix'} = 0;
300
=head2 sequence_profile
302
Title : sequence_profile
303
Usage : $prof = $factory->sequence_profile($seq1)
304
Function: Makes a dpAlign_SequenceProfile object from one sequence
305
Returns : A dpAlign_SequenceProfile object
306
Args : The lone argument is a Bio::PrimarySeqI that we want to
307
build a profile for. Usually, this would be the Query sequence
311
sub sequence_profile {
312
my ($self, $seq1) = @_;
314
if( ! defined $seq1 || ! $seq1->isa('Bio::PrimarySeqI')) {
315
$self->warn("Cannot call sequence_profilewithout specifing one sequence (Bio::PrimarySeqI object)");
319
# fix Jitterbug #1044
320
if( $seq1->length() < 2) {
321
$self->warn("cannot create sequence profile with length less than 2");
324
# create engine objects
325
$seq1->display_id('seq1') unless ( defined $seq1->id() );
327
if ($seq1->alphabet eq 'dna') {
328
return Bio::Ext::Align::SequenceProfile->dna_new($seq1->seq, $self->{'match'}, $self->{'mismatch'}, $self->{'gap'}, $self->{'ext'});
330
elsif ($seq1->alphabet eq 'protein') {
331
return Bio::Ext::Align::SequenceProfile->protein_new($seq1->seq, $self->{'matrix'});
334
croak("There is currently no support for the types of sequences you want to align!\n");
339
=head2 pairwise_alignment_score
341
Title : pairwise_alignment_score
342
Usage : $score = $factory->pairwise_alignment_score($prof,$seq2)
343
Function: Makes a SimpleAlign object from two sequences
344
Returns : An integer that is the score of the optimal alignment.
345
Args : The first argument is the sequence profile obtained from a
346
call to the sequence_profile function. The second argument
347
is a Bio::PrimarySeqI object to be aligned. The second argument
348
is usually a sequence in the database sequence. Note
349
that this function only uses Phil Green's algorithm and
350
therefore theoretically may not always give you the optimal
355
sub pairwise_alignment_score {
356
my ($self, $prof, $seq2) = @_;
358
if( ! defined $prof || ! $prof->isa('Bio::Ext::Align::SequenceProfile') ||
359
! defined $seq2 || ! $seq2->isa('Bio::PrimarySeqI') ) {
360
$self->warn("Cannot call pairwise_alignment_score without specifing 2 sequences (Bio::PrimarySeqI objects)");
363
# fix Jitterbug #1044
364
if( $seq2->length() < 2) {
365
$self->warn("cannot align sequences with length less than 2");
368
$self->set_memory_and_report();
369
# create engine objects
370
$seq2->display_id('seq2') unless ( defined $seq2->id() );
372
if ($prof->alphabet eq 'dna' and $seq2->alphabet eq 'dna') {
373
return Bio::Ext::Align::Score_DNA_Sequences($prof, $seq2->seq);
375
elsif ($prof->alphabet eq 'protein' and $seq2->alphabet eq 'protein') {
376
return Bio::Ext::Align::Score_Protein_Sequences($prof, $seq2->seq);
379
croak("There is currently no support for the types of sequences you want to align!\n");
265
384
=head2 pairwise_alignment
267
386
Title : pairwise_alignment
268
387
Usage : $aln = $factory->pairwise_alignment($seq1,$seq2)
269
388
Function: Makes a SimpleAlign object from two sequences
270
Returns : A SimpleAlign object
389
Returns : A SimpleAlign object if there is an alignment with positive
390
score. Otherwise, return undef.
391
Args : The first and second arguments are both Bio::PrimarySeqI
392
objects that are to be aligned.
352
477
$aln = Bio::Ext::Align::Align_DNA_Sequences($seq1->seq, $seq2->seq, $self->{'match'}, $self->{'mismatch'}, $self->{'gap'}, $self->{'ext'}, $self->{'alg'});
354
479
elsif ($seq1->alphabet eq 'protein' and $seq2->alphabet eq 'protein') {
355
$aln = Bio::Ext::Align::Align_Protein_Sequences($seq1->seq, $seq2->seq, 'blosum62');
480
$aln = Bio::Ext::Align::Align_Protein_Sequences($seq1->seq, $seq2->seq, $self->{'matrix'}, $self->{'alg'});
358
483
croak("There is currently no support for the types of sequences you want to align!\n");
361
$out = Bio::SimpleAlign->new();
363
$out->add_seq(Bio::LocatableSeq->new(-seq => $aln->aln1,
364
-start => $aln->start1,
368
$out->add_seq(Bio::LocatableSeq->new(-seq => $aln->aln2,
369
-start => $aln->start2,
373
my $t1 = &Bio::Ext::Align::new_Sequence_from_strings($seq1->id, $seq1->seq);
374
my $t2 = &Bio::Ext::Align::new_Sequence_from_strings($seq2->id, $seq2->seq);
376
&Bio::Ext::Align::write_pretty_seq_align($out,$t1,$t2,12,50,$fh);
486
$out = Bio::Ext::Align::AlnBlock->new();
487
my $s1 = Bio::Ext::Align::AlnSequence->new();
488
my $s2 = Bio::Ext::Align::AlnSequence->new();
491
my $first_col = undef;
492
my $last_col = undef;
500
for (my $i = 0; $i < length($a1); ++$i) {
501
$col = Bio::Ext::Align::AlnColumn->new();
502
$alu1 = Bio::Ext::Align::AlnUnit->new();
503
$alu2 = Bio::Ext::Align::AlnUnit->new();
504
$first_col = $col unless defined $first_col;
505
Bio::Ext::Align::AlnColumn::set_next($last_col, $col) if defined $last_col;
507
if (substr($a1, $i, 1) eq "-") {
508
Bio::Ext::Align::AlnUnit::set_text_label($alu1, "INSERT");
509
Bio::Ext::Align::AlnUnit::set_text_label($alu2, "SEQUENCE");
512
elsif (substr($a2, $i, 1) eq "-") {
513
Bio::Ext::Align::AlnUnit::set_text_label($alu1, "SEQUENCE");
514
Bio::Ext::Align::AlnUnit::set_text_label($alu2, "INSERT");
518
Bio::Ext::Align::AlnUnit::set_text_label($alu1, "SEQUENCE");
519
Bio::Ext::Align::AlnUnit::set_text_label($alu2, "SEQUENCE");
522
Bio::Ext::Align::AlnUnit::set_start($alu1, $aln->start1+$i-$g1-2);
523
Bio::Ext::Align::AlnUnit::set_end($alu1, $aln->start1+$i-$g1-2);
524
Bio::Ext::Align::AlnUnit::set_start($alu2, $aln->start2+$i-$g2-2);
525
Bio::Ext::Align::AlnUnit::set_end($alu2, $aln->start2+$i-$g2-2);
526
Bio::Ext::Align::AlnColumn::add_alu($col, $alu1);
527
Bio::Ext::Align::AlnColumn::add_alu($col, $alu2);
530
Bio::Ext::Align::AlnBlock::set_start($out, $first_col);
531
$col = Bio::Ext::Align::AlnColumn->new();
532
$alu1 = Bio::Ext::Align::AlnUnit->new();
533
$alu2 = Bio::Ext::Align::AlnUnit->new();
534
Bio::Ext::Align::AlnUnit::set_start($alu1, $aln->end1);
535
Bio::Ext::Align::AlnUnit::set_end($alu1, $aln->end1);
536
Bio::Ext::Align::AlnUnit::set_text_label($alu1, "END");
537
Bio::Ext::Align::AlnUnit::set_start($alu2, $aln->end2);
538
Bio::Ext::Align::AlnUnit::set_end($alu2, $aln->end2);
539
Bio::Ext::Align::AlnUnit::set_text_label($alu2, "END");
540
Bio::Ext::Align::AlnColumn::add_alu($col, $alu1);
541
Bio::Ext::Align::AlnColumn::add_alu($col, $alu2);
542
Bio::Ext::Align::AlnColumn::set_next($last_col, $col);
544
&Bio::Ext::Align::write_pretty_str_align($out,$seq1->id,$seq1->seq,$seq2->id,$seq2->seq,12,50,$fh);