1
#------------------------------------------------------------------
3
# BioPerl module Bio::Restriction::Analysis
5
# Cared for by Rob Edwards <redwards@utmem.edu>
7
# You may distribute this module under the same terms as perl itself
8
#------------------------------------------------------------------
14
Bio::Restriction::Analysis - cutting sequences with restriction
19
# analyze a DNA sequence for restriction enzymes
20
use Bio::Restriction::Analysis;
24
# get a DNA sequence from somewhere
25
my $seq=new Bio::PrimarySeq
26
(-seq =>'AGCTTAATTCATTAGCTCTGACTGCAACGGGCAATATGTCTC'.
27
'TGTGTGGATTAAAAAAAGAGTGAGCTTCTGATAGCAGC',
28
-primary_id => 'synopsis',
31
# now start an analysis.
32
# this is using the default set of enzymes
33
my $ra=Bio::Restriction::Analysis->new(-seq=>$seq);
35
# find unique cutters. This returns a
36
# Bio::Restriction::EnzymeCollection object
37
my $enzymes=$ra->unique_cutters;
38
print "Unique cutters: ", join (', ',
39
map {$_->name} $enzymes->unique_cutters), "\n";
41
# AluI is one them. Where does it cut?
42
# This is will return an array of the sequence strings
45
my @frags=$ra->fragments($enz);
46
# how big are the fragments?
47
print "AluI fragment lengths: ", join(' & ', map {length $_} @frags), "\n";
49
# You can also bypass fragments and call sizes directly:
50
# to see all the fragment sizes
51
print "All sizes: ", join " ", $ra->sizes($enz), "\n";
52
# to see all the fragment sizes sorted by size like on a gel
53
print "All sizes, sorted ", join (" ", $ra->sizes($enz, 0, 1)), "\n";
55
# how many times does each enzyme cut
56
my $cuts=$ra->cuts_by_enzyme('BamHI');
57
print "BamHI cuts $cuts times\n";
59
# How many enzymes do not cut at all?
60
print "There are ", scalar $ra->zero_cutters->each_enzyme,
61
" enzymes that do not cut\n";
63
# what about enzymes that cut twice?
64
my $two_cutters=$ra->cutters(2);
65
print join (" ", map {$_->name} $two_cutters->each_enzyme),
66
" cut the sequence twice\n";
68
# what are all the enzymes that cut, and how often do they cut
69
printf "\n%-10s%s\n", 'Enzyme', 'Number of Cuts';
70
my $all_cutters=$ra->cutters;
72
printf "%-10s%s\n", $_->name, $ra->cuts_by_enzyme($_->name)
73
} $all_cutters->each_enzyme;
75
# Finally, we can interact the restriction enzyme object by
76
# retrieving it from the collection object see the docs for
77
# Bio::Restriction::Enzyme.pm
78
my $enzobj=$enzymes->get_enzyme($enz);
83
Bio::Restriction::Analysis describes the results of cutting a DNA
84
sequence with restriction enzymes.
86
To use this module you can pass a sequence object and optionally a
87
Bio::Restriction::EnzymeCollection that contains the enzyme(s) to cut the
88
sequences with. There is a default set of enzymes that will be loaded
89
if you do not pass in a Bio::Restriction::EnzymeCollection.
91
To cut a sequence, set up a Restriction::Analysis object with a sequence
94
use Bio::Restriction::Analysis;
95
my $ra=Bio::Restriction::Analysis->new(-seq=>$seqobj);
99
my $ra=Bio::Restriction::Analysis->new
100
(-seq=>$seqobj, -enzymes=>$enzs);
102
Then, to get the fragments for a particular enzyme use this:
104
@fragments=$ra->fragments('EcoRI');
106
Note that the naming of restriction enzymes is that the last numbers
107
are usually Roman numbers (I, II, III, etc). You may want to use
110
# get a reference to an array of unique (single) cutters
111
$singles = $re->unique_cutters;
112
foreach my $enz ($singles->each_enzyme) {
113
@fragments=$re->fragments($enz);
114
... do something here ...
117
Note that if your sequence is circular, the first and last fragment
118
will be joined so that they are the appropriate length and sequence
119
for further analysis. This fragment will also be checked for cuts
120
by the enzyme(s). However, this will change the start of the
123
There are two separate algorithms used depending on whether your
124
enzyme has ambiguity. The non-ambiguous algoritm is a lot faster,
125
and if you are using very large sequences you should try and use
126
this algorithm. If you have a large sequence (e.g. genome) and
127
want to use ambgiuous enzymes you may want to make seperate
128
Bio::Restriction::Enzyme objects for each of the possible
129
alternatives and make sure that you don't set is_ambiguous!
131
This version should correctly deal with overlapping cut sites
132
in both ambiguous and non-ambiguous enzymes.
134
I have tried to write this module with speed and memory in mind
135
so that it can be effectively used for large (e.g. genome sized)
136
sequence. This module only stores the cut positions internally,
137
and calculates everything else on an as-needed basis. Therefore
138
when you call fragment_maps (for example), there may be another
139
delay while these are generated.
145
User feedback is an integral part of the evolution of this and other
146
Bioperl modules. Send your comments and suggestions preferably to one
147
of the Bioperl mailing lists. Your participation is much appreciated.
149
bioperl-l@bioperl.org - General discussion
150
http://bioperl.org/MailList.shtml - About the mailing lists
152
=head2 Reporting Bugs
154
Report bugs to the Bioperl bug tracking system to help us keep track
155
the bugs and their resolution. Bug reports can be submitted via email
158
bioperl-bugs@bio.perl.org
159
http://bugzilla.bioperl.org/
163
Rob Edwards, redwards@utmem.edu,
164
Steve Chervitz, sac@bioperl.org
168
Heikki Lehvaslaiho, heikki@ebi.ac.uk
172
Copyright (c) 2003 Rob Edwards. Some of this work is Copyright (c)
173
1997-2002 Steve A. Chervitz. All Rights Reserved.
175
This module is free software; you can redistribute it and/or modify it
176
under the same terms as Perl itself.
180
L<Bio::Restriction::Enzyme>,
181
L<Bio::Restriction::EnzymeCollection>
185
Methods beginning with a leading underscore are considered private and
186
are intended for internal use by this module. They are not considered
187
part of the public interface and are described here for documentation
192
package Bio::Restriction::Analysis;
193
use Bio::Restriction::EnzymeCollection;
199
@ISA = qw(Bio::Root::Root);
204
Function : Initializes the restriction enzyme object
205
Returns : The Restriction::Analysis object
208
$re_anal->new(-seq=$seqobj,
209
-enzymes=>Restriction::EnzymeCollection object)
210
-seq requires a Bio::PrimarySeq object
211
-enzymes is optional.
212
If ommitted it will use the default set of enzymes
214
This is the place to start. Pass in a sequence, and you will be able
215
to get the fragments back out. Several other things are available
216
like the number of zero cutters or single cutters.
221
my($class, @args) = @_;
222
my $self = $class->SUPER::new(@args);
224
$self->_rearrange([qw(
229
$seq && $self->seq($seq);
231
$enzymes ? $self->enzymes($enzymes)
232
: ($self->{'_enzymes'} = Bio::Restriction::EnzymeCollection->new );
234
# keep track of status
237
# left these here because we want to reforce a _cut if someone
239
$self->{maximum_cuts} = 0;
241
$self->{'_number_of_cuts_by_enzyme'} = {};
242
$self->{'_number_of_cuts_by_cuts'} = {};
243
$self->{'_fragments'} = {};
244
$self->{'_cut_positions'} = {}; # cut position is the real position
245
$self->{'_frag_map_list'} = {};
251
=head1 Methods to set parameters
258
Usage : $ranalysis->seq($newval);
259
Function : get/set method for the sequence to be cut
260
Example : $re->seq($seq);
261
Returns : value of seq
262
Args : A Bio::PrimarySeqI dna object (optional)
270
$self->throw('Need a sequence object ['. ref $seq. ']')
271
unless $seq->isa('Bio::PrimarySeqI');
272
$self->throw('Need a DNA sequence object ['. $seq->alphabet. ']')
273
unless $seq->alphabet eq 'dna';
275
$self->{'_seq'} = $seq;
278
return $self->{'_seq'};
284
Usage : $re->enzymes($newval)
285
Function : gets/Set the restriction enzyme enzymes
286
Example : $re->enzymes('EcoRI')
287
Returns : reference to the collection
288
Args : an array of Bio::Restriction::EnzymeCollection and/or
289
Bio::Restriction::Enzyme objects
292
The default object for this method is
293
Bio::Restriction::EnzymeCollection. However, you can also pass it a
294
list of Bio::Restriction::Enzyme objects - even mixed with Collection
295
objects. They will all be stored into one collection.
302
$self->{'_enzymes'} = Bio::Restriction::EnzymeCollection->new (-empty => 1)
303
unless $self->{'_enzymes'};
304
$self->{'_enzymes'}->enzymes(@_);
307
return $self->{'_enzymes'};
311
=head1 Perform the analysis
319
Function : Cut the sequence with the enzymes
320
Example : $re->cut(); $re->cut('single'); or $re->cut('multiple', $enzymecollection);
322
Args : 'single' (optional), 'multiple' with enzyme collection.
324
An explicit cut method is needed to pass arguments to it.
326
There are two varieties of cut. Single is the default, and need
327
not be explicitly called. This cuts the sequence with each
330
Multiple cuts a sequence with more than one enzyme. You must pass
331
it a Bio::Restriction::EnzymeCollection object of the set
332
of enzymes that you want to use in the double digest. The results
333
will be stored as an enzyme named "multiple_digest", so you can
334
use all the retrieval methods to get the data.
336
If you want to use the default setting there is no need to call cut
337
directly. Every method in the class that needs output checks the
338
object's internal status and recalculates the cuts if needed.
340
Note: cut doesn't now re-initialize everything before figuring
341
out cuts. This is so that you can do multiple digests, or add more
342
data or whatever. You'll have to use new to reset everything.
344
See also the comments in above about ambiguous and non-ambiguous
350
my ($self, $opt, $ec) = @_;
352
# for the moment I have left this as a separate routine so
353
# the user calls cuts rather than _cuts. This also initializes
354
# some stuff we need to use.
356
$self->throw("A sequence must be supplied")
359
if (uc($opt) eq "MULTIPLE") {
360
$self->throw("You must supply a separate enzyme collection for multiple digests") unless $ec;
361
$self->_multiple_cuts($ec); # multiple digests
363
# reset some of the things that we save
364
$self->{maximum_cuts} = 0;
365
$self->{'_number_of_cuts_by_enzyme'} = {};
366
$self->{'_number_of_cuts_by_cuts'} = {};
367
$self->{'_fragments'} = {};
368
$self->{'_cut_positions'} = {}; # cut position is the real position
369
$self->{'_frag_map_list'} = {};
377
=head2 mulitple_digest
379
Title : multiple_digest
380
Function : perform a multiple digest on a sequence
381
Returns : $self so you can go and get any of the other methods
382
Arguments : An enzyme collection
384
Multiple digests can use 1 or more enzymes, and the data is stored
385
in as if it were an enzyme called multiple_digest. You can then
386
retrieve information about multiple digests from any of the other
389
You can use this method in place of $re->cut('multiple', $enz_coll);
393
sub multiple_digest {
395
return $self->cut('multiple', $ec);
398
=head1 Query the results of the analysis
405
Function : Retrieve the positions that an enzyme cuts at
406
Returns : An array of the positions that an enzyme cuts at
407
: or an empty array if the enzyme doesn't cut
408
Arguments: An enzyme name to retrieve the positions for
409
Comments : The cut occurs after the base specified.
414
my ($self, $enz) = @_;
415
$self->cut unless $self->{'_cut'};
416
$self->throw('no enzyme selected to get positions for')
419
return defined $self->{'_cut_positions'}->{$enz} ?
420
@{$self->{'_cut_positions'}->{$enz}} :
427
Function : Retrieve the fragments that we cut
428
Returns : An array of the fragments retrieved.
429
Arguments: An enzyme name to retrieve the fragments for
431
For example this code will retrieve the fragments for all enzymes that
434
my $all_cutters = $analysis->cutters;
435
foreach my $enz ($$all_cutters->each_enzyme}) {
436
@fragments=$analysis->fragments($enz);
442
my ($self, $enz) = @_;
443
$self->cut unless $self->{'_cut'};
444
$self->throw('no enzyme selected to get fragments for')
447
for ($self->fragment_maps($enz)) {push @fragments, $_->{seq}}
453
Title : fragment_maps
454
Function : Retrieves fragment sequences with start and end
455
points. Useful for feature construction.
457
Returns : An array containing a hash reference for each fragment,
458
containing the start point, end point and DNA
459
sequence. The hash keys are 'start', 'end' and
460
'seq'. Returns an empty array if not defined.
462
Arguments : An enzyme name, enzyme object,
463
or enzyme collection to retrieve the fragments for.
465
If passes an enzyme collection it will return the result of a multiple
466
digest. This : will also cause the special enzyme 'multiple_digest' to
467
be created so you can get : other information about this multiple
470
There is a minor problem with this and $self-E<gt>fragments that I
471
haven't got a good answer for (at the moment). If the sequence is not
472
cut, do we return undef, or the whole sequence?
474
For linear fragments it would be good to return the whole
475
sequence. For circular fragments I am not sure.
477
At the moment it returns the whole sequence with start of 1 and end of
478
length of the sequence. For example:
480
use Bio::Restriction::Analysis;
481
use Bio::Restriction::EnzymeCollection;
484
my $seq=new Bio::PrimarySeq
485
(-seq =>'AGCTTAATTCATTAGCTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATCCAAAAAAGAGTGAGCTTCTGAT',
486
-primary_id => 'synopsis',
489
my $ra=Bio::Restriction::Analysis->new(-seq=>$seq);
492
my @bam_maps = $ra->fragment_maps('BamHI');
493
foreach my $i (@bam_maps) {
494
my $start = $i->{start};
496
my $sequence = $i->{seq};
497
push @gel, "$start--$sequence--$end";
498
@gel = sort {length $b <=> length $a} @gel;
500
print join("\n", @gel) . "\n";
505
my ($self, $enz) = @_;
506
$self->cut unless $self->{'_cut'};
507
$self->throw('no enzyme selected to get fragment maps for')
510
# we are going to generate this on an as-needed basis rather than
511
# for every enzyme this should cut down on the amount of
512
# duplicated data we are trying to save in memory and make this
513
# faster and easier for large sequences, e.g. genome analysis
516
if (ref $enz eq '') {
517
@cut_positions=@{$self->{'_cut_positions'}->{$enz}};
518
} elsif ($enz->isa("Bio::Restriction::EnzymeI")) {
519
@cut_positions=@{$self->{'_cut_positions'}->{$enz->name}};
520
} elsif ($enz->isa("Bio::Restriction::EnzymeCollection")) {
521
$self->cuts('multiple', $enz);
522
@cut_positions=@{$self->{'_cut_positions'}->{'multiple_digest'}};
525
unless ($cut_positions[0]) {
527
# return the whole sequence
528
# this should probably have the is_circular command
531
'end' => $self->{'_seq'}->length,
532
'seq' => $self->{'_seq'}->seq
534
push (@{$self->{'_frag_map_list'}->{$enz}}, \%map);
535
return defined $self->{'_frag_map_list'}->{$enz} ?
536
@{$self->{'_frag_map_list'}->{$enz}} : ();
539
@cut_positions=sort {$a <=> $b} @cut_positions;
540
push my @cuts, $cut_positions[0];
541
foreach my $i (@cut_positions) {
542
push @cuts, $i if $i != $cuts[$#cuts];
545
my $start=1; my $stop; my %seq; my %stop;
546
foreach $stop (@cuts) {
547
$seq{$start}=$self->{'_seq'}->subseq($start, $stop);
551
$stop=$self->{'_seq'}->length;
552
$seq{$start}=$self->{'_seq'}->subseq($start, $stop);
555
if ($self->{'_seq'}->is_circular) {
556
# join the first and last fragments
557
$seq{$start}.=$seq{'1'};
559
$stop{$start}=$stop{'1'};
563
foreach my $start (sort {$a <=> $b} keys %seq) {
566
'end' => $stop{$start},
567
'seq' => $seq{$start}
569
push (@{$self->{'_frag_map_list'}->{$enz}}, \%map);
572
return defined $self->{'_frag_map_list'}->{$enz} ?
573
@{$self->{'_frag_map_list'}->{$enz}} : ();
580
Function : Retrieves an array with the sizes of the fragments
581
Returns : Array that has the sizes of the fragments ordered from
582
largest to smallest like they would appear in a gel.
583
Arguments: An enzyme name to retrieve the sizes for is required and
584
kilobases to the nearest 0.1 kb, else it will be in
585
bp. If the optional third entry is set the results will
588
This is designed to make it easy to see what fragments you should get
591
You should be able to do these:
593
# to see all the fragment sizes,
594
print join "\n", @{$re->sizes($enz)}, "\n";
595
# to see all the fragment sizes sorted
596
print join "\n", @{$re->sizes($enz, 0, 1)}, "\n";
597
# to see all the fragment sizes in kb sorted
598
print join "\n", @{$re->sizes($enz, 1, 1)}, "\n";
603
my ($self, $enz, $kb, $sort) = @_;
604
$self->throw('no enzyme selected to get fragments for')
606
$self->cut unless $self->{'_cut'};
607
my @frag; my $lastsite=0;
608
foreach my $site (@{$self->{'_cut_positions'}->{$enz}}) {
609
$kb ? push (@frag, (int($site-($lastsite))/100)/10)
610
: push (@frag, $site-($lastsite));
613
$kb ? push (@frag, (int($self->{'_seq'}->length-($lastsite))/100)/10)
614
: push (@frag, $self->{'_seq'}->length-($lastsite));
615
if ($self->{'_seq'}->is_circular) {
616
my $first=shift @frag;
618
push @frag, ($first+$last);
620
$sort ? @frag = sort {$b <=> $a} @frag : 1;
625
=head1 How many times does enzymes X cut?
629
=head2 cuts_by_enzyme
631
Title : cuts_by_enzyme
632
Function : Return the number of cuts for an enzyme
633
Returns : An integer with the number of times each enzyme cuts.
634
Returns 0 if doesn't cut or undef if not defined
635
Arguments : An enzyme name string
643
$self->throw("Need an enzyme name")
645
$self->cut unless $self->{'_cut'};
646
return $self->{'_number_of_cuts_by_enzyme'}->{$enz};
649
=head1 Which enzymes cut the sequence N times?
656
Function : Find enzymes that cut a given number of times
657
Returns : a Bio::Restriction::EnzymeCollection
658
Arguments : 1. exact time or lower limit,
659
non-negative integer, optional
660
2. upper limit, non-negative integer,
661
larger or equalthan first, optional
664
If no argumets are given, the method returns all enzymes that do cut
665
the sequence. The argument zero, '0', is same as method
666
zero_cutters(). The argument one, '1', corresponds to unique_cutters.
667
If either of the limits is larger than number of cuts any enzyme cuts the
668
sequence, the that limit is automagically lowered. The method max_cuts()
669
gives the largest number of cuts.
671
See Also : L<unique_cutters|unique_cutters>,
672
L<zero_cutters|zero_cutters>, L<max_cuts|max_cuts>
677
my ($self, $a, $z) = @_;
679
$self->cut unless $self->{'_cut'};
683
$self->throw("Need a non-zero integer [$a]")
684
unless $a =~ /^[+]?\d+$/;
689
$start = $self->{'maximum_cuts'} if $start > $self->{'maximum_cuts'};
692
$self->throw("Need a non-zero integer no smaller than start [0]")
693
unless $z =~ /^[+]?\d+$/ and $z >= $a;
699
$end = $self->{'maximum_cuts'};
701
$end = $self->{'maximum_cuts'} if $end > $self->{'maximum_cuts'};
702
my $set = new Bio::Restriction::EnzymeCollection(-empty => 1);
704
for (my $i=$start; $i<=$end; $i++) {
705
$set->enzymes( @{$self->{_number_of_cuts_by_cuts}->{$i}} )
706
if defined $self->{_number_of_cuts_by_cuts}->{$i};
713
=head2 unique_cutters
715
Title : unique_cutters
716
Function : A special case if cutters() where enzymes only cut once
717
Returns : a Bio::Restriction::EnzymeCollection
721
See also: L<cutters>, L<zero_cutters>
732
Function : A special case if cutters() where enzymes don't cut the sequence
733
Returns : a Bio::Restriction::EnzymeCollection
736
See also: L<cutters>, L<unique_cutters>
747
Function : Find the most number of cuts
748
Returns : The number of times the enzyme that cuts most cuts.
751
This is not a very practical method, but if you are curious...
755
sub max_cuts { return shift->{maximum_cuts} }
757
=head1 Internal methods
764
Function : Figures out which enzymes we know about and cuts the sequence.
767
Comments : An internal method. This will figure out where the sequence
768
should be cut, and provide the appropriate results.
775
my $target_seq=uc $self->{'_seq'}->seq; # I have been burned on this before :)
778
# first, find out all the enzymes that we have
779
foreach my $enz ($self->{'_enzymes'}->each_enzyme) {
781
my @others = $enz->others if $enz->can("others");
782
foreach my $enzyme ($enz, @others) {
783
my ($beforeseq, $afterseq)=$self->_enzyme_sites($enzyme);
786
# if the enzyme is ambiguous we need to use a regexp to find the cut site
787
# otherwise we can use index (much faster)
789
# All of these methods return references to arrays.
790
# All of the arrays are positions in the DNA where the sequence is cut
791
# We will push everything into @all_cuts, and then deconvolute it
792
# and figure everything else out from there.
795
if ($enzyme->is_ambiguous) {
796
$cut_positions= $self->_ambig_cuts($beforeseq, $afterseq, $target_seq, $enzyme);
798
$cut_positions= $self->_nonambig_cuts($beforeseq, $afterseq, $target_seq, $enzyme);
801
push @all_cuts, @$cut_positions;
803
# deal with is_circular sequences
804
if ($self->{'_seq'}->is_circular) {
805
$cut_positions=$self->_circular($beforeseq, $afterseq, $enzyme);
806
push @all_cuts, @$cut_positions;
809
# we need to deal with non-palindromic enzymes separately
810
unless ($enzyme->is_palindromic) {
811
$cut_positions=$self->_non_pal_enz($target_seq, $enzyme);
812
push @all_cuts, @$cut_positions;
816
if (defined $all_cuts[0]) {
817
# now just remove any duplicate cut sites
818
@all_cuts = sort {$a <=> $b} @all_cuts;
819
push @{$self->{'_cut_positions'}->{$enz->name}}, $all_cuts[0];
820
foreach my $i (@all_cuts) {
821
push @{$self->{'_cut_positions'}->{$enz->name}}, $i
822
if $i != ${$self->{'_cut_positions'}->{$enz->name}}[$#{$self->{'_cut_positions'}->{$enz->name}}];
825
# this just fixes an eror when @all_cuts is not defined!
826
@{$self->{'_cut_positions'}->{$enz->name}}=();
829
# note I have removed saving any other information except the
830
# cut_positions this should significantly decrease the amount
831
# of memory that is required for large sequences. It should
832
# also speed things up dramatically, because fragments and
833
# fragment maps are only calculated for those enzymes they are
836
# finally, save minimal information about each enzyme
837
my $number_of_cuts=scalar @{$self->{'_cut_positions'}->{$enz->name}};
838
# now just store the number of cuts
839
$self->{_number_of_cuts_by_enzyme}->{$enz->name}=$number_of_cuts;
840
push (@{$self->{_number_of_cuts_by_cuts}->{$number_of_cuts}}, $enz);
841
if ($number_of_cuts > $self->{maximum_cuts}) {
842
$self->{maximum_cuts}=$number_of_cuts;
850
Title : _enzyme_sites
851
Function : An internal method to figure out the two sides of an enzyme
852
Returns : The sequence before the cut and the sequence after the cut
853
Arguments : A Bio::Restriction::Enzyme object
860
# I have reworked this so that it uses $enz->cut to get the site
863
# split it into the two fragments for the sequence before and after.
864
$site=0 unless defined $site;
865
# shouldn't happen, but what if the cut site is outside of the sequence
866
if ($site < 0 || $site > length($enz->string)) {
867
$self->throw("This is (probably) not your fault.\nGot a cut site of $site and a sequence of ".$enz->string);
870
# the default values just stop an error from an undefined
871
# string. But they don't affect the split.
872
my ($beforeseq, $afterseq)= ('.', '.');
875
$afterseq=$enz->string;
877
elsif ($site == $enz->seq->length) {
878
$beforeseq=$enz->string;
881
$beforeseq=$enz->seq->subseq(1, $site);
882
$afterseq=$enz->seq->subseq($site+1, $enz->seq->length);
885
# if the enzyme is ambiguous we need to convert this into a perl string
886
if ($enz->is_ambiguous) {
887
$beforeseq=$self->_expanded_string($beforeseq);
888
$afterseq =$self->_expanded_string($afterseq);
891
return ($beforeseq, $afterseq);
898
Function : Analyses non_palindromic enzymes for cuts in both ways
899
Returns : A reference to an array of cut positions
900
Arguments: The sequence to check and the enzyme object
905
my ($self, $target_seq, $enz) =@_;
906
# add support for non-palindromic sequences
907
# the enzyme is not the same forwards and backwards
908
my $site=$enz->complementary_cut;
909
# we are going to rc the sequence, so complementary_cut becomes length-complementary_cut
911
my ($beforeseq, $afterseq)=('.', '.');
913
my $new_left_cut=$enz->seq->length-$site;
914
# there is a problem when this is actually zero
915
if ($new_left_cut == 0) {$afterseq=$enz->seq->revcom->seq}
916
elsif ($new_left_cut == $enz->seq->length) {$beforeseq=$enz->seq->revcom->seq}
918
$beforeseq=$enz->seq->revcom->subseq(1, ($enz->seq->length-$site));
919
$afterseq=$enz->seq->revcom->subseq(($enz->seq->length-$site), $enz->seq->length);
922
# complementary cut is the position on the forward strand
923
# correct for reverse strand - I think this is right
925
if ($enz->is_ambiguous) {
926
$results= $self->_ambig_cuts($beforeseq, $afterseq, $target_seq, $enz);
928
$results= $self->_nonambig_cuts($beforeseq, $afterseq, $target_seq, $enz);
931
# deal with is_circular
933
$more_results=$self->_circular($beforeseq, $afterseq, $enz)
934
if ($self->{'_seq'}->is_circular);
935
push my @all_cuts, (@$more_results, @$results);
942
Function : An internal method to localize the cuts in the sequence
943
Returns : A reference to an array of cut positions
944
Arguments : The separated enzyme site, the target sequence, and the enzyme object
945
Comments : This is a slow implementation but works for ambiguous sequences.
946
Whenever possible, _nonambig_cuts should be used as it is a lot faster.
951
my ($self, $beforeseq, $afterseq, $target_seq, $enz) = @_;
953
# cut the sequence. This is done with split so we can use
956
my @cuts = split /($beforeseq)($afterseq)/i, $target_seq;
957
# now the array has extra elements --- the before and after!
961
# element 2 5' end of next sequence
965
# we need to loop through the array and add the ends to the
966
# appropriate parts of the sequence
970
if ($#cuts) { # there is >1 element
973
# the first sequence is a special case
975
$joinedseq=$cuts[$i].$cuts[$i+1];
977
$joinedseq=$cuts[$i-1].$cuts[$i].$cuts[$i+1];
979
# now deal with overlapping sequences
980
# we can do this through a regular regexp as we only
981
# have a short fragment to look through
983
while ($joinedseq =~ /$beforeseq$afterseq/) {
984
$joinedseq =~ s/^(.*?$beforeseq)($afterseq)/$2/;
987
push @re_frags, $joinedseq;
991
# I don't think we want the last fragment in. It is messing up the _circular
992
# part of things. So I deleted this part of the code :)
995
# if we don't cut, leave the array empty
997
} # the sequence was not cut.
999
# now @re_frags has the fragments of all the sequences
1000
# but some people want to have this return the lengths
1003
# in theory the actual cut sites should be the length
1004
# of the fragments in @re_frags
1006
# note, that now this is the only data that we are saving. We
1007
# will have to go back add regenerate re_frags. The reason is
1008
# that we can use this in _circular easier
1010
my @cut_positions = map {length($_)} @re_frags;
1012
# the cut positions are right now the lengths of the sequence, but
1013
# we need to add them all onto each other
1015
for (my $i=1; $i<=$#cut_positions; $i++) {
1016
$cut_positions[$i]+=$cut_positions[$i-1];
1019
# in one of those oddities in life, 2 fragments mean an enzyme cut once
1020
# so $#re_frags is the number of cuts
1021
return \@cut_positions;
1025
=head2 _nonambig_cuts
1027
Title : _nonambig_cuts
1028
Function : Figures out which enzymes we know about and cuts the sequence.
1030
Arguments : The separated enzyme site, the target sequence, and the enzyme object
1032
An internal method. This will figure out where the sequence should be
1033
cut, and provide the appropriate results. This is a much faster
1034
implementation because it doesn't use a regexp, but it can not deal
1035
with ambiguous sequences
1039
sub _nonambig_cuts {
1040
my ($self, $beforeseq, $afterseq, $target_seq, $enz) = @_;
1042
if ($beforeseq eq ".") {$beforeseq = ''}
1043
if ($afterseq eq ".") {$afterseq = ''}
1044
my $index_posn=index($target_seq, $beforeseq.$afterseq);
1045
return [] if ($index_posn == -1); # there is no match to the sequence
1047
# there is at least one cut site
1049
while ($index_posn > -1) {
1050
push (@cuts, $index_posn+length($beforeseq));
1051
$index_posn=index($target_seq, $beforeseq.$afterseq, $index_posn+1);
1058
=head2 _mulitple_cuts
1060
Title : _multiple_cuts
1061
Function : Figures out multiple digests
1062
Returns : An array of the cut sites for multiply digested DNA
1063
Arguments : A Bio::Restriction::EnzymeCollection object
1064
Comments : Double digests is one subset of this, but you can use
1065
as many enzymes as you want.
1069
sub _multiple_cuts {
1071
$self->cut unless $self->{'_cut'};
1073
# now that we are using positions rather than fragments
1074
# this is really easy
1076
foreach my $enz ($ec->each_enzyme) {
1077
push @cuts, @{$self->{'_cut_positions'}->{$enz->name}}
1078
if defined $self->{'_cut_positions'}->{$enz->name};
1080
@{$self->{'_cut_positions'}->{'multiple_digest'}}=sort {$a <=> $b} @cuts;
1084
$number_of_cuts=scalar @{$self->{'_cut_positions'}->{'multiple_digest'}};
1085
$self->{_number_of_cuts_by_enzyme}->{'multiple_digest'}=$number_of_cuts;
1086
push (@{$self->{_number_of_cuts_by_cuts}->{$number_of_cuts}}, 'multiple_digest');
1087
if ($number_of_cuts > $self->{maximum_cuts}) {
1088
$self->{maximum_cuts}=$number_of_cuts;
1096
Function : Deals with circular sequences
1100
There are two problems with circular sequences.
1102
1. When you cut a sequence and rejoin fragments you could generate
1105
2. There could be a cut site at the end of the sequence.
1107
I think these may be the same problem, and so we're working on #2 first!
1112
my ($self, $beforeseq, $afterseq, $enz) = @_;
1113
my $target_seq=uc $self->{'_seq'}->seq; # I have been burned on this before :)
1115
# the approach I am taking is to find out the longest enzyme in the collection
1116
# (I'll have to add a new function in enzyme collection for this)
1117
# and then add more than that sequence from the end of the sequence to the start
1118
# of the sequence, and map the new cut sites for each of the enzymes.
1120
# The cut sites that we are interested in must be within the length of the
1121
# enzyme sequence from the start or the end.
1123
my $longest_enz=$self->{'_enzymes'}->longest_cutter;
1124
my $longest_cut=$longest_enz->recognition_length;
1125
# this is an error that I don't want to deal with at the moment
1126
$self->throw("Crap. The longest recognition site ($longest_cut) is longer than the".
1127
" length of the sequence") if ($longest_cut > $self->{'_seq'}->length);
1129
# newseq is just the last part of the sequence and the first part of the sequence
1130
# we don't want to go through and check the whole sequence again
1132
my ($first, $last) =
1133
(substr($target_seq, 0, $longest_cut),substr($target_seq, -$longest_cut));
1134
my $newseq=$last.$first;
1136
# now find the cut sites
1137
# if the enzyme is ambiguous we need to use a regexp to find the cut site
1138
# otherwise we can use index (much faster)
1140
if ($enz->is_ambiguous) {
1141
$cut_positions= $self->_ambig_cuts($beforeseq, $afterseq, $newseq, $enz);
1143
$cut_positions=$self->_nonambig_cuts($beforeseq, $afterseq, $newseq, $enz);
1146
# the enzyme doesn't cut in the new fragment - likely to be default
1147
return [] if (!$cut_positions);
1149
# now we are going to add things to _cut_positions
1150
# in this shema it doesn't matter if the site is there twice -
1151
# we will take care of that later. Because we are using position
1152
# rather than frag or anything else, we can just
1153
# remove duplicates.
1155
foreach my $cut (@$cut_positions) {
1156
if ($cut == length($last)) {
1157
# the cut is actually at position 0, but we're going to call this the
1158
# length of the sequence so we don't confuse no cuts with a 0 cut
1159
push (@circ_cuts, $self->{'_seq'}->length);
1161
elsif ($cut < length($last)) {
1162
# the cut is before the end of the sequence
1163
# there is VERY likely to be an off by one error here
1164
push (@circ_cuts, $self->{'_seq'}->length - (length($last) - $cut));
1167
# the cut is at the start of the sequence (position >=1)
1168
# there is VERY likely to be an off by one error here
1169
# note, we put this at the beginning of the array rather than the end!
1170
unshift (@circ_cuts, $cut-length($last));
1180
=head2 _expanded_string
1182
Title : _expanded_string
1183
Function : Expand nucleotide ambiguity codes to their representative letters
1184
Returns : The full length string
1185
Arguments : The string to be expanded.
1187
Stolen from the original RestrictionEnzyme.pm
1192
sub _expanded_string {
1193
my ($self, $str) = @_;
1196
$str =~ s/R/\[AG\]/g;
1197
$str =~ s/Y/\[CT\]/g;
1198
$str =~ s/S/\[GC\]/g;
1199
$str =~ s/W/\[AT\]/g;
1200
$str =~ s/M/\[AC\]/g;
1201
$str =~ s/K/\[TG\]/g;
1202
$str =~ s/B/\[CGT\]/g;
1203
$str =~ s/D/\[AGT\]/g;
1204
$str =~ s/H/\[ACT\]/g;
1205
$str =~ s/V/\[ACG\]/g;