16
Bio::Tools::IUPAC - Generates unique Seq objects from an ambiguous Seq object
16
Bio::Tools::IUPAC - Generates unique sequence objects or regular expressions from
17
an ambiguous IUPAC sequence
21
22
use Bio::Tools::IUPAC;
23
my $ambiseq = Bio::Seq->new(-seq => 'ARTCGUTGR', -alphabet => 'dna');
24
my $stream = Bio::Tools::IUPAC->new(-seq => $ambiseq);
26
while ($uniqueseq = $stream->next_seq()) {
27
# process the unique Seq object.
24
# Get the IUPAC code for proteins
25
my %iupac_prot = Bio::Tools::IUPAC->new->iupac_iup;
27
# Create a sequence with degenerate residues
28
my $ambiseq = Bio::PrimarySeq->new(-seq => 'ARTCGUTGN', -alphabet => 'dna');
30
# Create all possible non-degenerate sequences
31
my $iupac = Bio::Tools::IUPAC->new(-seq => $ambiseq);
32
while ($uniqueseq = $iupac->next_seq()) {
33
# process the unique Bio::Seq object.
36
# Get a regular expression that matches all possible sequences
37
my $regexp = $iupac->regexp();
32
IUPAC is a tool that produces a stream of unique, "strict"-satisfying Seq
33
objects from an ambiquous Seq object (containing non-standard characters given
34
the meaning shown below)
36
Extended DNA / RNA alphabet :
37
(includes symbols for nucleotide ambiguity)
38
------------------------------------------
39
Symbol Meaning Nucleic Acid
40
------------------------------------------
59
IUPAC-IUB SYMBOLS FOR NUCLEOTIDE NOMENCLATURE:
60
Cornish-Bowden (1985) Nucl. Acids Res. 13: 3021-3030.
62
-----------------------------------
65
------------------------------------------
67
------------------------------------------
69
B Aspartic Acid, Asparagine
93
Z Glutamic Acid, Glutamine
97
IUPAC-IUP AMINO ACID SYMBOLS:
98
Biochem J. 1984 Apr 15; 219(2): 345-373
99
Eur J Biochem. 1993 Apr 1; 213(1): 2
41
Bio::Tools::IUPAC is a tool that manipulates sequences with ambiguous residues
42
following the IUPAC conventions. Non-standard characters have the meaning
45
IUPAC-IUB SYMBOLS FOR NUCLEOTIDE (DNA OR RNA) NOMENCLATURE:
46
Cornish-Bowden (1985) Nucl. Acids Res. 13: 3021-3030
48
------------------------------------------
49
Symbol Meaning Nucleic Acid
50
------------------------------------------
70
IUPAC-IUP AMINO ACID SYMBOLS:
71
Biochem J. 1984 Apr 15; 219(2): 345-373
72
Eur J Biochem. 1993 Apr 1; 213(1): 2
74
------------------------------------------
76
------------------------------------------
78
B Aspartic Acid, Asparagine
102
Z Glutamic Acid, Glutamine
105
There are a few things Bio::Tools::IUPAC can do for you:
111
report the IUPAC mapping between ambiguous and non-ambiguous residues
115
produce a stream of all possible corresponding unambiguous Bio::Seq objects given
116
an ambiguous sequence object
120
convert an ambiguous sequence object to a corresponding regular expression
143
# Let the code begin...
145
166
package Bio::Tools::IUPAC;
148
use vars qw(%IUP %IUB %REV_IUB $AUTOLOAD);
169
use base qw(Bio::Root::Root);
170
use vars qw(%IUB %IUB_AMB %REV_IUB %IUP %IUP_AMB $AUTOLOAD);
151
%IUB = ( A => [qw(A)],
169
%REV_IUB = (A => 'A',
188
%IUP = (A => [qw(A)],
173
# Ambiguous nucleic residues are matched to unambiguous residues
194
# Same as %IUB but ambiguous residues are matched to ambiguous residues only
206
N => [qw(B D H K M N R S V W Y)],
209
# The inverse of %IUB
230
# Same thing with proteins now
218
use base qw(Bio::Root::Root);
223
Usage : Bio::Tools::IUPAC->new( $seq)
224
Function: returns a new seq stream (akin to SeqIO)
225
Returns : a Bio::Tools::IUPAC stream object that will produce unique
226
Seq objects on demand.
227
Args : an ambiguously coded Seq.pm object that has a specified 'alphabet'
273
Usage : Bio::Tools::IUPAC->new($seq);
274
Function: Create a new IUPAC object, which acts as a sequence stream (akin to
276
Args : an ambiguously coded sequence object that has a specified 'alphabet'
277
Returns : a Bio::Tools::IUPAC object.
234
my($class,@args) = @_;
282
my ($class,@args) = @_;
235
283
my $self = $class->SUPER::new(@args);
237
284
my ($seq) = $self->_rearrange([qw(SEQ)],@args);
238
if((! defined($seq)) && @args && ref($args[0])) {
239
# parameter not passed as named parameter?
242
$seq->isa('Bio::Seq') or
243
$self->throw("Must supply a Seq.pm object to IUPAC!");
244
$self->{'_SeqObj'} = $seq;
245
if ($self->{'_SeqObj'}->alphabet() =~ m/^[dr]na$/i ) {
246
# nucleotide seq object
247
$self->{'_alpha'} = [ map { $IUB{uc($_)} }
248
split('', $self->{'_SeqObj'}->seq()) ];
249
} elsif ($self->{'_SeqObj'}->alphabet() =~ m/^protein$/i ) {
250
# amino acid seq object
251
$self->{'_alpha'} = [ map { $IUP{uc($_)} }
252
split('', $self->{'_SeqObj'}->seq()) ];
253
} else { # unknown type: we could make a guess, but let's not.
254
$self->throw("You must specify the 'type' of sequence provided to IUPAC");
256
$self->{'_string'} = [(0) x length($self->{'_SeqObj'}->seq())];
257
scalar @{$self->{'_string'}} or $self->throw("Sequence has zero-length!");
286
if ( (not defined $seq) && @args && ref($args[0]) ) {
287
# parameter not passed as named parameter?
292
if (not $seq->isa('Bio::PrimarySeqI')) {
293
$self->throw('Must supply a sequence object');
295
if (length $seq->seq == 0) {
296
$self->throw('Sequence had zero-length');
298
$self->{'_seq'} = $seq;
307
my %iupac = $self->iupac;
308
$self->{'_alpha'} = [ map { $iupac{uc $_} } split('', $self->{'_seq'}->seq) ];
309
$self->{'_string'} = [(0) x length($self->{'_seq'}->seq())];
258
310
$self->{'_string'}->[0] = -1;
265
Usage : $iupac->next_seq()
266
Function: returns the next unique Seq object
267
Returns : a Seq.pm object
317
Usage : $iupac->next_seq();
318
Function: returns the next unique sequence object
320
Returns : a Bio::Seq object
327
if (not exists $self->{'_string'}) {
328
$self->_initialize();
276
331
for my $i ( 0 .. $#{$self->{'_string'}} ) {
277
next unless $self->{'_string'}->[$i] || @{$self->{'_alpha'}->[$i]} > 1;
278
if ( $self->{'_string'}->[$i] == $#{$self->{'_alpha'}->[$i]} ) { # rollover
279
if ( $i == $#{$self->{'_string'}} ) { # end of possibilities
282
$self->{'_string'}->[$i] = 0;
286
$self->{'_string'}->[$i]++;
288
$self->{'_SeqObj'}->seq(join('', map { $j++; $self->{'_alpha'}->[$j]->[$_]; } @{$self->{'_string'}}));
289
my $desc = $self->{'_SeqObj'}->desc();
290
if ( !defined $desc ) { $desc = ""; }
293
1 while $self->{'_num'} =~ s/(\d)(\d\d\d)(?!\d)/$1,$2/;
294
$desc =~ s/( \[Bio::Tools::IUPAC-generated\sunique sequence # [^\]]*\])|$/ \[Bio::Tools::IUPAC-generated unique sequence # $self->{'_num'}\]/;
295
$self->{'_SeqObj'}->desc($desc);
296
$self->{'_num'} =~ s/,//g;
297
return $self->{'_SeqObj'};
332
next unless $self->{'_string'}->[$i] || @{$self->{'_alpha'}->[$i]} > 1;
333
if ( $self->{'_string'}->[$i] == $#{$self->{'_alpha'}->[$i]} ) { # rollover
334
if ( $i == $#{$self->{'_string'}} ) { # end of possibilities
337
$self->{'_string'}->[$i] = 0;
341
$self->{'_string'}->[$i]++;
343
my $seqstr = join('', map { $j++; $self->{'_alpha'}->[$j]->[$_]; } @{$self->{'_string'}});
344
my $desc = $self->{'_seq'}->desc() || '';
346
1 while $self->{'_num'} =~ s/(\d)(\d\d\d)(?!\d)/$1,$2/;
347
$desc =~ s/( \[Bio::Tools::IUPAC-generated\sunique sequence # [^\]]*\])|$/ \[Bio::Tools::IUPAC-generated unique sequence # $self->{'_num'}\]/;
348
$self->{'_num'} =~ s/,//g;
350
# Return a fresh sequence object
351
return Bio::PrimarySeq->new(-seq => $seqstr, -desc => $desc);
360
Usage : my %symbols = $iupac->iupac;
361
Function: Returns a hash of symbols -> symbol components of the right type
362
for the given sequence, i.e. it is the same as iupac_iup() if
363
Bio::Tools::IUPAC was given a proteic sequence, or iupac_iub() if the
364
sequence was nucleic. For example, the key 'M' has the value ['A', 'C'].
372
my $alphabet = lc( $self->{'_seq'}->alphabet() );
373
if ( ($alphabet eq 'dna') or ($alphabet eq 'rna') ) {
374
return %IUB; # nucleic
375
} elsif ( $alphabet eq 'protein' ) {
376
return %IUP; # proteic
378
$self->throw("The input sequence had the unknown alphabet '$alphabet'\n");
387
Usage : my %symbols = $iupac->iupac_amb;
388
Function: Same as iupac() but only contains a mapping between ambiguous residues
389
and the ambiguous residues they map to. For example, the key 'N' has
390
the value ['R', 'Y', 'K', 'M', 'S', 'W', 'B', 'D', 'H', 'V', 'N'],
391
i.e. it matches all other ambiguous residues.
399
my $alphabet = lc( $self->{'_seq'}->alphabet() );
400
if ( ($alphabet eq 'dna') or ($alphabet eq 'rna') ) {
401
return %IUB_AMB; # nucleic
402
} elsif ( $alphabet eq 'protein' ) {
403
return %IUP_AMB; # proteic
405
$self->throw("The input sequence had the unknown alphabet '$alphabet'\n");
304
412
Title : iupac_iup
305
Usage : my %aasymbols = $iupac->iupac_iup
306
Function: Returns a hash of PROTEIN symbols -> symbol components
413
Usage : my %aasymbols = $iupac->iupac_iup;
414
Function: Returns a hash of PROTEIN symbols -> non-ambiguous symbol components
427
Title : iupac_iup_amb
428
Usage : my %aasymbols = $iupac->iupac_iup_amb;
429
Function: Returns a hash of PROTEIN symbols -> ambiguous symbol components
319
442
Title : iupac_iub
320
Usage : my %dnasymbols = $iupac->iupac_iub
321
Function: Returns a hash of DNA symbols -> symbol components
443
Usage : my %dnasymbols = $iupac->iupac_iub;
444
Function: Returns a hash of DNA symbols -> non-ambiguous symbol components
457
Title : iupac_iub_amb
458
Usage : my %dnasymbols = $iupac->iupac_iub;
459
Function: Returns a hash of DNA symbols -> ambiguous symbol components
331
470
=head2 iupac_rev_iub
333
472
Title : iupac_rev_iub
334
Usage : my %dnasymbols = $iupac->iupac_rev_iub
473
Usage : my %dnasymbols = $iupac->iupac_rev_iub;
335
474
Function: Returns a hash of nucleotide combinations -> IUPAC code
336
475
(a reverse of the iupac_iub hash).
349
489
Usage : my $total = $iupac->count();
350
490
Function: Calculates the number of unique, unambiguous sequences that
351
491
this ambiguous sequence could generate
499
if (not exists $self->{'_string'}) {
500
$self->_initialize();
361
503
$count *= scalar(@$_) for (@{$self->{'_alpha'}});
511
Usage : my $re = $iupac->regexp();
512
Function: Converts the ambiguous sequence into a regular expression that
513
matches all of the corresponding ambiguous and non-ambiguous sequences.
514
You can further manipulate the resulting regular expression with the
515
Bio::Tools::SeqPattern module. After you are done building your
516
regular expression, you might want to compile it and make it case-
519
Args : 1 to match RNA: T and U characters will match interchangeably
520
Return : regular expression
525
my ($self, $match_rna) = @_;
527
my $seq = $self->{'_seq'}->seq;
528
my %iupac = $self->iupac;
529
my %iupac_amb = $self->iupac_amb;
530
for my $pos (0 .. length($seq)-1) {
531
my $res = substr $seq, $pos, 1;
532
my $iupacs = $iupac{$res};
533
my $iupacs_amb = $iupac_amb{$res} || [];
534
if (not defined $iupacs) {
535
$self->throw("Primer sequence '$seq' is not a valid IUPAC sequence.".
536
" Offending character was '$res'.\n");
538
my $part = join '', (@$iupacs, @$iupacs_amb);
540
$part =~ s/T/TU/i || $part =~ s/U/TU/i;
542
if (length $part > 1) {
543
$part = '['.$part.']';
368
552
my $self = shift @_;
369
553
my $method = $AUTOLOAD;
370
554
$method =~ s/.*:://;
371
return $self->{'_SeqObj'}->$method(@_)
372
unless $method eq 'DESTROY';
555
return $self->{'_seq'}->$method(@_)
556
unless $method eq 'DESTROY';