1
# $Id: PrimarySeqI.pm,v 1.38.2.2 2002/03/15 12:31:41 heikki Exp $
3
# BioPerl module for Bio::PrimarySeqI
5
# Cared for by Ewan Birney <birney@sanger.ac.uk>
7
# Copyright Ewan Birney
9
# You may distribute this module under the same terms as perl itself
11
# POD documentation - main docs before the code
15
Bio::PrimarySeqI - Interface definition for a Bio::PrimarySeq
19
# get a Bio::PrimarySeqI compliant object somehow
21
# to test this is a seq object
23
$obj->isa("Bio::PrimarySeqI") ||
24
$obj->throw("$obj does not implement the Bio::PrimarySeqI interface");
28
$string = $obj->seq();
29
$substring = $obj->subseq(12,50);
30
$display = $obj->display_id(); # for human display
31
$id = $obj->primary_id(); # unique id for this object,
32
# implementation defined
33
$unique_key= $obj->accession_number();
34
# unique biological id
39
$rev = $obj->revcom();
42
$obj->throw("Could not reverse complement. ".
43
"Probably not DNA. Actual exception\n$@\n");
46
$trunc = $obj->trunc(12,50);
48
# $rev and $trunc are Bio::PrimarySeqI compliant objects
53
This object defines an abstract interface to basic sequence
54
information. PrimarySeq is an object just for the sequence and its
55
name(s), nothing more. Seq is the larger object complete with
56
features. There is a pure perl implementation of this in
57
Bio::PrimarySeq. If you just want to use Bio::PrimarySeq objects, then
58
please read that module first. This module defines the interface, and
59
is of more interest to people who want to wrap their own Perl
60
Objects/RDBs/FileSystems etc in way that they "are" bioperl sequence
61
objects, even though it is not using Perl to store the sequence etc.
64
This interface defines what bioperl consideres necessary to "be" a
65
sequence, without providing an implementation of this. (An
66
implementation is provided in Bio::PrimarySeq). If you want to provide
67
a Bio::PrimarySeq 'compliant' object which in fact wraps another
68
object/database/out-of-perl experience, then this is the correct thing
69
to wrap, generally by providing a wrapper class which would inheriet
70
from your object and this Bio::PrimarySeqI interface. The wrapper class
71
then would have methods lists in the "Implementation Specific
72
Functions" which would provide these methods for your object.
79
User feedback is an integral part of the evolution of this and other
80
Bioperl modules. Send your comments and suggestions preferably to one
81
of the Bioperl mailing lists. Your participation is much appreciated.
83
bioperl-l@bioperl.org - General discussion
84
http://bio.perl.org/MailList.html - About the mailing lists
88
Report bugs to the Bioperl bug tracking system to help us keep track
89
the bugs and their resolution. Bug reports can be submitted via email
92
bioperl-bugs@bio.perl.org
93
http://bio.perl.org/bioperl-bugs/
95
=head1 AUTHOR - Ewan Birney
97
Email birney@sanger.ac.uk
101
The rest of the documentation details each of the object
102
methods. Internal methods are usually preceded with a _
107
# Let the code begin...
110
package Bio::PrimarySeqI;
113
use Bio::Root::RootI;
114
use Bio::Tools::CodonTable;
117
@ISA = qw(Bio::Root::RootI);
119
=head1 Implementation Specific Functions
121
These functions are the ones that a specific implementation must
127
Usage : $string = $obj->seq()
128
Function: Returns the sequence as a string of letters. The
129
case of the letters is left up to the implementer.
130
Suggested cases are upper case for proteins and lower case for
131
DNA sequence (IUPAC standard),
132
but implementations are suggested to keep an open mind about
133
case (some users... want mixed case!)
141
$self->throw_not_implemented();
147
Usage : $substring = $obj->subseq(10,40);
148
Function: returns the subseq from start to end, where the first base
149
is 1 and the number is inclusive, ie 1-2 are the first two
150
bases of the sequence
152
Start cannot be larger than end but can be equal
162
$self->throw_not_implemented();
168
Usage : $id_string = $obj->display_id();
169
Function: returns the display id, aka the common name of the Sequence object.
171
The semantics of this is that it is the most likely string
172
to be used as an identifier of the sequence, and likely to
173
have "human" readability. The id is equivalent to the ID
174
field of the GenBank/EMBL databanks and the id field of the
175
Swissprot/sptrembl database. In fasta format, the >(\S+) is
176
presumed to be the id, though some people overload the id
177
to embed other information. Bioperl does not use any
178
embedded information in the ID field, and people are
179
encouraged to use other mechanisms (accession field for
180
example, or extending the sequence object) to solve this.
182
Notice that $seq->id() maps to this function, mainly for
183
legacy/convience issues
193
$self->throw_not_implemented();
197
=head2 accession_number
199
Title : accession_number
200
Usage : $unique_biological_key = $obj->accession_number;
201
Function: Returns the unique biological id for a sequence, commonly
202
called the accession_number. For sequences from established
203
databases, the implementors should try to use the correct
204
accession number. Notice that primary_id() provides the
205
unique id for the implemetation, allowing multiple objects
206
to have the same accession number in a particular implementation.
208
For sequences with no accession number, this method should return
217
sub accession_number {
218
my ($self,@args) = @_;
219
$self->throw_not_implemented();
227
Usage : $unique_implementation_key = $obj->primary_id;
228
Function: Returns the unique id for this object in this
229
implementation. This allows implementations to manage their
230
own object ids in a way the implementaiton can control
231
clients can expect one id to map to one object.
233
For sequences with no accession number, this method should
234
return a stringified memory location.
244
my ($self,@args) = @_;
245
$self->throw_not_implemented();
252
Usage : if( $obj->can_call_new ) {
253
$newobj = $obj->new( %param );
255
Function: can_call_new returns 1 or 0 depending
256
on whether an implementation allows new
257
constructor to be called. If a new constructor
258
is allowed, then it should take the followed hashed
261
$myobject->new( -seq => $sequence_as_string,
263
-accession_number => $accession
274
my ($self,@args) = @_;
276
# we default to 0 here
284
Usage : if( $obj->alphabet eq 'dna' ) { /Do Something/ }
285
Function: Returns the type of sequence being one of
286
'dna', 'rna' or 'protein'. This is case sensitive.
288
This is not called <type> because this would cause
289
upgrade problems from the 0.5 and earlier Seq objects.
291
Returns : a string either 'dna','rna','protein'. NB - the object must
292
make a call of the type - if there is no type specified it
302
$self->throw_not_implemented();
306
my ($self,@args) = @_;
308
$self->warn("moltype: pre v1.0 method. Calling alphabet() instead...");
309
$self->alphabet(@args);
313
=head1 Optional Implementation Functions
315
The following functions rely on the above functions. An
316
implementing class does not need to provide these functions, as they
317
will be provided by this class, but is free to override these
320
All of revcom(), trunc(), and translate() create new sequence
321
objects. They will call new() on the class of the sequence object
322
instance passed as argument, unless can_call_new() returns FALSE. In
323
the latter case a Bio::PrimarySeq object will be created. Implementors
324
which really want to control how objects are created (eg, for object
325
persistence over a database, or objects in a CORBA framework), they
326
are encouraged to override these methods
331
Usage : $rev = $seq->revcom()
332
Function: Produces a new Bio::PrimarySeqI implementing object which
333
is the reversed complement of the sequence. For protein
334
sequences this throws an exception of "Sequence is a
335
protein. Cannot revcom"
337
The id is the same id as the original sequence, and the
338
accession number is also indentical. If someone wants to
339
track that this sequence has be reversed, it needs to
340
define its own extensions
342
To do an inplace edit of an object you can go:
344
$seq = $seq->revcom();
346
This of course, causes Perl to handle the garbage
347
collection of the old object, but it is roughly speaking as
348
efficient as an inplace edit.
350
Returns : A new (fresh) Bio::PrimarySeqI object
360
# check the type is good first.
361
my $t = $self->alphabet;
363
if( $t eq 'protein' ) {
364
$self->throw("Sequence is a protein. Cannot revcom");
367
if( $t ne 'dna' && $t ne 'rna' ) {
368
if( $self->can('warn') ) {
369
$self->warn("Sequence is not dna or rna, but [$t]. ".
370
"Attempting to revcom, but unsure if this is right");
372
warn("[$self] Sequence is not dna or rna, but [$t]. ".
373
"Attempting to revcom, but unsure if this is right");
377
# yank out the sequence string
379
my $str = $self->seq();
381
# if is RNA - map to DNA then map back
389
$str =~ tr/acgtrymkswhbvdnxACGTRYMKSWHBVDNX/tgcayrkmswdvbhnxTGCAYRKMSWDVBHNX/;
390
my $revseq = CORE::reverse $str;
393
$revseq =~ tr/tT/uU/;
397
if($self->can_call_new()) {
398
$seqclass = ref($self);
400
$seqclass = 'Bio::PrimarySeq';
401
$self->_attempt_to_load_Seq();
403
my $out = $seqclass->new( '-seq' => $revseq,
404
'-display_id' => $self->display_id,
405
'-accession_number' => $self->accession_number,
406
'-alphabet' => $self->alphabet,
407
'-desc' => $self->desc()
416
Usage : $subseq = $myseq->trunc(10,100);
417
Function: Provides a truncation of a sequence,
420
Returns : a fresh Bio::PrimarySeqI implementing object
421
Args : Two integers denoting first and last base of the sub-sequence.
427
my ($self,$start,$end) = @_;
430
if( defined $start && ref($start) &&
431
$start->isa('Bio::LocationI') ) {
432
$str = $self->subseq($start); # start is a location actually
434
$self->throw("trunc start,end");
435
} elsif( $end < $start ) {
436
my $msg = "start [$start] is greater than end [$end. \n".
437
"If you want to truncated and reverse complement, \n".
438
"you must call trunc followed by revcom. Sorry.";
441
$str = $self->subseq($start,$end);
445
if($self->can_call_new()) {
446
$seqclass = ref($self);
448
$seqclass = 'Bio::PrimarySeq';
449
$self->_attempt_to_load_Seq();
452
my $out = $seqclass->new( '-seq' => $str,
453
'-display_id' => $self->display_id,
454
'-accession_number' => $self->accession_number,
455
'-alphabet' => $self->alphabet,
456
'-desc' => $self->desc()
465
Usage : $protein_seq_obj = $dna_seq_obj->translate
466
#if full CDS expected:
467
$protein_seq_obj = $cds_seq_obj->translate(undef,undef,undef,undef,1);
470
Provides the translation of the DNA sequence using full
471
IUPAC ambiguities in DNA/RNA and amino acid codes.
473
The full CDS translation is identical to EMBL/TREMBL
474
database translation. Note that the trailing terminator
475
character is removed before returning the translation
478
Note: if you set $dna_seq_obj->verbose(1) you will get a
479
warning if the first codon is not a valid initiator.
482
Returns : A Bio::PrimarySeqI implementing object
483
Args : character for terminator (optional) defaults to '*'
484
character for unknown amino acid (optional) defaults to 'X'
485
frame (optional) valid values 0, 1, 2, defaults to 0
486
codon table id (optional) defaults to 1
487
complete coding sequence expected, defaults to 0 (false)
488
boolean, throw exception if not complete CDS (true) or defaults to warning (false)
494
my($stop, $unknown, $frame, $tableid, $fullCDS, $throw) = @_;
495
my($i, $len, $output) = (0,0,'');
499
## User can pass in symbol for stop and unknown codons
500
unless(defined($stop) and $stop ne '') { $stop = "*"; }
501
unless(defined($unknown) and $unknown ne '') { $unknown = "X"; }
502
unless(defined($frame) and $frame ne '') { $frame = 0; }
504
## the codon table ID
505
unless(defined($tableid) and $tableid ne '') { $tableid = 1; }
507
##Error if monomer is "Amino"
508
$self->throw("Can't translate an amino acid sequence.") if
509
($self->alphabet eq 'protein');
511
##Error if frame is not 0, 1 or 2
512
$self->throw("Valid values for frame are 0, 1, 2, not [$frame].") unless
513
($frame == 0 or $frame == 1 or $frame == 2);
515
#warns if ID is invalid
516
my $codonTable = Bio::Tools::CodonTable->new( -id => $tableid);
518
my ($seq) = $self->seq();
520
# deal with frame offset.
522
$seq = substr ($seq,$frame);
526
$output = $codonTable->translate($seq);
527
# Use user-input stop/unknown
528
$output =~ s/\*/$stop/g;
529
$output =~ s/X/$unknown/g;
531
# only if we are expecting to translate a complete coding region
533
my $id = $self->display_id;
534
#remove the stop character
535
if( substr($output,-1,1) eq $stop ) {
538
$throw && $self->throw("Seq [$id]: Not using a valid terminator codon!");
539
$self->warn("Seq [$id]: Not using a valid terminator codon!");
541
# test if there are terminator characters inside the protein sequence!
542
if ($output =~ /\*/) {
543
$throw && $self->throw("Seq [$id]: Terminator codon inside CDS!");
544
$self->warn("Seq [$id]: Terminator codon inside CDS!");
546
# if the initiator codon is not ATG, the amino acid needs to changed into M
547
if ( substr($output,0,1) ne 'M' ) {
548
if ($codonTable->is_start_codon(substr($seq, 0, 3)) ) {
549
$output = 'M'. substr($output,1);
552
$self->warn("Seq [$id]: Not using a valid initiator codon!");
554
$self->throw("Seq [$id]: Not using a valid initiator codon!");
560
if($self->can_call_new()) {
561
$seqclass = ref($self);
563
$seqclass = 'Bio::PrimarySeq';
564
$self->_attempt_to_load_Seq();
566
my $out = $seqclass->new( '-seq' => $output,
567
'-display_id' => $self->display_id,
568
'-accession_number' => $self->accession_number,
569
# is there anything wrong with retaining the
571
'-desc' => $self->desc(),
572
'-alphabet' => 'protein'
581
Usage : $id = $seq->id()
582
Function: ID of the sequence. This should normally be (and actually is in
583
the implementation provided here) just a synonym for display_id().
594
return $self->display_id();
601
Usage : $len = $seq->length()
604
Returns : integer representing the length of the sequence.
612
$self->throw_not_implemented();
618
Usage : $seq->desc($newval);
619
$description = $seq->desc();
620
Function: Get/set description text for a seq object
622
Returns : value of desc
623
Args : newvalue (optional)
629
my ($self,$value) = @_;
630
$self->warn_not_implemented();
638
Usage : if( $obj->is_circular) { /Do Something/ }
639
Function: Returns true if the molecule is circular
640
Returns : Boolean value
646
my ($self,$value) = @_;
647
if (defined $value) {
648
$self->{'_is_circular'} = 1 if $value;
650
return $self->{'_is_circular'};
653
=head1 Private functions
655
These are some private functions for the PrimarySeqI interface. You do not
656
need to implement these functions
658
=head2 _attempt_to_load_Seq
660
Title : _attempt_to_load_Seq
670
sub _attempt_to_load_Seq{
673
if( $main::{'Bio::PrimarySeq'} ) {
677
require Bio::PrimarySeq;
680
my $text = "Bio::PrimarySeq could not be loaded for [$self]\n".
681
"This indicates that you are using Bio::PrimarySeqI ".
682
"without Bio::PrimarySeq loaded or without providing a ".
683
"complete implementation.\nThe most likely problem is that there ".
684
"has been a misconfiguration of the bioperl environment\n".
685
"Actual exception:\n\n";
686
$self->throw("$text$@\n");