1
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
5
>Web-based SimpleAnalysis HOWTO</TITLE
8
CONTENT="Modular DocBook HTML Stylesheet Version 1.7"></HEAD
24
>Web-based SimpleAnalysis HOWTO</A
36
>Dept. Medical Genetics, University of Edinburgh<BR></SPAN
44
HREF="mailto:radams_at_staffmail.ed.ac.uk"
45
>radams_at_staffmail.ed.ac.uk</A
95
> This HOWTO tries to teach you to run a web based sequence
96
analysis program using basic SimpleAnalysis rules.
108
>Table of Contents</B
122
HREF="#RETRIEVING_RESULTS"
123
>Retrieving results</A
127
HREF="#METASEQUENCES"
133
>How to run the same analysis with varying parameters</A
143
>Interested in developing your own Analysis module?</A
147
HREF="#ACKNOWLEDGMENTS"
161
> There are several situations where it would be useful to run a web
162
based sequence analysis program via a Perl script rather than using
163
the web interface manually or downloading the program. Firstly, the
164
source code or binaries may be unavailable or unavailable on your
165
platform. Secondly, the analysis may depend on a regularly updated or
166
large database that you don't wish to store or keep updated
167
yourself. Thirdly, novel analyses are frequently available via a web
168
server before being available for download.
171
> The aim of the Bio::Tools::Analysis modules is to allow automatic
172
submission of sequences to prediction servers to facilitate sequence
173
annotation for low to medium numbers of sequences (perhaps tens to
174
hundreds). The modules both submit the sequences and parse the results
175
into a number of useful formats, usually including Bio::SeqFeature
176
objects, Bio::Seq::MetaI sequence objects and a basic Perl data
177
structure, as well as the raw output text.
180
> At present the following prediction servers are supported, mainly
181
reflecting my own research interests. Under current development
182
are modules wrapping remote analyses using HMMER, Pfam, the ELM
183
peptide motif database and SIFT for prediction of untolerated
192
>Table 1. Supported analyses</B
197
><COL><COL><COL><THEAD
212
>Protein domain boundaries</TD
214
>Bioinformatics 19, 673-674 (2003)</TD
220
>Protein phosphorylation sites</TD
222
>J Mol Biol 294, 1351-1362 (1999)</TD
228
>Protein Secondary structure</TD
230
>Meths. Enzymology 266, 540-553 (1996)</TD
236
>Protein Secondary structure</TD
238
>Bioinformatics 15,413-421 (1999)</TD
244
>Protein Secondary structure</TD
246
>Comput Appl Biosci 11, 681-684 (1995)</TD
252
>Mitochondrial cleavage site prediction</TD
254
>Eur J Biochem 241, 779-786 (1996)</TD
260
>Exonic splice site enhancers</TD
262
>NAR 31, 3568-3571 (2003)</TD
276
>2. Simple Examples</A
279
> The script below runs multiple analyses on a single sequence
280
and parses the results into standard BioPerl sequence feature
281
objects (Bio::SeqFeature::Generic objects to be precise).
283
CLASS="PROGRAMLISTING"
284
> # load up the modules
286
use Bio::Tools::Analysis::Protein::HNN;
287
use Bio::Tools::Analysis::Protein::Domcut;
288
use Bio::Tools::Analysis::Protein::MitoProt;
289
use Bio::Tools::Analysis::Protein::NetPhos;
292
our @ISA = qw(Bio::Tools::Analysis::SimpleAnalysisBase);
294
my $seq;# a Bio::Seq object
295
for my $method ( qw(Domcut MitoProt NetPhos HNN)) {
297
#analyses need a Bio::PrimarySeq, not a Bio::Seq;
298
my $tool = Bio::Tools::Analysis::Protein::$method->new(
299
-seq => $seq->primary_seq);
301
my @fts = $tool->result('Bio::SeqFeatureI');
302
$seq->add_SeqFeature(@fts);
308
> The above script runs several analyses using default parameters. All
309
analyses have such defaults and in general only a sequence of the
310
appropriate type is needed for the analysis to be submitted. A sequence
311
can be added either in the constructor, as shown above, or by the seq()
314
CLASS="PROGRAMLISTING"
315
> my $primary_seq = new Bio::PrimarySeq(-seq=>$seq_as_string);
316
my $tool = new Bio::Tools::Analysis::Protein::NetPhos();
317
$tool->seq($primary_seq);
320
Note that the only valid sequence format is a Bio::PrimarySeq object.
321
This is in order to support multiple methods of retrieving the
322
results. If you initially have a Bio::Seq object or Bio::RichSeq
323
object (e.g., from a GenBank file) you can call its primary_seq()
324
method to obtain a Bio::PrimarySeq object.
332
NAME="RETRIEVING_RESULTS"
333
>3. Retrieving results</A
336
> If the run() method executes successfully, the raw output (stripped of
337
HTML) is now stored in the Analysis object.
338
All modules should return the raw report by a call to result()
339
with no parameters. e.g.,
341
CLASS="PROGRAMLISTING"
342
> my $raw_report = $tool->result();
345
A second way is to retrieve a ready-parsed data structure:
347
CLASS="PROGRAMLISTING"
348
> my $parsed_report = $tool->result('parsed');
351
The data structure returned is described in the $RESULT_SPEC->{'raw'}
352
hash reference and should always be a native Perl data structure.
355
> A third way is to retrieve an array of sequence features:
357
CLASS="PROGRAMLISTING"
358
> my @fts = $tool->result('Bio::SeqFeatureI');
359
$seq->add_SeqFeature(@fts); # add features to sequence.
363
These are Bio::SequenceFeature::Generic features. Sometimes a
364
module might use some code to judge the significance of a
365
result prior to listing it as a feature, for example in the
366
secondary structure prediction modules. The rules governing this
367
are described in individual modules. For example, I have put in a rule
368
that only runs of a minimum of 4 consecutive residues predicted to be
369
beta sheet or alpha helix can be annotated as features - it makes no sense
370
for a single isolated residue to be annotated as being in a helix.
371
However you might want to filter the raw results yourself, in which
372
case retrieving the results as Bio::Seq::Meta::Array type objects
386
> Many analyses produce a list of scores, one for each residue
387
in a sequence. For example, the protein secondary structure
388
prediction program Sopma returns a list of probabilities for
389
each residue being in helix, sheet, turn or coil. These
390
modules make this data available in the form of meta
391
sequences. These are sequence objects derived from
392
Bio::PrimarySeq that have arrays of sequence associated data
393
contained in the object. The meta sequence names should be
394
listed in the individual module documentation. To retrieve
395
results like this supply the string 'meta' as a parameter
396
to the result() method.
397
Meta sequence objects can access all the PrimarySeq object
398
methods for accessing and manipulating the protein/DNA sequence
399
and also have specific methods for accessing the result data.
402
CLASS="PROGRAMLISTING"
403
> $meta_seq = $analysis->result('meta');
407
This returns a sequence object with the raw analysis data
408
accessible through methods e.g.,
411
CLASS="PROGRAMLISTING"
412
> my @scores1_20 = $meta_seq->named_sub_meta('Sopma_turn', 1,20);
416
returns an array of scores for the first 20 amino acids
419
CLASS="PROGRAMLISTING"
420
> my @allscores = $meta_seq->named_meta('Sopma_turn');
424
returns an array of scores for the whole sequence. The names of
425
individual meta sequence names are listed in the module
429
> Although a full description of how metasequence objects work isn't
430
really the purpose of this HOWTO there is excellent documentation in
431
Bio::Seq::MetaI and Bio::Seq::Meta::Array modules.
440
>5. How to run the same analysis with varying parameters</A
443
> You might want to run some analyses with varying parameters in
444
order to determine the effect on the prediction.
445
At present only the Sopma module takes alternative parameters i.e.
446
arguments other than just the sequence. Any parameter that is settable
447
on the web form should have a method of the same name to get/set its
448
values, or alternatively it can be set in the constructor.
451
CLASS="PROGRAMLISTING"
452
> my $sopma = Bio::Tools::Analysis::Protein::Sopma->new();
453
$sopma->seq(seqobj->primary_seq);
454
$sopma->window_width(21);
458
So, let's suppose we want to investigate how varying the window_width
459
parameter affects the secondary structure prediction for a sequence.
460
We can do this inthe following manner:
464
CLASS="PROGRAMLISTING"
465
> my $seq = Bio::Seq->new(-seq => 'ASFATFDATFATFSTFATSFATFSTAF');
466
my $tool = Bio::Tools::Analysis::Protein::Sopma->new
467
(-seq=>$seq->primary_seq);
469
for my $w_size(qw(11 13 15 17 19 21 23)) {
471
$tool->window_width($w_size); #set new parameter
472
$tool->run(); #default parameters
474
#2nd param is appended to metasequence name
475
$tool->result('meta', $w_size);
477
## add to sequence features
478
$seq->add_SeqFeature(
479
$tool->result('Bio::SeqFeatureI'));
480
$tool->clear(); #removes raw data from the previous analysis
482
# meta seq now has 28 arrays of metadats - 7 analyses and
483
# 4 prediction categories,
485
## now get meta_sequence
486
my $meta_seq = $tool->seq();
489
Only 3 new points are raised by this program. Firstly, each time
490
the analysis is run it needs to be parsed immediately. The raw
491
report data is removed from one analysis to another to reduce
492
memory usage. Secondly, in order to distinguish the meta data
493
each analysis needs to be given a specific identifier which is
494
appended to the default name. So in this case the meta data names
495
would be Sopma_meta|11, Sopma_meta|13, etc.
496
Thirdly, the clear() method should be called between
497
analyses of the same sequence to ensure the internal data fields
498
that hold the raw report are removed. So if you want to keep the raw
499
reports you need to store them after each analysis is returned.
502
> So, how are features obtained from multiple runs distinguished?
503
This information is contained in tags with the same name as the
504
parameters, when the settings aren't the default ones. In other words,
505
the features retain knowledge of the analysis method's parameters.
508
CLASS="PROGRAMLISTING"
510
## carrying on from previous listing.... ##
512
## get all secondary structure features
513
my @sec_fts = grep{$_->primary_tag eq '2ary'}
514
$seq->all_SeqFeatures;
515
my @ww11 = grep{($_->each_tag_value('method'))[0] eq 'Sopma' &&
516
($_->each_tag_value('window_width'))[0] == 11}
519
## now onto comparison..... ##
535
> The problem with these analyses is their speed of execution, which
536
obviously depends on your network speed, the complexity of the
537
analysis and the speed of the host. Moreover, it is usually polite to
538
leave a second or two between requests to avoid blocking the server,
539
which means that to annotate a single protein sequence with all of the
540
X methods available may take over a minute. Certainly these modules
541
are unsuitable for genome scale analysis and are designed for use with
542
smaller numbers of sequences. However, the code in the result()
543
methods should still be usable for parsing analyses run locally when
544
the local output and webpage output are the same format.
554
>7. Interested in developing your own Analysis module?</A
557
> Most of the hard work is done by Bio::WebAgent and
558
Bio::SimpleAnalysisBase. Any module must inherit from this latter
559
module as it contains methods common to all analyses.
562
> For a minimal prediction server which takes just sequence as a
563
parameter (e.g., Domcut), only 3 methods need to be explicitly
567
> 1. _run() which mimics the web form, submits the sequence
568
and extracts the raw text data of the result from the HTML.
571
> 2. result() which parses the raw data into whatever useful format you
572
wish. Usually these include SeqFeature objects, standard Perl data
573
structures and meta sequences if there is a result for each residue in
577
> 3. _init() which imports the analysis specifications into the
581
> As well as these methods, various file scoped lexical hashes need to
582
be defined which hold constant data about the analysis, and the
583
analysis result. These are useful for reference and are incorporated
584
into the analysis objects.
588
For more complicated prediction programs with analysis specific
589
parameters (e.g., Sopma), get/set methods need to be written. Also, if
590
any of these parameters need special error checking then a
591
check_parameters() needs to be written as well. The nature of these
592
parameters should be listed in a hash referenced by the $INPUT_SPEC
597
> Alternatively, mail me with your suggestion and I'll try to put one
598
together. It is my intent to keep the modules up to date with new
599
analysis programs as they become available which are not included in
600
the usual protein annotations and will be glad to hear of new
609
NAME="ACKNOWLEDGMENTS"
610
>8. Acknowledgments</A
613
> These modules depend on several recently developed modules and wouldn't
614
work at all without them.
615
Bio::Seq::Meta:: modules:
616
Chad Matsalla, Aaron Mackey and Heikki Lehvaslaiho
619
Bio::SimpleAnalysisI:
b'\\ No newline at end of file'