1
#----------------------------------------------------------------------------
2
# PACKAGE : Bio::Tools::Blast.pm
3
# PURPOSE : To encapsulate code for running, parsing, and analyzing
5
# AUTHOR : Steve Chervitz (sac@bioperl.org)
7
# REVISION: $Id: Blast.pm,v 1.24.2.1 2002/03/10 17:00:30 jason Exp $
10
# For the latest version and documentation, visit:
11
# http://bio.perl.org/Projects/Blast
13
# To generate documentation, run this module through pod2html
14
# (preferably from Perl v5.004 or better).
16
# Copyright (c) 1996-2000 Steve Chervitz. All Rights Reserved.
17
# This module is free software; you can redistribute it and/or
18
# modify it under the same terms as Perl itself.
19
#----------------------------------------------------------------------------
21
package Bio::Tools::Blast;
25
use Bio::Tools::SeqAnal;
26
use Bio::Root::Global qw(:std);
27
use Bio::Root::Utilities qw(:obj);
32
use vars qw(@ISA @EXPORT @EXPORT_OK %EXPORT_TAGS
33
$ID $VERSION $Blast @Blast_programs $Revision $Newline);
35
@ISA = qw( Bio::Tools::SeqAnal Exporter);
37
@EXPORT_OK = qw($VERSION $Blast);
38
%EXPORT_TAGS = ( obj => [qw($Blast)],
41
$ID = 'Bio::Tools::Blast';
43
$Revision = '$Id: Blast.pm,v 1.24.2.1 2002/03/10 17:00:30 jason Exp $'; #'
45
## Static Blast object.
48
$Blast->{'_name'} = "Static Blast object";
50
@Blast_programs = qw(blastp blastn blastx tblastn tblastx);
52
use vars qw($DEFAULT_MATRIX $DEFAULT_SIGNIF);
53
my $DEFAULT_MATRIX = 'BLOSUM62';
54
my $DEFAULT_SIGNIF = 999;# Value used as significance cutoff if none supplied.
55
my $MAX_HSP_OVERLAP = 2; # Used when tiling multiple HSPs.
61
Bio::Tools::Blast.pm - Bioperl BLAST sequence analysis object
65
=head2 Parsing Blast reports
67
Parse an existing Blast report from file:
69
use Bio::Tools::Blast;
71
$blastObj = Bio::Tools::Blast->new( -file => '/tmp/blast.out',
76
Parse an existing Blast report from STDIN:
78
$blastObj = Bio::Tools::Blast->new( -parse => 1,
82
Then send a Blast report to your script via STDIN.
84
Full parameters for parsing Blast reports.
91
-filt_func => \&my_filter,
98
-exec_func => \&process_blast,
99
-save_array => \@blast_objs, # not used if -exce_func defined.
102
See L<parse()|parse> for a description of parameters and see L<USAGE | USAGE> for
103
more examples including how to parse streams containing multiple Blast
104
reports L<Using the Static $Blast Object>.
106
See L<Memory Usage Issues> for information about how to make Blast
107
parsing be more memory efficient.
109
=head2 Running Blast reports
111
Run a new Blast2 at NCBI and then parse it:
116
-database => 'swissprot',
117
-seqs => [ $seq ], # Bio::Seq.pm objects.
120
$blastObj = Bio::Tools::Blast->new( -run => \%runParam,
126
Full parameters for running Blasts at NCBI using Webblast.pm:
131
-version => 2, # BLAST2
132
-database =>'swissprot',
134
-seqs => [ $seqObject ], # Bio::Seq.pm object(s)
140
-email => undef, # don't send report via e-mail if parsing.
141
-filter => undef, # use default
142
-gap_c => undef, # use default
143
-gap_e => undef, # use default
144
-word => undef, # use default
145
-min_len => undef, # use default
148
See L<run()|run> and L<USAGE | USAGE> for more information about running Blasts.
150
=head2 HTML-formatting Blast reports
152
Print an HTML-formatted version of a Blast report:
154
use Bio::Tools::Blast qw(:obj);
156
$Blast->to_html($filename);
157
$Blast->to_html(-file => $filename,
158
-header => "<H1>Blast Results</H1>");
159
$Blast->to_html(-file => $filename,
160
-out => \@array); # store output
161
$Blast->to_html(); # use STDIN
163
Results are sent directly to STDOUT unless an C<-out =E<gt> array_ref>
164
parameter is supplied. See L<to_html()|to_html> for details.
168
This module is included with the central Bioperl distribution:
170
http://bio.perl.org/Core/Latest
171
ftp://bio.perl.org/pub/DIST
173
Follow the installation instructions included in the README file.
177
The Bio::Tools::Blast.pm module encapsulates data and methods for
178
running, parsing, and analyzing pre-existing BLAST reports. This
179
module defines an application programming interface (API) for working
180
with Blast reports. A Blast object is constructed from raw Blast
181
output and encapsulates the Blast results which can then be accessed
182
via the interface defined by the Blast object.
184
The ways in which researchers use Blast data are many and varied. This
185
module attempts to be general and flexible enough to accommodate
186
different uses. The Blast module API is still at an early stage of
187
evolution and I expect it to continue to evolve as new uses for Blast
188
data are developed. Your L<FEEDBACK | FEEDBACK> is welcome.
194
=item * Supports NCBI Blast1.x, Blast2.x, and WashU-Blast2.x, gapped
197
Can parse HTML-formatted as well as non-HTML-formatted reports.
199
=item * Launch new Blast analyses remotely or locally.
201
Blast objects can be constructed directly from the results of the
202
run. See L<run()|run>.
204
=item * Construct Blast objects from pre-existing files or from a new run.
206
Build a Blast object from a single file or build multiple Blast
207
objects from an input stream containing multiple reports. See
210
=item * Add hypertext links from a BLAST report.
212
See L<to_html()|to_html>.
214
=item * Generate sequence and sequence alignment objects from HSP
217
If you have Bio::Seq.pm and Bio::UnivAln.pm installed on your system,
218
they can be used for working with high-scoring segment pair (HSP)
219
sequences in the Blast alignment. (A new version of Bio::Seq.pm is
220
included in the distribution, see L<INSTALLATION | INSTALLATION>). For more
221
information about them, see:
223
http://bio.perl.org/Projects/Sequence/
224
http://bio.perl.org/Projects/SeqAlign/
228
A variety of different data can be extracted from the Blast report by
229
querying the Blast.pm object. Some basic examples are given in the
230
L<USAGE | USAGE> section. For some working scripts, see the links provided in
231
the L<the DEMO SCRIPTS section | DEMO> section.
233
As a part of the incipient Bioperl framework, the Bio::Tools::Blast.pm
234
module inherits from B<Bio::Tools::SeqAnal.pm>, which provides some
235
generic functionality for biological sequence analysis. See the
236
documentation for that module for details
237
(L<Links to related modules>).
239
=head2 The BLAST Program
241
BLAST (Basic Local Alignment Search Tool) is a widely used algorithm
242
for performing rapid sequence similarity searches between a single DNA
243
or protein sequence and a large dataset of sequences. BLAST analyses
244
are typically performed by dedicated remote servers, such as the ones
245
at the NCBI. Individual groups may also run the program on local
248
The Blast family includes 5 different programs:
251
------------ ----------
252
blastp -- protein protein
253
blastn -- nucleotide nucleotide
254
blastx -- nucleotide* protein
255
tblastn -- protein nucleotide*
256
tblastx -- nucleotide* nucleotide*
258
* = dynamically translated in all reading frames, both strands
260
See L<References & Information about the BLAST program>.
262
=head2 Versions Supported
264
BLAST reports generated by different application front ends are similar
265
but not exactly the same. Blast reports are not intended to be exchange formats,
266
making parsing software susceptible to obsolescence. This module aims to
267
support BLAST reports generated by different implementations:
269
Implementation Latest version tested
270
-------------- --------------------
271
NCBI Blast1 1.4.11 [24-Nov-97]
272
NCBI Blast2 2.0.8 [Jan-5-1999]
273
WashU-BLAST2 2.0a19MP [05-Feb-1998]
276
Support for both gapped and ungapped versions is included. Currently, there
277
is only rudimentary support for PSI-BLAST in that these reports can be parsed but
278
there is no special treatment of separate iteration rounds (they are all
281
=head2 References & Information about the BLAST program
285
http://www.ncbi.nlm.nih.gov/BLAST/ - Homepage at NCBI
286
http://www.ncbi.nlm.nih.gov/BLAST/blast_help.html - Help manual
287
http://blast.wustl.edu/ - WashU-Blast2
289
B<PUBLICATIONS:> (with PubMed links)
291
Altschul S.F., Gish W., Miller W., Myers E.W., Lipman D.J. (1990).
292
"Basic local alignment search tool", J Mol Biol 215: 403-410.
294
http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?uid=2231712&form=6&db=m&Dopt=r
296
Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
297
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997).
298
"Gapped BLAST and PSI-BLAST: a new generation of protein database
299
search programs", Nucleic Acids Res. 25:3389-3402.
301
http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?uid=9254694&form=6&db=m&Dopt=r
303
Karlin, Samuel and Stephen F. Altschul (1990). Methods for
304
assessing the statistical significance of molecular sequence
305
features by using general scoring schemes. Proc. Natl. Acad.
308
http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?uid=2315319&form=6&db=m&Dopt=b
310
Karlin, Samuel and Stephen F. Altschul (1993). Applications
311
and statistics for multiple high-scoring segments in molecu-
312
lar sequences. Proc. Natl. Acad. Sci. USA 90:5873-7.
314
http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?uid=8390686&form=6&db=m&Dopt=b
318
=head2 Creating Blast objects
320
A Blast object can be constructed from the contents of a Blast report
321
using a set of named parameters that specify significance criteria for
322
parsing. The report data can be read in from an existing file
323
specified with the C<-file =E<gt> 'filename'> parameter or from a
324
STDIN stream containing potentially multiple Blast reports. If the
325
C<-file> parameter does not contain a valid filename, STDIN will be
326
used. Separate Blast objects will be created for each report in the
329
To parse the report, you must include a C<-parse =E<gt> 1> parameter
330
in addition to any other parsing parameters
331
See L<parse()|parse> for a full description of parsing parameters.
332
To run a new report and then parse it, include a C<-run =E<gt> \%runParams>
333
parameter containing a reference to a hash
334
that hold the parameters required by the L<run()|run> method.
336
The constructor for Blast objects is inherited from Bio::Tools::SeqAnal.pm.
337
See the B<_initialize>() method of that package for general information
338
relevant to creating Blast objects. (The B<new>() method, inherited from
339
B<Bio::Root::Object.pm>, calls B<_initialize>(). See L<Links to related modules>).
341
The Blast object can read compressed (gzipped) Blast report
342
files. Compression/decompression uses the gzip or compress programs
343
that are standard on Unix systems and should not require special
344
configuration. If you can't or don't want to use gzip as the file
345
compression tool, either pre-uncompress your files before parsing with
346
this module or modify B<Bio::Root::Utilities.pm> to your liking.
348
Blast objects can be generated either by direct instantiation as in:
350
use Bio::Tools::Blast;
351
$blast = new Bio::Tools::Blast (%parameters);
353
=head2 Using the Static $Blast Object
355
use Bio::Tools::Blast qw(:obj);
357
This exports the static $Blast object into your namespace. "Static"
358
refers to the fact that it has class scope and there is one of these
359
created when you use this module. The static $Blast object is
360
basically an empty object that is provided for convenience and is also
361
used for various internal chores.
363
It is exported by this module and can be used for
364
parsing and running reports as well as HTML-formatting without having
365
to first create an empty Blast object.
367
Using the static $Blast object for parsing a STDIN stream of Blast reports:
369
use Bio::Tools::Blast qw(:obj);
372
my $blastObj = shift;
373
print $blastObj->table();
377
$Blast->parse( -parse => 1,
379
-exec_func => \&process_blast,
382
Then pipe a stream of Blast reports into your script via STDIN. For
383
each Blast report extracted from the input stream, the parser will
384
generate a new Blast object and pass it to the function specified by
385
C<-exec_func>. The destroy() call tells Perl to free the memory
386
associated with the object, important if you are crunching through
387
many reports. This method is inherited from B<Bio::Root::Object.pm>
388
(see L<Links to related modules>). See L<parse()|parse> for a full
389
description of parameters and L<the DEMO SCRIPTS section | DEMO> for
392
=head2 Running Blasts
394
To run a Blast, create a new Blast object with a C<-run =E<gt>
395
\%runParams> parameter. Remote Blasts are performed by including a
396
C<-method =E<gt> 'remote'> parameter; local Blasts are performed by
397
including a C<-method =E<gt> 'local'> parameter. See
398
L<Running Blast reports> as well as the
399
L<the DEMO SCRIPTS section | DEMO> for examples.
400
Note that running local Blasts is not yet supported, see below.
402
Note that the C<-seqs =E<gt> [ $seqs ]> run parameter must contain a
403
reference to an array of B<Bio::Seq.pm> objects
404
(L<Links to related modules>). Encapsulating the sequence in an
405
object makes sequence information much easier to handle as it can
406
be supplied in a variety of formats. Bio::Seq.pm is included with
407
this distribution (L<INSTALLATION | INSTALLATION>).
409
Remote Blasts are implemented using the
410
B<Bio::Tools::Blast::Run::Webblast.pm> module. Local Blasts require
411
that you customize the B<Bio::Tools::Blast::Run::LocalBlast.pm>
412
module. The version of LocalBlast.pm included with this distribution
413
provides the basic framework for running local Blasts.
414
See L<Links to related modules>.
416
=head2 Significance screening
418
A C<-signif> parameter can be used to screen out all hits with
419
P-values (or Expect values) above a certain cutoff. For example, to
420
exclude all hits with Expect values above 1.0e-10: C<-signif =E<gt>
421
1e-10>. Providing a C<-signif> cutoff can speed up processing
422
tremendously, since only a small fraction of the report need be
423
parsed. This is because the C<-signif> value is used to screen hits
424
based on the data in the "Description" section of the Blast report:
426
For NCBI BLAST2 reports:
429
Sequences producing significant alignments: (bits) Value
431
sp|P31376|YAB1_YEAST HYPOTHETICAL 74.1 KD PROTEIN IN CYS3-MDM10... 957 0.0
433
For BLAST1 or WashU-BLAST2 reports:
438
Sequences producing High-scoring Segment Pairs: Score P(N) N
440
PDB:3PRK_E Proteinase K complexed with inhibitor ........... 504 1.8e-50 1
442
Thus, the C<-signif> parameter will screen based on Expect values for
443
BLAST2 reports and based on P-values for BLAST1/WashU-BLAST2 reports.
445
To screen based on other criteria, you can supply a C<-filt_func>
446
parameter containing a function reference that takes a
447
B<Bio::Tools::Sbjct.pm> object as an argument and returns a boolean,
448
true if the hit is to be screened out. See example below for
449
L<Screening hits using arbitrary criteria>.
451
=head2 Get the best hit.
453
$hit = $blastObj->hit;
455
A "hit" is contained by a B<Bio::Tools::Blast::Sbjct.pm> object.
457
=head2 Get the P-value or Expect value of the most significant hit.
459
$p = $blastObj->lowest_p;
460
$e = $blastObj->lowest_expect;
464
$p = $blastObj->hit->p;
465
$e = $blastObj->hit->expect;
467
Note that P-values are not reported in NCBI Blast2 reports.
469
=head2 Iterate through all the hits
471
foreach $hit ($blastObj->hits) {
472
printf "%s\t %.1e\t %d\t %.2f\t %d\n",
473
$hit->name, $hit->expect, $hit->num_hsps,
474
$hit->frac_identical, $hit->gaps;
477
Refer to the documentation for B<Bio::Tools::Blast::Sbjct.pm>
478
for other ways to work with hit objects (L<Links to related modules>).
480
=head2 Screening hits using arbitrary criteria
482
sub filter { $hit=shift;
483
return ($hit->gaps == 0 and
484
$hit->frac_conserved > 0.5); }
486
$blastObj = Bio::Tools::Blast->new( -file => '/tmp/blast.out',
488
-filt_func => \&filter );
490
While the Blast object is parsing the report, each hit checked by calling
491
&filter($hit). All hits that generate false return values from &filter
492
are screened out and will not be added to the Blast object.
493
Note that the Blast object will normally stop parsing the report after
494
the first non-significant hit or the first hit that does not pass the
495
filter function. To force the Blast object to check all hits,
496
include a C<-check_all_hits =E<gt> 1> parameter.
497
Refer to the documentation for B<Bio::Tools::Blast::Sbjct.pm>
498
for other ways to work with hit objects.
502
=item Hit start, end coordinates.
504
print $sbjct->start('query');
505
print $sbjct->end('sbjct');
507
In array context, you can get information for both query and sbjct with one call:
509
($qstart, $sstart) = $sbjct->start();
510
($qend, $send) = $sbjct->end();
512
For important information regarding coordinate information, see
513
the L<HSP start, end, and strand> section below.
514
Also check out documentation for the start and end methods in B<Bio::Tools::Blast::Sbjct.pm>,
515
which explains what happens if there is more than one HSP.
519
=head2 Working with HSPs
523
=item Iterate through all the HSPs of every hit
525
foreach $hit ($blastObj->hits) {
526
foreach $hsp ($hit->hsps) {
527
printf "%.1e\t %d\t %.1f\t %.2f\t %.2f\t %d\t %d\n",
528
$hsp->expect, $hsp->score, $hsp->bits,
529
$hsp->frac_identical, $hsp->frac_conserved,
530
$hsp->gaps('query'), $hsp->gaps('sbjct');
533
Refer to the documentation for B<Bio::Tools::Blast::HSP.pm>
534
for other ways to work with hit objects (L<Links to related modules>).
540
=item Extract HSP sequence data as strings or sequence objects
542
Get the first HSP of the first hit and the sequences
543
of the query and sbjct as strings.
545
$hsp = $blast_obj->hit->hsp;
546
$query_seq = $hsp->seq_str('query');
547
$hsp_seq = $hsp->seq_str('sbjct');
549
Get the indices of identical and conserved positions in the HSP query seq.
551
@query_iden_indices = $hsp->seq_inds('query', 'identical');
552
@query_cons_indices = $hsp->seq_inds('query', 'conserved');
554
Similarly for the sbjct sequence.
556
@sbjct_iden_indices = $hsp->seq_inds('sbjct', 'identical');
557
@sbjct_cons_indices = $hsp->seq_inds('sbjct', 'conserved');
559
print "Query in Fasta format:\n", $hsp->seq('query')->layout('fasta');
560
print "Sbjct in Fasta format:\n", $hsp->seq('sbjct')->layout('fasta');
562
See the B<Bio::Seq.pm> package for more information about using these sequence objects
563
(L<Links to related modules>).
569
=item Create sequence alignment objects using HSP sequences
571
$aln = $hsp->get_aln;
572
print " consensus:\n", $aln->consensus();
573
print $hsp->get_aln->layout('fasta');
575
$ENV{READSEQ_DIR} = '/home/users/sac/bin/solaris';
576
$ENV{READSEQ} = 'readseq';
577
print $hsp->get_aln->layout('msf');
579
MSF formated layout requires Don Gilbert's ReadSeq program (not included).
580
See the B<Bio::UnivAln.pm> for more information about using these alignment objects
581
(L<Links to related modules>)'.
587
=item HSP start, end, and strand
589
To facilitate HSP processing, endpoint data for each HSP sequence are
590
normalized so that B<start is always less than end>. This affects TBLASTN
591
and TBLASTX HSPs on the reverse complement or "Minus" strand.
593
Some examples of obtaining start, end coordinates for HSP objects:
595
print $hsp->start('query');
596
print $hsp->end('sbjct');
597
($qstart, $sstart) = $hsp->start();
598
($qend, $send) = $hsp->end();
600
Strandedness of the HSP can be assessed using the strand() method
603
print $hsp->strand('query');
604
print $hsp->strand('sbjct');
606
These will return 'Minus' or 'Plus'.
607
Or, to get strand information for both query and sbjct with a single call:
609
($qstrand, $sstrand) = $hsp->strand();
613
=head2 Report Generation
617
=item Generate a tab-delimited table of all results.
619
print $blastObj->table;
620
print $blastObj->table(0); # don't include hit descriptions.
621
print $blastObj->table_tiled;
623
The L<table()|table> method returns data for each B<HSP> of each hit listed one per
624
line. The L<table_tiled()|table_tiled> method returns data for each B<hit, i.e., Sbjct>
625
listed one per line; data from multiple HSPs are combined after tiling to
626
reduce overlaps. See B<Bio::Tools::Blast::Sbjct.pm> for more information about
627
HSP tiling. These methods generate stereotypical, tab-delimited data for each
628
hit of the Blast report. The output is suitable for importation into
629
spreadsheets or database tables. Feel free to roll your own table function if
630
you need a custom table.
632
For either table method, descriptions of each hit can be included if a
633
single, true argument is supplied (e.g., $blastObj-E<gt>table(1)). The description
634
will be added as the last field. This will significantly increase the size of
635
the table. Labels for the table columns can be obtained with L<table_labels()|table_labels>
636
and L<table_labels_tiled()|table_labels_tiled>.
642
=item Print a summary of the Blast report
644
$blastObj->display();
645
$blastObj->display(-show=>'hits');
647
L<display()|display> prints various statistics extracted from the Blast report
648
such as database name, database size, matrix used, etc. The
649
C<display(-show=E<gt>'hits')> call prints a non-tab-delimited table
650
attempting to line the data up into more readable columns. The output
651
generated is similar to L<table_tiled()|table_tiled>.
657
=item HTML-format an existing report
659
use Bio::Tools::Blast qw(:obj);
661
# Going straight from a non HTML report file to HTML output using
662
# the static $Blast object exported by Bio::Tools::Blast.pm
663
$Blast->to_html(-file => '/usr/people/me/blast.output.txt',
664
-header => qq|<H1>BLASTP Results</H1><A HREF="home.html">Home</A>|
667
# You can also use a specific Blast object created previously.
670
L<to_html()|to_html> will send HTML output, line-by-line, directly to STDOUT
671
unless an C<-out =E<gt> array_ref> parameter is supplied (e.g., C<-out
672
=E<gt> \@array>), in which case the HTML will be stored in @array, one
673
line per array element. The direct outputting permits faster response
674
time since Blast reports can be huge. The -header tag can contain a
675
string containing any HTML that you want to appear at the top of the
682
Sample Scripts are included in the central bioperl distribution in the
683
'examples/blast/' directory (see L<INSTALLATION | INSTALLATION>). These are also
684
available at the following URLs (but it would be safer to use the
685
scripts included with the distribution).
687
=head2 Handy library for working with Bio::Tools::Blast.pm
689
http://bio.perl.org/Core/Examples/blast/blast_config.pl
691
=head2 Parsing Blast reports one at a time.
693
http://bio.perl.org/Core/Examples/blast/parse_blast.pl
694
http://bio.perl.org/Core/Examples/blast/parse_blast2.pl
695
http://bio.perl.org/Core/Examples/blast/parse_positions.pl
697
=head2 Parsing sets of Blast reports.
699
http://bio.perl.org/Core/Examples/blast/parse_blast.pl
700
http://bio.perl.org/Core/Examples/blast/parse_multi.pl
702
B<Warning:> See note about L<Memory Usage Issues>.
704
=head2 Running Blast analyses one at a time.
706
http://bio.perl.org/Core/Examples/blast/run_blast_remote.pl
708
=head2 Running Blast analyses given a set of sequences.
710
http://bio.perl.org/Core/Examples/blast/blast_seq.pl
712
=head2 HTML-formatting Blast reports.
714
http://bio.perl.org/Core/Examples/blast/html.pl
716
=head1 TECHNICAL DETAILS
720
A BLAST object may be created using one of three different modes as
721
defined by the B<Bio::Tools::SeqAnal.pm> package
722
(See L<Links to related modules>):
724
-- parse - Load a BLAST report and parse it, storing parsed data in
726
-- run - Run the BLAST program to generate a new report.
727
-- read - Load a BLAST report into the Blast object without parsing.
729
B<Run mode support has recently been added>. The module
730
B<Bio::Tools::Blast::Run::Webblast.pm> is an modularized adaptation of
731
the webblast script by Alex Dong Li:
733
http://www.genet.sickkids.on.ca/bioinfo_resources/software.html#webblast
735
for running remote Blast analyses and saving the results locally. Run
736
mode can be combined with a parse mode to generate a Blast report and
737
then build the Blast object from the parsed results of this report
738
(see L<run()|run> and L<SYNOPSIS | SYNOPSIS>).
740
In read mode, the BLAST report is read in by the Blast object but is
741
not parsed. This could be used to internalize a Blast report but not
742
parse it for results (e.g., generating HTML formatted output).
744
=head2 Significant Hits
746
This module permits the screening of hits on the basis of
747
user-specified criteria for significance. Currently, Blast reports can
748
be screened based on:
750
CRITERIA PARAMETER VALUE
751
---------------------------------- --------- ----------------
752
1) the best Expect (or P) value -signif float or sci-notation
753
2) the length of the query sequence -min_length integer
754
3) arbitrary criteria -filt_func function reference
756
The parameters are used for construction of the BLAST object or when
757
running the L<parse()|parse> method on the static $Blast object. The
758
-SIGNIF value represents the number listed in the description section
759
at the top of the Blast report. For Blast2, this is an Expect value,
760
for Blast1 and WashU-Blast2, this is a P-value. The idea behind the
761
C<-filt_func> parameter is that the hit has to pass through a filter
762
to be considered significant. Refer to the documentation for
763
B<Bio::Tools::Blast::Sbjct.pm> for ways to work with hit objects.
765
Using a C<-signif> parameter allows for the following:
769
=item Faster parsing.
771
Each hit can be screened by examination of the description line alone
772
without fully parsing the HSP alignment section.
776
The C<-signif> tag provides a more semantic-free way to specify the
777
value to be used as a basis for screening hits. Thus, C<-signif> can
778
be used for screening Blast1 or Blast2 reports. It is up to the user
779
to understand whether C<-signif> represents a P-value or an Expect
784
Any hit not meeting the significance criteria will not be added to the
785
"hit list" of the BLAST object. Also, a BLAST object without any hits
786
meeting the significance criteria will throw an exception during
787
object construction (a fatal event).
789
=head2 Statistical Parameters
791
There are numerous parameters which define the behavior of the BLAST
792
program and which are useful for interpreting the search
793
results. These parameters are extracted from the Blast report:
795
filter -- for masking out low-complexity sequences or short repeats
796
matrix -- name of the substitution scoring matrix (e.g., BLOSUM62)
797
E -- Expect filter (screens out frequent scores)
798
S -- Cutoff score for segment pairs
800
T -- Threshold score for word pairs
801
Lambda, -- Karlin-Altschul "sum" statistical parameters dependent on
802
K, H sequence composition.
803
G -- Gap creation penalty.
804
E -- Gap extension penalty.
806
These parameters are not always needed. Extraction may be turned off
807
explicitly by including a C<-stats =E<gt> 0> parameter during object
808
construction. Support for all statistical parameters is not complete.
810
For more about the meaning of parameters, check out the NCBI URLs given above.
812
=head2 Module Organization
814
The modules that comprise this Bioperl Blast distribution are location in the
815
Bio:: hierarchy as shown in the diagram below.
819
+--------------------------+
823
+----------------------+ Object.pm
825
SeqAnal.pm Blast.pm Blast/
827
+---------+---------+------------+
829
Sbjct.pm HSP.pm HTML.pm Run/
833
Webblast.pm LocalBlast.pm
835
Bio::Tools::Blast.pm is a concrete class that inherits from
836
B<Bio::Tools::SeqAnal.pm> and relies on other modules for parsing and
837
managing BLAST data. Worth mentioning about this hierarchy is the
838
lack of a "Parse.pm" module. Since parsing is considered central to
839
the purpose of the Bioperl Blast module (and Bioperl in general), it
840
seems somewhat unnatural to segregate out all parsing code. This
841
segregation could also lead to inefficiencies and harder to maintain
842
code. I consider this issue still open for debate.
844
Bio::Tools::Blast.pm, B<Bio::Tools::Blast::Sbjct.pm>, and
845
B<Bio::Tools::Blast::HSP.pm> are mostly dedicated to parsing and all
846
can be used to instantiate objects. Blast.pm is the main "command and
847
control" module, inheriting some basic behaviors from SeqAnal.pm
848
(things that are not specific to Blast I<per se>).
850
B<Bio::Tools::Blast::HTML.pm> contains functions dedicated to
851
generating HTML-formatted Blast reports and does not generate objects.
853
=head2 Running Blasts: Details
855
B<Bio::Tools::Blast::Run::Webblast.pm> contains a set of functions for
856
running Blast analyses at a remote server and also does not
857
instantiate objects. It uses a helper script called postclient.pl,
858
located in the Run directory. The proposed LocalBlast.pm module would
859
be used for running Blast reports on local machines and thus would be
860
customizable for different sites. It would operate in a parallel
861
fashion to Webblast.pm (i.e., being a collection of functions, taking
862
in sequence objects or files, returning result files).
864
The Run modules are considered experimental. In particular,
865
Webblast.pm catures an HTML-formatted version of the Blast report from
866
the NCBI server and strips out the HTML in preparation for parsing. A
867
more direct approach would be to capture the Blast results directly
868
from the server using an interface to the NCBI toolkit. This approach
869
was recently proposed on the Bioperl mailing list:
870
http://www.uni-bielefeld.de/mailinglists/BCD/vsns-bcd-perl/9805/0000.html
872
=head2 Memory Usage Issues
874
Parsing large numbers of Blast reports (a few thousand or so) with
875
Bio::Tools::Blast.pm may lead to unacceptable memory usage situations.
876
This is somewhat dependent of the size and complexity of the reports.
878
While this problem is under investigation, here are some workarounds
879
that fix the memory usage problem:
883
=item 1 Don't specify a -signif criterion when calling L<parse()|parse>.
885
The C<-signif> value is used for imposing a upper limit to the expect- or
886
P-value for Blast hits to be parsed. For reasons that are still under
887
investigation, specifying a value for C<-signif> in the L<parse()|parse>
888
method prevents Blast objects from being fully
889
garbage collected. When using the B<parse_blast.pl> or B<parse_multi.pl>
890
scripts in C<examples/blast/> of the bioperl distribution), don't supply
891
a C<-signif> command-line parameter.
893
=item 2 If you want to impose a -signif criterion, put it inside a
896
For the L<parse()|parse> method, a -signif =E<gt> 1e-5 parameter is equivalent
897
to using a filter function parameter of
899
-filt_func => sub { my $hit = shift; return $hit->signif <= 1e-5; }
901
Using the B<examples/blast/parse_multi.pl> script, you can supply a
902
command-line argument of
904
-filt_func '$hit->signif <= 1e-5'
906
For more information, see L<parse()|parse> and the section
907
L<Screening hits using arbitrary criteria>.
915
=item * Develop a functional, prototype Bio::Tools::Blast::Run::LocalBlast.pm module.
917
=item * Add support for PSI-BLAST and PHI-BLAST
919
=item * Parse histogram of expectations and retrieve gif image in
920
Blast report (if present).
922
=item * Further investigate memory leak that occurs when parsing Blast
923
streams whe supplying a -signif parameter to L<parse()|parse>.
925
=item * Access Blast results directly from the NCBI server using a
926
Perl interface to the NCBI toolkit or XML formated Blast reports (when
929
=item * Further exploit Bio::UnivAln.pm and multiple-sequence
930
alignment programs using HSP sequence data. Some of this may best go
931
into a separate, dedicated module or script as opposed to burdening
932
Blast.pm, Sbjct.pm, and HSP.pm with additional functionality that is
935
=item * Add an example script for parsing Blast reports containing
942
Bio::Tools::Blast.pm, 0.09
948
User feedback is an integral part of the evolution of this and other
949
Bioperl modules. Send your comments and suggestions preferably to one
950
of the Bioperl mailing lists. Your participation is much appreciated.
952
bioperl-l@bioperl.org - General discussion
953
http://bio.perl.org/MailList.html - About the mailing lists
955
=head2 Reporting Bugs
957
Report bugs to the Bioperl bug tracking system to help us keep track
958
the bugs and their resolution. Bug reports can be submitted via email
961
bioperl-bugs@bio.perl.org
962
http://bio.perl.org/bioperl-bugs/
966
Steve Chervitz, sac@bioperl.org
968
See the L<FEEDBACK | FEEDBACK> section for where to send bug reports and comments.
970
=head1 ACKNOWLEDGEMENTS
972
This module was developed under the auspices of the Saccharomyces Genome
974
http://genome-www.stanford.edu/Saccharomyces
976
Other contributors include: Alex Dong Li (webblast), Chris Dagdigian
977
(Seq.pm), Steve Brenner (Seq.pm), Georg Fuellen (Seq.pm, UnivAln.pm),
978
and untold others who have offered comments (noted in the
979
Bio/Tools/Blast/CHANGES file of the distribution).
983
Copyright (c) 1996-98 Steve Chervitz. All Rights Reserved. This
984
module is free software; you can redistribute it and/or modify it
985
under the same terms as Perl itself.
989
Bio::Tools::SeqAnal.pm - Sequence analysis object base class.
990
Bio::Tools::Blast::Sbjct.pm - Blast hit object.
991
Bio::Tools::Blast::HSP.pm - Blast HSP object.
992
Bio::Tools::Blast::HTML.pm - Blast HTML-formating utility class.
993
Bio::Tools::Blast::Run::Webblast.pm - Utility module for running Blasts remotely.
994
Bio::Tools::Blast::Run::LocalBlast.pm - Utility module for running Blasts locally.
995
Bio::Seq.pm - Biosequence object
996
Bio::UnivAln.pm - Biosequence alignment object.
997
Bio::Root::Object.pm - Proposed base class for all Bioperl objects.
999
=head2 Links to related modules
1001
Bio::Tools::SeqAnal.pm
1002
http://bio.perl.org/Core/POD/Bio/Tools/SeqAnal.html
1004
Bio::Tools::Blast::Sbjct.pm
1005
http://bio.perl.org/Core/POD/Bio/Tools/Blast/Sbjct.html
1007
Bio::Tools::Blast::HSP.pm
1008
http://bio.perl.org/Core/POD/Bio/Tools/Blast/HSP.html
1010
Bio::Tools::Blast::HTML.pm
1011
http://bio.perl.org/Core/POD/Bio/Tools/Blast/HTML.html
1013
Bio::Tools::Blast::Run::Webblast.pm
1014
http://bio.perl.org/Core/POD/Bio/Tools/Blast/Run/Webblast.html
1016
Bio::Tools::Blast::Run::LocalBlast.pm
1017
http://bio.perl.org/Core/POD/Bio/Tools/Blast/Run/LocalBlast.html
1020
http://bio.perl.org/Core/POD/Seq.html
1023
http://bio.perl.org/Projects/SeqAlign/
1024
Europe: http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/#univaln
1026
Bio::Root::Object.pm
1027
http://bio.perl.org/Core/POD/Root/Object.html
1029
http://bio.perl.org/Projects/modules.html - Online module documentation
1030
http://bio.perl.org/Projects/Blast/ - Bioperl Blast Project
1031
http://bio.perl.org/ - Bioperl Project Homepage
1033
L<References & Information about the BLAST program>.
1037
There is a memory leak that occurs when parsing parsing streams
1038
containing large numbers of Blast reports (a few thousand or so) and
1039
specifying a -signif parameter to the L<parse()|parse> method. For a
1040
workaround, see L<Memory Usage Issues>.
1042
Not sharing statistical parameters between different Blast objects
1043
when parsing a multi-report stream has not been completely tested and
1044
may be a little buggy.
1046
Documentation inconsistencies or inaccuracies may exist since this
1047
module underwend a fair bit of re-working going from 0.75 to 0.80
1048
(corresponds to versions 0.04.4 to 0.05 of the bioperl distribution).
1055
#### END of main POD documentation.
1062
Methods beginning with a leading underscore are considered private and
1063
are intended for internal use by this module. They are B<not>
1064
considered part of the public interface and are described here for
1065
documentation purposes only.
1069
##############################################################################
1071
##############################################################################
1075
my ($class,@args) = @_;
1076
$class->warn("Bio::Tools::BLAST is deprecated, use Bio::SearchIO system or Bio::Tools::BPlite");
1077
return $class->SUPER::new(@args);
1080
## The Blast.pm object relies on the the superclass constructor:
1081
## Bio::Tools::SeqAnal::_initialize(). See that module for details.
1087
$DEBUG==2 && print STDERR "DESTROYING $self ${\$self->name}";
1088
if($self->{'_hits'}) {
1089
foreach($self->hits) {
1093
undef $self->{'_hits'};
1094
#$self->{'_hits'}->remove_all; ## When and if this member becomes a vector.
1097
$self->SUPER::destroy;
1100
#####################################################################################
1102
#####################################################################################
1106
Usage : $object->run( %named_parameters )
1107
Purpose : Run a local or remote Blast analysis on one or more sequences.
1108
Returns : String containing name of Blast output file if a single Blast
1111
: List of Blast objects if multiple Blasts are being run as a group.
1112
Argument : Named parameters: (PARAMETER TAGS CAN BE UPPER OR LOWER CASE).
1113
: -METHOD => 'local' or 'remote' (default = remote),
1114
: -PARSE => boolean, (true if the results are to be parsed after the run)
1115
: -STRICT => boolean, the strict mode to use for the resulting Blast objects.
1116
: ADDITIONAL PARAMETERS:
1117
: See methods _run_remote() and _run_local() for required
1118
: parameters necessary for running the blast report.
1119
Throws : Exception if no Blast output file was obtained.
1120
Comments : This method is called automatically during construction of a
1121
: Blast.pm object when run parameters are sent to the constructor:
1122
: $blastObj = new Bio::Tools::Blast (-RUN =>\%runParam,
1125
: The specific run methods (local or remote) called by run()
1126
: must return a list containing the file name(s) with the Blast output.
1128
: The run() method can perform single or multiple Blast runs
1129
: (analogous to the way parse() works) depending on how many
1130
: sequences are submitted. However, the running of multiple
1131
: Blasts is probably better handled at the script level. See notes in
1132
: the "TODO" section below.
1134
: As for what to do with the Blast result file, that decision is
1135
: left for the user who can direct the Blast object to delete, compress,
1136
: or leave it alone.
1138
: This method does not worry about load balancing, which
1139
: is probably best handled at the server level.
1141
TODO: : Support for running+parsing multiple Blast analyses with a
1142
: single run() call is incomplete. One can generate multiple
1143
: reports by placing more than one sequence object in the -seqs
1144
: reference parameter. This saves some overhead in the code
1145
: that executes the Blasts since all options are configured once.
1146
: (This is analogous to parsing using the static $Blast object
1147
: see parse() and _parse_stream()).
1149
: The trouble is that Blast objects for all runs are constructed,
1150
: parsed (if necessary), and then returned as a group
1151
: This can require lots of memory when run+parsing many Blasts
1152
: but should be fine if you just want to run a bunch Blasts.
1154
: For now, when running+parsing Blasts, stick to running one
1155
: Blast at a time, building the Blast object with the results
1156
: of that report, and processing as necessary.
1158
: Support for running PSI-Blast is not complete.
1160
See Also: L<_run_remote()|_run_remote>, L<_run_local()|_run_local>, L<parse()|parse>
1167
my ($self, %param) = @_;
1168
my($method, $parse, $strict) =
1169
$self->_rearrange([qw(METHOD PARSE STRICT)], %param);
1171
$strict = $self->strict($strict) if $strict;
1174
if($method =~ /loc/i) {
1175
@files = $self->_run_local(%param);
1178
@files = $self->_run_remote(%param);
1181
$self->throw("Run Blast failed: no Blast output created.") if !@files;
1183
if(scalar(@files) == 1) {
1184
# If there was just one Blast output file, prepare to incorporate it
1185
# into the current Blast object. run() is called before parse() in the
1186
# SeqAnal.pm constructor.
1187
if($files[0] ne 'email') {
1188
$self->file($files[0]);
1190
# Can't do anything with the report.
1191
$self->throw("Blast report to be sent via e-mail.");
1195
# If there are multiple report files, build individual Blast objects foreach.
1196
# In this situation, the static $Blast object is being used to run
1197
# a set of related Blasts, similar to the way parse() can be used.
1198
# This strategy is not optimal since all reports are generated first
1199
# before any are parsed.
1204
push @objs, new Bio::Tools::Blast(-FILE => $_,
1205
-PARSE => $parse || 0,
1215
Usage : n/a; internal method called by run()
1216
: $object->_run_remote( %named_parameters )
1217
Purpose : Run Blast on a remote server.
1218
Argument : Named parameters:
1219
: See documentation for function &blast_remote in
1220
: Bio::Tools::Blast::Run::Webblast.pm for description
1222
Comments : This method requires the Bio::Tools::Blast::Run::Webblast.pm
1223
: which conforms to this minimal API:
1224
: * export a method called &blast_remote that accepts a
1225
: Bio::Tools::Blast.pm object + named parameters
1226
: (specified in the Webblast.pm module).
1227
: * return a list of names of files containing the raw Blast reports.
1228
: (When building a Blast object, this list would contain a
1229
: single file from which the Blast object is to be constructed).
1231
See Also : L<run()|run>, L<_run_local()|_run_local>, B<Bio::Tools::Blast::Run::Webblast.pm::blast_remote()>, L<Links to related modules>
1238
my ($self, %param) = @_;
1240
require Bio::Tools::Blast::Run::Webblast;
1241
Bio::Tools::Blast::Run::Webblast->import(qw(&blast_remote));
1243
&blast_remote($self, %param);
1248
Usage : n/a; internal method called by run()
1249
: $object->_run_local(%named_parameters)
1250
Purpose : Run Blast on a local machine.
1251
Argument : Named parameters:
1252
: See documentation for function &blast_local in
1253
: Bio::Tools::Blast::Run::LocalBlast.pm for description
1255
Comments : This method requires the Bio::Tools::Blast::Run::LocalBlast.pm
1256
: module which should be customized for your site. This module would
1257
: contain all the commands, paths, environment variables, and other
1258
: data necessary to run Blast commands on a local machine, but should
1259
: not contain any semantics for specific query sequences.
1261
: LocalBlast.pm should also conform to this minimal API:
1262
: * export a method called &blast_local that accepts a
1263
: Bio::Tools::Blast.pm object + named parameters
1264
: (specified in the LocalBlast.pm module).
1265
: * return a list of names of files containing the raw Blast reports.
1266
: (When building a Blast object, this list would contain a
1267
: single file from which the Blast object is to be constructed).
1269
See Also : L<run()|run>, L<_run_remote()|_run_remote>, B<Bio::Tools::Blast::Run::LocalBlast::blast_local()>, L<Links to related modules>
1276
my ($self, %param) = @_;
1278
require Bio::Tools::Blast::Run::Webblast;
1279
Bio::Tools::Blast::Run::Webblast->import(qw(&blast_local));
1281
&blast_local($self, %param);
1286
Usage : @dbs = $Blast->db_remote( [seq_type] );
1287
Purpose : Get a list of available sequence databases for remote Blast analysis.
1288
Returns : Array of strings
1289
Argument : seq_type = 'p' or 'n'
1290
: 'p' = Gets databases for peptide searches (default)
1291
: 'n' = Gets databases for nucleotide searches
1293
Comments : Peptide databases are a subset of the nucleotide databases.
1294
: It is convenient to call this method on the static $Blast object
1295
: as shown in Usage.
1297
See Also : L<db_local()|db_local>
1304
my ($self, $type) = @_;
1307
require Bio::Tools::Blast::Run::Webblast;
1308
Bio::Tools::Blast::Run::Webblast->import(qw(@Blast_dbp_remote
1309
@Blast_dbn_remote));
1311
# We shouldn't have to fully qualify the Blast_dbX_remote arrays. Hm.
1314
if( $type =~ /^p|amino/i) {
1315
@dbs = @Bio::Tools::Blast::Run::Webblast::Blast_dbp_remote;
1317
@dbs = @Bio::Tools::Blast::Run::Webblast::Blast_dbn_remote;
1324
Usage : @dbs = $Blast->db_local( [seq_type] );
1325
Purpose : Get a list of available sequence databases for local Blast analysis.
1326
Returns : Array of strings
1327
Argument : seq_type = 'p' or 'n'
1328
: 'p' = Gets databases for peptide searches (default)
1329
: 'n' = Gets databases for nucleotide searches
1331
Comments : Peptide databases are a subset of the nucleotide databases.
1332
: It is convenient to call this method on the static $Blast object.
1335
See Also : L<db_remote()|db_remote>
1342
my ($self, $type) = @_;
1345
require Bio::Tools::Blast::Run::LocalBlast;
1346
Bio::Tools::Blast::Run::LocalBlast->import(qw(@Blast_dbp_local
1349
# We shouldn't have to fully qualify the Blast_dbX_local arrays. Hm.
1352
if( $type =~ /^p|amino/i) {
1353
@dbs = @Bio::Tools::Blast::Run::LocalBlast::Blast_dbp_local;
1355
@dbs = @Bio::Tools::Blast::Run::LocalBlast::Blast_dbn_local;
1362
Usage : $blast_object->parse( %named_parameters )
1363
Purpose : Parse a Blast report from a file or STDIN.
1364
: * Parses a raw BLAST data, populating Blast object with report data.
1365
: * Sets the significance cutoff.
1366
: * Extracts statistical parameters about the BLAST run.
1367
: * Handles both single files and streams containing multiple reports.
1368
Returns : integer (number of Blast reports parsed)
1369
Argument : <named parameters>: (PARAMETER TAGS CAN BE UPPER OR LOWER CASE).
1370
: -FILE => string (name of file containing raw Blast output.
1371
: Optional. If a valid file is not supplied,
1372
: STDIN will be used).
1373
: -SIGNIF => number (float or scientific notation number to be used
1374
: as a P- or Expect value cutoff;
1375
: default = $DEFAULT_SIGNIF (999)).
1376
: -FILT_FUNC => func_ref (reference to a function to be used for
1377
: filtering out hits based on arbitrary criteria.
1378
: This function should take a
1379
: Bio::Tools::Blast::Sbjct.pm object as its first
1380
: argument and return a boolean value,
1381
: true if the hit should be filtered out).
1382
: Sample filter function:
1383
: -FILT_FUNC => sub { $hit = shift;
1384
: $hit->gaps == 0; },
1385
: -CHECK_ALL_HITS => boolean (check all hits for significance against
1386
: significance criteria. Default = false.
1387
: If false, stops processing hits after the first
1388
: non-significant hit or the first hit that fails
1389
: the filt_func call. This speeds parsing,
1390
: taking advantage of the fact that the hits
1391
: are processed in the order they are ranked.)
1392
: -MIN_LEN => integer (to be used as a minimum query sequence length
1393
: sequences below this length will not be processed).
1394
: default = no minimum length).
1395
: -STATS => boolean (collect stats for report: matrix, filters, etc.
1397
: -BEST => boolean (only process the best hit of each report;
1399
: -OVERLAP => integer (the amount of overlap to permit between
1400
: adjacent HSPs when tiling HSPs,
1401
: Default = $MAX_HSP_OVERLAP (2))
1403
: PARAMETERS USED WHEN PARSING MULTI-REPORT STREAMS:
1404
: --------------------------------------------------
1405
: -SHARE => boolean (set this to true if all reports in stream
1406
: share the same stats. Default = true)
1407
: Must be set to false when parsing both Blast1 and
1408
: Blast2 reports in the same run or if you need
1409
: statistical params for each report, Lambda, K, H).
1410
: -STRICT => boolean (use strict mode for all Blast objects created.
1411
: Increases sensitivity to errors. For single
1412
: Blasts, this is parameter is sent to new().)
1413
: -EXEC_FUNC => func_ref (reference to a function for processing each
1414
: Blast object after it is parsed. Should accept a
1415
: Blast object as its sole argument. Return value
1416
: is ignored. If an -EXEC_FUNC parameter is supplied,
1417
: the -SAVE_ARRAY parameter will be ignored.)
1418
: -SAVE_ARRAY =>array_ref, (reference to an array for storing all
1419
: Blast objects as they are created.
1420
: Experimental. Not recommended.)
1421
: -SIGNIF_FMT => boolean String of 'exp' or 'parts'. Sets the format
1422
: for reporting P/Expect values. 'exp' reports
1423
: only the exponent portion. 'parts' reports
1424
: them as a 2 element list. See signif_fmt()..
1426
Throws : Exception if BLAST report contains a FATAL: error.
1427
: Propagates any exception thrown by read().
1428
: Propagates any exception thrown by called parsing methods.
1429
Comments : This method can be called either directly using the static $Blast object
1430
: or indirectly (by Bio::Tools::SeqAnal.pm) during constuction of an
1431
: individual Blast object.
1433
: HTML-formatted reports can be parsed as well. No special flag is required
1434
: since it is detected automatically. The presence of HTML-formatting
1435
: will result in slower performace, however, since it must be removed
1436
: prior to parsing. Parsing HTML-formatted reports is highly
1437
: error prone and is generally not recommended.
1439
: If one has an HTML report, do NOT remove the HTML from it by using the
1440
: "Save As" option of a web browser to save it as text. This renders the
1441
: report unparsable.
1442
: HTML-formatted reports can be parsed after running through the strip_html
1443
: function of Blast::HTML.pm as in:
1444
: require Bio::Tools::Blast::HTML;
1445
: Bio::Tools::Blast::HTML->import(&strip_html);
1446
: &strip_html(\$data);
1447
: # where data contains full contents of an HTML-formatted report.
1448
: TODO: write a demo script that does this.
1450
See Also : L<_init_parse_params()|_init_parse_params>, L<_parse_blast_stream()|_parse_blast_stream>, L<overlap()|overlap>, L<signif_fmt()|signif_fmt>, B<Bio::Root::Object::read()>, B<Bio::Tools::Blast::HTML.pm::strip_html()>, L<Links to related modules>
1457
# $self might be the static $Blast object.
1458
my ($self, @param) = @_;
1460
my($signif, $filt_func, $min_len, $check_all, $overlap, $stats,
1461
$share, $strict, $best, $signif_fmt, $no_aligns) =
1462
$self->_rearrange([qw(SIGNIF FILT_FUNC MIN_LEN CHECK_ALL_HITS
1463
OVERLAP STATS SHARE STRICT
1464
BEST EXPONENT NO_ALIGNS )], @param);
1466
## Initialize the static Blast object with parameters that
1467
## apply to all Blast objects within a parsing session.
1469
&_init_parse_params($share, $filt_func, $check_all,
1470
$signif, $min_len, $strict,
1471
$best, $signif_fmt, $stats, $no_aligns
1474
my $count = $self->_parse_blast_stream(@param);
1476
# print STDERR "\nDONE PARSING STREAM.\n";
1478
if($Blast->{'_blast_errs'}) {
1479
my @errs = @{$Blast->{'_blast_errs'}};
1480
printf STDERR "\n*** %d BLAST REPORTS HAD FATAL ERRORS:\n", scalar(@errs);
1481
foreach(@errs) { print STDERR "$_\n"; }
1482
@{$Blast->{'_blast_errs'}} = ();
1488
=head2 _init_parse_params
1490
Title : _init_parse_params
1491
Usage : n/a; called automatically by parse()
1492
Purpose : Initializes parameters used during parsing of Blast reports.
1493
: This is a static method used by the $Blast object.
1494
: Calls _set_signif().
1497
Args : Args extracted by parse().
1499
See Also: L<parse()|parse>, L<_set_signif()|_set_signif>
1503
#----------------------
1504
sub _init_parse_params {
1505
#----------------------
1506
my ($share, $filt_func, $check_all,
1507
$signif, $min_len, $strict,
1508
$best, $signif_fmt, $stats, $no_aligns) = @_;
1510
## Default is to share stats.
1511
$Blast->{'_share'} = defined($share) ? $share : 1;
1512
$Blast->{'_filt_func'} = $filt_func || 0;
1513
$Blast->{'_check_all'} = $check_all || 0;
1514
$Blast->{'_signif_fmt'} ||= $signif_fmt || '';
1515
$Blast->{'_no_aligns'} = $no_aligns || 0;
1517
&_set_signif($signif, $min_len, $filt_func);
1518
$Blast->strict($strict) if defined $strict;
1519
$Blast->best($best) if $best;
1520
$Blast->{'_blast_count'} = 0;
1522
## If $stats is false, miscellaneous statistical and other parameters
1523
## are NOT extracted from the Blast report (e.g., matrix name, filter used, etc.).
1524
## This can speed processing when crunching tons of Blast reports.
1525
## Default is to NOT get stats.
1526
$Blast->{'_get_stats'} = defined($stats) ? $stats : 0;
1528
# Clear any errors from previous parse.
1529
undef $Blast->{'_blast_errs'};
1534
Usage : n/a; called automatically by _init_parse_params()
1535
: This is now a "static" method used only by $Blast.
1536
: _set_signif($signif, $min_len, $filt_func);
1537
Purpose : Sets significance criteria for the BLAST object.
1538
Argument : Obligatory three arguments:
1539
: $signif = float or sci-notation number or undef
1540
: $min_len = integer or undef
1541
: $filt_func = function reference or undef
1543
: If $signif is undefined, a default value is set
1544
: (see $DEFAULT_SIGNIF; min_length = not set).
1545
Throws : Exception if significance value is defined but appears
1546
: out of range or invalid.
1547
: Exception if $filt_func if defined and is not a func ref.
1548
Comments : The significance of a BLAST report can be based on
1549
: the P (or Expect) value and/or the length of the query sequence.
1550
: P (or Expect) values GREATER than '_significance' are not significant.
1551
: Query sequence lengths LESS than '_min_length' are not significant.
1553
: Hits can also be screened using arbitrary significance criteria
1554
: as discussed in the parse() method.
1556
: If no $signif is defined, the '_significance' level is set to
1557
: $Bio::Tools::Blast::DEFAULT_SIGNIF (999).
1559
See Also : L<signif()|signif>, L<min_length()|min_length>, L<_init_parse_params()|_init_parse_params>, L<parse()|parse>
1566
my( $sig, $len, $func ) = @_;
1569
$Blast->{'_confirm_significance'} = 1;
1570
if( $sig =~ /[^\d.e-]/ or $sig <= 0) {
1571
$Blast->throw("Invalid significance value: $sig",
1572
"Must be greater than zero.");
1574
$Blast->{'_significance'} = $sig;
1576
$Blast->{'_significance'} = $DEFAULT_SIGNIF;
1577
$Blast->{'_check_all'} = 1 if not $Blast->{'_filt_func'};
1581
if($len =~ /\D/ or $len <= 0) {
1582
$Blast->warn("Invalid minimum length value: $len",
1583
"Value must be an integer > 0. Value not set.");
1585
$Blast->{'_min_length'} = $len;
1590
$Blast->{'_confirm_significance'} = 1;
1591
if($func and not ref $func eq 'CODE') {
1592
$Blast->throw("Not a function reference: $func",
1593
"The -filt_func parameter must be function reference.");
1598
=head2 _parse_blast_stream
1600
Usage : n/a. Internal method called by parse()
1601
Purpose : Obtains the function to be used during parsing and calls read().
1602
Returns : Integer (the number of blast reports read)
1603
Argument : Named parameters (forwarded from parse())
1604
Throws : Propagates any exception thrown by _get_parse_blast_func() and read().
1606
See Also : L<_get_parse_blast_func()|_get_parse_blast_func>, B<Bio::Root::Object::read()>
1610
#----------------------
1611
sub _parse_blast_stream {
1612
#----------------------
1613
my ($self, %param) = @_;
1615
my $func = $self->_get_parse_blast_func(%param);
1616
# my $func = sub { my $data = shift;
1617
# printf STDERR "Chunk length = %d\n", length($data);
1621
# Only setting the newline character once per session.
1622
$Newline ||= $Util->get_newline(-client => $self, %param);
1624
$self->read(-REC_SEP =>"$Newline>",
1628
return $Blast->{'_blast_count'};
1631
=head2 _get_parse_blast_func
1633
Usage : n/a; internal method used by _parse_blast_stream()
1634
: $func_ref = $blast_object->_get_parse_blast_func()
1635
Purpose : Generates a function ref to be used as a closure for parsing
1636
: raw data as it is being loaded by Bio::Root::IOManager::read().
1637
Returns : Function reference (closure).
1638
Comments : The the function reference contains a fair bit of logic
1639
: at present. It could perhaps be split up into separate
1640
: functions to make it more 'digestible'.
1642
See Also : L<_parse_blast_stream()|_parse_blast_stream>
1646
#--------------------------
1647
sub _get_parse_blast_func {
1648
#--------------------------
1649
my ($self, @param) = @_;
1651
my ($save_a, $exec_func) =
1652
$self->_rearrange([qw(SAVE_ARRAY EXEC_FUNC)], @param);
1654
# $MONITOR && print STDERR "\nParsing Blast stream (5/dot, 250/line)\n";
1656
my $strict = $self->strict();
1658
# Some parameter validation.
1659
# Remember, all Blast parsing will use this function now.
1660
# You won't need a exec-func or save_array when just creating a Blast object
1661
# as in: $blast = new Bio::Tools::Blast();
1662
if($exec_func and not ref($exec_func) eq 'CODE') {
1663
$self->throw("The -EXEC_FUNC parameter must be function reference.",
1664
"exec_func = $exec_func");
1666
} elsif($save_a and not ref($save_a) eq 'ARRAY') {
1667
$self->throw("The -SAVE_ARRAY parameter must supply an array reference".
1668
"when not using an -EXEC_FUNC parameter.");
1671
## Might consider breaking this closure up if possible.
1675
## $data should contain one of three possible fragment types
1676
## from a Blast report:
1677
## 1. Header with description section,
1678
## 2. An alignment section for a single hit, or
1679
## 3. The final alignment section plus the footer section.
1680
## (record separator = "Newline>").
1682
# print STDERR "\n(BLAST) DATA CHUNK: $data\n";
1684
my ($current_blast, $current_prog, $current_vers, $current_db);
1686
my $contains_translation = 0;
1688
### steve --- Wed Mar 15 02:48:07 2000
1689
### In the process of addressing bug PR#95. Tricky.
1690
### Using the $contains_translation to do so. Not complete
1691
### and possibly won't fix. We'll see.
1693
# Check for header section. Start a new Blast object and
1694
# parse the description section.
1695
# if ($data =~ /\sQuery\s?=/s || ($contains_translation && $data =~ /Database:/s)) {
1696
if ($data =~ /\sQuery\s?=/s) {
1697
$Blast->{'_blast_count'}++;
1698
print STDERR ".", $Blast->{'_blast_count'} % 50 ? '' : "\n" if $MONITOR;
1700
if($data =~ /$Newline\s+Translating/so) {
1701
print STDERR "\nCONTAINS TRANSLATION\n";
1702
$contains_translation = 1;
1705
# If we're parsing a stream containing multiple reports,
1706
# all subsequent header sections will contain the last hit of
1707
# the previous report which needs to be parsed and added to that
1708
# report if signifcant. It also contains the run parameters
1709
# at the bottom of the Blast report.
1710
# if($Blast->{'_blast_count'} > 1 || $contains_translation) {
1711
if($Blast->{'_blast_count'} > 1) {
1712
# print STDERR "\nMULTI-BLAST STREAM.\n";
1713
$Blast->{'_multi_stream'} = 1;
1715
if($data =~ /(.+?)$Newline(<\w+>)?(T?BLAST[NPX])\s+(.+?)$Newline(.+)/so) {
1716
($current_prog, $current_vers, $data) = ($3, $4, $5);
1717
# Final chunk containing last hit and last footer.
1718
$Blast->{'_current_blast'}->_parse_alignment($1);
1719
$prev_blast = $Blast->{'_current_blast'}; # finalized.
1720
# } elsif($contains_translation) {
1721
# $data =~ /(T?BLAST[NPX])\s+(.+?)$Newline(.+)/so;
1722
# ($current_prog, $current_vers, $data) = ($1, $2, $3);
1724
$Blast->throw("Can't determine program type from BLAST report.",
1725
"Checked for: @Blast_programs.");
1726
# This has important implications for how to handle interval
1727
# information for HSPs. TBLASTN uses nucleotides in query HSP
1728
# but amino acids in the sbjct HSP sequence.
1731
if($data =~ m/Database:\s+(.+?)$Newline/so ) {
1734
# In some reports, the Database is only listed at end.
1735
#$Blast->warn("Can't determine database name from BLAST report.");
1738
# Incyte_Fix: Nasty Invisible Bug.
1739
# Records in blast report are delimited by '>', but... when
1740
# there are no hits for a query, there won't be a '>'. That
1741
# causes several blast reports to run together in the data
1742
# passed to this routine. Need to get rid of non-hits in data
1743
if ($data =~ /.+(No hits? found.+Sequences.+)/so) {
1750
# Determine if we need to create a new Blast object
1751
# or use the $self object for this method.
1753
if($Blast->{'_multi_stream'} or $self->name eq 'Static Blast object') {
1754
# Strict mode is not object-specific but may be someday.
1755
# print STDERR "\nCreating new Blast object.\n";
1756
$current_blast = new Bio::Tools::Blast(-STRICT => $strict);
1758
$current_blast = $self;
1760
$Blast->{'_current_blast'} = $current_blast;
1762
# If we're not sharing stats, set data on current blast object.
1763
if(defined $current_prog and not $Blast->{'_share'}) {
1764
$current_blast->program($current_prog);
1765
$current_blast->program_version($current_vers);
1766
$current_blast->database($current_db);
1769
# print STDERR "CURRENT BLAST = ", $current_blast->name, "\n";
1770
$current_blast->_parse_header($data);
1772
# If there were any descriptions in the header,
1773
# we know if there are any significant hits.
1774
# No longer throwing exception if there were no significant hits
1775
# and a -signif parameter was specified. Doing so prevents the
1776
# construction of a Blast object, which could still be useful.
1777
# if($current_blast->{'_has_descriptions'} and $Blast->{'_confirm_significance'} and not $current_blast->is_signif) {
1778
# $current_blast->throw("No significant BLAST hits for ${\$current_blast->name}");
1782
} # Done parsing header/description section
1784
### For use with $contains_translation - not right - breaks regular report parsing.
1785
# elsif(ref $Blast->{'_current_blast'} && $data !~ /\s*\w*\s*/s) {
1786
elsif(ref $Blast->{'_current_blast'} ) {
1787
# Process an alignment section.
1788
$current_blast = $Blast->{'_current_blast'};
1789
# print STDERR "\nCONTINUING PROCESSING ALN WITH ", $current_blast->name, "\n";
1790
# print STDERR "DATA: $data\n";
1792
$current_blast->_parse_alignment($data);
1795
# push @{$self->{'_blast_errs'}}, $@;
1799
# If the current Blast object has been completely parsed
1800
# (occurs with a single Blast stream), or if there is a previous
1801
# Blast object (occurs with a multi Blast stream),
1802
# execute a supplied function on it or store it in a supplied array.
1804
if( defined $prev_blast or $current_blast->{'_found_params'}) {
1805
my $finished_blast = defined($prev_blast) ? $prev_blast : $current_blast;
1807
$finished_blast->_report_errors();
1808
# print STDERR "\nNEW BLAST OBJECT: ${\$finished_blast->name}\n";
1811
# print STDERR " RUNNING EXEC_FUNC...\n";
1812
&$exec_func($finished_blast); # ignoring any return value.
1813
# Report processed, no longer need object.
1814
$finished_blast->destroy;
1815
undef $finished_blast;
1817
# print STDERR " SAVING IN ARRAY...\n";
1818
# We've already verified that if there is no exec_func
1819
# then there must be a $save_array
1820
push @$save_a, $finished_blast;
1827
=head2 _report_errors
1829
Title : _report_errors
1830
Usage : n/a; Internal method called by _get_parse_blast_func().
1831
Purpose : Throw or warn about any errors encountered.
1834
Throws : If all hits generated exceptions, raise exception
1835
: (a fatal event for the Blast object.)
1836
: If some hits were okay but some were bad, generate a warning
1837
: (a few bad applies should not spoil the bunch).
1838
: This usually indicates a limiting B-value.
1839
: When the parsing code fails, it is either all or nothing.
1843
#-------------------
1844
sub _report_errors {
1845
#-------------------
1848
return unless ref($self->{'_blast_errs'});
1849
# ref($self->{'_blast_errs'}) || (print STDERR "\nNO ERRORS\n", return );
1851
my @errs = @{$self->{'_blast_errs'}};
1855
@{$self->{'_blast_errs'}} = (); # clear the errs on the object.
1856
# When there are many errors, in most of the cases, they are
1857
# caused by the same problem. Only need to see full data for
1859
if(scalar @errs > 2) {
1860
$str = "SHOWING FIRST EXCEPTION ONLY:\n$errs[0]";
1861
$self->clear_err(); # clearing the existing set of errors (conserve memory).
1862
# Not necessary, unless the -RECORD_ERR =>1
1863
# constructor option was used for Blast object.
1865
$str = join("\n",@errs);
1868
if(not $self->{'_num_hits_significant'}) {
1869
$self->throw(sprintf("Failed to parse any hit data (n=%d).", scalar(@errs)),
1870
"\n\nTRAPPED EXCEPTION(S):\n$str\nEND TRAPPED EXCEPTION(S)\n"
1873
$self->warn(sprintf("Some potential hits were not parsed (n=%d).", scalar(@errs)),
1874
@errs > 2 ? "This may be due to a limiting B value (max alignment listings)." : "",
1875
"\n\nTRAPPED EXCEPTION(S):\n$str\nEND TRAPPED EXCEPTION(S)\n"
1881
=head2 _parse_header
1883
Usage : n/a; called automatically by the _get_parse_blast_func().
1884
Purpose : Parses the header section of a BLAST report.
1885
Argument : String containing the header+description section of a BLAST report.
1886
Throws : Exception if description data cannot be parsed properly.
1887
: Exception if there is a 'FATAL' error in the Blast report.
1888
: Warning if there is a 'WARNING' in the Blast report.
1889
: Warning if there are no significant hits.
1890
Comments : Description section contains a single line for each hit listing
1891
: the seq id, description, score, Expect or P-value, etc.
1893
See Also : L<_get_parse_blast_func()|_get_parse_blast_func>
1897
#----------------------
1899
#----------------------
1900
my( $self, $data ) = @_;
1902
# print STDERR "\n$ID: PARSING HEADER\n"; #$data\n";
1904
$data =~ s/^\s+|\s+>?$//sg;
1906
if($data =~ /<HTML/i) {
1907
$self->throw("Can't parse HTML-formatted BLAST reports.",
1908
# "Such reports can be parsed with a special parsing \n".
1909
# "script included in the examples/blast directory \n".
1910
# "of the Bioperl distribution. (TODO)"
1912
# This was the old strategy, can't do it with new strategy
1913
# since we don't have the whole report in one chunk.
1914
# This could be the basis for the "special parsing script".
1915
# require Bio::Tools::Blast::HTML;
1916
# Bio::Tools::Blast::HTML->import(&strip_html);
1917
# &strip_html(\$data);
1920
$data =~ /WARNING: (.+?)$Newline$Newline/so and $self->warn("$1") if $self->strict;
1921
$data =~ /FATAL: (.+?)$Newline$Newline/so and $self->throw("FATAL BLAST ERROR = $1");
1922
# No longer throwing exception when no hits were found. Still reporting it.
1923
$data =~ /No hits? found/i and $self->warn("No hits were found.") if $self->strict;
1925
# If this is the first Blast, the program, version, and database info
1926
# pertain to it. Otherwise, they are for the previous report and have
1927
# already been parsed out.
1928
# Data is stored in the static Blast object. Data for subsequent reports
1929
# will be stored in separate objects if the -share parameter is not set.
1930
# See _get_parse_blast_func().
1932
if($Blast->{'_blast_count'} == 1) {
1933
if($data =~ /(<\w+>)?(T?BLAST[NPX])\s+(.+?)$Newline/so) {
1934
$Blast->program($2);
1935
$Blast->program_version($3);
1937
$self->throw("Can't determine program type from BLAST report.",
1938
"Checked for: @Blast_programs.");
1939
# This has important implications for how to handle interval
1940
# information for HSPs. TBLASTN uses nucleotides in query HSP
1941
# but amino acids in the sbjct HSP sequence.
1944
if($data =~ m/Database:\s+(.+?)$Newline/so ) {
1945
$Blast->database($1);
1947
# In some reports, the Database is only listed at end.
1948
#$self->warn("Can't determine database name from BLAST report (_parse_header)\n$data\n.");
1952
my ($header, $descriptions);
1954
## For efficiency reasons, we want to to avoid using $' and $`.
1955
## Therefore using single-line mode pattern matching.
1957
if($data =~ /(.+?)\nSequences producing.+?\n(.+)/s ) {
1958
($header, $descriptions) = ($1, $2);
1959
$self->{'_has_descriptions'} = 1;
1962
$self->{'_has_descriptions'} = 0;
1963
# Blast reports can legally lack description section. No need to warn.
1964
#push @{$self->{'_blast_errs'}}, "Can't parse description data.";
1967
$self->_set_query($header); # The name of the sequence will appear in error report.
1968
# print STDERR "\nQUERY = ", $Blast->{'_current_blast'}->query, "\n";
1970
$self->_set_date($header) if $Blast->{'_get_stats'};
1971
$self->_set_length($header);
1973
# not $Blast->{'_confirm_significance'} and print STDERR "\nNOT PARSING DESCRIPTIONS.\n";
1975
# Setting the absolute max and min significance levels.
1976
$self->{'_highestSignif'} = 0;
1977
$self->{'_lowestSignif'} = $DEFAULT_SIGNIF;
1979
if ($Blast->{'_confirm_significance'} || $Blast->{'_no_aligns'}) {
1980
$self->_parse_descriptions($descriptions) if $descriptions;
1982
$self->{'_is_significant'} = 1;
1986
#-----------------------
1987
sub _parse_descriptions {
1988
#-----------------------
1989
my ($self, $desc) = @_;
1991
# NOTE: This method will not be called if the report lacks
1992
# a description section.
1994
# print STDERR "\nPARSING DESCRIPTION DATA\n";
1996
my @descriptions = split( $Newline, $desc);
1999
# NOW step through each line parsing out the P/Expect value
2000
# All we really need to do is check the first one, if it doesn't
2001
# meet the significance requirement, we can skip the report.
2002
# BUT: we want to collect data for all hits anyway to get min/max signif.
2004
my $my_signif = $self->signif;
2005
my $layout_set = $Blast->{'_layout'} || 0;
2011
foreach $line (@descriptions) {
2013
last desc_loop if $line =~ / NONE |End of List/;
2014
next desc_loop if $line =~ /^\s*$/;
2015
next desc_loop if $line =~ /^\.\./;
2017
## Checking the significance value (P- or Expect value) of the hit
2018
## in the description line.
2020
# These regexps need testing on a variety of reports.
2021
if ( $line =~ /\d+\s{1,5}[\de.-]+\s*$/) {
2023
} elsif( $line =~ /\d+\s{1,5}[\de.-]+\s{1,}\d+\s*$/) {
2026
$self->warn("Can't parse significance data in description line $line");
2029
not $layout_set and ($self->_layout($layout), $layout_set = 1);
2031
$sig = &_parse_signif( $line, $layout );
2033
# print STDERR " Parsed signif ($layout) = $sig\n";
2035
last desc_loop if ($sig > $my_signif and not $Blast->{'_check_all'});
2036
$self->_process_significance($sig, $my_signif);
2039
# printf "\n%d SIGNIFICANT HITS.\nDONE PARSING DESCRIPTIONS.\n", $self->{'_num_hits_significant'};
2042
sub _process_significance {
2043
my($self, $sig, $my_signif) = @_;
2045
$self->{'_highestSignif'} = ($sig > $self->{'_highestSignif'})
2046
? $sig : $self->{'_highestSignif'};
2048
$self->{'_lowestSignif'} = ($sig < $self->{'_lowestSignif'})
2049
? $sig : $self->{'_lowestSignif'};
2051
# Significance value assessment.
2052
$sig <= $my_signif and $self->{'_num_hits_significant'}++;
2053
$self->{'_num_hits'}++;
2055
$self->{'_is_significant'} = 1 if $self->{'_num_hits_significant'};
2058
=head2 _parse_alignment
2060
Usage : n/a; called automatically by the _get_parse_blast_func().
2061
Purpose : Parses a single alignment section of a BLAST report.
2062
Argument : String containing the alignment section.
2063
Throws : n/a; All errors are trapped while parsing the hit data
2064
: and are processed as a group when the report is
2065
: completely processed (See _report_errors()).
2067
Comments : Alignment section contains all HSPs for a hit.
2068
: Requires Bio::Tools::Blast::Sbjct.pm.
2069
: Optionally calls a filter function to screen the hit on arbitrary
2070
: criteria. If the filter function returns true for a given hit,
2071
: that hit will be skipped.
2073
: If the Blast object was created with -check_all_hits set to true,
2074
: all hits will be checked for significance and processed if necessary.
2075
: If this field is false, the parsing will stop after the first
2076
: non-significant hit.
2077
: See parse() for description of parsing parameters.
2079
See Also : L<parse()|parse>, L<_get_parse_blast_func()|_get_parse_blast_func>, L<_report_errors()|_report_errors>, B<Bio::Tools::Blast::Sbjct()>, L<Links to related modules>
2083
#----------------------
2084
sub _parse_alignment {
2085
#----------------------
2086
# This method always needs to check detect if the $data argument
2087
# contains the footer of a Blast report, indicating the last chunk
2088
# of a single Blast stream.
2090
my( $self, $data ) = @_;
2092
# printf STDERR "\nPARSING ALIGNMENT DATA for %s $self.\n", $self->name;
2094
# NOTE: $self->{'_current_hit'} is an instance variable
2095
# The $Blast object will not have this member.
2097
# If all of the significant hits have been parsed,
2098
# return if we're not checking all or if we don't need to get
2099
# the Blast stats (parameters at footer of report).
2100
if(defined $self->{'_current_hit'} and
2101
defined $self->{'_num_hits_significant'}) {
2102
return if $self->{'_current_hit'} >= $self->{'_num_hits_significant'} and
2103
not ($Blast->{'_check_all'} or $Blast->{'_get_stats'});
2106
# Check for the presence of the Blast footer section.
2107
# _parse_footer returns the alignment section.
2108
$data = $self->_parse_footer($data);
2110
# Return if we're only interested in the best hit.
2111
# This has to occur after checking for the parameters section
2112
# in the footer (since we may still be interested in them).
2113
return if $Blast->best and ( defined $self->{'_current_hit'} and $self->{'_current_hit'} >=1);
2115
# print "RETURNED FROM _parse_footer (", $self->to_string, ")";
2116
# print "\n --> FOUND PARAMS.\n" if $self->{'_found_params'};
2117
# print "\n --> DID NOT FIND PARAMS.\n" unless $self->{'_found_params'};
2119
require Bio::Tools::Blast::Sbjct;
2121
$data =~ s/^\s+|\s+>?$//sg;
2122
$data =~ s/$Newline$Newline/$Newline/sog; # remove blank lines.
2123
my @data = split($Newline, $data);
2126
# print STDERR "\nALIGNMENT DATA:\n$data\n";
2128
my $prog = $self->program;
2129
my $check_all = $Blast->{'_check_all'};
2130
my $filt_func = $Blast->{'_filt_func'} || 0;
2131
my $signif_fmt = $Blast->{'_signif_fmt'};
2132
my $my_signif = $self->signif;
2135
# Now construct the Sbjct objects from the alignment section
2139
$self->{'_current_hit'}++;
2141
# If not confirming significance, _parse_descriptions will not have been run,
2142
# so we need to count the total number of hits here.
2143
if( not $Blast->{'_confirm_significance'}) {
2144
$self->{'_num_hits'}++;
2147
if($Blast->{'_no_aligns'}) {
2148
# printf STDERR "\nNOT PARSING ALIGNMENT DATA\n";
2152
my $hit; # Must be my'ed within hit_loop.
2154
$hit = new Bio::Tools::Blast::Sbjct (-DATA =>\@data,
2156
-NAME =>$self->{'_current_hit'},
2157
-RANK =>$self->{'_current_hit'},
2160
-SIGNIF_FMT=>$signif_fmt,
2161
-OVERLAP =>$Blast->{'_overlap'} || $MAX_HSP_OVERLAP,
2163
# printf STDERR "NEW HIT: %s, SIGNIFICANCE = %g\n", $hit->name, $hit->expect; <STDIN>;
2164
# The BLAST report may have not had a description section.
2165
if(not $self->{'_has_descriptions'}) {
2166
$self->_process_significance($hit->signif, $my_signif);
2171
# Throwing lots of errors can slow down the code substantially.
2172
# Error handling code is not that efficient.
2173
#print STDERR "\nERROR _parse_alignment: $@\n";
2174
push @{$self->{'_blast_errs'}}, $@;
2175
$hit->destroy if ref $hit;
2178
# Collect overall signif data if we don't already have it,
2179
# (as occurs if no -signif parameter is supplied).
2180
my $hit_signif = $hit->signif;
2182
if (not $Blast->{'_confirm_significance'} ) {
2183
$self->{'_highestSignif'} = ($hit_signif > $self->{'_highestSignif'})
2184
? $hit_signif : $self->{'_highestSignif'};
2186
$self->{'_lowestSignif'} = ($hit_signif < $self->{'_lowestSignif'})
2187
? $hit_signif : $self->{'_lowestSignif'};
2190
# Test significance using custom function (if supplied)
2192
if(&$filt_func($hit)) {
2193
push @{$self->{'_hits'}}, $hit;
2195
$hit->destroy; undef $hit;
2197
} elsif($hit_signif <= $my_signif) {
2198
push @{$self->{'_hits'}}, $hit;
2204
=head2 _parse_footer
2206
Usage : n/a; internal function. called by _parse_alignment()
2207
Purpose : Extracts statistical and other parameters from the BLAST report.
2208
: Sets various key elements such as the program and version,
2209
: gapping, and the layout for the report (blast1 or blast2).
2210
Argument : Data to be parsed.
2211
Returns : String containing an alignment section for processing by
2212
: _parse_alignment().
2213
Throws : Exception if cannot find the parameters section of report.
2214
: Warning if cannot determine if gapping was used.
2215
: Warning if cannot determine the scoring matrix used.
2216
Comments : This method must always get called, even if the -STATS
2217
: parse() parameter is false. The reason is that the layout
2218
: of the report and the presence of gapping must always be set.
2219
: The determination whether to set additional stats is made
2220
: by methods called by _parse_footer().
2222
See Also : L<parse()|parse>, L<_parse_alignment()|_parse_alignment>, L<_set_database()|_set_database>
2226
#---------------------
2228
#---------------------
2230
# 1. figure out if we're supposed to get the stats,
2231
# 2. figure out if the stats are to be shared. some, not all can be shared
2232
# (eg., db info and matrix can be shared, karlin altschul params cannot.
2233
# However, this method assumes they are all sharable.)
2234
# 3. Parse the stats.
2235
# 4. return the block before the parameters section if the supplied data
2236
# contains a footer parameters section.
2238
my ($self, $data) = @_;
2239
my ($client, $last_align, $params);
2241
# printf STDERR "\nPARSING PARAMETERS for %s $self.\n", $self->name;
2243
# Should the parameters be shared?
2244
# If so, set $self to be the static $Blast object and return if
2245
# the parameters were already set.
2246
# Before returning, we need to extract the last alignment section
2247
# from the parameter section, if any.
2249
if ($Blast->{'_share'}) {
2251
$self = $Blast if $Blast->{'_share'};
2254
my $get_stats = $Blast->{'_get_stats'};
2256
if( $data =~ /(.+?)${Newline}CPU time: (.*)/so) {
2257
# NCBI-Blast2 format (v2.04).
2258
($last_align, $params) = ($1, $2);
2259
return $last_align if $client->{'_found_params'};
2260
$self->_set_blast2_stats($params);
2262
} elsif( $data =~ /(.+?)${Newline}Parameters:(.*)/so) {
2263
# NCBI-Blast1 or WashU-Blast2 format.
2264
($last_align, $params) = ($1, $2);
2265
return $last_align if $client->{'_found_params'};
2266
$self->_set_blast1_stats($params);
2268
} elsif( $data =~ /(.+?)$Newline\s+Database:(.*)/so) {
2269
# Gotta watch out for confusion with the Database: line in the header
2270
# which will be present in the last hit of an internal Blast report
2271
# in a multi-report stream.
2273
# NCBI-Blast2 format (v2.05).
2274
($last_align, $params) = ($1, $2);
2275
return $last_align if $client->{'_found_params'};
2276
$self->_set_blast2_stats($params);
2278
} elsif( $data =~ /(.+?)$Newline\s*Searching/so) {
2279
# trying to detect a Searching at the end of a PSI-blast round.
2280
# Gotta watch out for confusion with the Searching line in the header
2281
# which will be present in the last hit of an internal Blast report
2282
# in a multi-report, non-PSI-blast stream.
2284
# PSI-Blast format (v2.08).
2285
($last_align) = ($1);
2286
return $last_align; # if $client->{'_found_params'};
2289
# If parameter section was found, set a boolean,
2290
# otherwise return original data.
2292
if (defined($params)) {
2293
$client->{'_found_params'} = 1;
2298
$self->_set_database($params) if $get_stats;
2300
# The {'_gapped'} member should be set in the _set_blast?_stats() call.
2301
# This is a last minute attempt to deduce it.
2303
if(!defined($self->{'_gapped'})) {
2304
if($self->program_version() =~ /^1/) {
2305
$self->{'_gapped'} = 0;
2307
if($self->strict > 0) {
2308
$self->warn("Can't determine if gapping was used. Assuming gapped.");
2310
$self->{'_gapped'} = 1;
2317
=head2 _set_blast2_stats
2319
Usage : n/a; internal function called by _parse_footer()
2320
Purpose : Extracts statistical and other parameters from BLAST2 report footer.
2321
: Stats collected: database release, gapping,
2322
: posted date, matrix used, filter used, Karlin-Altschul parameters,
2324
Throws : Exception if cannot get "Parameters" section of Blast report.
2326
See Also : L<parse()|parse>, L<_parse_footer()|_parse_footer>, L<_set_database()|_set_database>, B<Bio::Tools::SeqAnal::set_date()>,L<Links to related modules>
2330
#---------------------'
2331
sub _set_blast2_stats {
2332
#---------------------
2333
my ($self, $data) = (@_);
2335
if($data =~ /$Newline\s*Gapped/so) {
2336
$self->{'_gapped'} = 1;
2338
$self->{'_gapped'} = 0;
2341
# Other stats are not always essential.
2342
return unless $Blast->{'_get_stats'};
2344
# Blast2 Doesn't report what filter was used in the parameters section.
2345
# It just gives a warning that *some* filter was used in the header.
2346
# You just have to know the defaults (currently: protein = SEG, nucl = DUST).
2347
if($data =~ /\bfiltered\b/si) {
2348
$self->{'_filter'} = 'DEFAULT FILTER';
2350
$self->{'_filter'} = 'NONE';
2353
if($data =~ /Gapped$Newline\s*Lambda +K +H$Newline +(.+?)$Newline/so) {
2354
my ($l, $k, $h) = split(/\s+/, $1);
2355
$self->{'_lambda'} = $l || 'UNKNOWN';
2356
$self->{'_k'} = $k || 'UNKNOWN';
2357
$self->{'_h'} = $h || 'UNKNOWN';
2358
} elsif($data =~ /Lambda +K +H$Newline +(.+?)$Newline/so) {
2359
my ($l, $k, $h) = split(/\s+/, $1);
2360
$self->{'_lambda'} = $l || 'UNKNOWN';
2361
$self->{'_k'} = $k || 'UNKNOWN';
2362
$self->{'_h'} = $h || 'UNKNOWN';
2365
if($data =~ /$Newline\s*Matrix: (.+?)$Newline/so) {
2366
$self->{'_matrix'} = $1;
2368
$self->{'_matrix'} = $DEFAULT_MATRIX.'?';
2369
if($self->strict > 0) {
2370
$self->warn("Can't determine scoring matrix. Assuming $DEFAULT_MATRIX.");
2374
if($data =~ /$Newline\s*Gap Penalties: Existence: +(\d+), +Extension: (\d+)$Newline/so) {
2375
$self->{'_gapCreation'} = $1;
2376
$self->{'_gapExtension'} = $2;
2378
if($data =~ /sequences better than (\d+):/s) {
2379
$self->{'_expect'} = $1;
2382
if($data =~ /$Newline\s*T: (\d+)/o) { $self->{'_word_size'} = $1; }
2383
if($data =~ /$Newline\s*A: (\d+)/o) { $self->{'_a'} = $1; }
2384
if($data =~ /$Newline\s*S1: (\d+)/o) { $self->{'_s'} = $1; }
2385
if($data =~ /$Newline\s*S2: (\d+)/o) { $self->{'_s'} .= ", $1"; }
2386
if($data =~ /$Newline\s*X1: (\d+)/o) { $self->{'_x1'} = $1; }
2387
if($data =~ /$Newline\s*X2: (\d+)/o) { $self->{'_x2'} = $1; }
2390
=head2 _set_blast1_stats
2392
Usage : n/a; internal function called by _parse_footer()
2393
Purpose : Extracts statistical and other parameters from BLAST 1.x style eports.
2394
: Handles NCBI Blast1 and WashU-Blast2 formats.
2395
: Stats collected: database release, gapping,
2396
: posted date, matrix used, filter used, Karlin-Altschul parameters,
2399
See Also : L<parse()|parse>, L<_parse_footer()|_parse_footer>, L<_set_database()|_set_database>, B<Bio::Tools::SeqAnal::set_date()>,L<Links to related modules>
2403
#----------------------
2404
sub _set_blast1_stats {
2405
#----------------------
2406
my ($self, $data) = (@_);
2408
if(!$self->{'_gapped'} and $self->program_version() =~ /^2[\w\-\.]+WashU/) {
2409
$self->_set_gapping_wu($data);
2411
$self->{'_gapped'} = 0;
2414
# Other stats are not always essential.
2415
return unless $Blast->{'_get_stats'};
2417
if($data =~ /filter=(.+?)$Newline/so) {
2418
$self->{'_filter'} = $1;
2419
} elsif($data =~ /filter$Newline +(.+?)$Newline/so) {
2420
$self->{'_filter'} = $1;
2422
$self->{'_filter'} = 'NONE';
2425
if($data =~ /$Newline\s*E=(\d+)$Newline/so) { $self->{'_expect'} = $1; }
2427
if($data =~ /$Newline\s*M=(\w+)$Newline/so) { $self->{'_matrix'} = $1; }
2429
if($data =~ /\s*Frame MatID Matrix name .+?$Newline +(.+?)$Newline/so) {
2431
my ($fr, $mid, $mat, $lu, $ku, $hu, $lc, $kc, $hc) = split(/\s+/,$1);
2432
$self->{'_matrix'} = $mat || 'UNKNOWN';
2433
$self->{'_lambda'} = $lu || 'UNKNOWN';
2434
$self->{'_k'} = $ku || 'UNKNOWN';
2435
$self->{'_h'} = $hu || 'UNKNOWN';
2437
} elsif($data =~ /Lambda +K +H$Newline +(.+?)$Newline/so) {
2439
my ($l, $k, $h) = split(/\s+/, $1);
2440
$self->{'_lambda'} = $l || 'UNKNOWN';
2441
$self->{'_k'} = $k || 'UNKNOWN';
2442
$self->{'_h'} = $h || 'UNKNOWN';
2445
if($data =~ /E +S +W +T +X.+?$Newline +(.+?)$Newline/so) {
2447
my ($fr, $mid, $len, $elen, $e, $s, $w, $t, $x, $e2, $s2) = split(/\s+/,$1);
2448
$self->{'_expect'} ||= $e || 'UNKNOWN';
2449
$self->{'_s'} = $s || 'UNKNOWN';
2450
$self->{'_word_size'} = $w || 'UNKNOWN';
2451
$self->{'_t'} = $t || 'UNKNOWN';
2452
$self->{'_x'} = $x || 'UNKNOWN';
2454
} elsif($data =~ /E +S +T1 +T2 +X1 +X2 +W +Gap$Newline +(.+?)$Newline/so) {
2456
my ($e, $s, $t1, $t2, $x1, $x2, $w, $gap) = split(/\s+/,$1);
2457
$self->{'_expect'} ||= $e || 'UNKNOWN';
2458
$self->{'_s'} = $s || 'UNKNOWN';
2459
$self->{'_word_size'} = $w || 'UNKNOWN';
2460
$self->{'_t1'} = $t1 || 'UNKNOWN';
2461
$self->{'_t2'} = $t2 || 'UNKNOWN';
2462
$self->{'_x1'} = $x1 || 'UNKNOWN';
2463
$self->{'_x2'} = $x2 || 'UNKNOWN';
2464
$self->{'_gap'} = $gap || 'UNKNOWN';
2467
if(!$self->{'_matrix'}) {
2468
$self->{'_matrix'} = $DEFAULT_MATRIX.'?';
2469
if($self->strict > 0) {
2470
$self->warn("Can't determine scoring matrix. Assuming $DEFAULT_MATRIX.");
2475
=head2 _set_gapping_wu
2477
Usage : n/a; internal function called by _set_blast1_stats()
2478
Purpose : Determine if gapping_wu was on for WashU Blast reports.
2479
Comments : In earlier versions, gapping was always specified
2480
: but in the current version (2.0a19MP), gapping is on by default
2481
: and there is no positive "gapping" indicator in the Parameters
2484
See Also : L<_set_blast1_stats()|_set_blast1_stats>
2488
#--------------------
2489
sub _set_gapping_wu {
2490
#--------------------
2491
my ($self, $data) = @_;
2493
if($data =~ /gaps?$Newline/so) {
2494
$self->{'_gapped'} = ($data =~ /nogaps?$Newline/so) ? 0 : 1;
2496
$self->{'_gapped'} = 1;
2502
Usage : n/a; internal function called by _parse_footer()
2503
Purpose : Determine the date on which the Blast analysis was performed.
2504
Comments : Date information is not consistently added to Blast output.
2505
: Uses superclass method set_date() to set date from the file,
2508
See Also : L<_parse_footer()|_parse_footer>, B<Bio::Tools::SeqAnal::set_date()>,L<Links to related modules>
2518
### Network BLAST reports from NCBI are time stamped as follows:
2519
#Fri Apr 18 15:55:41 EDT 1997, Up 1 day, 19 mins, 1 user, load: 19.54, 19.13, 17.77
2520
if($data =~ /Start:\s+(.+?)\s+End:/s) {
2521
## Calling superclass method to set the date.
2522
## If we can't get date from the report, file date is obtained.
2523
$self->set_date($1);
2524
} elsif($data =~ /Date:\s+(.*?)$Newline/so) {
2525
## E-mailed reports have a Date: field
2526
$self->set_date($1);
2527
} elsif( $data =~ /done\s+at (.+?)$Newline/so ) {
2528
$self->set_date($1);
2529
} elsif( $data =~ /$Newline([\w:, ]+), Up \d+/so ) {
2530
$self->set_date($1);
2532
## Otherwise, let superclass attempt to get the file creation date.
2533
$self->set_date() if $self->file;
2539
Usage : n/a; called automatically during Blast report parsing.
2540
Purpose : Sets the length of the query sequence (extracted from report).
2541
Returns : integer (length of the query sequence)
2542
Throws : Exception if cannot determine the query sequence length from
2544
: Exception if the length is below the min_length cutoff (if any).
2545
Comments : The logic here is a bit different from the other _set_XXX()
2546
: methods since the significance of the BLAST report is assessed
2547
: if MIN_LENGTH is set.
2549
See Also : B<Bio::Tools::SeqAnal::length()>, L<Links to related modules>
2556
my ($self, $data) = @_;
2559
if( $data =~ m/$Newline\s+\(([\d|,]+) letters[\);]/so ) {
2562
# printf "Length = $length in BLAST for %s$Newline",$self->name; <STDIN>;
2564
$self->throw("Can't determine sequence length from BLAST report.");
2568
if(defined($Blast->{'_min_length'})) {
2570
if($length < $Blast->{'_min_len'}) {
2571
$self->throw("Query sequence too short for ${\$self->name} ($length)",
2572
"Minimum length is $Blast->{'_min_len'}");
2576
$self->length($length); # defined in superclass.
2579
=head2 _set_database
2581
Usage : n/a; called automatically during Blast report parsing.
2582
Purpose : Sets the name of the database used by the BLAST analysis.
2583
: Extracted from raw BLAST report.
2584
Throws : Exception if the name of the database cannot be determined.
2585
Comments : The database name is used by methods or related objects
2586
: for database-specific parsing.
2588
See Also : L<parse()|parse>, B<Bio::Tools::SeqAnal::database()>,B<Bio::Tools::SeqAnal::_set_db_stats()>,L<Links to related modules>
2595
# This now only sets data base information extracted from the report footer.
2597
my ($self, $data) = @_;
2599
my ($name, $date, $lets, $seqs);
2601
my $strict = $self->strict > 0;
2603
# This is fail-safe since DB name usually gets set in _parse_header()
2604
# In some reports, the database is only listed at bottom (NCBI 2.0.8).
2605
if($data =~ m/Database: +(.+?)$Newline/so ) {
2607
} elsif(not $self->database) {
2608
$self->warn("Can't determine database name from BLAST report.");
2611
if($data =~ m/Posted date: +(.+?)$Newline/so ) {
2613
} elsif($data =~ m/Release date: +(.+?)$Newline/so ) {
2616
$self->warn("Can't determine database release date.");
2619
if($data =~ m/letters in database: +([\d,]+)/si ||
2620
$data =~ m/length of database: +([\d,]+)/si ) {
2623
$self->warn("Can't determine number of letters in database.\n$data\n");
2626
if($data =~ m/sequences in database: +([\d,]+)/si ||
2627
$data =~ m/number of sequences: +([\d,]+)/si ) {
2630
$self->warn("Can't determine number of sequences in database.\n$data\n");
2633
$self->_set_db_stats( -NAME => $name,
2634
-RELEASE => $date || '',
2635
-LETTERS => $lets || '',
2636
-SEQS => $seqs || ''
2642
Usage : n/a; called automatically during Blast report parsing.
2643
Purpose : Set the name of the query and the query description.
2644
: Extracted from the raw BLAST report.
2645
Returns : String containing name of query extracted from report.
2646
Throws : Warning if the name of the query cannont be obtained.
2648
See Also : B<Bio::Tools::SeqAnal::query_desc()>,L<Links to related modules>
2658
if($data =~ m/${Newline}Query= *(.+?)$Newline/so ) {
2660
$info =~ s/TITLE //;
2661
# Split the query line into two parts.
2662
# Using \s instead of ' '
2663
$info =~ /(\S+?)\s(.*)/;
2664
$self->query_desc($2 || '');
2665
# set name of Blast object and return.
2666
$self->name($1 || 'UNKNOWN');
2668
$self->warn("Can't determine query sequence name from BLAST report.");
2670
# print STDERR "$Newline NAME = ${\$self->name}$Newline";
2673
=head2 _parse_signif
2675
Usage : &_parse_signif(string, layout, gapped);
2676
: This is a class function.
2677
Purpose : Extracts the P- or Expect value from a single line of a BLAST description section.
2678
Example : &_parse_signif("PDB_UNIQUEP:3HSC_ heat-shock cognate ... 799 4.0e-206 2", 1);
2679
: &_parse_signif("gi|758803 (U23828) peritrophin-95 precurs 38 0.19", 2);
2680
Argument : string = line from BLAST description section
2681
: layout = integer (1 or 2)
2682
: gapped = boolean (true if gapped Blast).
2683
Returns : Float (0.001 or 1e-03)
2691
my ($line, $layout, $gapped) = @_;
2694
my @linedat = split();
2696
# When processing both Blast1 and Blast2 reports
2697
# in the same run, offset needs to be configured each time.
2700
$offset = 1 if $layout == 1 or not $gapped;
2702
my $signif = $linedat[ $#linedat - $offset ];
2705
if(not $signif =~ /[.-]/) {
2706
$offset = ($offset == 0 ? 1 : 0);
2707
$signif = $linedat[ $#linedat - $offset ];
2710
$signif = "1$signif" if $signif =~ /^e/i;
2715
## BEGIN ACCESSOR METHODS THAT INCORPORATE THE STATIC $Blast OBJECT.
2719
## Overridden method to incorporate the BLAST object.
2721
return $self->SUPER::program(@_) if @_; # set
2722
$self->SUPER::program || $Blast->SUPER::program; # get
2725
sub program_version {
2726
## Overridden method to incorporate the BLAST object.
2728
return $self->SUPER::program_version(@_) if @_; # set
2729
$self->SUPER::program_version || $Blast->SUPER::program_version; # get
2733
## Overridden method to incorporate the BLAST object.
2735
return $self->SUPER::database(@_) if @_; # set
2736
$self->SUPER::database || $Blast->SUPER::database; # get
2739
sub database_letters {
2740
## Overridden method to incorporate the BLAST object.
2742
return $self->SUPER::database_letters(@_) if @_; # set
2743
$self->SUPER::database_letters || $Blast->SUPER::database_letters; # get
2746
sub database_release {
2747
## Overridden method to incorporate the BLAST object.
2749
return $self->SUPER::database_release(@_) if @_; # set
2750
$self->SUPER::database_release || $Blast->SUPER::database_release; # get
2754
## Overridden method to incorporate the BLAST object.
2756
return $self->SUPER::database_seqs(@_) if @_; # set
2757
$self->SUPER::database_seqs || $Blast->SUPER::database_seqs; # get
2761
## Overridden method to incorporate the BLAST object.
2763
return $self->SUPER::date(@_) if @_; # set
2764
$self->SUPER::date || $Blast->SUPER::date; # get
2768
## Overridden method to incorporate the BLAST object.
2770
return $Blast->SUPER::best(@_) if @_; # set
2771
$Blast->SUPER::best; # get
2776
Usage : $blast->signif();
2777
Purpose : Gets the P or Expect value used as significance screening cutoff.
2778
Returns : Scientific notation number with this format: 1.0e-05.
2780
Comments : Screening of significant hits uses the data provided on the
2781
: description line. For Blast1 and WU-Blast2, this data is P-value.
2782
: for Blast2 it is an Expect value.
2784
: Obtains info from the static $Blast object if it has not been set
2785
: for the current object.
2787
See Also : L<_set_signif()|_set_signif>
2795
my $sig = $self->{'_significance'} || $Blast->{'_significance'};
2796
sprintf "%.1e", $sig;
2801
Usage : $blast->is_signif();
2802
Purpose : Determine if the BLAST report contains significant hits.
2805
Comments : BLAST reports without significant hits but with defined
2806
: significance criteria will throw exceptions during construction.
2807
: This obviates the need to check significant() for
2810
See Also : L<_set_signif()|_set_signif>
2815
sub is_signif { my $self = shift; return $self->{'_is_significant'}; }
2818
# is_signif() doesn't incorporate the static $Blast object but is included
2819
# here to be with the other 'signif' methods.
2823
Usage : $blast->signif_fmt( [FMT] );
2824
Purpose : Allows retrieval of the P/Expect exponent values only
2825
: or as a two-element list (mantissa, exponent).
2826
Usage : $blast_obj->signif_fmt('exp');
2827
: $blast_obj->signif_fmt('parts');
2828
Returns : String or '' if not set.
2829
Argument : String, FMT = 'exp' (return the exponent only)
2830
: = 'parts'(return exponent + mantissa in 2-elem list)
2831
: = undefined (return the raw value)
2832
Comments : P/Expect values are still stored internally as the full,
2833
: scientific notation value.
2834
: This method uses the static $Blast object since this issue
2835
: will pertain to all Blast reports within a given set.
2836
: This setting is propagated to Bio::Tools::Blast::Sbjct.pm.
2844
if(@_) { $Blast->{'_signif_fmt'} = shift; }
2845
$Blast->{'_signif_fmt'} || '';
2850
Usage : $blast->min_length();
2851
Purpose : Gets the query sequence length used as significance screening criteria.
2854
Comments : Obtains info from the static $Blast object if it has not been set
2855
: for the current object.
2857
See Also : L<_set_signif()|_set_signif>, L<signif()|signif>
2865
$self->{'_min_length'} || $Blast->{'_min_length'};
2870
Usage : $blast->gapped();
2871
Purpose : Set/Get boolean indicator for gapped BLAST.
2874
Comments : Obtains info from the static $Blast object if it has not been set
2875
: for the current object.
2883
if(@_) { $self->{'_gapped'} = shift; }
2884
$self->{'_gapped'} || $Blast->{'_gapped'};
2889
Usage : n/a; internal method.
2890
Purpose : Set/Get indicator for collecting full statistics from report.
2891
Returns : Boolean (0 | 1)
2892
Comments : Obtains info from the static $Blast object which gets set
2893
: by _init_parse_params().
2901
$Blast->{'_get_stats'};
2906
Usage : n/a; internal method.
2907
Purpose : Set/Get indicator for the layout of the report.
2908
Returns : Integer (1 | 2)
2909
: Defaults to 2 if not set.
2910
Comments : Blast1 and WashU-Blast2 have a layout = 1.
2911
: This is intended for internal use by this and closely
2912
: allied modules like Sbjct.pm and HSP.pm.
2914
: Obtains info from the static $Blast object if it has not been set
2915
: for the current object.
2924
# Optimization if we know all reports share the same stats.
2925
if($Blast->{'_share'}) {
2926
$Blast->{'_layout'} = shift;
2928
$self->{'_layout'} = shift;
2931
$self->{'_layout'} || $Blast->{'_layout'} || 2;
2935
## END ACCESSOR METHODS THAT INCORPORATE THE STATIC $Blast OBJECT.
2940
Usage : $blast->hits();
2941
Purpose : Get a list containing all BLAST hit (Sbjct) objects.
2942
: Get the numbers of significant hits.
2943
Examples : @hits = $blast->hits();
2944
: $num_signif = $blast->hits();
2945
Returns : List context : list of Bio::Tools::Blast::Sbjct.pm objects
2946
: or an empty list if there are no hits.
2947
: Scalar context: integer (number of significant hits)
2948
: or zero if there are no hits.
2949
: (Equivalent to num_hits()).
2950
Argument : n/a. Relies on wantarray.
2952
: Not throwing exception because the absence of hits may have
2953
: resulted from stringent significance criteria, not a failure
2956
See Also : L<hit()|hit>, L<num_hits()|num_hits>, L<is_signif()|is_signif>, L<_set_signif()|_set_signif>
2966
my @ary = ref($self->{'_hits'}) ? @{$self->{'_hits'}} : ();
2969
return $self->num_hits();
2972
# my $num = ref($self->{'_hits'}) ? scalar(@{$self->{'_hits'}}) : 0;
2973
# my @ary = ref($self->{'_hits'}) ? @{$self->{'_hits'}} : ();
2976
# # returning list containing all hits or empty list.
2977
# ? $self->{'_is_significant'} ? @ary : ()
2978
# # returning number of hits or 0.
2979
# : $self->{'_is_significant'} ? $num : 0;
2984
Example : $blast_obj->hit( [class] )
2985
Purpose : Get a specific hit object.
2986
: Provides some syntactic sugar for the hits() method.
2987
Usage : $hitObj = $blast->hit();
2988
: $hitObj = $blast->hit('best');
2989
: $hitObj = $blast->hit('worst');
2990
: $hitObj = $blast->hit( $name );
2991
Returns : Object reference for a Bio::Tools::Blast::Sbjct.pm object.
2992
: undef if there are no hit (Sbjct) objects defined.
2993
Argument : Class (or no argument).
2994
: No argument (default) = highest scoring hit (same as 'best').
2995
: 'best' or 'first' = highest scoring hit.
2996
: 'worst' or 'last' = lowest scoring hit.
2997
: $name = retrieve a hit by seq id (case-insensitive).
2998
Throws : Exception if the Blast object has no significant hits.
2999
: Exception if a hit cannot be found when supplying a specific
3000
: hit sequence identifier as an argument.
3001
Comments : 'best' = lowest significance value (P or Expect) among significant hits.
3002
: 'worst' = highest sigificance value (P or Expect) among significant hits.
3004
See Also : L<hits()|hits>, L<num_hits()|num_hits>, L<is_signif()|is_signif>
3011
my( $self, $option) = @_;
3014
if($Blast->{'_no_aligns'} || ! ref($self->{'_hits'})) {
3018
$self->{'_is_significant'} or
3019
$self->throw("There were no significant hits.",
3020
"Use num_hits(), hits(), is_signif() to check.");
3022
my @hits = @{$self->{'_hits'}};
3024
return $hits[0] if $option =~ /^(best|first|1)$/i;
3025
return $hits[$#hits] if $option =~ /^(worst|last)$/i;
3029
return $_ if $_->name() =~ /$option/i;
3032
$self->throw("Can't get hit for: $option");
3037
Usage : $blast->num_hits( ['total'] );
3038
Purpose : Get number of significant hits or number of total hits.
3039
Examples : $num_signif = $blast-num_hits;
3040
: $num_total = $blast->num_hits('total');
3042
Argument : String = 'total' (or no argument).
3043
: No argument (Default) = return number of significant hits.
3044
: 'total' = number of total hits.
3046
: Not throwing exception because the absence of hits may have
3047
: resulted from stringent significance criteria, not a failure
3049
Comments : A significant hit is defined as a hit with an expect value
3050
: (or P value for WU-Blast) at or below the -signif parameter
3051
: used when parsing the report. Additionally, if a filter function
3052
: was supplied, the significant hit must also pass that
3055
See Also : L<hits()|hits>, L<hit()|hit>, L<is_signif()|is_signif>, L<_set_signif()|_set_signif>, L<parse()|parse>
3062
my( $self, $option) = @_;
3065
$option =~ /total/i and return $self->{'_num_hits'} || 0;
3067
# Default: returning number of significant hits.
3068
# return $self->{'_num_hits_significant'} || 0;
3069
# return 0 if not ref $self->{'_hits'};
3071
if(ref $self->{'_hits'}) {
3072
return scalar(@{$self->{'_hits'}});
3074
return $self->{'_num_hits_significant'} || 0;
3080
Usage : $blast->lowest_p()
3081
Purpose : Get the lowest P-value among all hits in a BLAST report.
3082
: Syntactic sugar for $blast->hit('best')->p().
3083
Returns : Float or scientific notation number.
3084
: Returns -1.0 if lowest_p has not been set.
3086
Throws : Exception if the Blast report does not report P-values
3087
: (as is the case for NCBI Blast2).
3088
Comments : A value is returned regardless of whether or not there were
3089
: significant hits ($DEFAULT_SIGNIF, currently 999).
3091
See Also : L<lowest_expect()|lowest_expect>, L<lowest_signif()|lowest_signif>, L<highest_p()|highest_p>, L<signif_fmt()|signif_fmt>
3100
# Layout 2 = NCBI Blast 2.x does not report P-values.
3101
$self->_layout == 2 and
3102
$self->throw("Can't get P-value with BLAST2.",
3103
"Use lowest_signif() or lowest_expect()");
3105
return $self->{'_lowestSignif'} || -1.0;
3108
=head2 lowest_expect
3110
Usage : $blast->lowest_expect()
3111
Purpose : Get the lowest Expect value among all hits in a BLAST report.
3112
: Syntactic sugar for $blast->hit('best')->expect()
3113
Returns : Float or scientific notation number.
3114
: Returns -1.0 if lowest_expect has not been set.
3116
Throws : Exception if there were no significant hits and the report
3117
: does not have Expect values on the description lines
3118
: (i.e., Blast1, WashU-Blast2).
3120
See Also : L<lowest_p()|lowest_p>, L<lowest_signif()|lowest_signif>, L<highest_expect()|highest_expect>, L<signif_fmt()|signif_fmt>
3129
if ($self->_layout == 2) {
3130
return $self->{'_lowestSignif'} || -1.0;
3133
if($self->{'_is_significant'}) {
3134
my $bestHit = $self->{'_hits'}->[0];
3135
return $bestHit->expect();
3137
$self->throw("Can't get lowest expect value: no significant hits ",
3138
"The format of this report requires expect values to be extracted$Newline".
3139
"from the hits themselves.");
3145
Example : $blast->highest_p( ['overall'])
3146
Purpose : Get the highest P-value among all hits in a BLAST report.
3147
: Syntactic sugar for $blast->hit('worst')->p()
3148
: Can also get the highest P-value overall (not just among signif hits).
3149
Usage : $p_signif = $blast->highest_p();
3150
: $p_all = $blast->highest_p('overall');
3151
Returns : Float or scientific notation number.
3152
: Returns -1.0 if highest_p has not been set.
3153
Argument : String 'overall' or no argument.
3154
: No argument = get highest P-value among significant hits.
3155
Throws : Exception if object is created from a Blast2 report
3156
: (which does not report P-values).
3158
See Also : L<highest_signif()|highest_signif>, L<lowest_p()|lowest_p>, L<_set_signif()|_set_signif>, L<signif_fmt()|signif_fmt>
3165
my ($self, $overall) = @_;
3167
# Layout 2 = NCBI Blast 2.x does not report P-values.
3168
$self->_layout == 2 and
3169
$self->throw("Can't get P-value with BLAST2.",
3170
"Use highest_signif() or highest_expect()");
3172
$overall and return $self->{'_highestSignif'} || -1.0;
3173
$self->hit('worst')->p();
3176
=head2 highest_expect
3178
Usage : $blast_object->highest_expect( ['overall'])
3179
Purpose : Get the highest Expect value among all significant hits in a BLAST report.
3180
: Syntactic sugar for $blast->hit('worst')->expect()
3181
Examples : $e_sig = $blast->highest_expect();
3182
: $e_all = $blast->highest_expect('overall');
3183
Returns : Float or scientific notation number.
3184
: Returns -1.0 if highest_exoect has not been set.
3185
Argument : String 'overall' or no argument.
3186
: No argument = get highest Expect-value among significant hits.
3187
Throws : Exception if there were no significant hits and the report
3188
: does not have Expect values on the description lines
3189
: (i.e., Blast1, WashU-Blast2).
3191
See Also : L<lowest_expect()|lowest_expect>, L<highest_signif()|highest_signif>, L<signif_fmt()|signif_fmt>
3195
#-------------------
3196
sub highest_expect {
3197
#-------------------
3198
my ($self, $overall) = @_;
3200
if ( $overall and $self->_layout == 2) {
3201
return $self->{'_highestSignif'} || -1.0;
3204
if($self->{'_is_significant'}) {
3205
return $self->hit('worst')->expect;
3207
$self->throw("Can't get highest expect value: no significant hits ",
3208
"The format of this report requires expect values to be extracted$Newline".
3209
"from the hits themselves.");
3213
=head2 lowest_signif
3215
Usage : $blast_obj->lowest_signif();
3216
: Syntactic sugar for $blast->hit('best')->signif()
3217
Purpose : Get the lowest P or Expect value among all hits
3218
: in a BLAST report.
3219
: This method is syntactic sugar for $blast->hit('best')->signif()
3220
: The value returned is the one which is reported in the decription
3221
: section of the Blast report.
3222
: For Blast1 and WU-Blast2, this is a P-value,
3223
: for NCBI Blast2, it is an Expect value.
3224
Example : $blast->lowest_signif();
3225
Returns : Float or scientific notation number.
3226
: Returns -1.0 if lowest_signif has not been set.
3229
Status : Deprecated. Use lowest_expect() or lowest_p().
3230
Comments : The signif() method provides a way to deal with the fact that
3231
: Blast1 and Blast2 formats differ in what is reported in the
3232
: description lines of each hit in the Blast report. The signif()
3233
: method frees any client code from having to know if this is a P-value
3234
: or an Expect value, making it easier to write code that can process
3235
: both Blast1 and Blast2 reports. This is not necessarily a good thing, since
3236
: one should always know when one is working with P-values or
3237
: Expect values (hence the deprecated status).
3238
: Use of lowest_expect() is recommended since all hits will have an Expect value.
3240
See Also : L<lowest_p()|lowest_p>, L<lowest_expect()|lowest_expect>, L<signif()|signif>, L<signif_fmt()|signif_fmt>, L<_set_signif()|_set_signif>
3249
return $self->{'_lowestSignif'} || -1.0;
3252
=head2 highest_signif
3254
Usage : $blast_obj->highest_signif('overall');
3255
: Syntactic sugar for $blast->hit('worst')->signif()
3256
Purpose : Get the highest P or Expect value among all hits
3257
: in a BLAST report.
3258
: The value returned is the one which is reported in the decription
3259
: section of the Blast report.
3260
: For Blast1 and WU-Blast2, this is a P-value,
3261
: for NCBI Blast2, it is an Expect value.
3262
Example : $blast->highest_signif();
3263
Returns : Float or scientific notation number.
3264
: Returns -1.0 if highest_signif has not been set.
3265
Argument : Optional string 'overall' to get the highest overall significance value.
3267
Status : Deprecated. Use highest_expect() or highest_p().
3268
Comments : Analogous to lowest_signif(), q.v.
3270
See Also : L<lowest_signif()|lowest_signif>, L<lowest_p()|lowest_p>, L<lowest_expect()|lowest_expect>, L<signif()|signif>, L<signif_fmt()|signif_fmt>, L<_set_signif()|_set_signif>
3274
#---------------------
3275
sub highest_signif {
3276
#---------------------
3277
my ($self, $overall) = @_;
3279
$overall and return $self->{'_highestSignif'} || -1.0;
3281
if($self->{'_is_significant'}) {
3282
my $worst_hit = $self->hit('worst');
3283
if(defined $worst_hit) {
3284
return $worst_hit->signif;
3286
return $self->{'_highestSignif'};
3293
Usage : $blast_object->matrix();
3294
Purpose : Get the name of the scoring matrix used.
3295
: This is extracted from the report.
3297
Returns : string or undef if not defined
3302
sub matrix { my $self = shift; $self->{'_matrix'} || $Blast->{'_matrix'}; }
3307
Usage : $blast_object->filter();
3308
Purpose : Get the name of the low-complexity sequence filter used.
3309
: (SEG, SEG+XNU, DUST, NONE).
3310
: This is extracted from the report.
3312
Returns : string or undef if not defined
3317
sub filter { my $self = shift; $self->{'_filter'} || $Blast->{'_filter'}; }
3322
Usage : $blast_object->expect();
3323
Purpose : Get the expect parameter (E) used for the Blast analysis.
3324
: This is extracted from the report.
3326
Returns : string or undef if not defined.
3331
sub expect { my $self = shift; $self->{'_expect'} || $Blast->{'_expect'}; }
3334
=head2 karlin_altschul
3336
Usage : $blast_object->karlin_altschul();
3337
Purpose : Get the Karlin_Altschul sum statistics (Lambda, K, H)
3338
: These are extracted from the report.
3340
Returns : list of three floats (Lambda, K, H)
3341
: If not defined, returns list of three zeros)
3345
#---------------------
3346
sub karlin_altschul {
3347
#---------------------
3349
if(defined($self->{'_lambda'})) {
3350
($self->{'_lambda'}, $self->{'_k'}, $self->{'_h'});
3351
} elsif(defined($Blast->{'_lambda'})) {
3352
($Blast->{'_lambda'}, $Blast->{'_k'}, $Blast->{'_h'});
3360
Usage : $blast_object->word_size();
3361
Purpose : Get the word_size used during the Blast analysis.
3362
: This is extracted from the report.
3364
Returns : integer or undef if not defined.
3372
$self->{'_word_size'} || $Blast->{'_word_size'};
3377
Usage : $blast_object->s();
3378
Purpose : Get the s statistic for the Blast analysis.
3379
: This is extracted from the report.
3381
Returns : integer or undef if not defined.
3386
sub s { my $self = shift; $self->{'_s'} || $Blast->{'_s'}; }
3391
Usage : $blast_object->gap_creation();
3392
Purpose : Get the gap creation penalty used for a gapped Blast analysis.
3393
: This is extracted from the report.
3395
Returns : integer or undef if not defined.
3397
See Also : L<gap_extension()|gap_extension>
3405
$self->{'_gapCreation'} || $Blast->{'_gapCreation'};
3408
=head2 gap_extension
3410
Usage : $blast_object->gap_extension();
3411
Purpose : Get the gap extension penalty used for a gapped Blast analysis.
3412
: This is extracted from the report.
3414
Returns : integer or undef if not defined.
3416
See Also : L<gap_extension()|gap_extension>
3420
#-------------------
3422
#-------------------
3424
$self->{'_gapExtension'} || $Blast->{'_gapExtension'};
3427
=head2 ambiguous_aln
3429
Usage : $blast_object->ambiguous_aln();
3430
Purpose : Test all hits and determine if any have an ambiguous alignment.
3431
Example : print "ambiguous" if $blast->ambiguous_aln();
3432
Returns : Boolean (true if ANY significant hit has an ambiguous alignment)
3435
Status : Experimental
3436
Comments : An ambiguous BLAST alignment is defined as one where two or more
3437
: different HSPs have significantly overlapping sequences such
3438
: that it is not possible to create a unique alignment
3439
: by simply concatenating HSPs. This may indicate the presence
3440
: of multiple domains in one sequence relative to another.
3441
: This method only indicates the presence of ambiguity in at
3442
: least one significant hit. To determine the nature of the
3443
: ambiguity, each hit must be examined.
3445
See Also : B<Bio::Tools::Blast::Sbjct::ambiguous_aln()>,L<Links to related modules>
3453
foreach($self->hits()) {
3454
return 1 if ($_->ambiguous_aln() ne '-');
3461
Usage : $blast_object->overlap([integer]);
3462
Purpose : Set/Get the number of overlapping residues allowed when tiling multiple HSPs.
3463
: Delegates to Bio::Tools::Blast::Sbjct::overlap().
3464
Throws : Exception if there are no significant hits.
3465
Status : Experimental
3467
See Also : B<Bio::Tools::Blast::Sbjct::overlap()>,L<Links to related modules>
3475
if(not $self->hits) {
3476
$self->throw("Can't get overlap data without significant hits.");
3478
$self->hit->overlap();
3483
Usage : @data = $blast_object->homo_data( %named_params );
3484
Purpose : Gets specific similarity data about each significant hit.
3485
Returns : Array of strings:
3486
: "Homology data" for each HSP is in the format:
3487
: "<integer> <start> <stop>"
3488
: Data for different HSPs are tab-delimited.
3489
Argument : named parameters passed along to the hit objects.
3491
Status : Experimental
3492
Comments : This is a very experimental method used for obtaining an
3494
: 1) how many HSPs are in a Blast alignment
3495
: 2) how strong the similarity is between sequences in the HSP
3496
: 3) the endpoints of the alignment (sequence monomer numbers)
3498
See Also : B<Bio::Tools::Blast::Sbjct::homol_data()>,L<Links to related modules>
3506
my ($self, %param) = @_;
3507
my @hits = $self->hits();
3510
## Note: Homology data can be either for the query sequence or the hit
3511
## (Sbjct) sequence. Default is for sbjct. This is specifyable via
3512
## $param{-SEQ}='sbjct' || 'query'.
3515
push @data, $_->homol_data(%param);
3520
=head1 REPORT GENERATING METHODS
3524
Usage : $blast_obj->table( [get_desc]);
3525
Purpose : Output data for each HSP of each hit in tab-delimited format.
3526
Example : print $blast->table;
3527
: print $blast->table(0);
3528
: # Call table_labels() to print labels.
3529
Argument : get_desc = boolean, if false the description of each hit is not included.
3530
: Default: true (if not defined, include description column).
3531
Returns : String containing tab-delimited set of data for each HSP
3532
: of each significant hit. Different HSPs are separated by newlines.
3533
: Left-to-Right order of fields:
3534
: 1 QUERY_NAME # Sequence identifier of the query.
3535
: 2 QUERY_LENGTH # Full length of the query sequence.
3536
: 3 SBJCT_NAME # Sequence identifier of the sbjct ("hit".
3537
: 4 SBJCT_LENGTH # Full length of the sbjct sequence.
3538
: 5 EXPECT # Expect value for the alignment.
3539
: 6 SCORE # Blast score for the alignment.
3540
: 7 BITS # Bit score for the alignment.
3541
: 8 NUM_HSPS # Number of HSPs (not the "N" value).
3542
: 9 HSP_FRAC_IDENTICAL # fraction of identical substitutions.
3543
: 10 HSP_FRAC_CONSERVED # fraction of conserved ("positive") substitutions.
3544
: 11 HSP_QUERY_ALN_LENGTH # Length of the aligned portion of the query sequence.
3545
: 12 HSP_SBJCT_ALN_LENGTH # Length of the aligned portion of the sbjct sequence.
3546
: 13 HSP_QUERY_GAPS # Number of gaps in the aligned query sequence.
3547
: 14 HSP_SBJCT_GAPS # Number of gaps in the aligned sbjct sequence.
3548
: 15 HSP_QUERY_START # Starting coordinate of the query sequence.
3549
: 16 HSP_QUERY_END # Ending coordinate of the query sequence.
3550
: 17 HSP_SBJCT_START # Starting coordinate of the sbjct sequence.
3551
: 18 HSP_SBJCT_END # Ending coordinate of the sbjct sequence.
3552
: 19 HSP_QUERY_STRAND # Strand of the query sequence (TBLASTN/X only)
3553
: 20 HSP_SBJCT_STRAND # Strand of the sbjct sequence (TBLASTN/X only)
3554
: 21 HSP_FRAME # Frame for the sbjct translation (TBLASTN/X only)
3555
: 22 SBJCT_DESCRIPTION (optional) # Full description of the sbjct sequence from
3556
: # the alignment section.
3558
Comments : This method does not collect data based on tiling of the HSPs.
3559
: The table will contains redundant information since the hit name,
3560
: id, and other info for the hit are listed for each HSP.
3561
: If you need more flexibility in the output format than this
3562
: method provides, design a custom function.
3564
See Also : L<table_tiled()|table_tiled>, L<table_labels()|table_labels>, L<_display_hits()|_display_hits>
3571
my ($self, $get_desc) = @_;
3574
$get_desc = defined($get_desc) ? $get_desc : 1;
3575
# $str .= $self->_table_labels($get_desc) unless $self->{'_labels'};
3577
my $sigfmt = $self->signif_fmt();
3578
$sigfmt eq 'parts' and $sigfmt = 'exp'; # disallow 'parts' format for this table.
3579
my $sigprint = $sigfmt eq 'exp' ? 'd' : '.1e';
3582
foreach $hit($self->hits) {
3583
foreach $hsp($hit->hsps) {
3584
# Note: range() returns a 2-element list.
3585
$str .= sprintf "%s\t%d\t%s\t%d\t%$sigprint\t%d\t%d\t%d\t%.2f\t%.2f\t%d\t%d\t%d\t%d\t%d\t%d\t%d\t%d\t%s\t%s\t%s\t%s$Newline",
3586
$self->name, $self->length, $hit->name, $hit->length,
3587
$hit->expect($sigfmt), $hit->score, $hit->bits,
3588
$hit->num_hsps, $hsp->frac_identical, $hsp->frac_conserved,
3589
$hsp->length('query'), $hsp->length('sbjct'),
3591
$hsp->range('query'), $hsp->range('sbjct'),
3592
$hsp->strand('query'), $hsp->strand('sbjct'), $hsp->frame,
3593
($get_desc ? $hit->desc : '');
3596
$str =~ s/\t$Newline/$Newline/gs;
3602
Usage : print $blast_obj->table_labels( [get_desc] );
3603
Purpose : Get column labels for table().
3604
Returns : String containing column labels. Tab-delimited.
3605
Argument : get_desc = boolean, if false the description column is not included.
3606
: Default: true (if not defined, include description column).
3609
See Also : L<table()|table>
3616
my ($self, $get_desc) = @_;
3617
$get_desc = defined($get_desc) ? $get_desc : 1;
3618
my $descstr = $get_desc ? 'DESC' : '';
3619
my $descln = $get_desc ? '-----' : '';
3621
my $str = sprintf "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s$Newline",
3622
'QUERY', 'Q_LEN', 'SBJCT', 'S_LEN', 'EXPCT', 'SCORE', 'BITS', 'HSPS',
3623
'IDEN', 'CONSV', 'Q_ALN', 'S_ALN', 'Q_GAP', 'S_GAP',
3624
'Q_BEG', 'Q_END', 'S_BEG', 'S_END', 'Q_STR', 'S_STR', 'FRAM', $descstr;
3625
$str .= sprintf "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s$Newline",
3626
'-----', '-----', '-----', '-----', '-----', '-----', '-----', '-----',
3627
'-----', '-----', '-----', '-----', '-----', '-----',
3628
'-----', '-----', '-----','-----', '-----', '-----','-----', $descln;
3630
$self->{'_labels'} = 1;
3631
$str =~ s/\t$Newline/$Newline/gs;
3637
Purpose : Get data from tiled HSPs in tab-delimited format.
3638
: Allows only minimal flexibility in the output format.
3639
: If you need more flexibility, design a custom function.
3640
Usage : $blast_obj->table_tiled( [get_desc]);
3641
Example : print $blast->table_tiled;
3642
: print $blast->table_tiled(0);
3643
: # Call table_labels_tiled() if you want labels.
3644
Argument : get_desc = boolean, if false the description of each hit is not included.
3645
: Default: true (include description).
3646
Returns : String containing tab-delimited set of data for each HSP
3647
: of each significant hit. Multiple hits are separated by newlines.
3648
: Left-to-Right order of fields:
3649
: 1 QUERY_NAME # Sequence identifier of the query.
3650
: 2 QUERY_LENGTH # Full length of the query sequence.
3651
: 3 SBJCT_NAME # Sequence identifier of the sbjct ("hit".
3652
: 4 SBJCT_LENGTH # Full length of the sbjct sequence.
3653
: 5 EXPECT # Expect value for the alignment.
3654
: 6 SCORE # Blast score for the alignment.
3655
: 7 BITS # Bit score for the alignment.
3656
: 8 NUM_HSPS # Number of HSPs (not the "N" value).
3657
: 9 FRAC_IDENTICAL* # fraction of identical substitutions.
3658
: 10 FRAC_CONSERVED* # fraction of conserved ("positive") substitutions .
3659
: 11 FRAC_ALN_QUERY* # fraction of the query sequence that is aligned.
3660
: 12 FRAC_ALN_SBJCT* # fraction of the sbjct sequence that is aligned.
3661
: 13 QUERY_ALN_LENGTH* # Length of the aligned portion of the query sequence.
3662
: 14 SBJCT_ALN_LENGTH* # Length of the aligned portion of the sbjct sequence.
3663
: 15 QUERY_GAPS* # Number of gaps in the aligned query sequence.
3664
: 16 SBJCT_GAPS* # Number of gaps in the aligned sbjct sequence.
3665
: 17 QUERY_START* # Starting coordinate of the query sequence.
3666
: 18 QUERY_END* # Ending coordinate of the query sequence.
3667
: 19 SBJCT_START* # Starting coordinate of the sbjct sequence.
3668
: 20 SBJCT_END* # Ending coordinate of the sbjct sequence.
3669
: 21 AMBIGUOUS_ALN # Ambiguous alignment indicator ('qs', 'q', 's').
3670
: 22 SBJCT_DESCRIPTION (optional) # Full description of the sbjct sequence from
3671
: # the alignment section.
3673
: * Items marked with a "*" report data summed across all HSPs
3674
: after tiling them to avoid counting data from overlapping regions
3677
Comments : This function relies on tiling of the HSPs since it calls
3678
: frac_identical() etc. on the hit as opposed to each HSP individually.
3680
See Also : L<table()|table>, L<table_labels_tiled()|table_labels_tiled>, B<Bio::Tools::Blast::Sbjct::"HSP Tiling and Ambiguous Alignments">, L<Links to related modules>
3687
my ($self, $get_desc) = @_;
3690
$get_desc = defined($get_desc) ? $get_desc : 1;
3693
my $sigfmt = $self->signif_fmt();
3694
$sigfmt eq 'parts' and $sigfmt = 'exp'; # disallow 'parts' format for this table.
3695
my $sigprint = $sigfmt eq 'exp' ? 'd' : '.1e';
3697
foreach $hit($self->hits) {
3698
$str .= sprintf "%s\t%d\t%s\t%d\t%$sigprint\t%d\t%d\t%d\t%.2f\t%.2f\t%.2f\t%.2f\t%d\t%d\t%d\t%d\t%d\t%d\t%d\t%d\t%s\t%s$Newline",
3699
$self->name, $self->length, $hit->name, $hit->length,
3700
$hit->expect($sigfmt), $hit->score, $hit->bits,
3701
$hit->num_hsps, $hit->frac_identical, $hit->frac_conserved,
3702
$hit->frac_aligned_query, $hit->frac_aligned_hit,
3703
$hit->length_aln('query'), $hit->length_aln('sbjct'),
3704
$hit->gaps('list'), $hit->range('query'), $hit->range('sbjct'),
3705
$hit->ambiguous_aln, ($get_desc ? $hit->desc : '');
3707
$str =~ s/\t$Newline/$Newline/gs;
3711
=head2 table_labels_tiled
3713
Usage : print $blast_obj->table_labels_tiled( [get_desc] );
3714
Purpose : Get column labels for table_tiled().
3715
Returns : String containing column labels. Tab-delimited.
3716
Argument : get_desc = boolean, if false the description column is not included.
3717
: Default: true (include description column).
3720
See Also : L<table_tiled()|table_tiled>
3724
#---------------------
3725
sub table_labels_tiled {
3726
#---------------------
3727
my ($self, $get_desc) = @_;
3728
my $descstr = $get_desc ? 'DESC' : '';
3729
my $descln = $get_desc ? '-----' : '';
3731
my $str = sprintf "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s$Newline",
3732
'QUERY', 'Q_LEN', 'SBJCT', 'S_LEN', 'EXPCT', 'SCORE', 'BITS',
3733
'HSPS', 'FR_ID', 'FR_CN', 'FR_ALQ', 'FR_ALS', 'Q_ALN',
3734
'S_ALN', 'Q_GAP', 'S_GAP', 'Q_BEG', 'Q_END', 'S_BEG', 'S_END',
3736
$str =~ s/\t$Newline/$Newline/;
3737
$str .= sprintf "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s$Newline",
3738
'-----', '-----', '------', '-----', '-----','-----', '-----',
3739
'-----', '-----', '-----', '-----', '-----', '-----',
3740
'-----', '-----', '-----','-----','-----', '-----',
3741
'-----','-----', $descln;
3743
$self->{'_labels_tiled'} = 1;
3744
$str =~ s/\t$Newline/$Newline/gs;
3750
Usage : $blast_object->display( %named_parameters );
3751
Purpose : Display information about Bio::Tools::Blast.pm data members,
3752
: E.g., parameters of the report, data for each hit., etc.
3753
: Overrides Bio::Root::Object::display().
3754
Example : $object->display(-SHOW=>'stats');
3755
: $object->display(-SHOW=>'stats,hits');
3756
Argument : Named parameters: (TAGS CAN BE UPPER OR LOWER CASE)
3757
: -SHOW => 'file' | 'hits' | 'homol'
3758
: -WHERE => filehandle (default = STDOUT)
3759
Returns : n/a (print/printf is called)
3760
Status : Experimental
3761
Comments : For tab-delimited output, see table().
3763
See Also : L<_display_homol()|_display_homol>, L<_display_hits()|_display_hits>, L<_display_stats()|_display_stats>, L<table()|table>, B<Bio::Root::Tools::SeqAnal::display()>,L<Links to related modules>,
3770
my( $self, %param ) = @_;
3772
$self->SUPER::display(%param);
3773
my $OUT = $self->fh();
3775
$self->show =~ /homol/i and $self->_display_homol($OUT);
3776
$self->show =~ /hits/i and $self->_display_hits( %param );
3780
=head2 _display_homol
3782
Usage : n/a; called automatically by display()
3783
Purpose : Print homology data for hits in the BLAST report.
3785
Argument : one argument = filehandle object.
3786
Returns : printf call.
3787
Status : Experimental
3789
See Also : L<homol_data()|homol_data>, L<display()|display>
3793
#-------------------
3794
sub _display_homol {
3795
#-------------------
3796
my( $self, $OUT ) = @_;
3798
print $OUT "${Newline}BLAST HOMOLOGY DATA FOR: ${\$self->name()}$Newline";
3799
print $OUT '-'x40,"$Newline";
3801
foreach ( $self->homol_data()) {
3802
print $OUT "$_$Newline";
3806
=head2 _display_stats
3808
Usage : n/a; called automatically by display()
3809
Purpose : Display information about the Blast report "meta" data.
3810
: Overrides Bio::Tools::SeqAnal::_display_stats() calling it first.
3812
Argument : one argument = filehandle object.
3813
Returns : printf call.
3814
Status : Experimental
3816
See Also : L<display()|display>, B<Bio::Tools::SeqAnal::_display_stats()>,L<Links to related modules>
3820
#--------------------
3821
sub _display_stats {
3822
#--------------------
3823
my( $self, $OUT ) = @_;
3825
$self->SUPER::_display_stats($OUT);
3826
printf( $OUT "%-15s: %s$Newline", "GAPPED", $self->gapped ? 'YES' : 'NO');
3827
printf( $OUT "%-15s: %d$Newline", "TOTAL HITS", $self->num_hits('total'));
3828
printf( $OUT "%-15s: %s$Newline", "CHECKED ALL", $Blast->{'_check_all'} ? 'YES' : 'NO');
3829
printf( $OUT "%-15s: %s$Newline", "FILT FUNC", $Blast->{'_filt_func'} ? 'YES' : 'NO');
3830
if($self->min_length) {
3831
printf( $OUT "%-15s: Length >= %s$Newline", "MIN_LENGTH", $self->min_length);
3834
my $num_hits = $self->num_hits;
3835
my $signif_str = ($self->_layout == 1) ? 'P' : 'EXPECT';
3837
printf( $OUT "%-15s: %d$Newline", "SIGNIF HITS", $num_hits);
3838
# Blast1: signif = P-value, Blast2: signif = Expect value.
3840
printf( $OUT "%-15s: %s ($signif_str-VALUE)$Newline", "SIGNIF CUTOFF", $self->signif);
3841
printf( $OUT "%-15s: %s$Newline", "LOWEST $signif_str", $self->lowest_signif());
3842
printf( $OUT "%-15s: %s$Newline", "HIGHEST $signif_str", $self->highest_signif());
3844
printf( $OUT "%-15s: %s (OVERALL)$Newline", "HIGHEST $signif_str", $self->highest_signif('overall'));
3846
if($Blast->_get_stats) {
3847
my $warn = ($Blast->{'_share'}) ? '(SHARED STATS)' : '';
3848
printf( $OUT "%-15s: %s$Newline", "MATRIX", $self->matrix() || 'UNKNOWN');
3849
printf( $OUT "%-15s: %s$Newline", "FILTER", $self->filter() || 'UNKNOWN');
3850
printf( $OUT "%-15s: %s$Newline", "EXPECT", $self->expect() || 'UNKNOWN');
3851
printf( $OUT "%-15s: %s, %s, %s %s$Newline", "LAMBDA, K, H", $self->karlin_altschul(), $warn);
3852
printf( $OUT "%-15s: %s$Newline", "WORD SIZE", $self->word_size() || 'UNKNOWN');
3853
printf( $OUT "%-15s: %s %s$Newline", "S", $self->s() || 'UNKNOWN', $warn);
3855
printf( $OUT "%-15s: %s$Newline", "GAP CREATION", $self->gap_creation() || 'UNKNOWN');
3856
printf( $OUT "%-15s: %s$Newline", "GAP EXTENSION", $self->gap_extension() || 'UNKNOWN');
3859
print $OUT "$Newline";
3862
=head2 _display_hits
3864
Usage : n/a; called automatically by display()
3865
Purpose : Display data for each hit. Not tab-delimited.
3867
Argument : one argument = filehandle object.
3868
Returns : printf call.
3869
Status : Experimental
3870
Comments : For tab-delimited output, see table().
3872
See Also : L<display()|display>, B<Bio::Tools::Blast::Sbjct::display()>, L<table()|table>, L<Links to related modules>
3878
my( $self, %param ) = @_;
3879
my $OUT = $self->fh();
3880
my @hits = $self->hits();
3882
## You need a wide screen to see this properly.
3884
print $OUT "${Newline}BLAST HITS FOR: ${\$self->name()} length = ${\$self->length}$Newline";
3885
print "(This table requires a wide display.)$Newline";
3886
print $OUT '-'x80,"$Newline";
3888
print $self->table_labels_tiled(0);
3889
print $self->table_tiled(0);
3891
## Doing this interactively since there is potentially a lot of data here.
3892
## Not quite satisfied with this approach.
3894
if (not $param{-INTERACTIVE}) {
3898
print "${Newline}DISPLAY FULL HSP DATA? (y/n): [n] ";
3899
chomp( $reply = <STDIN> );
3905
print $OUT "$Newline$Newline",'-'x80,"$Newline";
3906
print $OUT "HSP DATA FOR HIT #$count (hit <RETURN>)";
3907
print $OUT "$Newline",'-'x80;<STDIN>;
3908
$param{-SHOW} = 'hsp';
3909
$_->display( %param );
3917
Usage : $blast_object->to_html( [%named_parameters] )
3918
Purpose : To produce an HTML-formatted version of a BLAST report
3919
: for efficient navigation of the report using a web browser.
3920
Example : # Using the static Blast object:
3921
: # Can read from STDIN or from a designated file:
3922
: $Blast->to_html($file);
3923
: $Blast->to_html(-FILE=>$file, -HEADER=>$header);
3924
: (if no file is supplied, STDIN will be used).
3925
: # saving HTML to an array:
3926
: $Blast->to_html(-FILE=>$file, -OUT =>\@out);
3927
: # Using a pre-existing blast object (must have been built from
3928
: # a file, not STDIN:
3929
: $blastObj->to_html();
3930
Returns : n/a, either prints report to STDOUT or saves to a supplied array
3931
: if an '-OUT' parameter is defined (see below).
3932
Argument : %named_parameters: (TAGS ARE AND CASE INSENSITIVE).
3933
: -FILE => string containing name of a file to be processed.
3934
: If not a valid file or undefined, STDIN will be used.
3935
: Can skip the -FILE tag if supplying a filename
3936
: as a single argument.
3938
: This should be an HTML-formatted string to be used
3939
: as a header for the page, typically describing query sequence,
3940
: database searched, the date of the analysis, and any
3942
: If not supplied, no special header is used.
3943
: Regardless of whether a header is supplied, the
3944
: standard info at the top of the report is highlighted.
3945
: This should include the <HEADER></HEADER> section
3946
: of the page as well.
3948
: -IN => array reference containing a raw Blast report.
3949
: each line in a separate element in the array.
3950
: If -IN is not supplied, read() is called
3951
: and data is then read either from STDIN or a file.
3953
: -OUT => array reference to hold the HTML output.
3954
: If not supplied, output is sent to STDOUT.
3955
Throws : Exception is propagated from $HTML::get_html_func()
3956
: and Bio::Root::Object::read().
3957
Comments : The code that does the actual work is located in
3958
: Bio::Tools::Blast::HTML::get_html_func().
3959
Bugs : Some hypertext links to external databases may not be
3960
: correct. This due in part to the dynamic nature of
3962
: Hypertext links are not added to hits without database ids.
3963
TODO : Possibly create a function to produce fancy default header
3964
: using data extracted from the report (requires some parsing).
3965
: For example, it would be nice to always include a date
3967
See Also : B<Bio::Tools::Blast::HTML::get_html_func()>, B<Bio::Root::Object::read()>, L<Links to related modules>
3974
my ($self, @param) = @_;
3976
# Permits syntax such as: $blast->to_html($filename);
3977
my ($file, $header_html, $in_aref, $out_aref) =
3978
$self->_rearrange([qw(FILE HEADER IN OUT)], @param);
3980
$self->file($file) if $file;
3982
# Only setting the newline character once for efficiency.
3983
$Newline ||= $Util->get_newline(-client => $self, @param);
3985
$header_html ||= '';
3986
(ref($out_aref) eq 'ARRAY') ? push(@$out_aref, $header_html) : print "$header_html$Newline";
3988
require Bio::Tools::Blast::HTML;
3989
Bio::Tools::Blast::HTML->import(qw(&get_html_func));
3992
eval{ $func = &get_html_func($out_aref); };
4000
$out_aref ? push(@$out_aref, "<html><body>$Newline") : print "<html><body>$Newline";
4003
if (ref ($in_aref) =~ /ARRAY/) {
4004
# If data is being supplied, process it.
4005
foreach(@$in_aref) {
4009
# Otherwise, read it, processing as we go.
4011
$self->read(-FUNC => $func, @param);
4013
$out_aref ? push(@$out_aref, "$Newline</pre></body></html>") : print "$Newline</pre></body></html>";
4017
# Check for trivial error (report already HTML formatted).
4018
if($@ =~ /HTML formatted/) {
4019
print STDERR "\a${Newline}Blast report appears to be HTML formatted already.$Newline$Newline";
4030
#####################################################################################
4032
#####################################################################################
4034
=head1 FOR DEVELOPERS ONLY
4038
Information about the various data members of this module is provided for those
4039
wishing to modify or understand the code. Two things to bear in mind:
4043
=item 1 Do NOT rely on these in any code outside of this module.
4045
All data members are prefixed with an underscore to signify that they are private.
4046
Always use accessor methods. If the accessor doesn't exist or is inadequate,
4047
create or modify an accessor (and let me know, too!). (An exception to this might
4048
be for Sbjct.pm or HSP.pm which are more tightly coupled to Blast.pm and
4049
may access Blast data members directly for efficiency purposes, but probably
4052
=item 2 This documentation may be incomplete and out of date.
4054
It is easy for these data member descriptions to become obsolete as
4055
this module is still evolving. Always double check this info and search
4056
for members not described here.
4060
An instance of Bio::Tools::Blast.pm is a blessed reference to a hash containing
4061
all or some of the following fields:
4064
--------------------------------------------------------------
4065
_significance P-value or Expect value cutoff (depends on Blast version:
4066
Blast1/WU-Blast2 = P-value; Blast2 = Expect value).
4067
Values GREATER than this are deemed not significant.
4069
_significant Boolean. True if the query has one or more significant hit.
4071
_min_length Integer. Query sequences less than this will be skipped.
4073
_confirm_significance Boolean. True if client has supplied significance criteria.
4075
_gapped Boolean. True if BLAST analysis has gapping turned on.
4077
_hits List of Sbjct.pm objects.
4079
_num_hits Number of hits obtained from the BLAST report.
4081
_num_hits_significant Number of significant based on Significant data members.
4083
_highestSignif Highest P or Expect value overall (not just what is stored in _hits).
4085
_lowestSignif Lowest P or Expect value overall (not just what is stored in _hits).
4087
The static $Blast object has a special set of members:
4096
Miscellaneous statistical parameters:
4097
-------------------------------------
4098
_filter, _matrix, _word_size, _expect, _gapCreation, _gapExtension, _s,
4101
INHERITED DATA MEMBERS
4102
-----------------------
4103
(See Bio::Tools::SeqAnal.pm for inherited data members.)