1
PowerBLAST: RELEASE 2 (1/6/97)
2
(README last modified 2/23/99)
4
PowerBLAST can now be run with either the command line
5
interface ('pblcmd') or a REAL graphical user interface
8
You still need connection to both network BLAST server and
9
Entrez network service to run the program.
11
Please read the manual carefully. If you have any questions
12
about the display of the TEXT alignment, please check out
13
the examples in this file.
15
For questions and bug report, please send email to:
16
blast-help@ncbi.nlm.nih.gov
18
############################################################
20
############################################################
21
1)Graphical User Interface.
22
2)Search with Multiple BLAST Programs.
23
3)Save the Settings for Both PowerBLAST Specific Options and
24
BLAST Search Parameters.
25
4)Options for Dumping out a HTML Page for TEXT Alignment
26
5)Option for Monitoring the Search Process
29
############################################################
31
############################################################
33
1)Manual for Graphical User Interface (powblast)
34
2)Manual for Command Line Interface (pblcmd)
35
3)Examples for the TEXT output format
36
4)Some Excerpt from the draft of the PowerBLAST paper that
37
explains the system and algorithm of PowerBLAST
41
############################################################
42
Manual for Graphical User Interface
43
############################################################
44
PowerBLAST start up with a window for setting up the
45
PowerBLAST special options as well as the BLAST search
49
***********************
50
Set up BLAST Parameters
51
***********************
52
Push the Button "Blast Program" on the top right of the
53
window and you will have a new window titled "Parameter and
54
DataBase for Blast Search". This window allows you to set
55
parameters for multiple database searches with multiple
56
BLAST programs. It has two sections. The top is to set
57
parameters for searching against nucleotide databases, and
58
the bottom for searching against protein databases.
61
On the top section, you can select multiple databases by
62
checking the boxes such as "nr", "sts", and "est". On the
63
bottom, you can check the boxes such as "nr", "swissprot",
64
and "pdb". In addition to the check boxes, you can type the
65
name of a BLAST search database in the dialog box "other".
66
That gives the flexibility of including a new search
67
database if it is not covered by the check boxes.
70
You can check "BLASTN" and "TBLASTN" for searching against
71
nucleotide databases. You can check "BLASTX" and "BLASTP"
72
for searching against protein databases. If none of the
73
BLAST program is selected, PowerBLAST will run BLASTP for a
74
protein query sequence and BLASTN for a nucleotide query
75
sequence. PowerBLAST also checks the consistency between
76
query and the selected BLAST program. For example, if the
77
query sequence is a protein and if one of the selected
78
program is BLASTN, it will skip the BLASTN option.
80
c)Parameters for BLAST search
81
Set the parameters for BLAST search in the dialog box
82
following the program name. !!!!NOTE!!!! The behavior of an
83
empty dialog box is different in this version than the
84
previous one. If no parameter is set, PowerBLAST will use
85
the default setting in the regular BLAST search. In the old
86
version, the default for BLASTN and BLASTX were set with
87
high cutoff score (M=1 N=-3 S=40 S2=40 for BLASTN and S=90
88
S2=90 -filter=seg for BLASTX). If you are processing a large
89
genomic sequence, those two sets of parameters work quite
90
well, and you may consider to keep them as your default
91
search parameter. For searching against protein databases,
92
it will be good to set -filter=seg all the time to filter
93
the low complexity regions in a protein sequence. If you
94
have questions for setting up the BLAST search parameter,
95
push the button "Help" at the bottom of the window to obtain
96
the email address for help message of BLAST search.
99
If you push the "Cancel" button, the option will be reverted
100
to the previous setting. If you Push "Accept", the new
101
setting will be effective. If you push "Help", you can get
102
the email address for help message of BLAST search.
107
If it is the FIRST time for you to run PowerBLAST, it is
108
strongly recommended that you set up BLAST options and push
109
"Accet". In the main window, push "Save Setting". The
110
current setting will become the default setting when you run
111
PowerBLAST next time. All you need to do is just load the
114
***********************
115
Input Query sequences
116
***********************
117
You can load the query sequences either a) from an input
118
file or b) by "pasting" to the Window. The input can be a
119
FASTA formatted sequence file (contain one or multiple
120
sequences), or a list of accession numbers, or a list of
121
gis. If the input is from a FILE, it can also be a list of
122
file names. If Accessions or gis are supplied, the query
123
sequences will be fetched from the Entrez server directly.
125
a) Type in the file name or Push the Button "Read Input
126
File" to load the input file.
128
b) Click at the empty panel underneath "Or Paste Query
129
Formatted As:". This is important!!! If you don't place the
130
cursor properly, you can paste the buffer in the wrong place
131
!!! After that, go to the Pulldown Menu "Edit" and select
132
"Paste" to paste the buffer to the panel. Only three formats
133
are supported for this option: FASTA, GI or Accessions. The
134
default is set to FASTA. You can specify the format by
135
selecting from the pulldown list. If you Push the button
136
"Clear Window", the panel will become empty again so you can
142
You can monitor the PowerBLAST search process by selecting
143
"Use Monitor" or disable it by de-selecting the option. On
144
Unix machines, sometimes, the monitor can be quite annoying.
145
And you have to be prepared to Click OK when a query has no
148
***********************
149
Mask the Repeat Region
150
***********************
151
Type in the file name of a FASTA formatted repeat sequence
152
library or load it by pushing the button "Mask Repeats". A
153
sample file for human repeats, humrep.fsa, is supplied in the
159
Select one of the Radio Buttons, "None", "SIM",
160
"SIM2","SIM3", for Gapped Alignment Algorithm. SIM works for
161
DNA-DNA and protein-protein alignment. SIM2 and SIM3 work
162
for DNA-DNA alignment only. Note!! There is NO gapped
163
alignment algorithm available for DNA-protein (BLASTX) and
164
protein-DNA (TBLASTN) alignment.
166
*********************
168
*********************
169
Select "Low Complexity" region to filter the low complexity
170
region in a DNA sequence with dust.
171
Select "Self Hit" if your query is a GenBank sequence and
172
you don't want to see any hits to its self.
174
*************************
175
Organism Specific Search
176
*************************
177
You can restrict the BLAST search results to include or
178
exclude a specific group of organism. Use the radio button
179
"None", "Include", "Exclude" to make the choice and put the
180
organism name (either taxname, such as homo sapien or common
181
name, such as mammal will do) in the dialog box.
186
The output from PowerBLAST will be saved in one or more
187
selected file format: "TEXT" (file extention .ali), "HTML"
188
(file extension .html), "Seq-align" (file extension .sat),
189
and "Seq-entry' (file extension .ent). Both Seq-align(*.sat)
190
and Seq-entry (*.ent) can only be viewed by Chromoscope. The
191
TEXT file can be viewed directly and the HTML file can be
192
opened with a WWW browser, such as netscape. It has hotlinks
193
to the sequences in the public databases.
198
The "Search" Button is disabled when there is no query
199
sequence. It will be activated when there is an input file
200
or the panel for pasting the results contains data.
202
The "Save Setting" Button is very useful!!! It will save all
203
the parameters that you have set (which include the BLAST
204
search parameters as well as the other powerBLAST specific
205
options) in the powblast configuration file and the next
206
time you run the program, all the options will come up as
207
the default setting. All you need to do is to load the query
208
sequence. If you do NOT "Save Setting", next time, it will
209
start with the same parameters as the current start-up.
211
The "Quit" Button is used for quit the program.
214
**********************
215
Some Advanced Options
216
**********************
218
The default formatting will display the annotated features
219
together with alignments. You can disable this function by
220
going to the Pulldown menu "Option" and deselect "Show
223
If you are processing a large amount of queries at one time,
224
you may find the monitor annoying and turned it off.
225
However, you may still want to know the records that found
226
no hit. Go to the pulldown menu File and select "Save Error
227
Log" to open an error log file for recording the sequences
231
############################################################
232
Manual for Command Line Interface
233
############################################################
234
The Name of the Program is called "pblcmd". The argument
235
list gets much more complicated compared with the previous
236
version because new options are available
237
To review the parameter list, type pblcmd -
240
power blast arguments:
242
-i The file name for power blast job [String]
243
-c Reset the options 0=No 1=Reset 2=Reset+Save 3=Modify
248
-l The repeat FASTA library file for filtering [String]
250
-d dust the sequence before blast [T/F] Optional
252
-f filter the blast output with the organism? 0=NO 1=Keep
257
-o the name for organism for filtering [String] Optional
258
-s compute gapped alignment 0=No 1=sim1 2=sim2 3=sim3
262
-a export the results as 1=text(*.ali) 8=HTML(*.html)
263
2=Seq-align(*.sat) 4=Seq-entry(*.ent)
267
-b type of blast 0=default 1=blastn 2=blastp 4=blastx
272
-N Search Nucleotide databases: 1=nr 2=est 4=sts 8=month
274
64=mito 128=kabat 512=pDB epd=1024 yeast=2048 gss=4096
278
-A Search Protein databases: 1=nr 8=month 128=kabat
279
256=swissprot 512=pdb
283
-n Parameters for BLASTN, use quote [String] Optional
284
-x Parameters for BLASTX, use quote [String] Optional
285
-p Parameters for BLASTP, use quote [String] Optional
286
-t Parameters for TBLASTN, use quote [String] Optional
287
-q filter out the GenBank query itself [T/F] Optional
289
-m Enable the Monitor [T/F] Optional
295
-i: the file name for the query sequence(s), which can
296
be FASTA formatted file with multiple sequences, a
297
list of accessions/locus/gis, or a list of file
300
-c: options for save the settings
301
-c0 take the default settings from the config
303
-c1 reset all the parameters by taking the values
304
from the command line
305
-c2 same as c1 and save the settings to the
307
-c3 take the default settings from the config
308
file, modify the values with the user setting in
310
-c4 same as c3 and save the settings to the
313
If it is the FIRST time for you to run pblcmd, it
314
is strongly recommended that you use -c2 to set up
315
your options in the most of the search fields. The
316
settings will be saved into the config file and
317
next time you run the program, if you choose -c0,
318
it will automatically set up the previous options
320
This option is a little bit awkward. I tried to
321
mimic the GUI interface for saving the settings
322
and being able to modify the some but not all the
325
-l a FASTA formatted repeat library file for human
326
repeats, humrep, is included in this package. If
327
you want to filter human repeats, just do
330
-d -dT mask the low complexity region in DNA query
331
sequences by the dust program. -dF no dusting
333
-f -f0 No organism filtering
337
-o Name of the organism. Use quotes. -o"human"
338
-s -s0 Do not run gapped alignment
343
-a The format for output files. You can select to
344
save multiple formats by adding the numbers
345
together. If you select -a5, it will produce both
346
the TEXT alignment (*.ali file) and the Seq-entry
347
ASN.1 file (*.ent file)
348
-a1 TEXT alignment with the extension .ali
349
-a8 HTML page with the extension .html
350
-a2 ASN.1 Seq-align file with the extension .sat.
351
You can view it in Chromoscope.
352
-a4 Seq-entry ASN.1 file with the extension .ent.
353
You can view in Chromoscope.
355
-b BLAST programs. You can select multiple BLAST
356
programs by adding the numbers together
357
-b0 default. Use BLASTN for a DNA query and
358
BLASTP for a Protein query.
363
PowerBLAST also checks the consistency between
364
query and the selected BLAST program. For example,
365
if the query sequence is a protein and if one of
366
the selected program is BLASTN, it will skip the
369
-N the Nucleotide Databases for BLAST Search. You can
370
run searches against multiple databases by adding
371
the numbers together.
373
-A the Protein Databases for BLAST Search. The
374
settings are similar to -N.
376
-n Parameters for BLASTN search.
377
-x Parameters for BLASTX search.
378
-p Parameters for BLASTP search.
379
-t Parameters for TBLASTN search.
381
!!!!NOTE!!!! The behavior of the unspecified
382
choice for setting BLAST parameter is different in
383
this version than the previous one. If no
384
parameter is set, PowerBLAST will use the default
385
setting in the regular BLAST search. In the old
386
version, the default for BLASTN and BLASTX were
387
set with high cutoff score (M=1 N=-3 S=40 S2=40
388
for BLASTN and S=90 S2=90 -filter=seg for BLASTX).
389
If you are processing a large genomic sequence,
390
those two sets of parameters work quite well, and
391
you may consider to keep them as your default
392
search parameter. For searching against protein
393
databases, it will be good to set -filter=seg all
394
the time to filter the low complexity regions in a
397
-q -qT if the query is a GenBank sequence, filter the
401
-m -mT monitor the process
402
-mF turn off the monitor
408
pblcmd -iH_214K23.seq -c2 -lhumrep -dT -f1 -o"human" -s2 -a5
409
-b5 -N3 -A257 -n"M=1 N=-3 S=40 S2=40" -x"S=90 S2=90
412
For this setting, it will reset the parameters and save them
413
into the configuration file. It takes the input sequence
414
file H_214K23.seq, run against the human repeat library to
415
find the repeats, mask the low complexity regions in the
416
query with dust. Keep only the human hits from BLAST search.
417
Run SIM2 to produce gapped alignment. Save the results in
418
both the TEXT and the ASN.1 file. Search both nr and est
419
database for BLASTN and the parameter for BLASTN is "M=1 N=-
420
3 S=40 S2=40". For BLASTX, the parameter is "S=90 S2=90 -
421
filter=seg". It will run with a monitor.
423
It is a long parameter list. But once it is set with -c2,
424
the next time, all you need is to run the search with pblcmd
425
-iinput file to get the same results.
428
############################################################
429
Examples for TEXT output of the Alignment
430
###########################################################
432
a) a simple DNA-DNA alignment
435
12> 297 aattaaactgtatattctggataaataaaattatttcgac
436
L24443> 1347 ........................................
437
D31734> 1344 ........................................
438
3'UTR > 1344 ****************************************
439
polyA_sign > 1367 ******
440
U25274> 1262 ...................a.....at...
445
In this output format, 12 is the query sequence. L24443,
446
D31734, and U25274 are the BLAST hits. All the resides of
447
the query sequence are displayed, while in the hit
448
sequences, only the mismatched residues are displayed (the a
449
and at in U25274). The identical residues are displayed as
450
dots ".". The ">" symbol shows the orientation of the
451
alignment. ">" for the plus strand and "<" for the minus
452
strand. The number followint ">" indicates the position in
458
For sequence D31734, there are two annotated features at the
459
region where there is high similarity to the query. Both are
460
marked as "*" underneath the DNA sequence. one is 3'UTR.
461
3'UTR > 1344 ****************************************
462
The other is the polyA signal
463
polyA_sign > 1367 ******
466
b) the combined view of BLASTN and BLASTX
471
214K> 8837 ttgggtttctagactaaatacagtgtgggaatacacaata
472
X03557> 192 ...aa..c......a.......................--
473
56-KDa> 43 I E F L D K Y S V G I H N
478
G05877> 24 ...an..c......a.......................--
482
______________________________________________________
483
frame=+1> I G F L D * I Q C G N T Q Y
487
_______________________________________________________
488
frame=+3 > W V S R L N T V W E Y T I
489
P09914 42 Q I E F . D . K Y S V G .
490
307041 42 Q I E F . D . K Y S V G .
491
A25407 42 Q I E F . D . K Y S V G .
494
The results from BLASTN and BLASTX are separated by the line
495
____ into three panels. The top shows the results from
496
BLASTN, the middle and the bottom show the results from
497
BLASTX with frame = +1 and frame = +3, respectively.
502
The query sequence 214K has two BLASTN hits: X03557 and
503
G05877 in this region. The gapped alignments were computed
504
by SIM2, and the alignments were displayed as multiple pair-
505
wise alignment. A gap on the master sequence, i.e. the query
506
sequence, is displayed as an insertion in the matching
507
sequence. At position 8852 of the query sequence, both of
508
the hit sequences contain 2-bp insertions represented by \.
511
At the end of line, both have 2-bp gaps represented by
512
dashes (--). In the aligned region, the mRNA sequence X03557
513
has a coding region feature, which is presented by labeling
514
each amino-acid in the middle of the 3-base codon. As a
515
result, the protein sequence displayed in this panel is
516
derived from the annotation on the DNA sequence.
518
The BLASTX display, the conceptual translation with the
519
specified reading frame is displayed underneath the
520
separation line. The conceptual translation is compared with
521
matching sequences from the protein database. Identical
522
residues are labeled by dots. In this view, there are 3
523
protein sequences, P09914, 307041, A25407, all of which
524
align to the query sequence in both frame +1 and frame +3.
525
The alignments for frame 1 translation stop at position 8852
526
on the query sequence, which corresponds the 2-bp gap in the
527
query sequence (displayed as 2-bp insertions on the matching
531
############################################################
532
Algorithms: Excerpt from the draft of PowerBLAST paper
533
############################################################
535
Figure 1 illustrates the data processes in PowerBlast. Prior to a BLAST search, SIM2
536
computes repeat regions in the query sequence and the results are automatically annotated
537
as repeat features in the query sequence. Those, together with the low complexity regions
538
in a DNA sequence identified by dust (Kuzio, unpublished), are masked in a copy of the
539
query sequence which will be sent to the BLAST server for database search. Four types of
540
BLAST search may be conducted with PowerBLAST: BLASTN compares a nucleotide
541
query to a nucleotide database; BLASTP compares a protein query to a protein database;
542
BLASTX compares a translated nucleotide query to a protein database; TBLASTN
543
compares a protein query to a translated nucleotide database. Large sequences are split
544
into overlapping pieces and the results are merged at the end. An interface was developed
545
to enable searches against multiple databases with multiple BLAST programs (Figure 2).
546
Organism specific results can be obtained at any level of taxonomy index by filtering the
547
HSP alignments inclusively or exclusively with Entrez Taxonomy Server. A suite of SIM
548
algorithms (SIM, SIM2, SIM3) may be selected to compute more refined gapped
549
alignments. The details of repeat filtering, process of large sequences, organism filtering
550
and gapped alignments are described below.
553
To identify repeat regions in the query sequence, PowerBLAST uses the SIM2 algorithm
554
to compute the top n non-intersecting gapped alignments between the query sequence and
555
repeat sequences in a user supplied FASTA library file. A sample file for human repeat
556
sequences, humrep (Makalowski, unpublished), is included in the package. In order to
557
reduce false positive and false negative results, various parameters were tested in a
558
experiment that compares the ALU repeats identified by SIM2 with the annotations in the
559
public records ( Makalowski and Zhang, unpublished) and the optimal choice is the
560
combination of scores>=20 and sequence identity > 65%. The end points of the
561
alignments are taken as repeat regions, and if there are tandem repeats of the same repeat
562
element, the leftmost and rightmost positions will be recorded as the end points of a
563
single repeat region. The repeat regions will also be annotated automatically as features
564
on the query sequence. Since repeat features are derived from the gapped alignments, the
565
query sequence will be broken into overlapping pieces if its length exceeds 10,000bp
566
because it is faster to compute alignments multiple times than to process the whole
567
sequence at one time.
569
Processing Large Genomic Sequence
570
The memory and CPU-time requirements vary with the type of BLAST program as well
571
as the composition and length of the query, PowerBlast uses an empirically derived
572
maximum search size for each BLAST program. For BLASTN, the maximum size is
573
8000bp; for BLASTP, it is 4000aa; for BLASTX, it is 3000bp; and for TBLASTN, it is
574
2000aa. If the query sequence exceeds the threshold, it is broken into overlapping pieces
575
and each piece is submitted as a separate query to the Network BLAST server. When the
576
entire sequence is processed, the HSPs from the same match sequence are sorted by
577
locations. If two neighboring HSPs overlap and cover the same diagonal, they will be
578
merged into a larger HSP. The statistics from the HSP that has a higher score is assigned
579
to the new HSP as an approximation of the real statistical value.
582
PowerBLAST employs two strategies for organism filtering to achieve the most efficient
583
network communication with Entrez Taxonomy Server. If the selected organism has less
584
than 1000 records in the public databases, all the Ids are loaded in memory. The BLAST
585
hits will be compared locally with the list of the Ids. Otherwise, the Ids of the matching
586
sequences will be sent over the network to Entrez server for evaluation. The user may
587
choose either to include or exclude a certain taxonomy class.
590
Three algorithms, SIM, SIM2, SIM3, can be selected to compute gapped alignments
591
between the query sequence and the database matches. The original unmasked query
592
sequence is used as the input to the SIM programs to ensure that the repeat regions are
593
included in the alignments. SIM is a space efficient algorithm that generates the top n
594
non-intersecting Smith-Waterman alignments between DNA-DNA or protein-protein
595
sequences. However, it may be too slow for long sequences. SIM2 and SIM3 are much
596
faster than SIM, but they only compute DNA-DNA alignments. SIM2 improves the speed
597
by first constructing the n best non-intersecting chains of "fragments". It then applies the
598
traditional dynamic programming algorithm to compute an optimal gapped alignment in a
599
region delimited by the chain. SIM3 computes global alignments for sequences that have
600
high similarity; it can be only used when a high cutoff score is set for the BLAST search.
601
HSPs from a BLAST search supply the orientation and approximate range as input to the
602
SIM programs so that the computation is much more efficient than aligning the entire
603
sequences. They are sorted by location, and the gaps between the neighboring HSPs are
604
analyzed to determine if more than one alignment needs to be computed because a large
605
gap may impose a heavy penalty that terminates the alignment. The threshold is set to be
606
200 with the default setting of the SIM programs. The ends of the HSPs are extended
607
(1000 bp for DNA sequences, 100 aa for protein sequences) so that the SIM programs
608
will be able to compute more accurate end points.