2
README for stand-alone BLAST
3
(last updated 12/18/2001)
7
This document provides information on stand-alone BLAST. Topics covered are
8
setting up stand-alone BLAST, command-line options for stand-alone BLAST,
9
and a release history of the different versions.
11
BLAST binaries are provided for IRIX6.2, Solaris2.6 (Sparc) Solaris2.7 (Intel),
12
DEC OSF1 (ver. 4.0D), LINUX/Intel, HPUX, MacIntosh, and Win32 systems.
13
We will attempt to produce binaries for other platforms upon request.
15
Stand-alone binaries are available from ftp://ftp.ncbi.nih.gov/blast/executables/
17
Please remember to FTP in binary mode.
20
Setting up Standalone BLAST for UNIX:
21
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
23
Basically, there are three steps needed to setup the Standalone BLAST
24
executable for the UNIX platform.
26
1) Download the UNIX binary, uncompress and untar the file. It is
27
suggested that you do this in a separate directory, perhaps called
30
2) Create a .ncbirc file. In order for Standalone BLAST to operate, you
31
have will need to have a .ncbirc file that contains the following lines:
36
Where "path/data/" is the path to the location of the Standalone BLAST
37
"data" subdirectory. For Example:
41
The data subdirectory should automatically appear in the directory where
42
the downloaded file was extracted. Please note that in many cases it may
43
be necessary to delimit the entire path including the machine name and
44
or the net work you are located on. Your systems administrator can help
45
you if you do not know the entire path to the data subdirectory.
47
Make sure that your .ncbirc file is either in the directory that you
48
call the Standalone BLAST program from or in your root directory.
50
3) Format your BLAST database files. The main advantage of Standalone
51
BLAST is to be able to create your own BLAST databases. This can be done
52
with any file of FASTA formatted protein or nucleotide sequences. If you
53
are interested in creating your own database files you should refer to
54
the sections "Non-redundant defline syntax" and "Appendix 1: Sequence
55
Identifier Syntax" of the README in the BLAST database directory
56
(ftp://ftp.ncbi.nih.gov/blast/db/). You can also refer to the FASTA
57
description available from the BLAST search pages
58
(http://www.ncbi.nlm.nih.gov/BLAST/fasta.html).
60
However, for a testing purposes you should download one of the NCBI
61
databases and run a search against it.
63
In the BLAST database FTP directory (ftp://ftp.ncbi.nih.gov/blast/db/)
64
you will find the downloadable BLAST database files. For your first
65
search we recommend downloading something relatively small like
66
ecoli.nt.Z (1349 Kb). This is a FASTA formatted file of nucleotide
67
sequences which is also compressed. Once uncompressed, you will need to
68
format the database using the 'formatdb' program which comes with your
69
Standalone BLAST executable. The list of arguments for this program and
70
all other BLAST programs are located at the end of the README in the
71
Standalone BLAST FTP directory (ftp://ftp.ncbi.nih.gov/blast/executable/). Or
72
you can get these arguments by running each of the BLAST programs (formatdb,
73
blastall etc.) with a single hyphen as the argument (Example: formatdb -). For
74
this document we are just going to show you the basic commands for formatting
75
the database and running your first search.
77
To format the ecoli.nt database run the following from the command
80
formatdb -i ecoli.nt -p F -o T
82
This will create seven index files that Standalone BLAST needs to
83
perform the searches and produce results. The ecoli.nt file is not
84
needed after formatdb has been done and you can delete this.
86
Next create a test nucleotide file to run against the new database. It
87
may be easier to 'cheat' here and just extract a portion of a
88
nucleotide sequence you know is in the downloaded ecoli.nt database.
89
Make a text file called test.txt with the following sequence:
92
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC
93
TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA
94
TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC
95
ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG
96
CCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAA
97
GTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC
98
AGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTG
99
AAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTT
101
To run the first search enter the following command from the UNIX
102
command line in your BLAST directory:
104
blastall -p blastn -d ecoli.nt -i test.txt -o test.out
106
This should generate a results file called test.out in the Standalone
109
Now you are ready to create your own databases and run BLAST searches.
110
For more information you should refer to the Standalone BLAST README (
111
ftp://ftp.ncbi.nih.gov/blast/executable/) and the BLAST literature.
112
This will give you some idea of all the programs BLAST supports and the
113
use of different parameters for increasing or decreasing the stringency
116
If you have any questions please send them to the
117
blast-help@ncbi.nlm.nih.gov e-mail address.
120
Setting up Standalone BLAST for Windows
121
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
123
There are three steps needed to setup the Standalone BLAST
126
1) Download and compress the Standalone BLAST Windows binary
127
blastcz.exe. We suggest doing this in it's own directory, perhaps called
128
blast. This is a 'self-extracting' archive and all you need to do is run
129
this either through a Command Prompt (DOS Prompt) or by selecting "Run"
130
from the Windows "Start button" and browsing the blastcz.exe file.
132
2) Create an ncbi.ini file. In order for Standalone BLAST to operate,
133
you have will need to have an ncbi.ini file that contains the following
139
Where "C:path\data\" is the path to the location of the Standalone
140
BLAST "data" subdirectory. For example:
144
This data subdirectory should automatically appear in the directory
145
where the downloaded file was extracted.
147
Make sure that your ncbi.ini file is in the Windows or WINNT directory
148
on your machine. Note: If you already have an ncbi.ini file on your
149
machine from installing other NCBI software(Network Entrez, Sequin etc.)
150
you can skip this section. However, if you see the following error
151
message, you should rename the old ncbi.ini file to something like
152
ncbi.bak and follow the instructions in number 2 above.
155
FATAL ERROR: FindPath failed.
157
C) The main advantage of Standalone BLAST is to be able to create your
158
own BLAST databases. This can be done with any file of FASTA formatted
159
protein or nucleotide sequences. If you are interested in creating your
160
own database you should refer to the sections "Non-redundant defline
161
syntax" and "Appendix 1: Sequence Identifier Syntax" of the README in
162
the BLAST database directory (ftp://ftp.ncbi.nih.gov/blast/db/). You can
163
also refer to the FASTA description available from the BLAST search
164
pages (http://www.ncbi.nlm.nih.gov/BLAST/fasta.html).
166
However, for a testing purposes you should download one of the NCBI
167
databases and run a search against it.
169
In the BLAST database FTP directory ftp://ftp.ncbi.nih.gov/blast/db/
170
you will find the downloadable BLAST database files. For your first
171
search we recommend downloading something relatively small like
172
ecoli.nt.Z (1349 Kb). This is a FASTA formatted file of nucleotide
173
sequences which is also compressed. (If you do not have a copy of UNIX
174
"uncompress" for your Windows PC contact NCBI Info at
175
info@ncbi.nlm.nih.gov).
177
Once uncompressed, you will now need to format the database using the
178
'formatdb' program which comes with your Standalone BLAST executable.
179
The list of arguments for this program and all other BLAST programs are
180
located at the end of the README in the Standalone BLAST FTP directory
181
(ftp://ftp.ncbi.nih.gov/blast/executable/). Or you can get these
182
arguments by running each of the BLAST programs (formatdb, blastall
183
etc.) with a single hyphen as the argument (Example: formatdb -). For
184
this document we are just going to show you the basic commands for
185
formatting the database and running your first search.
187
To format the ecoli.nt database run the following from the command
190
formatdb -i ecoli.nt -p F -o T
192
This will create seven index files that Standalone BLAST needs to
193
perform the searches and produce results. The ecoli.nt file can be
194
removed once formatdb has been run.
196
Next create a test nucleotide file to run against the new database. It
197
may be easier to 'cheat' here and just extract a portion of a
198
nucleotide sequence you know is in the downloaded ecoli.nt database.
199
So make a text file called test.txt with the following sequence:
202
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC
203
TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA
204
TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC
205
ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG
206
CCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAA
207
GTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC
208
AGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTG
209
AAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTT
211
To run the first search just do the command:
213
blastall -p blastn -d ecoli.nt -i test.txt -o test.out
215
This should generate a results file called test.out in the Standalone
216
BLAST directory. Now you are ready to create your own databases and run
217
BLAST searches. For more information you should refer to the Standalone
218
BLAST README ( ftp://ftp.ncbi.nih.gov/blast/executable/) and the BLAST
219
literature. This will give you some idea of all the programs BLAST
220
supports and the use of different parameters for increasing or
221
decreasing the stringency of your results.
223
If you have any questions please send them to the
224
blast-help@ncbi.nlm.nih.gov e-mail address.
230
SGI recommends the following threads patches on IRIX6 systems:
232
For 6.2 systems, install SG0001404, SG0001645, SG0002000, SG0002420 and SG0002458 (in that order)
233
For 6.3 systems, install SG0001645, SG0002420 and SG0002458 (in that order)
234
For 6.4 systems, install SG0002194, SG0002420 and SG0002458 (in that order)
236
These patches can be obtained by calling SGI customer service or from the web: http://support.sgi.com/
238
System recommendations:
239
----------------------
241
BLAST uses memory-mapped files (on UNIX and NT systems), so it runs best if
242
it can read the entire BLAST database into memory, then keep on using it
243
there. Resources consumed reading a database into memory can easily
244
outweight the cost of a BLAST search, so that the memory of a machine is
245
normally more important than the CPU speed. This means that one should have
246
sufficient memory for the largest BLAST database one will use, then run all
247
the searches against this databases in serial, then run queries against
248
another database in serial. This guarantees that the database will be read
249
into memory only once. As of Aug. 1997 the EST FASTA file is about 500 Meg,
250
which translates to about 170-200 Meg of BLAST database. At least another
251
100-200 Meg should be allowed for memory consumed by the actual BLAST
252
program. All of the FASTA databases together are about 1.5 Gig, the BLAST
253
databases produced from this will probably be about another Gig or so. 4 Gig
254
of disk space, to make room for software and output, is probably a pretty
264
There is now a separate document describing formatdb (README.formatdb). Please
265
refer to it for information on formatting FASTA files for BLAST searches.
271
Blastall may be used to perform all five flavors of blast comparison. One
272
may obtain the blastall options by executing 'blastall -' (note the dash). A
273
typical use of blastall would be to perform a blastn search (nucl. vs. nucl.)
274
of a file called QUERY would be:
276
blastall -p blastn -d nr -i QUERY -o out.QUERY
278
The output is placed into the output file out.QUERY and the search is performed
279
against the 'nr' database. If a protein vs. protein search is desired,
280
then 'blastn' should be replaced with 'blastp' etc.
282
Some of the most commonly used blastall options are:
286
-p Program Name [String]
288
Input should be one of "blastp", "blastn", "blastx", "tblastn", or "tblastx".
293
The database specified must first be formatted with formatdb.
294
Multiple database names (bracketed by quotations) will be accepted.
299
which will search both the nr and est databases, presenting the results as if one
300
'virtual' database consisting of all the entries from both were searched. The
301
statistics are based on the 'virtual' database of nr and est.
303
-i Query File [File In]
306
The query should be in FASTA format. If multiple FASTA entries are in the input
307
file, all queries will be searched.
309
-e Expectation value (E) [Real]
312
-o BLAST report Output File [File Out] Optional
315
-F Filter query sequence (DUST with blastn, SEG with others) [String]
318
BLAST 2.0 and 2.1 uses the dust low-complexity filter for blastn and seg for the
319
other programs. Both 'dust' and 'seg' are integral parts of the NCBI toolkit
320
and are accessed automatically.
322
If one uses "-F T" then normal filtering by seg or dust (for blastn)
323
occurs (likewise "-F F" means no filtering whatsoever).
325
This options also takes a string as an argument. One may use such a
326
string to change the specific parameters of seg or invoke other filters.
327
Please see the "Filtering Strings" section (below) for details.
329
-S Query strands to search against database (for blast[nx], and tblastx). 3 is both, 1 is top, 2 is bottom [Integer]
332
-T Produce HTML output [T/F]
335
-l Restrict search of database to list of GI's [String] Optional
337
This option specifies that only a subset of the database should be
338
searched, determined by the list of gi's (i.e., NCBI identifiers) in a
339
file. One can obtain a list of gi's for a given Entrez query from
340
http://www.ncbi.nlm.nih.gov/Entrez/batch.html. This file should
341
be in the same directory as the database, or in the directory that
342
BLAST is called from.
344
-U Use lower case filtering of FASTA sequence [T/F] Optional
347
This option specifies that any lower-case letters in the input FASTA file
351
Documentation for PSI-TBLASTN
353
PSI-BLASTN is a variant of blastall that searches a protein query
354
sequence against a nucleotide sequence database using a position
355
specific matrix created by PSI-BLAST. The nucleotide sequence database
356
is dynamically translated in all reading frames during PSI-TBLASTN
357
search. Using a position specific matrix may enable finding more
358
distantly related sequences.
361
blastpgp [takes a protein query and perform PSI-BLAST search to
362
creates a position specific matrix using a protein
365
blastall [reads position specific matrix and performs PSI-TBLASTN
369
A user would typically run blastpgp to create and save a position
370
specific matrix, followed by a run of blastall for PSI-TBLASTN search.
372
blastpgp must be executed with -C option followed by a file name to
373
save position specific score matrix.
375
blastall with "-p psitblastn" option executes PSI-TBLASTSN search, and
376
-R option followed by a file name specifying the file that contains
377
position specific score matrix. All other options that apply when
378
using "blastall -p tblastn ..." also apply when using "blastall -p
379
psitblastn ...", but there are some restrictions to parameters: 1) The
380
query must be the same as the one used in blastpgp for creating a
381
position specific matrix. 2) By default, blastpgp has filtering off
382
(-F F) and blastall has filtering on (-F T). To ensure consistent
383
usage of the blastpgp/psitblastn combination, the -F option should be
384
explicitly set in one or the other run.
388
One may run PSI-BLST to create and save a position specific score matrix
391
blastpgp -d nr -i ff.chd -j 2 -C ff.chd.ckp
393
Position specific score matrix is saved in ff.chd.ckp. Then, using
394
this matrix, one may run PSI-TBLASTN search:
396
blastall -i ff.chd -d yeast -p psitblastn -R ff.chd.ckp
398
Note that this allows the score matrix to be constructed using one
399
database (nr in the example) and then used to search a second database
400
(yeast in the example). Even if the two database names are the same,
401
blastpgp uses the protein version while "blastall -p psitblastn" uses
409
Blastpgp performs gapped blastp searches and can be used to perform
410
iterative searches in psi-blast and phi-blast mode. See the PSI-Blast and
411
PHI-BLAST sections (below) for a description of this binary. The options may be
412
obtained by executing 'blastpgp -'.
414
-T Produce HTML output [T/F]
417
-Q Output File for PSI-BLAST Matrix in ASCII [File Out] Optional
422
Bl2seq performs a comparison between two sequences using either the blastn or
423
blastp algorithm. Both sequences must be either nucleotides or proteins.
424
The options may be obtained by executing 'bl2seq -'.
426
-i First sequence [File In]
427
-j Second sequence [File In]
428
-p Program name: blastp, blastn, blastx. For blastx 1st argument should be nucleotide [String]
432
-o alignment output file [File Out]
434
-d theor. db size (zero is real size) [Integer]
436
-a SeqAnnot output file [File Out] Optional
437
-G Cost to open a gap (zero invokes default behavior) [Integer]
439
-E Cost to extend a gap (zero invokes default behavior) [Integer]
441
-X X dropoff value for gapped alignment (in bits) (zero invokes default behavior) [Integer]
443
-W Wordsize (zero invokes default behavior) [Integer]
447
-q Penalty for a nucleotide mismatch (blastn only) [Integer]
449
-r Reward for a nucleotide match (blastn only) [Integer]
451
-F Filter query sequence (DUST with blastn, SEG with others) [String]
453
-e Expectation value (E) [Real]
455
-S Query strands to search against database (blastn only). 3 is both, 1 is top, 2 is bottom [Integer]
457
-T Produce HTML output [T/F]
464
Fastacmd retrives FASTA formatted sequences from a BLAST database, if it was formatted
465
using the '-o' option. An example fastacmd call would be:
467
fastacmd -d nr -s p38398
469
The fastacmd options are:
475
-s Search string: GIs, accessions and locuses may be used delimited
476
by comma or space) [String] Optional
477
-i Input file wilth GIs/accessions/locuses for batch retrieval [String] Optional
478
-a Retrieve duplicated accessions [T/F] Optional
480
-l Line length for sequence [Integer] Optional
488
The -F argument can take a string as input specifying that seg should be
489
run with certain values or that other non-standard filters should be used.
490
This sections describes this syntax.
492
The seg options can be changed by using:
496
which specifies a window of 10, locut of 1.0 and hicut of 1.5.
498
A coiled-coiled filter, based on the work of Lupas et al. (Science, vol 252, pp. 1162-4 (1991))
499
and written by John Kuzio (Wilson et al., J Gen Virol, vol. 76, pp. 2923-32 (1995)), may be invoked
504
There are three parameters for this: window, cutoff (prob of a coil-coil), and
505
linker (distance between two coiled-coiled regions that should be linked
506
together). These are now set to
512
One may also change the coiled-coiled parameters in a manner analogous to
515
-F "C 28 40.0 32" will change the window to 28.
517
One may also run both seg and coiled-coiled together by using a ";":
521
Filtering by dust may also be specified by:
525
It is possible to specify that the masking should only be done during
526
the process of building the initial words by starting the filtering
527
command with 'm', e.g.:
531
which specifies that seg (with default arguments) should be used for masking,
532
but that the masking should only be done when the words are being built.
533
This masking option is available with all filters.
535
If the -U option (to mask any lower-case sequence in the input FASTA file) is used and
536
one does not wish any other filtering, but does wish to mask when building the lookup tables
537
then one should specify:
541
This is the only case where "m" should be specified alone.
547
The blastpgp program can do an iterative search in which
548
sequences found in one round of searching are used to build
549
a score model for the next round of searching. In this usage,
550
the program is called Position-Specific Iterated BLAST, or PSI-BLAST.
551
As explained in the accompanying paper, the BLAST algorithm is
552
not tied to a specific score matrix. Traditionally, it has been
553
implemented using an AxA substitution matrix where A is the alphabet size.
554
PSI-BLAST instead uses a QxA matrix, where Q is the length of the query
555
sequence; at each position the cost of a letter depends on the position
556
w.r.t. the query and the letter in the subject sequence.
558
The position-specific matrix for round i+1 is built from a constrained
559
multiple alignment among the query and the sequences found with
560
sufficiently low e-value in round i. The top part of the output for
561
each round distinguishes the sequences into: sequences found
562
previously and used in the score model, and sequences not used in the
563
score model. The output currently includes lots of diagnostics
564
requested by users at NCBI. To skip quickly from the output of
565
one round to the next, search for the string "producing", which is
566
part of the header for each round and likely does not appear elsewhere
567
in the output. PSI-BLAST "converges" and stops if all sequences
568
found at round i+1 below the e-value threshold were already in
569
the model at the beginning of the round.
571
There are several blastpgp parameters specifically for PSI-BLAST:
572
-j is the maximum number of rounds (default 1; i.e., regular BLAST)
573
-h is the e-value threshold for including sequences in the
574
score matrix model (default 0.001)
575
-c is the "constant" used in the pseudocount formula specified in the
578
The -C and -R flags provide a "checkpointing" facility whereby
579
a score model can be stored and later reused.
580
-C stores the query and frequency count ratio matrix in a
582
-R restarts from a file stored previously.
583
When using -R, it is required that the query specified on the command line
584
match exactly the query in the restart file.
585
The checkpoint files are stored in a byte-encoded (not human readable)
586
format, so as to prevent roundoff error between writing and reading
588
Users who also develop their own sequence analysis software may wish
589
to develop their own scoring systems. For this purpose the code
590
in posit.c that writes out the checkpoint can be easily adapated to
591
write out scoring systems derived by other algorithms in such
592
a way that PSI-BLAST can read the files in later.
593
The checkpoint structure is general in the sense that it can handle
594
any position-specific matrix that fits in the Karlin-Altschul
595
statistical framework for BLAST scoring.
597
The -B flag provides a way to jump start PSI-BLAST from a master-slave
598
multiple alignment computed outside PSI-BLAST. The multiple alignment
599
must include the query sequence as one of the sequences, but it need
600
not be the first sequence. The multiple alignment must be specified
601
in a format that is derived from Clustal, but without some headers and
602
trailers. See example below. The rules are also described by the
603
following words. Suppose the multiple alignments has N sequences. It
604
may be presented in 1 or more blocks, where each block presents a
605
range of columns from the multiple alignment. E.g., the first block
606
might have columns 1-60, the second block might have columns 61-95,
607
the third block might have columns 96-128. Each block should have N
608
rows, 1 row per sequence. The sequences should be in the same order
609
in every block. Blocks are separated by 1 or more blank lines.
610
Within a block there are no blank lines, and each line consists of 1
611
sequence identifier followed by some white space followed by
612
characters (and gaps) for that sequence in the multiple alignment. In
613
each column, all letters must be in upper case, or all letters must be
614
in lower case. Upper case means that this column is to be given
615
position-specific scores. Lower-case means to use the underlying
616
matrix (specified by -M) for this column; e.g., if the query sequence
617
has an 'l' residue in the column, then the standard scores for
618
matching an L are used in the column.
620
A sample usage would be:
622
blastpgp -i seq1 -B align1 -j 2 -d nr
624
where seq1 is the query
625
align1 is the alignment file
626
-j 2 indicates to do 2 rounds
627
-d nr indicates to use the nr database
632
copied below were kindly supplied by L. Aravind from a paper
633
he and Chris Ponting published in Protein Science:
635
Aravind L, Ponting CP, Homologues of 26S proteasome subunits
636
are regulators of transcription and translation, Protein Science
639
L. Aravind (aravind@ncbi.nlm.nih.gov) was the first user
640
and helped define how -B should work. Y. Wolf (wolf@ncbi.nlm.nih.gov)
641
helped design a more flexible input format for the alignments.
642
If you like how -B works, let them know.
643
If you do not like how -B works, complain to
644
A. Schaffer(schaffer@helix.nih.gov) who did the implementation.
649
IHAAEEKDWKTAYSYFYEAFEGYDSIDSPKAITSLKYMLLCKIMLNTPEDVQALVSGKLALRYAGRQTEA
650
LKCVAQASKNRSLADFEKALTDYRAELRDDPIISTHLAKLYDNLLEQNLIRVIEPFSRVQIEHISSLIKL
651
SKADVERKLSQMILDKKFHGILDQGEGVLIIFDEPP
656
26SPS9_Hs IHAAEEKDWKTAYSYFYEAFEGYdsidspkaitslkymllckimlntpedvqalvsgklalryagrqtealkcvaqasknr
657
F57B9_Ce LHAADEKDFKTAFSYFYEAFEGYdsvdekvsaltalkymllckvmldlpdevnsllsaklalkyngsdldamkaiaaaaqk
658
YDL097c_Sc ILHCEDKDYKTAFSYFFESFESYhnltthnsyekacqvlkymllskimlnliddvknilnakytketyqsrgidamkavae
659
YMJ5_Ce LYSAEERDYKTSFSYFYEAFEGFasigdkinatsalkymilckimlneteqlagllaakeivayqkspriiairsmadafr
660
FUS6_ARATH KNYIRTRDYCTTTKHIIHMCMNAilvsiemgqfthvtsyvnkaeqnpetlepmvnaklrcasglahlelkkyklaarkfld
661
COS41.8_Ci SLDYKLKTYLTIARLYLEDEDPVqaemyinrasllqnetadeqlqihykvcyarvldyrrkfleaaqrynelsyksaihet
662
644879 KCYSRARDYCTSAKHVINMCLNVikvsvylqnwshvlsyvskaestpeiaeqrgerdsqtqailtklkcaaglaelaarky
663
YPR108w_Sc IHCLAVRNFKEAAKLLVDSLATFtsieltsyesiatyasvtglftlertdlkskvidspellslisttaalqsissltisl
664
eif-3p110_Hs SKAMKMGDWKTCHSFIINEKMNGkvw-------------------------------------------------------
665
T23D8.4_Ce SKAMLNGDWKKCQDYIVNDKMNQkvw-------------------------------------------------------
666
YD95_Sp IYLMSIRNFSGAADLLLDCMSTFsstellpyydvvryavisgaisldrvdvktkivdspevlavlpqnesmssleacinsl
667
KIAA0107_Hs LYCVAIRDFKQAAELFLDTVSTFtsyelmdyktfvtytvyvsmialerpdlrekvikgaeilevlhslpavrqylfslyec
668
F49C12.8_Hs LYRMSVRDFAGAADLFLEAVPTFgsyelmtyenlilytvitttfaldrpdlrtkvircnevqeqltggglngtlipvreyl
669
Int-6_Mm KFQYECGNYSGAAEYLYFFRVLVpatdrnalsslwgklaseilmqnwdaamedltrlketidnnsvssplqslqqrtwlih
671
26SPS9_Hs sladfekaltdy-----------------------------------------------------------------------------------
672
F57B9_Ce rslkdfqvafgsf----------------------------------------------------------------------------------
673
YDL097c_Sc aynnrslldfntalkqy------------------------------------------------------------------------------
674
YMJ5_Ce krslkdfvkalaeh---------------------------------------------------------------------------------
675
FUS6_ARATH vnpelgnsyneviapqdiatygglcalasfdrselkqkvidninfrnflelvpdvrelindfyssryascleylasl------------------
676
COS41.8_Ci eqtkalekalncailapagqqrsrmlatlfkdercqllpsfgilekmfldriiksdemeefar--------------------------------
677
644879 kqaakclllasfdhcdfpellspsnvaiygglcalatfdrqelqrnvissssfklflelepqvrdiifkfyeskyasclkmldem----------
678
YPR108w_Sc yasdyasyfpyllety-------------------------------------------------------------------------------
679
eif-3p110_Hs -----------------------------------------------------------------------------------------------
680
T23D8.4_Ce -----------------------------------------------------------------------------------------------
681
YD95_Sp ylcdysgffrtladve-------------------------------------------------------------------------------
682
KIAA0107_Hs rysvffqslavv-----------------------------------------------------------------------------------
683
F49C12.8_Hs esyydchydrffiqlaale----------------------------------------------------------------------------
684
Int-6_Mm wslfvffnhpkgrdniidlflyqpqylnaiqtmcphilrylttavitnkdvrkrrqvlkdlvkviqqesytykdpitefveclyvnfdfdgaqkk
686
26SPS9_Hs ----RAELRDDPIISTHLAKLYDNLLEQNLIRVIEPFSRVQIEHISSLIKLSKADVERKLSQMILDKKFHGILDQGEGVLIIFDEPP
687
F57B9_Ce ----PQELQMDPVVRKHFHSLSERMLEKDLCRIIEPYSFVQIEHVAQQIGIDRSKVEKKLSQMILDQKLSGSLDQGEGMLIVFEIAV
688
YDL097c_Sc ----EKELMGDELTRSHFNALYDTLLESNLCKIIEPFECVEISHISKIIGLDTQQVEGKLSQMILDKIFYGVLDQGNGWLYVYETPN
689
YMJ5_Ce ----KIELVEDKVVAVHSQNLERNMLEKEISRVIEPYSEIELSYIARVIGMTVPPVERAIARMILDKKLMGSIDQHGDTVVVYPKAD
690
FUS6_ARATH ----KSNLLLDIHLHDHVDTLYDQIRKKALIQYTLPFVSVDLSRMADAFKTSVSGLEKELEALITDNQIQARIDSHNKILYARHADQ
691
COS41.8_Ci ----QLMPHQKAITADGSNILHRAVTEHNLLSASKLYNNIRFTELGALLEIPHQMAEKVASQMICESRMKGHIDQIDGIVFFERRET
692
644879 ----KDNLLLDMYLAPHVRTLYTQIRNRALIQYFSPYVSADMHRMAAAFNTTVAALEDELTQLILEGLISARVDSHSKILYARDVDQ
693
YPR108w_Sc ----ANVLIPCKYLNRHADFFVREMRRKVYAQLLESYKTLSLKSMASAFGVSVAFLDNDLGKFIPNKQLNCVIDRVNGIVETNRPDN
694
eif-3p110_Hs ----DLFPEADKVRTMLVRKIQEESLRTYLFTYSSVYDSISMETLSDMFELDLPTVHSIISKMIINEELMASLDQPTQTVVMHRTEP
695
T23D8.4_Ce ----NLFHNAETVKGMVVRRIQEESLRTYLLTYSTVYATVSLKKLADLFELSKKDVHSIISKMIIQEELSATLDEPTDCLIMHRVEP
696
YD95_Sp ----VNHLKCDQFLVAHYRYYVREMRRRAYAQLLESYRALSIDSMAASFGVSVDYIDRDLASFIPDNKLNCVIDRVNGVVFTNRPDE
697
KIAA0107_Hs ----EQEMKKDWLFAPHYRYYVREMRIHAYSQLLESYRSLTLGYMAEAFGVGVEFIDQELSRFIAAGRLHCKIDKVNEIVETNRPDS
698
F49C12.8_Hs ----SERFKFDRYLSPHFNYYSRGMRHRAYEQFLTPYKTVRIDMMAKDFGVSRAFIDRELHRLIATGQLQCRIDAVNGVIEVNHRDS
699
Int-6_Mm lrecESVLVNDFFLVACLEDFIENARLFIFETFCRIHQCISINMLADKLNMTPEEAERWIVNLIRNARLDAKIDSKLGHVVMGNNAV
708
PHI-BLAST (Pattern-Hit Initiated BLAST) is a search
709
program that combines matching of regular expressions
710
with local alignments surrounding the match.
711
The most important features of the program have been
712
incorporated into the BLAST software framework
713
partly for user convenience and partly so that
714
PHI-BLAST may be combined seamlessly with PSI-BLAST.
715
Other features that do not fit into the BLAST framework
716
will be released later as a separate program and/or
717
separate Web page query options.
719
One very restrictive way to identify protein motifs
720
is by regular expressions that must contain each instance
721
of the motif. The PROSITE database is a compilation of
722
restricted regular expressions that describe protein motifs.
723
Given a protein sequence S and a regular expression pattern P
724
occurring in S, PHI-BLAST helps answer the question:
725
What other protein sequences both contain an occurrence of P
726
and are homologous to S in the vicinity of the pattern occurrences?
727
PHI-BLAST may be preferable to just searching for pattern occurrences
728
because it filters out those cases where the pattern occurrence is
729
probably random and not indicative of homology.
730
PHI-BLAST may be preferable to other flavors of BLAST because
731
it is faster and because it allows the user to express
732
a rigid pattern occurrence requirement.
734
The pattern search methods in PHI-BLAST are based on the
737
R. Baeza-Yates and G. Gonnet, Communications of the ACM 35(1992), pp. 74-82.
738
S. Wu and U. Manber, Communications of the ACM 35(1992), pp. 83-91.
740
The calculation of local alignments is done using a method
741
very similar to (and much of the same code as) gapped BLAST.
742
However, the method of evaluating statistical significance is different, and
745
In the stand-alone mode the typical PHI-BLAST usage looks like:
746
blastpgp -i -k -p patseedp
748
where -i is followed by the file containing the query in FASTA format
749
where -k is followed by the file containing the pattern in a syntax given below
750
and "patseedp" indicates the mode of usage, not representing any file.
752
The syntax for the query sequence is FASTA format as for all other
753
BLAST queries. The syntax for patterns follows the rules of
754
PROSITE and is documented in detail below.
755
The specified pattern is not required to be in the PROSITE list.
756
Most of the other BLAST flags can be used with PHI-BLAST.
757
One important exception is that PHI-BLAST requires gapped
758
alignments (i.e. forbids -g F in the flags) because ungapped
759
alignments do not make sense for almost all patterns in PROSITE.
761
There is a second mode of PHI-BLAST usage that is important when
762
the specified pattern occurs more than 1 time in the query.
763
In this case, the user may be interested in restricting the
764
search for local alignments to a subset of the pattern occurrences.
765
This can be done with a search that looks like:
766
blastpgp -i -k -p seedp
768
in which case the use of the "seedp" option requires the user to
769
specify the location(s) of the interesting pattern occurrence(s)
770
in the pattern file. The syntax for how to specify pattern
771
occurrences is below. When there are multiple pattern occurrences in the
772
query it may be important to decide how many are of interest because
773
the E-value for matches is effectively multiplied by the number
774
of interesting pattern occurrences.
776
The PHI-BLAST Web page supports only the "patseedp" option.
778
PHI-BLAST is integrated with PSI-BLAST. In the command-line
779
mode, PSI-BLAST can be invoked by using the -j option, as usual.
780
When this is done as:
781
blastpgp -i -k -p patseedp -j
783
then the first round of searching uses PHI-BLAST and all subsequent
784
rounds use PSI-BLAST.
785
In the Web page setting, the user must explicitly invoke one round
786
at a time, and the PHI-BLAST Web page provides the option to
787
initiate a PSI-BLAST round with the PHI-BLAST results.
788
To describe a combined usage, use the term "PHI-PSI-BLAST"
789
(Pattern-Hit Initiated, Position-Specific Iterated BLAST).
791
Determining statistical significance.
793
When a query sequence Q matches a database sequence D in PHI-BLAST,
794
it is useful to subdivide Q and D into 3 disjoint pieces
795
Qleft Qpattern Qright
796
Dleft Dpattern Dright
798
The substrings Qpattern and Dpattern contain the pattern specified
799
in the pattern file. The pieces Qpattern and Dpattern are aligned
800
and that alignment is displayed as part of the PHI-BLAST output,
801
but the score for that alignment is mostly ignored.
802
The "reduced" score r of an alignment is the sum of the scores obtained
803
by aligning Qleft with Dleft and by aligning Qright with Dright.
805
The expected number of alignments with a reduced score >= x
807
CN(Lambda*x + 1)e^(-Lambda *x)
810
C and Lambda are "constants" depending on the score matrix and the
812
N is (number of occurrences of pattern in database) * (number of
813
occurrences of pattern in Q)
814
e is the base of the natural logarithm.
816
It is important to understand that this method of computing
817
the statistical significance of a PHI-BLAST alignment is mathematically
818
different from the method used for BLAST and PSI-BLAST alignments.
819
However, both methods provide E-values, so they the E_values are
820
displayed with a similar output syntax.
822
Rules for pattern syntax for PHI-BLAST.
824
The syntax for patterns in PHI-BLAST follows the conventions
825
of PROSITE. When using the stand-alone program, it
826
is permissible to have multiple patterns in a file separated
827
by a blank line between patterns. When using the Web-page
828
only one pattern is allowed per query.
830
Valid protein characters for PHI-BLAST patterns:
831
ABCDEFGHIKLMNPQRSTVWXYZU
833
Valid DNA characters for PHI-BLAST patterns:
836
Other useful delimiters:
837
[ ] means any one of the characters enclosed in the brackets
838
e.g., [LFYT] means one occurrence of L or F or Y or T
839
- means nothing (this is a spacer character used by PROSITE)
840
x with nothing following means any residue
841
x(5) means 5 positions in which any residue is allowed (and similarly for any other
842
single number in parentheses after x)
843
x(2,4) means 2 to 4 positions where any residue is allowed,
844
and similarly for any other two numbers separated by a comma;
845
the first number should be < the second number.
846
> can occur only at the end of a pattern and means nothing
847
it may occur before a period
848
(another spacer used by PROSITE)
850
. may be used at the end of the pattern and means nothing
852
When using the stand-alone program, the pattern should
853
be in a file, with the first line starting:
855
followed by 2 spaces and a text string giving the pattern a name.
857
There should also be a line starting
859
followed by 2 spaces followed by the pattern description.
861
All other PROSITE codes in the first two columns are allowed,
862
but only the HI code, described below is relevant to PHI-BLAST.
864
Here is an example from PROSITE.
866
ID CNMP_BINDING_2; PATTERN.
868
DT OCT-1993 (CREATED); OCT-1993 (DATA UPDATE); NOV-1995 (INFO UPDATE).
869
DE Cyclic nucleotide-binding domain signature 2.
870
PA [LIVMF]-G-E-x-[GAS]-[LIVM]-x(5,11)-R-[STAQ]-A-x-[LIVMA]-x-[STACV].
871
NR /RELEASE=32,49340;
872
NR /TOTAL=57(36); /POSITIVE=57(36); /UNKNOWN=0(0); /FALSE_POS=0(0);
873
NR /FALSE_NEG=1; /PARTIAL=1;
874
CC /TAXO-RANGE=??EP?; /MAX-REPEAT=2;
878
gives the pattern a name.
880
AC, DT, DE, NR, NR, CC
881
are relevant to PROSITE users, but irrelevant to PHI-BLAST.
882
These lines are tolerated, but ignored by PHI-BLAST.
886
describes the pattern as:
899
any 5 to 11 characters
915
In this case the pattern ends with a period.
916
It can end with nothing after the last specifying symbol
917
or any number of > signs or periods or combination thereof.
919
Here is another example, illustrating the use of an HI line.
921
ID ER_TARGET; PATTERN.
922
PA [KRHQSA]-[DENQ]-E-L>.
926
In this example, the HI lines specify that the pattern
927
occurs twice, once from positions 19 through 22 in the
928
sequence and once from positions 201 through 204 in the
930
These specifications are relevant when stand-alone PHI-BLAST is
933
option, in which the interesting occurrences of the pattern
934
in the sequence are specified. In this case the
935
HI lines specify which occurrence(s) of the pattern
936
should be used to find good alignments.
938
In general, the seedp option is more useful than the
939
standard patternp option ONLY when the
940
pattern occurs K > 1 times in the sequence AND
941
the user is interested in matching to J < K of those
943
Then using the HI lines enables the user to specify which
944
occurrences are of interest.
946
Additional functionality related to PHI-BLAST.
948
PHI-BLAST takes as input both a sequence and a query containing
949
that sequence and searches a sequence database for
950
other sequences containing the same pattern and having a good alignment.
951
One may be interested in asking two related, simpler questions:
953
1. Given a sequence and a database of patterns, which patterns occur
954
in the sequence and where?
956
2. Given a pattern and a sequence database, which sequences contain the
959
These queries can be answered wih software closely related to PHI-BLAST,
960
but they do not fit into the output framework of BLAST because the
961
answers are simple lists without alignments and with no notion of
962
statistical significance.
964
The NCBI toolbox includes another program, currently called
966
to answer the two queries above.
968
Query 1 can be asked with:
969
seedtop -i -k -p patmatchp
971
Query 2 can be asked with:
972
seedtop -d -k -p patternp
974
The -k argument is used similarly in all queries and the file
975
format is always the same. The standard pattern database is
976
PROSITE, but others (or a subset) can be used.
977
There are plans afoot to offer the patmatchp query (number 1) on
978
the PHI-BLAST web page or in its vicinity, but this would
979
be restricted to having PROSITE as the pattern database.
983
Zhang, Zheng, Alejandro A. Sch�ffer, Webb Miller, Thomas L. Madden,
984
David J. Lipman, Eugene V. Koonin, and Stephen F. Altschul (1998),
985
"Protein sequence similarity searches using patterns as seeds", Nucleic
986
Acids Res. 26:3986-3990.
988
Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
989
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
990
"Gapped BLAST and PSI-BLAST: a new generation of protein database
991
search programs", Nucleic Acids Res. 25:3389-3402.
993
Karlin, Samuel and Stephen F. Altschul (1990). Methods for
994
assessing the statistical significance of molecular sequence
995
features by using general scoring schemes. Proc. Natl. Acad.
998
Karlin, Samuel and Stephen F. Altschul (1993). Applications
999
and statistics for multiple high-scoring segments in molecu-
1000
lar sequences. Proc. Natl. Acad. Sci. USA 90:5873-7.
1002
Schaffer, Alejandro A., L. Aravind, Thomas L. Madden, Sergei Shavirin
1003
John L. Spouge, Yuri I. Wolf, Eugene V. Koonin, and Stephen F. Altschul (2001),
1004
Improving PSI-BLAST Protein Database Search Sensitivity with Composition-Based
1005
Statistics and Other Refinements. Nucleic Acids Res. 29:2994-3005.
1010
Notes for 2.2.2 release:
1014
1.) Version 4 of the BLAST databases is now fully supported. This version
1015
has some enhancements described in README.formatdb and fixes some problems
1016
described below. Use the "-A" option on formatdb to produce the new database
1017
version. The BLAST binaries for release 2.2.2 are entirely compatiable with
1018
both the current and the new version of the BLAST databases. Old BLAST binaries
1019
are not necessarily compatiable with the new database format.
1021
2.) Fastacmd will dump out an entire BLAST database in FASTA format if the
1022
new -D option is used.
1024
3.) Fastacmd will separate definition lines from different GI's that have
1025
been merged together in nr (as they all have the same sequence) by control-A's.
1026
if the new -c option is used.
1031
1.) A problem has been fixed that caused tblastn searches to miss some protein matches,
1032
if the database sequence was longer than 15 million bases.
1034
2.) The old (current) version of the BLAST databases has a "rollover" problem if
1035
the total number of bases in a single volume is greater than 4294967295. The new
1036
database verison (#4) allows eight bytes for this.
1038
3.) The old (current) version of the BLAST database format does not handle ambiguity
1039
characters in a nucleotide database sequence if it is over 16 million characters long.
1040
The new version of the the BLAST database does.
1042
4.) A performance problem that caused a mutexes to be acquired too often for
1043
multi-threaded runs with four or more CPU's has been fixed. Thanks to Haruna
1044
Cofer of SGI for help in finding the cause.
1046
5.) A problem that caused ungapped blastp/blastx/tblastn/tblastx to crash on
1047
certain matrices (e.g., pam10) has been fixed.
1049
6.) Some blastpgp problems with using the -B (for reading a master-slave alignment) and
1050
reading checkpoint files (-C) have been resolved.
1053
Notes for 2.2.1 release:
1057
1.) BLAST and PSI-BLAST improvements as described in
1058
Schaffer et al., Nucleic Acids Research 2001 Jul 15;29(14):2994-3005.
1059
These include improvements the use of composition-based statistics
1060
and improvements to the edge-correction effects. Composition-based
1061
statistics were initially implemented in release 2.1.1, but the
1062
implementation is improved in release 2.2.1.
1064
2.) Formatdb automatically produces database volumes for input
1065
consisting of more than 4 billion letters.
1067
3.) Formatdb can produce an alias file for a given database and GI list
1068
as well as convert a GI list to the more efficient binary format. See
1069
details in README.formatdb.
1071
4.) RPSBLAST now works properly with 'scaled' databases. The scaling factor must
1072
be set when executing the program 'makemat' (which takes PSI-BLAST checkpoints
1073
as input). Scaling-up the matrix improves the precision of the (integer) calculations.
1075
5.) Tabular output has now been added to blastpgp and rpsblast, use the "-m 8" option.
1077
6.) Blastpgp will now process multiple queries.
1081
1.) A problem with the -K option (for culling) that caused BLAST to crash has been fixed.
1083
2.) A problem with the "gnl" identifier and multi-volume databases has been fixed.
1085
3.) A problem that caused BLASTN to very rarely find suboptimal alignments has been fixed.
1087
4.) A problem that could cause makemat to crash has been fixed.
1089
4.) Some multi-threading problem pointed out by Henry Gabb of KAI were fixed.
1091
5.) Some PC-lint errors and warnings pointed out by Russ Williams of United Devices
1095
Notes for 2.1.3 release:
1099
1.) Addition of PSI-TBLASTN ability to blastall, see description in
1102
2.) Database sequences over 5 million bases in length are now broken
1103
into chunks to keep memory usage reasonable.
1105
3.) Blastall now allows one to enter a location if it is desired
1106
to search a subsequence of the query.
1108
4.) Formatdb can produce a new BLAST database format using the -A option.
1109
The BLAST programs can read this format as well as the current format (the
1110
program automatically identifies which version it should work with). This
1111
new format stores the sequence definition lines in a structured manner
1112
(as ASN.1), this will allow future versions of BLAST to better present
1113
taxonomic information as well as information about other resources (e.g.,
1114
UniGene, LocusLink) for a database sequence.
1116
5.) Blastall can now produce tab-delimited, use "-m 8" to specify this.
1118
6.) Improved Karlin-Altschul parameters are now being used, they were
1119
calculated using the "island" method
1121
7.) A "gapped" check was added to BLASTN to ensure that if a hit is low-scoring
1122
after an ungapped extension, but high-scoring after a gapped extension, it will
1125
8.) The formatdb error messages have been improved for the case of illegal
1126
characters in the sequence.
1128
9.) The number of HSP's saved in an ungapped search has been increased to 400 from 200.
1132
1.) A problem with XML output was fixed.
1134
2.) A problem with the seg filtering under LINUX was
1135
fixed (many thanks to Eric Cabot at GCG for pointing this out).
1137
3.) A problem with format of BLAST reports if the "-o" flag
1138
was not used when the database was produced was fixed
1139
(thanks again to Eric Cabot).
1141
4.) A problem with reading the BLAST database caused by a 4-byte signed integer
1142
than should have been unsigned was fixed (thanks to Haruna Cofer at SGI
1143
for pointing this out).
1145
5.) A problem with copymat under NT and IRIX was fixed.
1148
Notes for 2.1.2 release:
1152
1.) Release of rpsblast. Rpsblast performs a search against a database
1153
of profiles. See README.rps for full details.
1155
2.) Release of blastclust. BLASTCLUST automatically and systematically clusters protein sequences
1156
based on pairwise matches found using the BLAST algorithm. See README.bcl for
1159
3.) Release of megablast. Megablast uses the greeedy algorithm of Webb Miller et al.
1160
for nucleotide sequence alignment search and concatenates many queries to save
1161
time spent scanning the database. See README.mbl for full details.
1163
4.) XML output can now be produced. Use the '-m 7' option for this.
1164
The XML output is still experimental.
1166
5.) the default behavior the culling (-K) option has been changed. Previously
1167
this option was set to 100, meaning that if more than 100 HSP's had a
1168
hit to a region lower scoring ones would be dropped. The option is now
1169
zero, which turns off this behavior. In a few cases this change will
1170
result in more database sequences being reported. The previous behavior can
1171
be recovered by using '-K 100' on the command-line.
1175
1.) A bug that caused only the last SeqAnnot to be written (if the -O option
1176
was used) when multiple sequences were searched has been fixed. All
1177
SeqAnnots are printed out.
1179
2.) A bug that caused the search space (set on the command line with the -Y option)
1180
to be ignored for some blastx and tblastn calculations has been fixed.
1182
3.) A failure to close a file if a gilst was used (using the -l option) was
1183
fixed. Many thanks to David Mathog at CalTech for spotting this problem
1184
and suggesting a fix.
1186
4.) A bug that caused all the database names listed in an alias file to be
1187
printed, rather than the "TITLE" field has been fixed.
1195
1.) Addition of compostion-based statistics:
1197
BLAST and PSI-BLAST now permit calculated E-values to take into account the amino acid composition of the individual database sequences involved in reported
1198
alignments. This improves E-value accuracy, thereby reducing the number of false positive results.
1200
The improved statistics are achieved with a scaling procedure [1,2] which in effect employs a slightly different scoring system for each database sequence. As a result,
1201
raw BLAST alignment scores in general will not correspond precisely to those implied by any standard substitution matrix. Furthermore, identical alignments can receive
1202
different scores, based upon the compositions of the sequences they involve. The improved statistics are now used by default for all rounds of searching on the
1203
PSI-BLAST page, but not on the BLAST page. Therefore, if one uses default settings, the results of the first round of searching will be different on the BLAST and
1206
In addition adjustments have been made to two PSI-BLAST parameters: the pseudocount constant default has been changed from 10 to 7, and the E-value threshold for
1207
including matches in the PSI-BLAST model has been changed from 0.001 to 0.002.
1209
1. Altschul, S.F. et al. (1997) Nucl. Acids Res. 25:3389-3402.
1210
2. Sch�ffer, A.A. et al. (1999) Bioinformatics 15:1000-1011.
1213
Notes for 2.0.14 release:
1218
1.) extra line returns between sequences in the a FASTA file
1219
causes formatdb to produce corrupted databases.
1221
2.) ";" at the beginning of a line was not being treated as a comment.
1223
3.) a problem with the formatter causes blast to core-dump if
1224
the FASTA definition line only contains an identifier and
1227
4.) a problem in the ungapped extension for protein sequences
1228
causes a rare problem.
1230
5.) the '-U' option that causes lower-case sequence to be masked
1231
does not work correctly for blastx.
1234
Notes for 2.0.13 release:
1238
1.) The output format for pairwise alignments was changed to
1239
put each new gi (if the sequence has redundant gi's) on a
1240
new line. If HTML output is specified then each gi is hyperlinked.
1244
1.) An NCBI toolkit problem parsing the new RefSeq format in FASTA files
1245
(two bars instead of three) was fixed. This fix applies to all
1246
BLAST binaries (formatdb, blastall, blastpgp, etc.).
1248
2.) A problem that caused BLAST version 2.0.12 under NT to freeze in
1249
multithreaded mode has been fixed.
1251
Notes for 2.0.12 release:
1255
1.) Bl2seq can now perform nucleotide-protein (blastx style) comparisons.
1256
This necessitated changing the '-p' option from a Boolean to a
1257
string. Valid arguments are "blastn", "blastp", or "blastx".
1261
1.) A problem in the NCBI threads library that caused BLAST to sometimes
1262
stick was corrected. Many thanks to Haruna Cofer and colleauges at SGI
1263
for providing a fix.
1265
2.) A problem that caused BLAST to core-dump (especially on long queries)
1266
has been fixed. Many thanks to Gary Williams for providing examples.
1268
3.) A problem that prevented the search of multiple multivolume databases
1273
Notes for 2.0.11 release:
1277
1.) Optimizations were contributed by Chris Joerg of COMPAQ. These changes
1278
reduce the number of cache misses, unroll loops, and make some instructions
1279
unnecessary. These improvements can speed up BLAST for long sequences
1282
2.) A database is now only memory-mapped while being searched. If multiple databases
1283
are searched and the total exceeds the allowed memory-map limit this allows
1284
all databases to be searched as memory-mapped files. If a database cannot
1285
be memory-mapped it is read as an ordinary file, rather than causing an error.
1289
1.) Formatdb was fixed to correct a problem with FASTA string identifiers under NT.
1291
2.) Blastpgp was fixed to prevent a core-dump under LINUX
1293
3.) BLASTN was found to miss some hits near the expect value cutoff. This has been
1298
Notes for 2.0.10 release:
1302
1.) Bl2seq, a utility to compare two sequences using the blastn or blastp approach,
1303
is included in the archive. See the full description in the README.bls for details.
1305
2.) A 'sparse' option ('-s') has been added to formatdb. This option limits the indices
1306
for the string identifiers (used by formatdb) to accessions (i.e., no locus names).
1307
This is especially useful for sequences sets like the EST's where the accession and locus
1308
names are identical. Formatdb runs faster and produces smaller temporary files if this
1309
option is used. It is strongly recommended for EST's, STS's, GSS's, and HTGS's.
1311
3.) A volume option ('-v') has been added to formatdb. This option breaks up large
1312
FASTA files into 'volumes' (each with a maximum size of 2 billion letters).
1313
As part of the creation of a volume formatdb writes a new type of BLAST database file,
1314
called an alias file, with the extension 'nal' or 'pal', is written. This option
1315
should be used if one wishes to formatdb large databases (e.g., over 2 billion
1318
4.) It is is now possible to jump start the command line version of PSI-BLAST (blastpgp)
1319
from a multiple alignment that includes the query sequence using the -B option. Details
1322
5.) The maximum wordsize limit for BLASTN has been removed.
1326
1.) A problem if the database length, set by the '-z' option was greater than
1327
2 billion, was fixed.
1329
2.) A core-dump that resulted from the use of the coil-coil masking
1330
('-F C') was fixed by including a file needed for the data directory.
1332
3.) A bug was fixed that caused some very short alignments to be assigned incorrect
1335
4.) A bug was fixed that caused formatdb to produce incorrect BLAST databases if
1336
the input was ASN.1.
1338
5.) A serious performance problem with BLASTN and longer words (greater than 16)
1341
Notes for 2.0.9 release:
1345
1.) two new options have been added to blastall: to produce output in HTML and
1346
to search a subset of the database based upon a list of GI's. Please see
1347
the options section for full information.
1349
2.) two new options have been added to blastpgp: to produce HTML output and to
1350
produce an ASCII version of the PSI-BLAST Matrix. Please see the options section
1351
for more information.
1353
3.) formatdb has a new option to allow specification of a 'base' name. see the options
1354
section for full details.
1356
4.) it is possible to mask only during the phase when the lookup table is being built,
1357
but not during the extensions. See the options section for full details.
1361
1.) a problem that occurred when too many HSP's aligned to the same part
1362
of the query from one database sequence has been fixed.
1364
2.) a problem that caused seedtop to not perform pattern-matching for DNA
1365
sequences has been fixed.
1367
3.) the number of HSP's saved for ungapped BLAST and tblastx is now limited to
1368
200 to prevent problems with memory and speed.
1370
4.) a missing thread join that caused problems under DEC Alpha has been added.
1372
5.) a formatting problem with the database summary at the beginning of the
1373
BLAST output (if multiple databases totaling over 2 Gig) has been fixed.
1375
6.) a bug in formatdb that caused a core-dump if the total number of sequences was an
1376
exact multiple of 100000 was fixed.
1379
Notes for 2.0.8 release:
1383
1.) Frame and strand information was added to the output. Examples of the
1384
new output format may be found at http://www.ncbi.nlm.nih.gov/BLAST/example.html.
1386
2.) An option that specifes the query strand to be searched (for blastn, blastx, and tblastx)
1387
has been added. The option is '-S'.
1391
1.) The problem with the 'too-wide' parameter input screen under NT was fixed.
1393
2.) BLAST no longer core-dump's when the query is NULL.
1395
3.) BLAST no longer core-dump's when the query contains an '@' and blastx or tblastx is selected.
1397
Notes for 2.0.7 release:
1401
1.) BLAST now multi-threads properly under LINUX.
1403
2.) A problem with very redundant databases and psi-blast was fixed.
1405
3.) A problem with the formatting of the number of identities and positives
1406
was fixed. This affected results on the minus strand only and did not
1407
affect the expect value or scores.
1409
4.) A problem that caused tblastn to core-dump very occassionally was corrected.
1411
5.) A problem with multiple patterns in PHI-BLAST was fixed.
1413
6.) A limit on the number of HSP's that were saved (100) was removed.
1415
Notes for 2.0.6 release:
1419
1.) PHI-BLAST is included in this release. Please see notes on PHI-BLAST for
1422
2.) SEG has become an integral part of the NCBI toolkit and it is no longer necessary
1423
to install it separately. It is also now supported under non-UNIX platforms.
1425
3.) Access to filtering options.
1427
If one uses "-F T" then normal filtering by seg or dust (for blastn)
1428
occurs (likewise "-F F" means no filtering whatsoever). The seg options
1429
can be changed by using:
1433
which specifies a window of 10, locut of 1.0 and hicut of 1.5. One may
1434
also specify coiled-coiled filtering by specifying:
1438
There are three parameters for this: window, cutoff (prob of a coil-coil), and
1439
linker (distance between two coiled-coiled regions that should be linked
1440
together). These are now set to
1446
One may also change the coiled-coiled parameters in a manner analogous to
1449
-F "C 28 40.0 32" will change the window to 28.
1451
One may also run both seg and coiled-coiled together by using a ";":
1455
4.) BLAST has been changed to reduce the number of redundant hits that a user
1456
may see. This is acheived by keeping track of the number of hits completely
1457
contained in a certain region and eliminating those lower scoring hits that
1458
are redundant with others. This behavior may be controlled with the -K and -L
1461
-K Number of best hits from a region to keep [Integer]
1463
-L Length of region used to judge hits [Integer]
1466
Setting -K to zero turns off this feature. This is the default only on blastall.
1470
1.) There was a problem with the procedure that called the external utility seg.
1471
The need to fix this was obviated by the integration of seg into the toolkit.
1472
This showed up under LINUX.
1474
2.) There was a memory problem with formatdb that has been fixed. This showed up
1475
mostly under NT and LINUX.
1477
3.) A problem with running in multi-processing mode under IRIX6.5 (as a non-root user)
1480
Notes for 2.0.5 release:
1484
1.) The BLAST version is printed by formatdb in it's log file.
1486
2.) Multi-database searches no longer require that the -o option be used when
1487
preparing the databases (i.e., with formatdb).
1491
1.) A serious bug with multi-database iterative searches was fixed (thanks to
1492
Steve Brenner for providing an example).
1494
2.) 'lcl' is not formatted in the BLAST report when the sequence identifier
1495
is a local identifier or does not contain a bar ("|").
1497
3.) A large memory leak in formatdb was fixed.
1499
4.) An unnecessary cast that caused formatdb to fail on Solaris 2.5 machines
1500
if the binary was made under 2.6 was fixed.
1502
5.) Better error checking was added to protect against core-dumps.
1504
6.) Some problems with the sum statistics treatment of the blastx and tblastn
1505
programs reported by D. Rozenbaum were fixed. The number of alignments
1506
involved in a sum group was misrepresented. Also the incorrect length for
1507
the database sequence was used, sometimes casuing a slight change in the
1510
7.) A problem with blastpgp was fixed that reported incorrect values for
1511
matrices other than BLOSUM62 during iterative searches.
1513
Notes for 2.0.4 release:
1517
1.) multiple database searches:
1519
Version 2.0.4 will accept multiple database names (bracketed by quotations).
1524
which will search both the nr and est databases, presenting the results as if one
1525
'virtual' database consisting of all the entries from both were searched. The
1526
statistics are based on the 'virtual' database.
1530
-W Word size, default if zero [Integer]
1532
-z Effective length of the database (use zero for the real size) [Integer]
1535
3.) The number of identities, positives, and gaps are now printed out before the
1536
alignments for gapped blastx, tblastn, and tblastx. Additionally this feature is
1537
now also enabled for ungapped BLAST.
1539
4.) Formatdb now accepts ASN.1, as well as FASTA, as input.
1543
1.) In blastx, tblastn, and tblastx a codon was incorrectly formatted as a start codon in
1546
2.) The last alignment of the last sequence being presented was incorrectly dropped
1547
in some cases. This change could affect the statistical significance of the last database
1548
sequence if the dropped alignment had a lower e-value than any other alignments from the
1549
same database sequence.