1
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
2
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
4
<html xmlns="http://www.w3.org/1999/xhtml">
7
content="HTML Tidy for Linux/x86 (vers 1st October 2002), see www.w3.org" />
14
README for standalone MEGABLAST
15
(last updated 7/2/2004)
18
Mega BLAST uses the greedy algorithm of Zhang et al. [1] for nucleotide
19
sequence alignment search and concatenates many queries to save time spent
20
scanning the database. This program is optimized for aligning sequences that
21
differ slightly as a result of sequencing or other similar "errors". It is up to
22
10 times faster than more common sequence similarity programs and therefore can
23
be used to swiftly compare two large sets of sequences against each other.
25
Most of the options are similar to those in the blastall binary (see README.bls
26
file for their descriptions). Note that megablast binary does not require the
27
program option. Below are the more detailed explanations of some of the options
28
either specific to Mega BLAST or having different meaning:
30
-----------------------------
34
When W is divisible by 4, it guarantees that all perfect matches of length
35
W + 3 will be found by Mega BLAST search, however perfect matches of length
36
as low as W might also be found, although the latter is not guaranteed. Any
37
value of W not divisible by 4 is equivalent to the nearest value divisible by
38
4 (with 4*i+2 equivalent to 4*i).
40
-----------------------------
42
-G, -E Affine gapping penalties.
44
If these options are not set (both are 0), then non-affine gapping is assumed
45
with gap opening penalty 0 and gap extension penalty E, that can be computed
46
from match reward r and mismatch penalty q by the formula: E = r/2 - q. The
47
affine version of Mega BLAST requires significantly more memory, so it should
48
be avoided if possible, especially when some of the query or database
49
sequences are very long.
51
-----------------------------
53
-D Type of the Mega BLAST output.
55
0: Produce one-line output for each alignment, in the form
57
'subject-id'=='[+-]query-id' (s_off q_off s_end q_end) score
59
Here subject(query)-id is a gi number, an accession or some other type of
60
identifier found in the FASTA definition line of the respective sequence.
62
+ or - corresponds to same or different strand alignment.
64
Score for non-affine gapping parameters means the total number of
65
differences (mismatches + gaps). For affine case it is the actual (raw)
66
score of the alignment.
68
1: Show the same output as level 0, plus the endpoints and percentage
69
of identical nucleotides for each ungapped segment in the alignment.
71
2: Show the traditional BLAST (blastn) output.
73
3: Show one-line output for each alignment, with the following fields
76
Query id, Subject id, percent of identity, alignment length, number of
77
mismatches (not including gaps), number of gap openings, start of
78
alignment in query, end of alignment in query, start of alignment in
79
subject, end of alignment in subject, expected value, bit score.
81
If the alignment is from a reverse strand, the subject start and end are
82
printed in the reverse order, reflecting the actual direction of the
85
-----------------------------
88
This option is described in the README.bls file and in general works
89
identically to other BLAST programs. It actually contains two different
90
options: the type of filtering and what stages of the search should mask the
91
filtered regions. The option is specified by a string that contains all types
92
of filters the user wants to apply, separated by semicolons or spaces. The
93
available filters for nucleotide BLAST or Mega BLAST searches are:
98
L - low complexity (equivalent to D)
100
Finally, if letter 'm' is included in the filter string, all types of
101
filters are used to mask the query sequence regions only on the word finding
102
stage and do not affect the extension stage.
104
E.g. if the option -F "m D;R" is specified, then both dust and human repeats
105
filtering will be applied, but the alignments will be extended through the
106
filtered areas. With option -F "L;V" the dust and vector screen filters will
107
be applied, and the filtered areas will be masked for all stages of the
110
The -F m option affects the lower case filtering (specified by the -U option)
111
as well. Therefore if one wants to use lower case filtering, but allow the
112
extension through lower case regions of the query sequence, the -F m -U T
113
combination of options must be used.
115
-----------------------------
119
As in BLAST, this values provides a cutoff threshold for the extension
120
algorithm tree exploration. When the score of a given branch drops below the
121
current best score minus the X-dropoff, the exploration of this branch
123
-y X-dropoff for ungapped extensions.
124
-Z X-dropoff for the final gapped extension.
125
Both -y and -Z are used only in conjunction with -n T option, i.e. when
126
non-greedy gapped extension is performed, like in blastn.
128
Note that all of these are the raw values, as opposed to bit values for other
130
-----------------------------
132
-e The cutoff expectation value.
134
By default this value is set to a very large number, i.e. effectively there
135
is no expectation value cutoff.
137
-----------------------------
139
-v Maximal number of database sequences to report alignments from.
140
-b Maximal number of reported alignments for a given database sequence.
142
These options are meaningful only in conjunction with -D 2.
144
-----------------------------
146
-J Believe the query defline.
148
The default is T (TRUE) for all types of output except -D 2. In the latter
149
case, the default is F (FALSE), unless a SeqAlign ASN.1 output is required,
150
specified by the -O option.
151
Note: this option must be set to F (FALSE) if the sequence IDs in the FASTA
154
-----------------------------
156
-M Maximal total length of queries to be concatenated for a single megablast
159
Setting this value to smaller than default (20,000,000) can reduce the memory
160
image of the program for large searches.
162
-----------------------------
164
-P Maximal number of positions for a hash value.
166
This option provides for a very simple type of filtering if it is set to a
167
non-zero value. Namely, any pattern of length 12 when word size is greater
168
than or equal to 16 (8 for smaller word sizes), that appears in all of the
169
query sequences together more than P times, is masked and not included in the
170
search look-up table. If such masking occurs, megablast shows a warning
171
message on the standard output. This can be useful when running megablast for
172
very long unmasked sequences, in which case when -P option is not set, the
173
search might take a very long time.
175
-----------------------------
177
-O ASN.1 Seqalign file.
179
This option specifies a file name for writing ASN.1 output. It is only
180
meaningful in conjunction with -D 2. The ASN.1 will consist of separate
181
ASN.1 codes for each query sequence:
184
All hits for first query
187
All hits for second query
191
-----------------------------
193
-s Minimal hit score to report.
195
By default this value is set to W, where W is the wordsize (-W option),
196
i.e. is ignored (since all found alignments are extended from an exact match
197
of length at least W).
199
-----------------------------
201
-Q Masked query output.
203
All regions of the query sequences, that were hit by any found alignment, are
204
masked by N's. The output is written to a file specified by the -Q option. It
205
can be used only in conjunction with -D 2.
207
-----------------------------
209
-f Show full IDs in the output.
211
By default, for -D 0 and -D 1 outputs, the sequence IDs are reported as GIs
212
or accession numbers (if GIs are not available). If -f is set to T, full IDs
213
will be shown, unless -J option is set to F. In the latter case full deflines
214
will be shown for the query sequences.
216
-----------------------------
218
-U Use lower case filtering of FASTA sequences.
220
Like in blastall binary, this option allows to treat lower case in the query
221
sequences as masked residues. The deafult for this option is set to FALSE,
222
in which case the lower case is treated identically to upper case.
224
-----------------------------
226
-p Cutoff by percentage of identity
228
The alignments with identity percentage below the value of this option are
229
not reported in all output formats except -D 0 (with the latter the traceback
230
is not performed, so it is impossible to calculate the percentage of identical
233
-----------------------------
235
-t Discontiguous word template length.
237
If this is not zero, the discontiguous word approach is used. The supported
238
template lengths are 16, 18, and 21. The word size (-W parameter) must be 11 or
241
-N Discontiguous template type: coding (0), non-coding (1), or both (2).
242
For each of the three template lengths, two discontiguous templates are
243
supported. One of them, called coding, is based on the '110' pattern, the other
244
is optimal, or close to optimal, based on the hit probability simulations for
247
The exact templates are:
248
W = 11, t = 16, coding: 1101101101101101
249
W = 11, t = 16, non-coding: 1110010110110111
250
W = 12, t = 16, coding: 1111101101101101
251
W = 12, t = 16, non-coding: 1110110110110111
252
W = 11, t = 18, coding: 101101100101101101
253
W = 11, t = 18, non-coding: 111010010110010111
254
W = 12, t = 18, coding: 101101101101101101
255
W = 12, t = 18, non-coding: 111010110010110111
256
W = 11, t = 21, coding: 100101100101100101101
257
W = 11, t = 21, non-coding: 111010010100010010111
258
W = 12, t = 21, coding: 100101101101100101101
259
W = 12, t = 21, non-coding: 111010010110010010111
261
If 'both' option is chosen, then all initial matches satisfying either one of the
262
two types of templates are extended.
264
-----------------------------
266
-g Generate words for every base of the database.
268
Both in blastn and traditional megablast, the database sequences are compressed
269
4:1, and words are looked up only at the beginning of each byte, i.e. at every 4th
270
base. This option prescribes to lookup words starting at any arbitrary base
271
of the database sequence.
273
-----------------------------
275
-H Maximal number of HSPs to save per database sequence.
277
-----------------------------
281
<p>[1] Zhang Z., Schwartz S., Wagner L., & Miller W. (2000),
282
"A greedy algorithm for aligning DNA sequences", J Comput Biol 2000; 7(1-2):203-14.
283
<a href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&list_uids=10890397&dopt=Citation">