2
<TITLE> Discontiguous Mega BLAST </TITLE>
5
What is discontiguous Mega BLAST?
9
This version of Mega BLAST is designed specifically for comparison of diverged
10
sequences, especially sequences from different organisms, which have alignments
11
with low degree of identity, where the original Mega BLAST is not very
12
effective. The major difference is in the use of the 'discontiguous word'
13
approach to finding initial offset pairs, from which the gapped extension is
16
Both Mega BLAST and all previous versions of nucleotide-nucleotide BLAST look
17
for exact matches of certain length as the starting points for gapped
18
alignments. When comparing less conserved sequences, i.e. when the expected
19
share of identity between them is e.g. 80% and below, this traditional approach
20
becomes much less productive than for the higher degree of conservation.
21
Depending on the length of the exact match to start the alignments from, it
22
either misses a lot of statistically significant alignments, or on the contrary
23
finds too many short random alignments.
25
According to [1], as well as our own probability simulations, it turns out that
26
if initial 'words' are based not on the exact match, but on a match of a certain
27
set of nonconsecutive positions within longer segments of the sequences, the
28
productivity of the word finding algorithm is much higher. This way fewer words
29
are found overall, but more of them end up producing statistically significant
30
alignments, than in the case of contiguous words of the same, and even shorter
31
length than the number of matched positions in the discontiguous word.
33
As an example, we can define a pattern (template) of 0s and 1s of length e.g.
35
100101100101100101101. For each pair of offsets in the query and subject
36
sequences that are being compared, we compare the 21 nucleotide segments in
37
these sequences ending at these offsets, and require only those positions in
38
those segments to match that correspond to the 1s in the above template.
40
There are several advantages in using this approach. First, the conditional
41
probabilities of finding word hits satisfying discontiguous templates given the
42
expected identity percentage in the alignments between two sequences, are higher
43
than for contiguous words with the same number of positions required matched.
44
If two word hits are required to initiate a gapped extension, the effect
45
of the discontiguous word approach is even larger. In both cases higher
46
sensitivity is achieved because there is less correlation between successive
47
words as the database sequence is scanned across the query sequence.
48
Second, when comparing coding sequences, the conservation of the
49
third nucleotides in every codon is not essential, so there is no need to
50
require it when matching initial words. This implies the advantage of using
51
templates based on the '110' pattern, which are called 'coding'.
52
Finally, to achieve even higher sensitivity, one might combine two different
53
discontiguous word templates and require any one of them to match at a given
54
position to qualify it for the initial word hit.
56
The following options specific to this approach are supported:
58
Template length: 16, 18, 21.
59
Word size (i.e. number of 1s in the template): 11, 12
60
Template type: coding, non-coding.
61
Require two words for extension: yes/no.
63
The 'coding' templates are based on the 110 pattern, although more 0s are
64
required for most of them, so some of the patterns become 010 or 100. These are
65
the most effective for comparison of coding regions.
67
The non-coding templates attempt to minimize the correlation between successive
68
words, when the database sequence is shifted by 4 positions against the query
69
sequence. This means more 1s are concentrated at the ends of the template (at
70
least 3 on each side).
72
When the option to require two words for extension is chosen, two word hits
73
matching the template must be found within a distance of 50 nucleotides of one
76
Below are the exact discontiguous word template patterns for different combinations
77
of word sizes and lengths:
79
W = 11, t = 16, coding: 1101101101101101
80
W = 11, t = 16, non-coding: 1110010110110111
81
W = 12, t = 16, coding: 1111101101101101
82
W = 12, t = 16, non-coding: 1110110110110111
83
W = 11, t = 18, coding: 101101100101101101
84
W = 11, t = 18, non-coding: 111010010110010111
85
W = 12, t = 18, coding: 101101101101101101
86
W = 12, t = 18, non-coding: 111010110010110111
87
W = 11, t = 21, coding: 100101100101100101101
88
W = 11, t = 21, non-coding: 111010010100010010111
89
W = 12, t = 21, coding: 100101101101100101101
90
W = 12, t = 21, non-coding: 111010010110010010111
92
[1] Ma, B., Tromp, J., Li, M., "PatternHunter: faster and more sensitive
93
homology search", Bioinformatics 2002 Mar;18(3):440-5