1
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
2
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
4
<html xmlns="http://www.w3.org/1999/xhtml">
7
content="HTML Tidy for Linux/x86 (vers 1st October 2002), see www.w3.org" />
14
RPS Blast: Reversed Position Specific Blast
17
RPS-BLAST (Reverse PSI-BLAST) searches a query sequence against a database
18
of profiles. This is the opposite of PSI-BLAST that searches a profile
19
against a database of sequences, hence the 'Reverse'. RPS-BLAST
20
uses a BLAST-like algorithm, finding single- or double-word hits
21
and then performing an ungapped extension on these candidate matches.
22
If a sufficiently high-scoring ungapped alignment is produced, a gapped
23
extension is performed and those (gapped) alignments with sufficiently
24
low expect value are reported. This procedure is in contrast to IMPALA
25
that performs a Smith-Waterman calculation between the query and
26
each profile, rather than using a word-hit approach to identify
27
matches that should be extended.
29
RPS-BLAST uses a BLAST database, but also has some other files that
30
contain a precomputed lookup table for the profiles to allow the search
31
to proceed faster. Unfortunately it was not possible to make this
32
lookup table architecture independent (like the BLAST databases themselves)
33
and one cannot take a RPS-BLAST databases prepared on a big-endian
34
system (e.g., Solaris Sparc) and run it on a small-endian system
35
(e.g., NT). The RPS-BLAST database must be prepared again for the small-endian
38
The CD-Search databases for RPS-BLAST can be found at:
40
ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/
42
It is necessary to untar the archive and run copymat and formatdb.
43
It is not necessary to run makemat on the databases from this
46
RPS-BLAST was coded by Sergei Shavirin with some help from Tom Madden.
47
RPS-BLAST reuses some of the IMPALA code for precomputing the lookup tables
48
and all of the IMPALA code for evaluating the statistical significance of a match.
51
1. Binary files used in RPS Blast:
53
The following binary files are used to setup and run RPS Blast:
55
makemat : primary profile preprocessor
56
(converts a collection of binary profiles, created by the -C option
57
of PSI-BLAST, into portable ASCII form);
59
copymat : secondary profile preprocessor
60
(converts ASCII matrices, produced by the primary preprocessor,
61
into database that can be read into memory quickly);
63
formatdb : general BLAST database formatter.
65
rpsblast : search program (searches a database of score
66
matrices, prepared by copymat, producing BLAST-like output).
68
2. Conversion of profiles into searchable database
70
*Note*: if you are starting with *.mtx files obtained from the NCBI FTP site or
71
another source you should skip the steps listed in 2.1.
73
2.1. Primary preprocessing
75
Prepare the following files:
77
i. a collection of PSI-BLAST-generated profiles with arbitrary
78
names and suffix .chk;
80
ii. a collection of "profile master sequences", associated with
81
the profiles, each in a separate file with arbitrary name and a 3 character
82
suffix starting with c;
83
the sequences can have deflines; they need not be sequences in nr or
84
in any other sequence database; if the sequences have deflines, then
85
the deflines must be unique.
87
iii. a list of profile file names, one per line, named
88
<database_name>.pn;
90
iv. a list of master sequence file names, one per line, in the same
91
order as a list of profile names, named
92
<database_name>.sn;
94
The following files will be created:
96
a. a collection of ASCII files, corresponding to each of the
97
original profiles, named
98
<profile_name>.mtx;
100
b. a list of ASCII matrix files, named
101
<database_name>.mn;
103
c. ASCII file with auxiliary information, named
104
<database_name>.aux;
106
Arguments to makemat:
108
-P database name (required)
109
-G Cost to open a gap (optional)
111
-E Cost to extend a gap (optional)
113
-U Underlying amino acid scoring matrix (optional)
115
-d Underlying sequence database used to create profiles (optional)
117
-z Effective size of sequence database given by -d
118
default = current size of -d option
119
Note: It may make sense to use -z without -d when the
120
profiles were created with an older, smaller version of an
122
-S Scaling factor for matrix outputs to avoid round-off problems
123
default = PRO_DEFAULT_SCALING_UP (currently defined as 100)
124
Use 1.0 to have no scaling
125
Output scores will be scaled back down to a unit scale to make
126
them look more like BLAST scores, but we found working with a larger
127
scale to help with roundoff problems.
128
-H get help (overrides all other arguments)
129
Note: It is not enforced that the values of -G and -E passed to makemat
130
were actually used in making the checkpoints. However, the values fed
131
in to makemat are propagated to copymat and rpsblast.
133
ATTENTION: It is strongly recommended to use -S 1 - the scaling factor
134
should be set to 1 for rpsblast at this point in time.
136
2.2. Secondary preprocessing
138
Prepare the following files:
140
i. a collection of ASCII files, corresponding to each of the
141
original profiles, named
142
<profile_name>.mtx
143
(created by makemat);
145
ii. a collection of "profile master sequences", associated with
146
the profiles, each in a separate file with arbitrary name and a 3 character
147
suffix starting with c.
149
iii. a list of ASCII_matrix files, named
150
<database_name>.mn
151
(created by makemat);
153
iv. a list of master sequence file names, one per
154
line, in the same order as a list of matrix names, named
155
<database_name>.sn;
157
v. ASCII file with auxiliary information, named
158
<database_name>.aux
159
(created by makemat);
161
The files input to copymatices are in ASCII format and thus portable
162
between machines with different encodings for machine-readable files
164
The following files will be created:
166
a. a huge binary file, containing all profile matrices, named
167
<database_name>.rps;
168
b. a huge binary file, containing lookup table for the Blast search
169
corresponding to matrixes named <database_name>.loo
170
c. File containing concatenation of all FASTA "profile master sequences".
171
named <database_name> (without extention)
175
-P database name (required)
176
-H get help (overrides all other arguments)
177
-r format data for RPS Blast
179
ATTENTION: "-r" parameter have to be set to TRUE to format data for
180
RPS Blast at this step.
182
NOTE: copymat requires a fair amount of memory as it first constructs
183
the the lookup table in memory before writing it to disk. Users have
184
found that they require a machine with at least 500 Meg of memory for this
187
2.3 Creating of BLAST database from <database_name> file containing
188
all "profile master sequences".
190
"formatdb" program should be run to create regular BLAST database of all
191
"profile master sequences":
193
formatdb -i <database_name> -o T
197
Arguments to RPS Blast
199
-i query sequence file (required)
200
-d RPS BLAST Database [File In]
201
-p if query sequence protein (if FALSE 6 frame franslation will be
202
conducted as in blastx program)
203
-P 0 for multiple hits 1-pass, 1 for single hit 1-pass [Integer]
205
-o output file (optional)
207
-e Expectation value threshold (E), (optional, same as for BLAST)
209
-m alignment view (optional, same as for BLAST)
210
-z effective length of database (optional)
211
-1 = length given via -z option to makemat
212
default (0) implies length is actual length of profile library
213
adjusted for end effects
219
A. Documentation of the .mtx file format
221
Format of the .mtx file:
225
ka#-* = Karlin/Altschul parameters, block #. There are three blocks,
226
each containing four floating point numbers on separate lines.
227
pX-Y = The position specific scores as integers.
230
The first element of this file format is [L]. This is the sequence
231
length. The second line contains the sequence itself, in NCBI AA
232
notation. After this, there are three KA blocks (four lines of
233
floating point numbers each), then the positional scores.
235
The positional scores are arranged in a grid. Each line contains 26
236
elements, corresponding to the 26 elements in the NCBI AA encoding,
237
and there are L lines where L is the previously mentioned sequence
240
Using the symbols mentioned above, it looks something like this:
256
<p1-1> <p1-2> <p1-3> ... <p1-26>
257
<p2-1> <p2-2> <p2-3> ... <p2-26>
259
<pL-1> <pL-2> <pL-3> ... <pL-26>
262
One can find the explanation for the three blocks of KA-parameters in
263
makemat's source code, lines 188-190:
265
putMatrixKbp(checkFile, compactSearch->kbp_gap_std[0], scaleScores, 1/scalingFactor);
266
putMatrixKbp(checkFile, compactSearch->kbp_gap_psi[0], scaleScores, 1/scalingFactor);
267
putMatrixKbp(checkFile, sbp->kbp_ideal, scaleScores, 1/scalingFactor);
269
Thus, the first KA block is the standard score, the second is for
270
PSI-Blast, and the third is the ideal score.