1
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
2
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
4
<html xmlns="http://www.w3.org/1999/xhtml">
7
content="HTML Tidy for Linux/x86 (vers 1st October 2002), see www.w3.org" />
14
IMPALA: Integrating Matrix Profiles And Local Alignments
16
1. Files in Distribution
18
The following IMPALA source code files are distributed:
31
Run the following commands in the directory, containing IMPALA source code files:
37
This will result in three binary executable files:
39
makemat : primary profile preprocessor
40
(converts a collection of binary profiles, created by the -C option
41
of PSI-BLAST, into portable ASCII form);
43
copymat : secondary profile preprocessor
44
(converts ASCII matrices, produced by the primary preprocessor,
45
into database that can be read into memory quickly);
47
impala : search program (searches a database of score
48
matrices, prepared by copymat, producing BLAST-like output).
50
3. Conversion of profiles into searchable database
52
3.1. Primary preprocessing
54
Prepare the following files:
56
i. a collection of PSI-BLAST-generated profiles with arbitrary
57
names and suffix .chk;
59
ii. a collection of "profile master sequences", associated with
60
the profiles, each in a separate file with arbitrary name and a 3 character
61
suffix starting with c;
62
the sequences can have deflines; they need not be sequences in nr or
63
in any other sequence database; if the sequences have deflines, then
64
the deflines must be unique.
66
iii. a list of profile file names, one per line, named
67
<database_name>.pn;
69
iv. a list of master sequence file names, one per line, in the same
70
order as a list of profile names, named
71
<database_name>.sn;
73
The following files will be created:
75
i. a collection of ASCII files, corresponding to each of the
76
original profiles, named
77
<profile_name>.mtx;
79
ii. a list of ASCII matrix files, named
80
<database_name>.mn;
82
iii. ASCII file with auxiliary information, named
83
<database_name>.aux;
87
-P database name (required)
88
-G Cost to open a gap (optional)
90
-E Cost to extend a gap (optional)
92
-U Underlying amino acid scoring matrix (optional)
94
-d Underlying sequence database used to create profiles (optional)
96
-z Effective size of sequence database given by -d
97
default = current size of -d option
98
Note: It may make sense to use -z without -d when the
99
profiles were created with an older, smaller version of an
101
-S Scaling factor for matrix outputs to avoid round-off problems
102
default = PRO_DEFAULT_SCALING_UP (currently defined as 100)
103
Use 1.0 to have no scaling
104
Output scores will be scaled back down to a unit scale to make
105
them look more like BLAST scores, but we found working with a larger
106
scale to help with roundoff problems.
107
-H get help (overrides all other arguments)
108
Note: It is not enforced that the values of -G and -E passed to makemat
109
were actually used in making the checkpoints. However, the values fed
110
in to makemat are propagated to copymat and impala.
112
3.1. Secondary preprocessing
114
Prepare the following files:
116
i. a collection of ASCII files, corresponding to each of the
117
original profiles, named
118
<profile_name>.mtx
119
(created by makemat);
121
ii. a collection of "profile master sequences", associated with
122
the profiles, each in a separate file with arbitrary name and a 3 character
123
suffix starting with c.
125
iii. a list of ASCII_matrix files, named
126
<database_name>.mn
127
(created by makemat);
129
iv. a list of master sequence file names, one per
130
line, in the same order as a list of matrix names, named
131
<database_name>.sn;
133
v. ASCII file with auxiliary information, named
134
<database_name>.aux
135
(created by makemat);
137
The files input to copymatices are in ASCII format and thus portable
138
between machines with different encodings for machine-readable files
140
The following files will be created:
142
i. a huge binary file, containing all profile matrices, named
143
<database_name>.mat;
147
-P database name (required)
148
-H get help (overrides all other arguments)
152
Before you start searching, check that you have copies of or soft
153
links to all the files associated with the PSSM library. If the
154
library has K PSSMs, you should have
156
K files with names ending in .mtx
157
K files with names ending in a 3-letter extension starting with c
158
1 file with name ending in .pn
159
1 file with name ending in .sn
160
1 file with name ending in .aux
161
1 file with name ending in .mn
162
1 file with name ending in .mat
166
-i query sequence file (required)
167
-P database of profiles (required)
168
-o output file (optional)
170
-e Expectation value threshold (E), (optional, same as for BLAST)
172
-m alignment view (optional, same as for BLAST)
173
-z effective length of database (optional)
174
-1 = length given via -z option to makemat
175
default (0) implies length is actual length of profile library
176
adjusted for end effects
177
-H get help (overrides all other options)
179
5. Directory convention
181
Since IMPALA requires a large number of files, it may be convenient
182
to store your impala files in various directories. For copymat,
183
makemat, and impala the following parsing convention applies
184
to the string that follows the -P argument.
185
If the string starts with a '/', then it is deemed to be a full
186
path name. Whatever prefix occurs upto and including the rightmost
187
'/' is deemed to be a prefix that should be prepended to all
188
file names in the .sn, .pn, and .mn files.
190
Example: If you call any of the 3 programs including the
191
argument -P /foo/bar/wolf1187
193
/foo/bar/ is prepended to every filename listed in
197
before opening the file, but the files
201
themselves are not changed.
205
IMPALA output closely mimics output of BLASTP family programs and
206
should be compatible with SEALS BLAST parsers.
208
Send suggestions, comments, complaints only to Alejandro Schaffer
209
schaffer@helix.nih.gov
214
Schaffer, A.A., Wolf, Y.I., Ponting, C.P. Koonin, E.V.,
215
Aravind, L., Altschul, S. F., IMPALA: Matching a Protein Sequence
216
Against a Collection of PSI-BLAST-Constructed Position-Specific
217
Score Matrices, Bioninformatics, to appear.
219
Please cite the above paper if you publish any results computed by IMPALA.