1
\documentclass[12pt]{article}
2
\usepackage[a4paper,top=20mm,bottom=20mm,left=20mm,right=20mm]{geometry}
10
\usepackage{optionman}
12
\newcommand{\Substring}[3]{#1[#2..#3]}
14
\newcommand{\Program}[0]{\texttt{tagerator}\xspace}
16
\title{\Program: a program for mapping short sequence tags\\
19
\author{\begin{tabular}{c}
20
\textit{Stefan Kurtz}\\
21
Center for Bioinformatics,\\
28
\section{Preliminary definitions}
29
By \(S\) let us denote the concatenation of all subject sequences.
30
By \(edist(u,v)\) we denote the unit edit distance of two strings
32
Consider a sequence \(p\) of length \(m\). An approximate match of \(p\) with
33
up to \(k\) differences is a substring \(v\) of \(S\) such that
34
\(edist(p,v)\leq k\). An approximate prefix match of \(p\) with \(k\)
35
differences and at most \(t\) occurrences is a substring \(v\) of \(S\) such
36
that \(v\) occurs at most \(t\) times in \(S\) and
37
\(edist(v,\Substring{p}{1}{i})\leq k\) for some \(i\in[1,m]\).
39
\section{The program \Program}
41
The program \Program is called as follows:
43
\noindent\texttt{gt} \Program [\textit{options}] \Showoption{q} \Showoptionarg{files} [\textit{options}]
45
\Showoptionarg{files} is a white space separated list of at least one
46
filename. Any sequence occurring in any file specified in
47
\Showoptionarg{files} is called \textit{short sequence tag} or \textit{tag}
48
for short. In addition to the mandatory option \Showoption{q}, the program
49
must be called with either option \Showoption{pck} or \Showoption{esa},
50
which specify to use a packed index and an enhanced suffix array,
51
respectively. Both indices are constructed from a given set of subject
54
\Program maps each short sequence tag, say \(p\) of length \(m\)
55
against the given index. The length of the tag is limited by the
56
size of a pointer. If \texttt{gt} is a 32-bit binary, then \(m\) must be
57
smaller or equal to 32. If \texttt{gt} is a 64-bit binary, \(m\) must be
58
smaller or equal to 64. The program runs in three basic modes:
61
In the \textit{ms}-mode, it computes for all \(i\in[1,m]\) the length \(\ell\)
62
of the longest prefix of \(\Substring{w}{i}{m}\) matching any substring of
64
In addition, it reports start positions of such a prefix in \(S\).
65
As these values make up the well known matching statistics, we denote
66
it by \textit{ms}. The length value \(\ell\) is the matching statistics length
67
and the position values are the matching statistics positions.
68
Both, the matching statistics length and the matching statistics position
69
is uniquely determined. The matching statistics mode requires to specify
70
a maximum number of occurrences, see option \Showoption{maxocc}.
72
In the \textit{cdiff}-mode, it computes all start positions of approximate
73
matches with up to \(k\) differences in \(S\). For each start position of
74
an approximate match, say \(j\), it reports the minimum integer \(\ell\) such
75
that \(edist(p,\Substring{S}{j}{j+\ell-1})\leq k\). Since this mode matches
76
the complete sequence \(p\), we call this mode \textit{cdiff},
77
for complete difference.
79
In the \textit{pdiff}-mode, it computes all start positions of approximate
80
prefix matches with up to \(k\) differences and at most \(t\) occurrences in
81
\(S\). For each start position of an approximate prefix match, say \(j\), it
82
reports the minimum integers \(i\) and \(\ell\) such that
83
\(edist(\Substring{p}{1}{i},\Substring{S}{j}{j+\ell-1})\leq k\). Since this
84
mode matches a prefix of \(p\) with some differences, we call this mode
85
\textit{pdiff}. This mode requires to specify a maximum number of
86
occurrences, see option \Showoption{maxocc}.
89
The following options are available in \Program:
91
\begin{Justshowoptions}
92
\Option{q}{$\Showoptionarg{files}$}{
93
Specify a white space separated list of query files (in multiple \Fasta format)
94
containing the tags. At least one query file must be given. The files may be in
95
gzipped format, in which case they have to end with the suffix \texttt{.gz}.
98
\Option{esa}{$\Showoptionarg{indexname}$}{
99
Use the given enhanced suffix array index to map the short sequence tags.
102
\Option{pck}{$\Showoptionarg{indexname}$}{
103
Use the packed index (an efficient representation of the FMindex)
104
to map the short sequence tags.
107
\Option{e}{$\Showoptionarg{k}$}{
108
Specify the number of differences allowed. \(k\) must be a non-negative number.
109
\(k=0\) means that no differences are allowed (exact matching) and \(k>0\)
110
means a positive number of differences. If this option is not used, then
111
the program runs in \textit{ms}-mode, i.e.\ it computes the matching statistics
112
for each short sequence tag.
116
Do not compute direct matches, i.e.\ matches on the forward strand. If
117
this option is not used, then matches are computed on the forward strand.}
120
Do not compute palindromic matches, i.e.\ matches on the
121
reverse complemented strand. If
122
this option is not used, then matches are computed on the reverse complemented
125
\Option{maxocc}{$\Showoptionarg{t}$}{
126
Specify the maximum number of occurrences of exact prefix matches (in case of
127
the matching statistics) or approximate prefix matches.
130
\Option{withwildcards}{}{
131
Output matches containing wildcard characters (e.g. N). This option cannot be
132
used in any of the following cases:
135
with option \Showoption{pck},
137
in the \textit{ms}-mode,
139
for all modes with \(k=0\).
144
For each tag, only show matches for the smallest possible distance. That is,
145
if a tag has exact matches in the input index, then only exact matches are
146
shown. If there are no exact matches, but matches with distance 1, then
147
only these are shown. If there are no matches matches with distance 1 (and
148
hence no exact matches), but with distance 2, then only these are shown etc.
151
\Option{output}{$\Showoptionarg{key}_{1}\ldots\Showoptionarg{key}_{q}$}{
152
Use combination of the following keywords to specify output according to
157
keyword & shows the following\\\hline
158
tagnum & show ordinal number of tag\\
159
tagseq & show tag sequence\\
160
dblength & show length of match in database\\
161
dbstartpos& show start position of match in database\\
162
abspos & show absolute value of dbstartpos\\
163
dbsequence& show sequence of match\\
164
strand & show strand\\
165
edist & show edit distance\\
166
tagstartpos& show start position of match in tag (only for
167
\Showoption{maxocc})\\
168
taglength & show length of match in tag (only for option
169
\Showoption{maxocc})\\
170
tagsuffixseq& show suffix tag involved in match (only for option
174
This option only has an effect when used in the cdiff-mode. If in cdiff
175
mode this option is not used, then the output is such as if keywords
176
\texttt{tagnum}, \texttt{tagseq}, \texttt{dblength}, \texttt{dbstartpos},
177
\texttt{strand} were used.
182
\end{Justshowoptions}
183
The following conditions must be satisfied:
186
Option \Showoption{q} is mandatory.
188
Either option \Showoption{pck} or \Showoption{esa} must be used. Both cannot
191
If option \Showoption{e} is not used, then option \Showoption{maxocc}
196
Suppose that in some directory, say \texttt{homo-sapiens}, we have 24 gzipped
197
\Fasta files containing all 24 human chromomsomes. These may have been
199
\url{ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens_47_36i/dna}.
201
In the first step, we construct the packed index for the entire human genome:
204
gt packedindex mkindex -dna -dir rev -parts 12 -bsize 10 -sprank -locfreq 32
205
-tis -ssp -indexname pck-human -db homo-sapiens/*.gz
208
The program runs for little more than two hours and delivers
209
an index \texttt{human-all} consisting of three files:
213
-rw-r----- 1 kurtz gistaff 37 2008-07-13 17:00 pck-human.al1
214
-rw-r----- 1 kurtz gistaff 2.6G 2008-07-13 19:22 pck-human.bdx
215
-rw-r----- 1 kurtz gistaff 3.3K 2008-07-13 19:22 pck-human.prj
218
Suppose that the compressed file \texttt{Q1.gz} contains short sequence tags
219
in multiple \Fasta format. In a first call we run \Program in \textit{ms}-mode:
221
\EXECUTE{gt tagerator -q Q1.gz -maxocc 10 -pck pck-human | head -n 25}
223
The first line shortly reports the kind of computation performed. The second
224
and third line give the name of the index containing the subject sequences.
225
It is reported whether it is a packed index or an enhanced suffix array. Then,
226
for each tag its length, say \(m\), and the tag \(p\) itself is shown, followed
227
by a block of \(m\) lines each containing one integers. For all \(i\in[1,m]\),
228
the first column in the \(i\)th line is the matching statistics length, say
229
\(l\). This means that \(\Substring{p}{i}{i+l-1}\) is the maximum length prefix
230
of \(\Substring{p}{i}{m}\) that occurs as a substring in the index. The
231
second column gives the strand of the match: a \texttt{+} stands for the
232
forward strand and a \texttt{-} for the reverse strand. If the number of
233
occurrences of \(\Substring{p}{i}{i+l-1}\) is smaller than the maximum
234
occurrence parameter (\(10\) in the above case), then all matching statistics
235
positions are reported in ascending order. If it is larger than the maximum
236
occurrence parameter, no positions are shown.
237
So, for example, the sequence \texttt{gcttgctg} of length 8 starting at tag
238
position 3 is the longest prefix of \texttt{gcttgctgctgca} that occurs as
239
a substring in the index. It occurs at the positions 25583, 88695, 213546,
240
etc. Note that the matching statistics lines for the forward strand are
241
followed by the matching statistics block on the reverse strand. Note however,
242
that the matching statistics positions are always reported with respect to the
245
In a second call, we run \Program in \textit{cdiff}-mode:
247
\EXECUTE{gt tagerator -q Q1.gz -pck pck-human -e 2 | head -n 28}
249
The first line shortly reports the kind of computation performed. The second
250
and third line are as in the previous example. Then, for
251
each tag its length and the tag itself is shown, followed by
252
a block of lines each containing three integers and one of the
253
symbols \texttt{+} or \texttt{-}, denoting the strand of the match.
254
Each such triple of integers reports the length of an approximate match,
255
the subject sequence in which the match occurs, and the relative position
256
of the match in this sequence. For each start position only the shortest
257
length is reported. If there are no
258
approximate matches, then no such line appears. Note that the match always
259
refers to the complete pattern.
261
Our third example concerns the \textit{pdiff}-mode:
263
\EXECUTE{gt tagerator -q Q1.gz -pck pck-human -e 1 -maxocc 5 | head -n 20}
265
The output is similar as in the previous example, except that each match
266
is reported by four integers with a sign between the first three and the
267
last number. The first number reports the length of the
268
approximate match, the second reports the subject sequence number in
269
which the match occur, the third its relative position in the subject
270
sequence. The last number reports the length of the prefix
271
of the tag involved in the approximate match. Note that in some cases,
272
the two length values are different.