1
.TH SPIDEY 1 2005-01-25 NCBI "NCBI Tools User's Manual"
3
spidey \- align mRNA sequences to a genome
7
[\|\fB\-F\fP\ \fIN\fP\|]
9
[\|\fB\-L\fP\ \fIN\fP\|]
10
[\|\fB\-M\fP\ \fIfilename\fP\|]
11
[\|\fB\-N\fP\ \fIfilename\fP\|]
12
[\|\fB\-R\fP\ \fIfilename\fP\|]
13
[\|\fB\-S\fP\ \fIp/m\fP\|]
14
[\|\fB\-T\fP\ \fIN\fP\|]
16
[\|\fB\-a\fP\ \fIfilename\fP\|]
17
[\|\fB\-c\fP\ \fIN\fP\|]
19
[\|\fB\-e\fP\ \fIX\fP\|]
20
[\|\fB\-f\fP\ \fIX\fP\|]
21
[\|\fB\-g\fP\ \fIX\fP\|]
22
\fB\-i\fP\ \fIfilename\fP
24
[\|\fB\-k\fP\ \fIfilename\fP\|]
25
[\|\fB\-l\fP\ \fIN\fP\|]
26
\fB\-m\fP\ \fIfilename\fP
27
[\|\fB\-n\fP\ \fIN\fP\|]
28
[\|\fB\-o\fP\ \fIstr\fP\|]
29
[\|\fB\-p\fP\ \fIN\fP\|]
30
[\|\fB\-r\fP\ \fIc/d/m/p/v\fP\|]
32
[\|\fB\-t\fP\ \fIfilename\fP\|]
36
\fBspidey\fP is a tool for aligning one or more mRNA sequences to a
37
given genomic sequence. \fBspidey\fP was written with two main goals
38
in mind: find good alignments regardless of intron size; and avoid
39
getting confused by nearby pseudogenes and paralogs. Towards the
40
first goal, \fBspidey\fP uses BLAST and Dot View (another local
41
alignment tool) to find its alignments; since these are both local
42
alignment tools, \fBspidey\fP does not intrinsically favor shorter or
43
longer introns and has no maximum intron size. To avoid mistakenly
44
including exons from paralogs and pseudogenes, \fBspidey\fP first
45
defines windows on the genomic sequence and then performs the
46
mRNA-to-genomic alignment separately within each window. Because of
47
the way the windows are constructed, neighboring paralogs or
48
pseudogenes should be in separate windows and should not be included
49
in the final spliced alignment.
50
.SS Initial alignments and construction of genomic windows
51
\fBspidey\fP takes as input a single genomic sequence and a set of
52
mRNA accessions or FASTA sequences. All processing is done one mRNA
53
sequence at a time. The first step for each mRNA sequence is a
54
high-stringency BLAST against the genomic sequence. The resulting
55
hits are analyzed to find the genomic windows.
57
The BLAST alignments are sorted by score and then assigned into
58
windows by a recursive function which takes the first alignment and
59
then goes down the alignment list to find all alignments that are
60
consistent with the first (same strand of mRNA, both the mRNA and
61
genomic coordinates are nonoverlapping and linearly consistent). On
62
subsequent passes, the remaining alignments are examined and are put
63
into their own nonoverlapping, consistent windows, until no alignments
64
are left. Depending on how many gene models are desired, the
65
top \fIn\fP windows are chosen to go on to the next step and the others
67
.SS Aligning in each window
68
Once the genomic windows are constructed, the initial BLAST alignments
69
are freed and another BLAST search is performed, this time with the
70
entire mRNA against the genomic region defined by the window, and at a
71
lower stringency than the initial search. \fBspidey\fP then uses a
72
greedy algorithm to generate a high-scoring, nonoverlapping subset of
73
the alignments from the second BLAST search. This consistent set is
74
analyzed carefully to make sure that the entire mRNA sequence is
75
covered by the alignments. When gaps are found between the
76
alignments, the appropriate region of genomic sequence is searched
77
against the missing mRNA, first using a very low-stringency BLAST and,
78
if the BLAST fails to find a hit, using DotView functions to locate
79
the alignment. When gaps are found at the ends of the alignments, the
80
BLAST and DotView searches are actually allowed to extend past the
81
boundaries of the window. If the 3' end of the mRNA does not align
82
completely, it is first examined for the presence of a poly(A) tail.
83
No attempt is made to align the portion of the mRNA that seems to be a
84
poly(A) tail; sometimes there is a poly(A) tail that does align to the
85
genomic sequence, and these are noted because they indicate the
86
possibility of a pseudogene.
88
Now that the mRNA is completely covered by the set of alignments, the
89
boundaries of the alignments (there should be one alignment per exon
90
now) are adjusted so that the alignments abut each other precisely and
91
so that they are adjacent to good splice donor and acceptor sites.
92
Most commonly, two adjacent exons' alignments overlap by as much as 20
93
or 30 base pairs on the mRNA sequence. The true exon boundary may lie
94
anywhere within this overlap, or (as we have seen empirically) even a
95
few base pairs outside the overlap. To position the exon boundaries,
96
the overlap plus a few base pairs on each side is examined for splice
97
donor sites, using functions that have different splice matrices
98
depending on the organism chosen. The top few splice donor sites (by
99
score) are then evaluated as to how much they affect the original
100
alignment boundaries. The site that affects the boundaries the least
101
is chosen, and is evaluated as to the presence of an acceptor site.
102
The alignments are truncated or extended as necessary so that they
103
terminate at the splice donor site and so that they do not overlap.
105
The windows are examined carefully to get the percent identity per
106
exon, the number of gaps per exon, the overall percent identity, the
107
percent coverage of the mRNA, presence of an aligning or non-aligning
108
poly(A) tail, number of splice donor sites and the presence or absence
109
of splice donor and acceptor sites for each exon, and the occurrence
110
of an mRNA that has a 5' or 3' end (or both) that does not align to
111
the genomic sequence. If the overall percent identity and percent
112
length coverage are above the user-defined cutoffs, a summary report
113
is printed, and, if requested, a text alignment showing identities and
114
mismatches is also printed.
115
.SS Interspecies alignments
116
\fBspidey\fP is capable of performing interspecies alignments. The
117
major difference in interspecies alignments is that the mRNA-genomic
118
identity will not be close to 100% as it is in intraspecies
119
alignments; also, the alignments have numerous and lengthy gaps. If
120
\fBspidey\fP is used in its normal mode to do interspecies alignments,
121
it produces gene models with many, many short exons. When the
122
interspecies flag is set, \fBspidey\fP uses different BLAST parameters
123
to encourage longer and more gaps and to not penalize as heavily for
124
mismatches. This way, the alignments for the exons are much longer
125
and more closely approximate the actual gene structure.
126
.SS Extracting CDS alignments
127
When \fBspidey\fP is run in network-aware mode or when ASN.1 files are
128
used for the mRNA records, it is capable of extracting a CDS alignment
129
from an mRNA alignment and printing the CDS information also. Since
130
the CDS alignment is just a subset of the mRNA alignment, it is
131
relatively straightforward to truncate the exon alignments as
132
necessary and to generate a CDS alignment. Furthermore, the
133
untranslated regions are now defined, so the percent identity for the
134
5' and 3' untranslated regions is also calculated.
137
A summary of options is included below.
143
Start of genomic interval desired (from; 0-based).
146
Input file is a GI list.
149
The extra-large intron size to use (default = 220000).
151
\fB\-M\fP\ \fIfilename\fP
152
File with donor splice matrix.
154
\fB\-N\fP\ \fIfilename\fP
155
File with acceptor splice matrix.
157
\fB\-R\fP\ \fIfilename\fP
158
File (including path) to repeat blast database for filtering.
161
Restrict to plus (p) or minus (m) strand of genomic sequence.
164
Stop of genomic interval desired (to; 0-based).
167
Use extra-large intron sizes (increases the limit for initial and
168
terminal introns from 100kb to 240kb and for all others from 35kb to
169
120kb); may result in significantly longer compute times.
171
\fB\-a\fP\ \fIfilename\fP
172
Output file for alignments when directed to a separate file with
173
\fB-p\ 3\fP (default = spidey.aln).
176
Identity cutoff, in percent, for quality control purposes.
179
Also try to align coding sequences corresponding to the given mRNA
180
records (may require network access).
183
First-pass e-value (default = 1.0e-10). Higher values increase speed
184
at the cost of sensitivity.
187
Second-pass e-value (default = 0.001).
190
Third-pass e-value (default = 10).
192
\fB\-i\fP\ \fIfilename\fP
193
Input file containing the genomic sequence in ASN.1 or FASTA format.
194
If your computer is running on a network that can access GenBank, you
195
can substitute the desired accession number for the filename.
198
Print ASN.1 alignment?
200
\fB\-k\fP\ \fIfilename\fP
201
File for ASN.1 output with \fB-k\fP (default = spidey.asn).
204
Length coverage cutoff, in percent.
206
\fB\-m\fP\ \fIfilename\fP
207
Input file containing the mRNA sequence(s) in ASN.1 or FASTA format,
208
or a list of their accessions (with \fB-G\fP). If your computer is
209
running on a network that can access GenBank, you can substitute a
210
single accession number for the filename.
213
Number of gene models to return per input mRNA (default = 1).
216
Main output file (default = stdout; contents controlled by \fB-p\fP).
223
summary and alignments together (default)
229
summary and alignments in different files
233
\fB\-r\fP\ \fIc/d/m/p/v\fP
234
Organism of genomic sequence, used to determine splice matrices.
242
Dictyostelium discoideum
251
Tune for interspecies alignments.
253
\fB\-t\fP\ \fIfilename\fP
254
File with feature table, in 4 tab-delimited columns:
258
(e.g., \fBNM_04377.1\fP)
260
(only \fBrepetitive_region\fP is currently supported)
269
Make a multiple alignment of all input mRNAs (which must overlap on
270
the genomic sequence).
273
Consider lowercase characters in input FASTA sequences to be masked.
275
Sarah Wheelan and others at the National Center for Biotechnology
276
Information; Steffen Moeller contributed to this documentation.
279
<http://www.ncbi.nlm.nih.gov/spidey>