2
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.docbook.org/xml/4.5/docbookx.dtd">
3
<chapter id="chap_sanger">
5
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="versionfile"/>
7
<firstname>Bastien</firstname>
8
<surname>Chevreux</surname>
9
<email>bach@chevreux.org</email>
11
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="copyrightfile"/>
14
<attribution>Solomon Short</attribution>
16
<emphasis><quote>Just when you think it's finally settled, it isn't.
21
<title>Short usage introduction to MIRA3</title>
22
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="warning_frontofchapter.xml"/>
24
This guide assumes that you have basic working knowledge of Unix systems,
25
know the basic principles of sequencing (and sequence assembly) and what
26
assemblers do. Furthermore, it is advised to read through the main
27
documentation of the assembler as this is really just a getting started
30
<sect1 id="sect_sanger_important_notes">
35
For working parameter settings for assemblies involving 454, IonTorrent,
36
Solexa or PacBio data, please also read the MIRA help files dedicated to these
40
<sect1 id="sect_sanger_quick_start_for_the_impatient">
42
Quick start for the impatient
45
This example assumes that you have a few sequences in FASTA format that
46
may or may not have been preprocessed - that is, where sequencing vector
47
has been cut back or masked out. If quality values are also present in a
48
fasta like format, so much the better.
51
We need to give a name to our project: throughout this example, we will
52
assume that the sequences we are working with are
53
from <emphasis>Bacillus</emphasis>
54
<emphasis>chocorafoliensis</emphasis> (or short: <emphasis>Bchoc</emphasis>); a well known,
55
chocolate-adoring bug from the <emphasis>Bacillus</emphasis> family which is able to make a
56
couple of hundred grams of chocolate vanish in just a few minutes.
59
Our project will therefore be named 'bchoc'.
61
<sect2 id="sect_sanger_estimating_memory_needs">
63
Estimating memory needs
66
<emphasis>"Do I have enough memory?"</emphasis> has been one of the
67
most often asked question in former times. To answer this question,
68
please use <command>miramem</command> which will give you an
69
estimate. Basically, you just need to start the program and answer the
70
questions, for more information please refer to the corresponding
71
section in the main MIRA documentation.
74
Take this estimate with a grain of salt, depending on the sequences
75
properties, variations in the estimate can be +/- 30%.
78
<sect2 id="sect_sanger_preparing_and_starting_an_assembly_from_scratch_with_fasta_files">
80
Preparing and starting an assembly from scratch with FASTA files
84
<sect3 id="sect_sanger_with_data_preclipped_or_prescreened_for_vector_sequence">
86
With data pre-clipped or pre-screened for vector sequence
89
The following steps will allow to quickly start a simple assembly if
90
your sequencing provider gave you data which was pre-clipped or
91
pre-screened for vector sequence:
94
<prompt>$</prompt> <userinput>mkdir bchoc_assembly1</userinput>
95
<prompt>$</prompt> <userinput>cd bchoc_assembly1</userinput>
96
<prompt>bchoc_assembly1$</prompt> <userinput>cp /your/path/sequences.fasta bchoc_in.sanger.fasta</userinput>
97
<prompt>bchoc_assembly1$</prompt> <userinput>cp /your/path/qualities.someextension bchoc_in.sanger.fasta.qual</userinput>
98
<prompt>bchoc_assembly1$</prompt> <userinput>mira --project=bchoc --job=denovo,genome,accurate,sanger --fasta</userinput></screen>
100
<emphasis role="underline">Explanation:</emphasis> we created a
101
directory for the assembly, copied the sequences into it (to make
102
things easier for us, we named the file directly in a format
103
suitable for mira to load it automatically) and we also copied
104
quality values for the sequences into the same directory. As last
105
step, we started mira with options telling it that
110
our project is named 'bchoc' and hence, input and output files
111
will have this as prefix;
116
the data is in a FASTA formatted file;
121
the data should be assembled <emphasis>de-novo</emphasis> as
122
a <emphasis>genome</emphasis> at an assembly quality level
123
of <emphasis>accurate</emphasis> and that the reads we are
124
assembling were generated with Sanger technology.
129
By giving mira the project name 'bchoc'
130
(<literal>--project=bchoc</literal>) and naming sequence file with
131
an appropriate extension <filename>_in.sanger.fasta</filename>, mira
132
automatically loaded that file for assembly. When there are
133
additional quality values available
134
(<filename>bchoc_in.sanger.fasta.qual</filename>), these are also
135
automatically loaded and used for the assembly.
138
If there is no file with quality values available, MIRA will stop
139
immediately. You will need to provide parameters to the command line
140
which explicitly switch off loading and using quality files.
143
Not using quality values is <emphasis role="bold">NOT</emphasis>
144
recommended. Read the corresponding section in the MIRA reference
148
<sect3 id="sect_sanger_using_ssaha2_smalt_to_screen_for_vector_sequence">
150
Using SSAHA2 / SMALT to screen for vector sequence
153
If your sequencing provider gave you data which was NOT pre-clipped
154
for vector sequence, you can do this yourself in a pretty robust
155
manner using SSAHA2 -- or the successor, SMALT -- from the Sanger
156
Centre. You just need to know which sequencing vector the provider
157
used and have its sequence in FASTA format (ask your provider).
160
Note that this screening is a valid method for any type of Sanger
161
sequencing vectors, 454 adaptors, Solexa adaptors and paired-end
165
For SSAHA2 follow these steps (most are the same as in the example
169
<prompt>$</prompt> <userinput>mkdir bchoc_assembly1</userinput>
170
<prompt>$</prompt> <userinput>cd bchoc_assembly1</userinput>
171
<prompt>bchoc_assembly1$</prompt> <userinput>cp /your/path/sequences.fasta bchoc_in.sanger.fasta</userinput>
172
<prompt>bchoc_assembly1$</prompt> <userinput>cp /your/path/qualities.someextension bchoc_in.sanger.fasta.qual</userinput>
173
<prompt>bchoc_assembly1$</prompt> <userinput>ssaha2 -output ssaha2
174
-kmer 8 -skip 1 -seeds 1 -score 12 -cmatch 9 -ckmer 6
175
/path/where/the/vector/data/resides/vector.fasta
176
bchoc_in.sanger.fasta > bchoc_ssaha2vectorscreen_in.txt</userinput>
177
<prompt>bchoc_assembly1$</prompt> <userinput>mira -project=bchoc -job=denovo,genome,accurate,sanger -fasta SANGER_SETTINGS -CL:msvs=yes</userinput></screen>
179
<emphasis role="underline">Explanation:</emphasis> there are just
180
two differences to the example above:
185
calling SSAHA2 to generate a file which contains information on
186
the vector sequence hitting your sequences.
191
telling mira with <literal>SANGER_SETTINGS
192
-CL:msvs=yes</literal> to load this vector screening data for
198
For SMALT, the only difference is that you use SMALT for generating
199
the vector-screen file and ask SMALT to generate it in SSAHA2
200
format. As SMALT works in two steps (indexing and then mapping), you
201
also need to perform it in two steps and then call MIRA. E.g.:
204
<prompt>bchoc_assembly1$</prompt> <userinput>smalt index -k 7 -s 1 smaltidxdb /path/where/the/vector/data/resides/vector.fasta</userinput>
205
<prompt>bchoc_assembly1$</prompt> <userinput>smalt map -f ssaha -d -1 -m 7 smaltidxdb bchoc_in.sanger.fasta > bchoc_smaltvectorscreen_in.txt</userinput>
206
<prompt>bchoc_assembly1$</prompt> <userinput>mira -project=bchoc -job=denovo,genome,accurate,sanger -fasta SANGER_SETTINGS -CL:msvs=yes</userinput></screen>
208
Please note that, due to subtle differences between output of SSAHA2
209
(in ssaha2 format) and SMALT (in ssaha2 format), MIRA identifies the
210
source of the screening (and the parsing method it needs) by the
211
name of the screen file. Therefore, screens done with SSAHA2 need to
212
have the postfix <filename>*_ssaha2vectorscreen_in.txt</filename> in
213
the file name and screens done with SMALT need
214
<filename>*_smaltvectorscreen_in.txt</filename>.
219
<sect1 id="sect_sanger_calling_mira_from_the_command_line">
221
Calling mira from the command line
224
Mira can be used in many different ways: building assemblies from
225
scratch, performing reassembly on existing projects, assembling
226
sequences from closely related strains, assembling sequences against an
227
existing backbone (mapping assembly), etc.pp. Mira comes with a number
228
of <emphasis role="bold">quick switches</emphasis>, i.e., switches that
229
turn on parameter combinations which should be suited for most needs.
232
E.g.: <literal>mira --project=foobar --job=sanger --fasta
233
-highlyrepetitive</literal>
236
The line above will tell mira that our project will have the general
237
name <emphasis>foobar</emphasis> and that the sequences are to be loaded
238
from FASTA files, the sequence input file being
239
named <filename>foobar_in.sanger.fasta</filename> (and sequence quality
241
available, <filename>foobar_in.sanger.fasta.qual</filename>. The reads
242
come from Sanger technology and mira is prepared for the genome
243
containing nasty repeats. The result files will be in a directory
244
named <filename>foobar_results</filename>, statistics about the assembly
245
will be available in the <filename>foobar_info</filename> directory
246
like, e.g., a summary of contig statistics in
247
<filename>foobar_info/foobar_info_contigstats.txt</filename>. Notice
248
that the <emphasis>--job=</emphasis> switch is missing some
249
specifications, mira will automatically fill in the remaining defaults
250
(i.e., denovo,genome,accurate in the example above).
253
E.g.: <literal>mira --project=foobar --job=mapping,accurate,sanger
254
--fasta --highlyrepetitive</literal>
257
This is the same as the previous example except mira will perform a
258
mapping assembly in 'accurate' quality of the sequences against a
259
backbone sequence(s). mira will therefore additionally load the backbone
260
sequence(s) from the file <filename>foobar_backbone_in.fasta</filename>
261
(FASTA being the default type of backbone sequence to be loaded) and, if
262
existing, quality values for the backbone
263
from <filename>foobar_backbone_in.fasta.qual</filename>.
266
E.g.: <literal>mira --project=foobar --job=mapping,accurate,sanger
267
--fasta --highlyrepetitive -SB:bft=gbf</literal>
270
As above, except we have added an <emphasis role="bold">extensive
271
switch</emphasis> (<arg>-SB:bft</arg>) to tell mira that the backbones
272
are in a GenBank format file (GBF). MIRA will therefore load the
273
backbone sequence(s) from the file
274
<filename>foobar_backbone_in.gbf</filename>. Note that the GBF file can
275
also contain multiple entries, i.e., it can be a GBFF file.
278
E.g.: <literal>mira --project=foobar --job=mapping,accurate,sanger
279
--fastq --highlyrepetitive -SB:bft=gbf</literal>
282
As above, except we have changed the input type for all files from FASTA
286
<sect1 id="sect_sanger_using_multiple_processors">
288
Using multiple processors
291
This feature is in its infancy, presently only the SKIM algorithm uses
292
multiple threads. Setting the number of processes for this stage can be
293
done via the <arg>-GE:not</arg>
294
parameter. E.g. <literal>-GE:not=4</literal> to use 4 threads.
297
<sect1 id="sect_sanger_usage_examples">
303
<sect2 id="sect_sanger_assembly_from_scratch_with_gap4_and_exp_files">
305
Assembly from scratch with GAP4 and EXP files
308
A simple GAP4 project will do nicely. Please take care of the
309
following: You need already preprocessed experiment / fasta / phd
310
files, i.e., at least the sequencing vector should have been tagged
311
(in EXP files) or masked out (FASTA or PHD files). It would be nice if
312
some kind of not too lazy quality clipping had also been done for the
313
EXP files, pregap4 should do this for you.
318
Step 1: Create a file of filenames (named
319
<filename>mira_in.fofn</filename>) for the project you wish to
320
assemble. The file of filenames should contain the newline
321
separated names of the EXP-files and nothing else.
326
Step 2: Execute the mira assembly, eventually using command line
327
options or output redirection:
330
<prompt>$</prompt> <userinput>/path/to/the/mira/package/mira <replaceable>... other options ...</replaceable></userinput></screen>
335
<prompt>$</prompt> <userinput>mira <replaceable>... other options ...</replaceable></userinput></screen>
337
if MIRA is in a directory which is in your PATH. The result of the
338
assembly will now be in directory
339
named <filename>mira_results</filename> where you will
340
find <filename>mira_out.caf</filename>, <filename>mira_out.html</filename>
341
etc. or in gap4 direct assembly format in
342
the <filename>mira_out.gap4da</filename> sub-directory.
347
Step 3a: <emphasis>(This is not recommended
348
anymore)</emphasis> Change to the gap4da directory and start gap4:
351
<prompt>$</prompt> <userinput>cd mira_results/mira_out.gap4da</userinput>
352
<prompt>mira_results/mira_out.gap4da$</prompt> <userinput>gap4</userinput></screen>
354
choose the menu 'File->New' and enter a name for your new database
355
(like 'demo'). Then choose the menu 'Assembly->Directed
356
assembly'. Enter the text 'fofn' in the entry
357
labelled <emphasis>Input readings from List or file
358
name</emphasis> and enter the text 'failures' into the entry
359
labelled <emphasis>Save failures to List or file name</emphasis>.
368
Step 3b: <emphasis>(Recommended)</emphasis> As an alternative to
369
step 3a, one can use the caf2gap converter (see below)
372
<prompt>mira_results$</prompt> <userinput>caf2gap -project demo -version 0 -ace mira_out.caf</userinput>
373
<prompt>mira_results$</prompt> <userinput>gap4 DEMO.0</userinput></screen>
379
<title>Out-of-the box example</title>
380
MIRA comes with a few really small toy project to test usability on a
381
given system. Go to the minidemo directory and follow the instructions
382
given in the section for own projects above, but start with step 2.
383
Eventually, you might want to start mira while redirecting the output
384
to a file for later analysis.
387
<sect2 id="sect_sanger_reassembly_of_gap4_edited_projects">
389
Reassembly of GAP4 edited projects
392
It is sometimes wanted to reassemble a project that has already been
393
edited, for example when hidden data in reads has been uncovered or
394
when some repetitive bases have been tagged manually. The canonical
395
way to do this is by using CAF files as data exchange format and the
396
caf2gap and gap2caf converters available from the Sanger Centre
397
(<ulink url="http://www.sanger.ac.uk/Software/formats/CAF/"/>).
400
The project will be completely reassembled, contig joins or breaks
401
that have been made in the GAP4 database will be lost, you will get an
402
entirely new assembly with what mira determines to be the best
408
Step 1: Convert your GAP4 project with the gap2caf tool. Assuming
409
that the assembly is in the GAP4
410
database <filename>CURRENT.0</filename>, convert it with the
414
<prompt>$</prompt> <userinput>gap2caf -project CURRENT -version 0 -ace > newstart_in.caf</userinput></screen>
416
The name <emphasis>"newstart"</emphasis> will be the project name
417
of the new assembly project.
422
Step 2: Start mira with the -caf option and tell it the name of
423
your new reassembly project:
426
<prompt>$</prompt> <userinput>mira -caf=newstart</userinput></screen>
428
(and other options like --job etc. at will.)
433
Step 3: Convert the resulting CAF file
434
<filename>newstart_assembly/newstart_d_results/newstart_out.caf</filename>
435
to a gap4 database format as explained above and start gap4 with
439
<prompt>$</prompt> <userinput>cd newstart_assembly/newstart_d_results</userinput>
440
<prompt>newstart_assembly/newstart_d_results$</prompt> <userinput>caf2gap -project reassembled -version 0 -ace newstart_out.caf</userinput>
441
<prompt>newstart_assembly/newstart_d_results$</prompt> <userinput>gap4 REASSEMBLED.0</userinput></screen>
445
<sect2 id="sect_sanger_using_backbones_to_perform_a_mapping_assembly_against_a_reference_sequence">
447
Using backbones to perform a mapping assembly against a reference sequence
450
<!--%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%-->
451
One useful features of mira is the ability to assemble against already
452
existing reference sequences or contigs (also called a mapping assembly). The
453
parameters that control the behaviour of the assembly in these cases are in
454
the <arg>-STRAIN/BACKBONE</arg> section of the parameters.
457
Please have a look at the example in the <filename>minidemo/bbdemo2</filename> directory
458
which maps sequences from <emphasis>C.jejuni RM1221</emphasis> against (parts of) the genome
459
of <emphasis>C.jejuni NCTC1168</emphasis>.
462
There are a few things to consider when using backbone sequences:
467
Backbone sequences can be as long as needed! They are not subject
468
to normal read length constraints of a maximum of 10k bases. That
469
is, if one wants to load one or several entire chromosomes of a
470
bacterium or lower eukaryote as backbone sequence(s), this is just
476
Backbone sequences can be single sequences like provided by, e.g.,
477
FASTA, FASTQ or GenBank files. But backbone sequences also can be
478
whole assemblies when they are provided as, e.g., CAF format. This
479
opens the possibility to perform semi-hybrid assemblies by
480
assembling first reads from one sequencing technology de-novo
481
(e.g. 454) and then map reads from another sequencing technology
482
(e.g. Solexa) to the whole 454 alignment instead of mapping it to
486
A semi-hybrid assembly will therefore contain, like a hybrid
487
assembly, the reads of both sequencing technologies.
492
Backbone sequences will not be reversed! They will always appear in
493
forward direction in the output of the assembly. Please note: if the
494
backbone sequence consists of a CAF file that contain contigs which contain
495
reversed reads, then the contigs themselves will be in forward direction.
496
But the reads they contain that are in reverse complement direction will of
497
course also stay reverse complement direction.
502
Backbone sequences will not not be assembled together! That is, if a
503
sequence of the backbones has a perfect overlap with another backbone
504
sequence, they will still not be merged.
509
Reads are assembled to backbones in a first come, first served
513
Suppose you have two identical backbones and one read which
514
would match both, then the read would be mapped to the first
515
backbone. If you had two (almost) identical reads, the first
516
read would go to the first backbone, the second read to the
517
second backbone. With three almost identical reads, the first
518
backbone would get two reads, the second backbone one read.
523
Only in backbones loaded from CAF files: contigs made out of single
524
reads (singlets) loose their status as backbones and will be returned to the
525
normal read pool for the assembly process. That is, these sequences will be
526
assembled to other backbones or with each other.
533
Examples for using backbone sequences:
538
Example 1: assume you have a genome of an existing organism. From
539
that, a mutant has been made by mutagenesis and you are skimming
540
the genome in shotgun mode for mutations. You would generate for
541
this a <emphasis>straindata</emphasis> file that gives the name of
542
the mutant strain to the newly sequenced reads and simply assemble
543
those against your existing genome, using the following
547
<literal>-SB:lsd=yes:lb=yes:bsn=<replaceable>nameOriginalStrain</replaceable>:bft=<replaceable>caf|fasta|gbf</replaceable></literal>
550
When loading backbones from CAF, the qualities of the consensus
551
bases will be calculated by mira according normal consensus
552
computing rules. When loading backbones from FASTA or GBF, one
553
can set the expected overall quality of the sequences (e.g. 1
554
error in 1000 bases = quality of 30) with
555
<arg>-SB:bbq=30</arg>. It is recommended to have the backbone
556
quality at least as high as the <arg>-CO:mgqrt</arg> value, so
557
that mira can automatically detect and report SNPs.
562
Example 2: suppose that you are in the process of performing a
563
shotgun sequencing and you want to determine the moment when you
564
got enough reads. One could make a complete assembly each day when
565
new sequences arrive. However, starting with genomes the size of a
566
lower eukaryote, this may become prohibitive from the
567
computational point of view. A quick and efficient way to resolve
568
this problem is to use the CAF file of the previous assembly as
569
backbone and simply add the new reads to the pool. The number of
570
singlets remaining after the assembly versus the total number of
571
reads of the project is a good measure for the coverage of the
577
Example 3: in EST assembly with miraSearchESTSNPs, existing cDNA
578
sequences can also be useful when added to the project during step
579
3 (in the file <filename>step3_in.par</filename>). They will
580
provide a framework to which mRNA-contigs built in previous steps
581
will be assembled against, allowing for a fast evaluation of the
582
results. Additionally, they provide a direction for the assembled
583
sequences so that one does not need to invert single contigs by