~ubuntu-branches/ubuntu/trusty/mira/trusty-proposed

Viewing changes to doc/docbook/attic/chap_sanger_part.xml

Committer: Package Import Robot
Author(s): Thorsten Alteholz, Thorsten Alteholz, Andreas Tille
Date: 2014-02-02 22:51:35 UTC
mfrom: (7.1.1 sid)
Revision ID: package-import@ubuntu.com-20140202225135-nesemzj59jjgogh0

Tags: 4.0-1

http://bugs.debian.org/735798

[ Thorsten Alteholz ]
* New upstream version
* debian/rules: add boost dir in auto_configure (Closes: #735798)

[ Andreas Tille ]
* cme fix dpkg-control
* debian/patches/{make.patch,spelling.patch}: applied upstream (thus removed)

files added:
doc/docbook/attic/chap_454_part.xml

doc/docbook/attic/chap_iontor_part.xml

doc/docbook/attic/chap_pacbio_part.xml

doc/docbook/attic/chap_sanger_part.xml

doc/docbook/attic/chap_solexa_part.xml

doc/docbook/bookfigures/results_miraconvert.png

doc/docbook/chap_dataprep_part.xml

doc/docbook/chap_denovo_part.xml

doc/docbook/chap_mapping_part.xml

doc/docbook/chap_seqtechdesc_part.xml

doc/docbook/chap_specialparams_part.xml

doc/docbook/replace_all.sh

m4/ax_boost_base.m4

m4/ax_boost_filesystem.m4

m4/ax_boost_iostreams.m4

m4/ax_boost_regex.m4

m4/ax_boost_system.m4

m4/ax_boost_thread.m4

m4/ax_check_zlib.m4

m4/ax_cxx_have_std.m4

m4/ax_cxx_have_stl.m4

m4/ax_cxx_namespaces.m4

m4/ax_lib_expat.m4

m4/libtool.m4

m4/ltoptions.m4

m4/ltsugar.m4

m4/ltversion.m4

m4/lt~obsolete.m4

src/mira/warnings.C

src/mira/warnings.H

src/modules

src/modules/Makefile.am

src/modules/Makefile.in

src/modules/misc.C

src/modules/misc.H

src/modules/mod_bait.C

src/modules/mod_bait.H

src/modules/mod_convert.C

src/modules/mod_convert.H

src/modules/mod_dbgreplay.C

src/modules/mod_memestim.C

src/modules/mod_memestim.H

src/modules/mod_mira.C

src/modules/mod_mira.H

src/modules/mod_tagsnp.C

src/modules/mod_tagsnp.H

src/progs/quirks.C

src/util/fmttext.C

src/util/fmttext.H

files removed:
.pc/boost-minimal.patch

.pc/boost-minimal.patch/src

.pc/boost-minimal.patch/src/progs

.pc/boost-minimal.patch/src/progs/Makefile.in

config

config/m4

config/m4/ax_boost_base.m4

config/m4/ax_boost_filesystem.m4

config/m4/ax_boost_iostreams.m4

config/m4/ax_boost_regex.m4

config/m4/ax_boost_system.m4

config/m4/ax_boost_thread.m4

config/m4/ax_check_zlib.m4

config/m4/ax_cxx_have_std.m4

config/m4/ax_cxx_have_stl.m4

config/m4/ax_cxx_namespaces.m4

config/m4/ax_lib_expat.m4

config/m4/libtool.m4

config/m4/ltoptions.m4

config/m4/ltsugar.m4

config/m4/ltversion.m4

config/m4/lt~obsolete.m4

doc/docbook/bookfigures/ion_dh10bcovB13_12kb.png

doc/docbook/bookfigures/ion_dh10bcovB13_320kb.png

doc/docbook/bookfigures/results_convert_project.png

doc/docbook/chap_454_part.xml

doc/docbook/chap_iontor_part.xml

doc/docbook/chap_pacbio_part.xml

doc/docbook/chap_sanger_part.xml

doc/docbook/chap_solexa_part.xml

src/progs/convert_project.C

src/progs/dbgreplay.C

src/scripts/fastaselect.tcl

src/scripts/fastqselect.tcl

files modified:
.pc/applied-patches

.tarball-version

.version

HELP_WANTED

INSTALL

Makefile.am

Makefile.in

README

README_build.html

THANKS

aclocal.m4

build-aux/git-version-gen

configure

configure.ac

debian/changelog

debian/control

debian/mira-assembler.lintian-overrides

debian/patches/series

debian/rules

debian/watch

distribution/Makefile

distribution/Makefile.am

distribution/Makefile.in

distribution/README

doc/Makefile

doc/Makefile.in

doc/docbook/Makefile

doc/docbook/Makefile.am

doc/docbook/Makefile.in

doc/docbook/attic/README.txt

doc/docbook/book_3rdparty.xml

doc/docbook/book_definitiveguide.xml

doc/docbook/chap_3p_ghbambus_part.xml

doc/docbook/chap_bitsandpieces_part.xml

doc/docbook/chap_est_part.xml

doc/docbook/chap_faq_part.xml

doc/docbook/chap_hard_part.xml

doc/docbook/chap_installation_part.xml

doc/docbook/chap_intro_part.xml

doc/docbook/chap_logfiles_part.xml

doc/docbook/chap_maf_part.xml

doc/docbook/chap_mirautils_part.xml

doc/docbook/chap_preface_part.xml

doc/docbook/chap_reference_part.xml

doc/docbook/chap_results_part.xml

doc/docbook/chap_seqadvice_part.xml

doc/docbook/copyrightfile

doc/docbook/warning_frontofchapter.xml

minidemo/bbdemo2/runme.sh

minidemo/estdemo2/README

src/Makefile.am

src/Makefile.in

src/caf/Makefile.in

src/caf/caf.C

src/debuggersupport/Makefile.in

src/debuggersupport/dbgsupport.C

src/errorhandling/Makefile.in

src/errorhandling/errorhandling.C

src/errorhandling/errorhandling.H

src/io/Makefile.in

src/io/fasta.C

src/io/fasta.H

src/io/fastq-lh.H

src/io/gap4_ft_so_map.xxd

src/io/ncbiinfoxml.C

src/io/ncbiinfoxml.H

src/io/phd.H

src/io/scf.C

src/io/scf.H

src/memorc/Makefile.in

src/memorc/main.C

src/memorc/memorc.C

src/mira/CHANGES.txt

src/mira/Makefile.am

src/mira/Makefile.in

src/mira/TODO

src/mira/adaptorsforclip.454.xxd

src/mira/ads.C

src/mira/align.C

src/mira/align.H

src/mira/assembly.C

src/mira/assembly.H

src/mira/assembly_info.C

src/mira/assembly_info.H

src/mira/assembly_io.C

src/mira/assembly_misc.C

src/mira/assembly_output.C

src/mira/assembly_output.H

src/mira/assembly_reduceskimhits.C

src/mira/assembly_swalign.C

src/mira/contig.C

src/mira/contig.H

src/mira/contig_analysis.C

src/mira/contig_consensus.C

src/mira/contig_edit.C

src/mira/contig_featureinfo.C

src/mira/contig_output.C

src/mira/contig_pairconsistency.C

src/mira/dataprocessing.C

src/mira/dataprocessing.H

src/mira/dynamic.C

src/mira/dynamic.H

src/mira/gbf_parse.H

src/mira/gff_parse.C

src/mira/hashstats.C

src/mira/hashstats.H

src/mira/hdeque.H

src/mira/maf_parse.C

src/mira/manifest.C

src/mira/manifest.H

src/mira/parameters.C

src/mira/parameters.H

src/mira/parameters_flexer.cc

src/mira/parameters_flexer.ll

src/mira/parameters_tokens.h

src/mira/pcrcontainer.C

src/mira/pcrcontainer.H

src/mira/ppathfinder.C

src/mira/ppathfinder.H

src/mira/preventinitfiasco.C

src/mira/read.C

src/mira/read.H

src/mira/readgrouplib.C

src/mira/readgrouplib.H

src/mira/readpool.C

src/mira/readpool.H

src/mira/sam_collect.C

src/mira/skim.C

src/mira/skim.H

src/mira/skim_farc.C

src/mira/skim_lowbph.C

src/mira/structs.H

src/progs/Makefile.am

src/progs/Makefile.in

src/progs/mira.C

src/progs/miramer.C

src/progs/quirks.H

src/scripts/Makefile.am

src/scripts/Makefile.in

src/scripts/fasta2frag.tcl

src/stdinc/Makefile.in

src/stdinc/defines.H

src/stdinc/stlincludes.H

src/stdinc/types.H

src/support/GTAGDB

src/support/Makefile.in

src/util/Makefile.am

src/util/Makefile.in

src/util/dptools.H

src/util/fileanddisk.C

src/util/fileanddisk.H

src/util/misc.C

src/version.H

src/version.stub

Show diffs side-by-side

added added

removed removed

doc/docbook/attic/chap_sanger_part.xml

<?xml version="1.0" ?>

<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.docbook.org/xml/4.5/docbookx.dtd">

<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="versionfile"/>

<firstname>Bastien</firstname>

<surname>Chevreux</surname>

<email>bach@chevreux.org</email>

</author>

<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="copyrightfile"/>

</chapterinfo>

<attribution>Solomon Short</attribution>

<para>

<emphasis><quote>Just when you think it's finally settled, it isn't.

</quote></emphasis>

</para>

</blockquote>

<title>Short usage introduction to MIRA3</title>

<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="warning_frontofchapter.xml"/>

<para>

This guide assumes that you have basic working knowledge of Unix systems,

know the basic principles of sequencing (and sequence assembly) and what

assemblers do. Furthermore, it is advised to read through the main

documentation of the assembler as this is really just a getting started

guide.

</para>

<title>

Important notes

</title>

<para>

For working parameter settings for assemblies involving 454, IonTorrent,

Solexa or PacBio data, please also read the MIRA help files dedicated to these

platforms.

</para>

</sect1>

<title>

Quick start for the impatient

</title>

<para>

This example assumes that you have a few sequences in FASTA format that

may or may not have been preprocessed - that is, where sequencing vector

has been cut back or masked out. If quality values are also present in a

fasta like format, so much the better.

</para>

<para>

We need to give a name to our project: throughout this example, we will

assume that the sequences we are working with are

from <emphasis>Bacillus</emphasis>

<emphasis>chocorafoliensis</emphasis> (or short: <emphasis>Bchoc</emphasis>); a well known,

chocolate-adoring bug from the <emphasis>Bacillus</emphasis> family which is able to make a

couple of hundred grams of chocolate vanish in just a few minutes.

</para>

<para>

Our project will therefore be named 'bchoc'.

</para>

<title>

Estimating memory needs

</title>

<para>

<emphasis>"Do I have enough memory?"</emphasis> has been one of the

most often asked question in former times. To answer this question,

please use <command>miramem</command> which will give you an

estimate. Basically, you just need to start the program and answer the

questions, for more information please refer to the corresponding

section in the main MIRA documentation.

</para>

<para>

Take this estimate with a grain of salt, depending on the sequences

properties, variations in the estimate can be +/- 30%.

</para>

</sect2>

<title>

Preparing and starting an assembly from scratch with FASTA files

</title>

<para>

</para>

<title>

With data pre-clipped or pre-screened for vector sequence

</title>

<para>

The following steps will allow to quickly start a simple assembly if

your sequencing provider gave you data which was pre-clipped or

pre-screened for vector sequence:

</para>

<prompt>$</prompt> <userinput>mkdir bchoc_assembly1</userinput>

<prompt>$</prompt> <userinput>cd bchoc_assembly1</userinput>

<prompt>bchoc_assembly1$</prompt> <userinput>cp /your/path/sequences.fasta bchoc_in.sanger.fasta</userinput>

<prompt>bchoc_assembly1$</prompt> <userinput>cp /your/path/qualities.someextension bchoc_in.sanger.fasta.qual</userinput>

<prompt>bchoc_assembly1$</prompt> <userinput>mira --project=bchoc --job=denovo,genome,accurate,sanger --fasta</userinput></screen>

<para>

100

<emphasis role="underline">Explanation:</emphasis> we created a

101

directory for the assembly, copied the sequences into it (to make

102

things easier for us, we named the file directly in a format

103

suitable for mira to load it automatically) and we also copied

104

quality values for the sequences into the same directory. As last

105

step, we started mira with options telling it that

106

</para>

107

108

109

<para>

110

our project is named 'bchoc' and hence, input and output files

111

will have this as prefix;

112

</para>

113

</listitem>

114

115

<para>

116

the data is in a FASTA formatted file;

117

</para>

118

</listitem>

119

120

<para>

121

the data should be assembled <emphasis>de-novo</emphasis> as

122

a <emphasis>genome</emphasis> at an assembly quality level

123

of <emphasis>accurate</emphasis> and that the reads we are

124

assembling were generated with Sanger technology.

125

</para>

126

</listitem>

127

</itemizedlist>

128

<para>

129

By giving mira the project name 'bchoc'

130

(<literal>--project=bchoc</literal>) and naming sequence file with

131

an appropriate extension <filename>_in.sanger.fasta</filename>, mira

132

automatically loaded that file for assembly. When there are

133

additional quality values available

134

(<filename>bchoc_in.sanger.fasta.qual</filename>), these are also

135

automatically loaded and used for the assembly.

136

</para>

137

<note>

138

If there is no file with quality values available, MIRA will stop

139

immediately. You will need to provide parameters to the command line

140

which explicitly switch off loading and using quality files.

141

</note>

142

143

Not using quality values is <emphasis role="bold">NOT</emphasis>

144

recommended. Read the corresponding section in the MIRA reference

145

manual.

146

</warning>

147

</sect3>

148

149

<title>

150

Using SSAHA2 / SMALT to screen for vector sequence

151

</title>

152

<para>

153

If your sequencing provider gave you data which was NOT pre-clipped

154

for vector sequence, you can do this yourself in a pretty robust

155

manner using SSAHA2 -- or the successor, SMALT -- from the Sanger

156

Centre. You just need to know which sequencing vector the provider

157

used and have its sequence in FASTA format (ask your provider).

158

</para>

159

<para>

160

Note that this screening is a valid method for any type of Sanger

161

sequencing vectors, 454 adaptors, Solexa adaptors and paired-end

162

adaptors etc.

163

</para>

164

<para>

165

For SSAHA2 follow these steps (most are the same as in the example

166

above):

167

</para>

168

169

<prompt>$</prompt> <userinput>mkdir bchoc_assembly1</userinput>

170

<prompt>$</prompt> <userinput>cd bchoc_assembly1</userinput>

171

<prompt>bchoc_assembly1$</prompt> <userinput>cp /your/path/sequences.fasta bchoc_in.sanger.fasta</userinput>

172

<prompt>bchoc_assembly1$</prompt> <userinput>cp /your/path/qualities.someextension bchoc_in.sanger.fasta.qual</userinput>

173

<prompt>bchoc_assembly1$</prompt> <userinput>ssaha2 -output ssaha2

174

-kmer 8 -skip 1 -seeds 1 -score 12 -cmatch 9 -ckmer 6

175

/path/where/the/vector/data/resides/vector.fasta

176

bchoc_in.sanger.fasta > bchoc_ssaha2vectorscreen_in.txt</userinput>

177

<prompt>bchoc_assembly1$</prompt> <userinput>mira -project=bchoc -job=denovo,genome,accurate,sanger -fasta SANGER_SETTINGS -CL:msvs=yes</userinput></screen>

178

<para>

179

<emphasis role="underline">Explanation:</emphasis> there are just

180

two differences to the example above:

181

</para>

182

183

184

<para>

185

calling SSAHA2 to generate a file which contains information on

186

the vector sequence hitting your sequences.

187

</para>

188

</listitem>

189

190

<para>

191

telling mira with <literal>SANGER_SETTINGS

192

-CL:msvs=yes</literal> to load this vector screening data for

193

Sanger data

194

</para>

195

</listitem>

196

</itemizedlist>

197

<para>

198

For SMALT, the only difference is that you use SMALT for generating

199

the vector-screen file and ask SMALT to generate it in SSAHA2

200

format. As SMALT works in two steps (indexing and then mapping), you

201

also need to perform it in two steps and then call MIRA. E.g.:

202

</para>

203

204

<prompt>bchoc_assembly1$</prompt> <userinput>smalt index -k 7 -s 1 smaltidxdb /path/where/the/vector/data/resides/vector.fasta</userinput>

205

<prompt>bchoc_assembly1$</prompt> <userinput>smalt map -f ssaha -d -1 -m 7 smaltidxdb bchoc_in.sanger.fasta > bchoc_smaltvectorscreen_in.txt</userinput>

206

<prompt>bchoc_assembly1$</prompt> <userinput>mira -project=bchoc -job=denovo,genome,accurate,sanger -fasta SANGER_SETTINGS -CL:msvs=yes</userinput></screen>

207

<note>

208

Please note that, due to subtle differences between output of SSAHA2

209

(in ssaha2 format) and SMALT (in ssaha2 format), MIRA identifies the

210

source of the screening (and the parsing method it needs) by the

211

name of the screen file. Therefore, screens done with SSAHA2 need to

212

have the postfix <filename>*_ssaha2vectorscreen_in.txt</filename> in

213

the file name and screens done with SMALT need

214

<filename>*_smaltvectorscreen_in.txt</filename>.

215

</note>

216

</sect3>

217

</sect2>

218

</sect1>

219

220

<title>

221

Calling mira from the command line

222

</title>

223

<para>

224

Mira can be used in many different ways: building assemblies from

225

scratch, performing reassembly on existing projects, assembling

226

sequences from closely related strains, assembling sequences against an

227

existing backbone (mapping assembly), etc.pp. Mira comes with a number

228

of <emphasis role="bold">quick switches</emphasis>, i.e., switches that

229

turn on parameter combinations which should be suited for most needs.

230

</para>

231

<para>

232

E.g.: <literal>mira --project=foobar --job=sanger --fasta

233

-highlyrepetitive</literal>

234

</para>

235

<para>

236

The line above will tell mira that our project will have the general

237

name <emphasis>foobar</emphasis> and that the sequences are to be loaded

238

from FASTA files, the sequence input file being

239

named <filename>foobar_in.sanger.fasta</filename> (and sequence quality

240

file, if

241

available, <filename>foobar_in.sanger.fasta.qual</filename>. The reads

242

come from Sanger technology and mira is prepared for the genome

243

containing nasty repeats. The result files will be in a directory

244

named <filename>foobar_results</filename>, statistics about the assembly

245

will be available in the <filename>foobar_info</filename> directory

246

like, e.g., a summary of contig statistics in

247

<filename>foobar_info/foobar_info_contigstats.txt</filename>. Notice

248

that the <emphasis>--job=</emphasis> switch is missing some

249

specifications, mira will automatically fill in the remaining defaults

250

(i.e., denovo,genome,accurate in the example above).

251

</para>

252

<para>

253

E.g.: <literal>mira --project=foobar --job=mapping,accurate,sanger

254

--fasta --highlyrepetitive</literal>

255

</para>

256

<para>

257

This is the same as the previous example except mira will perform a

258

mapping assembly in 'accurate' quality of the sequences against a

259

backbone sequence(s). mira will therefore additionally load the backbone

260

sequence(s) from the file <filename>foobar_backbone_in.fasta</filename>

261

(FASTA being the default type of backbone sequence to be loaded) and, if

262

existing, quality values for the backbone

263

from <filename>foobar_backbone_in.fasta.qual</filename>.

264

</para>

265

<para>

266

E.g.: <literal>mira --project=foobar --job=mapping,accurate,sanger

267

--fasta --highlyrepetitive -SB:bft=gbf</literal>

268

</para>

269

<para>

270

As above, except we have added an <emphasis role="bold">extensive

271

switch</emphasis> (<arg>-SB:bft</arg>) to tell mira that the backbones

272

are in a GenBank format file (GBF). MIRA will therefore load the

273

backbone sequence(s) from the file

274

<filename>foobar_backbone_in.gbf</filename>. Note that the GBF file can

275

also contain multiple entries, i.e., it can be a GBFF file.

276

</para>

277

<para>

278

E.g.: <literal>mira --project=foobar --job=mapping,accurate,sanger

279

--fastq --highlyrepetitive -SB:bft=gbf</literal>

280

</para>

281

<para>

282

As above, except we have changed the input type for all files from FASTA

283

to FASTQ.

284

</para>

285

</sect1>

286

287

<title>

288

Using multiple processors

289

</title>

290

<para>

291

This feature is in its infancy, presently only the SKIM algorithm uses

292

multiple threads. Setting the number of processes for this stage can be

293

done via the <arg>-GE:not</arg>

294

parameter. E.g. <literal>-GE:not=4</literal> to use 4 threads.

295

</para>

296

</sect1>

297

298

<title>

299

Usage examples

300

</title>

301

<para>

302

</para>

303

304

<title>

305

Assembly from scratch with GAP4 and EXP files

306

</title>

307

<para>

308

A simple GAP4 project will do nicely. Please take care of the

309

following: You need already preprocessed experiment / fasta / phd

310

files, i.e., at least the sequencing vector should have been tagged

311

(in EXP files) or masked out (FASTA or PHD files). It would be nice if

312

some kind of not too lazy quality clipping had also been done for the

313

EXP files, pregap4 should do this for you.

314

</para>

315

316

317

<para>

318

Step 1: Create a file of filenames (named

319

<filename>mira_in.fofn</filename>) for the project you wish to

320

assemble. The file of filenames should contain the newline

321

separated names of the EXP-files and nothing else.

322

</para>

323

</listitem>

324

325

<para>

326

Step 2: Execute the mira assembly, eventually using command line

327

options or output redirection:

328

</para>

329

330

<prompt>$</prompt> <userinput>/path/to/the/mira/package/mira <replaceable>... other options ...</replaceable></userinput></screen>

331

<para>

332

or simply

333

</para>

334

335

<prompt>$</prompt> <userinput>mira <replaceable>... other options ...</replaceable></userinput></screen>

336

<para>

337

if MIRA is in a directory which is in your PATH. The result of the

338

assembly will now be in directory

339

named <filename>mira_results</filename> where you will

340

find <filename>mira_out.caf</filename>, <filename>mira_out.html</filename>

341

etc. or in gap4 direct assembly format in

342

the <filename>mira_out.gap4da</filename> sub-directory.

343

</para>

344

</listitem>

345

346

<para>

347

Step 3a: <emphasis>(This is not recommended

348

anymore)</emphasis> Change to the gap4da directory and start gap4:

349

</para>

350

351

<prompt>$</prompt> <userinput>cd mira_results/mira_out.gap4da</userinput>

352

<prompt>mira_results/mira_out.gap4da$</prompt> <userinput>gap4</userinput></screen>

353

<para>

354

choose the menu 'File->New' and enter a name for your new database

355

(like 'demo'). Then choose the menu 'Assembly->Directed

356

assembly'. Enter the text 'fofn' in the entry

357

labelled <emphasis>Input readings from List or file

358

name</emphasis> and enter the text 'failures' into the entry

359

labelled <emphasis>Save failures to List or file name</emphasis>.

360

Press "OK".

361

</para>

362

<para>

363

That's it.

364

</para>

365

</listitem>

366

367

<para>

368

Step 3b: <emphasis>(Recommended)</emphasis> As an alternative to

369

step 3a, one can use the caf2gap converter (see below)

370

</para>

371

372

<prompt>mira_results$</prompt> <userinput>caf2gap -project demo -version 0 -ace mira_out.caf</userinput>

373

<prompt>mira_results$</prompt> <userinput>gap4 DEMO.0</userinput></screen>

374

</listitem>

375

</orderedlist>

376

<para>

377

</para>

378

379

<title>Out-of-the box example</title>

380

MIRA comes with a few really small toy project to test usability on a

381

given system. Go to the minidemo directory and follow the instructions

382

given in the section for own projects above, but start with step 2.

383

Eventually, you might want to start mira while redirecting the output

384

to a file for later analysis.

385

</formalpara>

386

</sect2>

387

388

<title>

389

Reassembly of GAP4 edited projects

390

</title>

391

<para>

392

It is sometimes wanted to reassemble a project that has already been

393

edited, for example when hidden data in reads has been uncovered or

394

when some repetitive bases have been tagged manually. The canonical

395

way to do this is by using CAF files as data exchange format and the

396

caf2gap and gap2caf converters available from the Sanger Centre

397

(<ulink url="http://www.sanger.ac.uk/Software/formats/CAF/"/>).

398

</para>

399

400

The project will be completely reassembled, contig joins or breaks

401

that have been made in the GAP4 database will be lost, you will get an

402

entirely new assembly with what mira determines to be the best

403

assembly.

404

</warning>

405

406

407

<para>

408

Step 1: Convert your GAP4 project with the gap2caf tool. Assuming

409

that the assembly is in the GAP4

410

database <filename>CURRENT.0</filename>, convert it with the

411

gap2caf tool:

412

</para>

413

414

<prompt>$</prompt> <userinput>gap2caf -project CURRENT -version 0 -ace > newstart_in.caf</userinput></screen>

415

<para>

416

The name <emphasis>"newstart"</emphasis> will be the project name

417

of the new assembly project.

418

</para>

419

</listitem>

420

421

<para>

422

Step 2: Start mira with the -caf option and tell it the name of

423

your new reassembly project:

424

</para>

425

426

<prompt>$</prompt> <userinput>mira -caf=newstart</userinput></screen>

427

<para>

428

(and other options like --job etc. at will.)

429

</para>

430

</listitem>

431

432

<para>

433

Step 3: Convert the resulting CAF file

434

<filename>newstart_assembly/newstart_d_results/newstart_out.caf</filename>

435

to a gap4 database format as explained above and start gap4 with

436

the new database:

437

</para>

438

439

<prompt>$</prompt> <userinput>cd newstart_assembly/newstart_d_results</userinput>

440

<prompt>newstart_assembly/newstart_d_results$</prompt> <userinput>caf2gap -project reassembled -version 0 -ace newstart_out.caf</userinput>

441

<prompt>newstart_assembly/newstart_d_results$</prompt> <userinput>gap4 REASSEMBLED.0</userinput></screen>

442

</listitem>

443

</itemizedlist>

444

</sect2>

445

446

<title>

447

Using backbones to perform a mapping assembly against a reference sequence

448

</title>

449

<para>

450

451

One useful features of mira is the ability to assemble against already

452

existing reference sequences or contigs (also called a mapping assembly). The

453

parameters that control the behaviour of the assembly in these cases are in

454

the <arg>-STRAIN/BACKBONE</arg> section of the parameters.

455

</para>

456

<para>

457

Please have a look at the example in the <filename>minidemo/bbdemo2</filename> directory

458

which maps sequences from <emphasis>C.jejuni RM1221</emphasis> against (parts of) the genome

459

of <emphasis>C.jejuni NCTC1168</emphasis>.

460

</para>

461

<para>

462

There are a few things to consider when using backbone sequences:

463

</para>

464

465

466

<para>

467

Backbone sequences can be as long as needed! They are not subject

468

to normal read length constraints of a maximum of 10k bases. That

469

is, if one wants to load one or several entire chromosomes of a

470

bacterium or lower eukaryote as backbone sequence(s), this is just

471

fine.

472

</para>

473

</listitem>

474

475

<para>

476

Backbone sequences can be single sequences like provided by, e.g.,

477

FASTA, FASTQ or GenBank files. But backbone sequences also can be

478

whole assemblies when they are provided as, e.g., CAF format. This

479

opens the possibility to perform semi-hybrid assemblies by

480

assembling first reads from one sequencing technology de-novo

481

(e.g. 454) and then map reads from another sequencing technology

482

(e.g. Solexa) to the whole 454 alignment instead of mapping it to

483

the 454 consensus.

484

</para>

485

<para>

486

A semi-hybrid assembly will therefore contain, like a hybrid

487

assembly, the reads of both sequencing technologies.

488

</para>

489

</listitem>

490

491

<para>

492

Backbone sequences will not be reversed! They will always appear in

493

forward direction in the output of the assembly. Please note: if the

494

backbone sequence consists of a CAF file that contain contigs which contain

495

reversed reads, then the contigs themselves will be in forward direction.

496

But the reads they contain that are in reverse complement direction will of

497

course also stay reverse complement direction.

498

</para>

499

</listitem>

500

501

<para>

502

Backbone sequences will not not be assembled together! That is, if a

503

sequence of the backbones has a perfect overlap with another backbone

504

sequence, they will still not be merged.

505

</para>

506

</listitem>

507

508

<para>

509

Reads are assembled to backbones in a first come, first served

510

scattering strategy.

511

</para>

512

<para>

513

Suppose you have two identical backbones and one read which

514

would match both, then the read would be mapped to the first

515

backbone. If you had two (almost) identical reads, the first

516

read would go to the first backbone, the second read to the

517

second backbone. With three almost identical reads, the first

518

backbone would get two reads, the second backbone one read.

519

</para>

520

</listitem>

521

522

<para>

523

Only in backbones loaded from CAF files: contigs made out of single

524

reads (singlets) loose their status as backbones and will be returned to the

525

normal read pool for the assembly process. That is, these sequences will be

526

assembled to other backbones or with each other.

527

</para>

528

</listitem>

529

</orderedlist>

530

<para>

531

</para>

532

<para>

533

Examples for using backbone sequences:

534

</para>

535

536

537

<para>

538

Example 1: assume you have a genome of an existing organism. From

539

that, a mutant has been made by mutagenesis and you are skimming

540

the genome in shotgun mode for mutations. You would generate for

541

this a <emphasis>straindata</emphasis> file that gives the name of

542

the mutant strain to the newly sequenced reads and simply assemble

543

those against your existing genome, using the following

544

parameters:

545

</para>

546

<para>

547

<literal>-SB:lsd=yes:lb=yes:bsn=<replaceable>nameOriginalStrain</replaceable>:bft=<replaceable>caf|fasta|gbf</replaceable></literal>

548

</para>

549

<para>

550

When loading backbones from CAF, the qualities of the consensus

551

bases will be calculated by mira according normal consensus

552

computing rules. When loading backbones from FASTA or GBF, one

553

can set the expected overall quality of the sequences (e.g. 1

554

error in 1000 bases = quality of 30) with

555

<arg>-SB:bbq=30</arg>. It is recommended to have the backbone

556

quality at least as high as the <arg>-CO:mgqrt</arg> value, so

557

that mira can automatically detect and report SNPs.

558

</para>

559

</listitem>

560

561

<para>

562

Example 2: suppose that you are in the process of performing a

563

shotgun sequencing and you want to determine the moment when you

564

got enough reads. One could make a complete assembly each day when

565

new sequences arrive. However, starting with genomes the size of a

566

lower eukaryote, this may become prohibitive from the

567

computational point of view. A quick and efficient way to resolve

568

this problem is to use the CAF file of the previous assembly as

569

backbone and simply add the new reads to the pool. The number of

570

singlets remaining after the assembly versus the total number of

571

reads of the project is a good measure for the coverage of the

572

project.

573

</para>

574

</listitem>

575

576

<para>

577

Example 3: in EST assembly with miraSearchESTSNPs, existing cDNA

578

sequences can also be useful when added to the project during step

579

3 (in the file <filename>step3_in.par</filename>). They will

580

provide a framework to which mRNA-contigs built in previous steps

581

will be assembled against, allowing for a fast evaluation of the

582

results. Additionally, they provide a direction for the assembled

583

sequences so that one does not need to invert single contigs by

584

hand afterwards.

585

</para>

586

</listitem>

587

</itemizedlist>

588

<para>

589

</para>

590

</sect2>

591

</sect1>

592

</chapter>

Older »