~ubuntu-branches/ubuntu/trusty/mira/trusty-proposed

Viewing changes to doc/docbook/attic/chap_454_part.xml

Committer: Package Import Robot
Author(s): Thorsten Alteholz, Thorsten Alteholz, Andreas Tille
Date: 2014-02-02 22:51:35 UTC
mfrom: (7.1.1 sid)
Revision ID: package-import@ubuntu.com-20140202225135-nesemzj59jjgogh0

Tags: 4.0-1

http://bugs.debian.org/735798

[ Thorsten Alteholz ]
* New upstream version
* debian/rules: add boost dir in auto_configure (Closes: #735798)

[ Andreas Tille ]
* cme fix dpkg-control
* debian/patches/{make.patch,spelling.patch}: applied upstream (thus removed)

files added:
doc/docbook/attic/chap_454_part.xml

doc/docbook/attic/chap_iontor_part.xml

doc/docbook/attic/chap_pacbio_part.xml

doc/docbook/attic/chap_sanger_part.xml

doc/docbook/attic/chap_solexa_part.xml

doc/docbook/bookfigures/results_miraconvert.png

doc/docbook/chap_dataprep_part.xml

doc/docbook/chap_denovo_part.xml

doc/docbook/chap_mapping_part.xml

doc/docbook/chap_seqtechdesc_part.xml

doc/docbook/chap_specialparams_part.xml

doc/docbook/replace_all.sh

m4/ax_boost_base.m4

m4/ax_boost_filesystem.m4

m4/ax_boost_iostreams.m4

m4/ax_boost_regex.m4

m4/ax_boost_system.m4

m4/ax_boost_thread.m4

m4/ax_check_zlib.m4

m4/ax_cxx_have_std.m4

m4/ax_cxx_have_stl.m4

m4/ax_cxx_namespaces.m4

m4/ax_lib_expat.m4

m4/libtool.m4

m4/ltoptions.m4

m4/ltsugar.m4

m4/ltversion.m4

m4/lt~obsolete.m4

src/mira/warnings.C

src/mira/warnings.H

src/modules

src/modules/Makefile.am

src/modules/Makefile.in

src/modules/misc.C

src/modules/misc.H

src/modules/mod_bait.C

src/modules/mod_bait.H

src/modules/mod_convert.C

src/modules/mod_convert.H

src/modules/mod_dbgreplay.C

src/modules/mod_memestim.C

src/modules/mod_memestim.H

src/modules/mod_mira.C

src/modules/mod_mira.H

src/modules/mod_tagsnp.C

src/modules/mod_tagsnp.H

src/progs/quirks.C

src/util/fmttext.C

src/util/fmttext.H

files removed:
.pc/boost-minimal.patch

.pc/boost-minimal.patch/src

.pc/boost-minimal.patch/src/progs

.pc/boost-minimal.patch/src/progs/Makefile.in

config

config/m4

config/m4/ax_boost_base.m4

config/m4/ax_boost_filesystem.m4

config/m4/ax_boost_iostreams.m4

config/m4/ax_boost_regex.m4

config/m4/ax_boost_system.m4

config/m4/ax_boost_thread.m4

config/m4/ax_check_zlib.m4

config/m4/ax_cxx_have_std.m4

config/m4/ax_cxx_have_stl.m4

config/m4/ax_cxx_namespaces.m4

config/m4/ax_lib_expat.m4

config/m4/libtool.m4

config/m4/ltoptions.m4

config/m4/ltsugar.m4

config/m4/ltversion.m4

config/m4/lt~obsolete.m4

doc/docbook/bookfigures/ion_dh10bcovB13_12kb.png

doc/docbook/bookfigures/ion_dh10bcovB13_320kb.png

doc/docbook/bookfigures/results_convert_project.png

doc/docbook/chap_454_part.xml

doc/docbook/chap_iontor_part.xml

doc/docbook/chap_pacbio_part.xml

doc/docbook/chap_sanger_part.xml

doc/docbook/chap_solexa_part.xml

src/progs/convert_project.C

src/progs/dbgreplay.C

src/scripts/fastaselect.tcl

src/scripts/fastqselect.tcl

files modified:
.pc/applied-patches

.tarball-version

.version

HELP_WANTED

INSTALL

Makefile.am

Makefile.in

README

README_build.html

THANKS

aclocal.m4

build-aux/git-version-gen

configure

configure.ac

debian/changelog

debian/control

debian/mira-assembler.lintian-overrides

debian/patches/series

debian/rules

debian/watch

distribution/Makefile

distribution/Makefile.am

distribution/Makefile.in

distribution/README

doc/Makefile

doc/Makefile.in

doc/docbook/Makefile

doc/docbook/Makefile.am

doc/docbook/Makefile.in

doc/docbook/attic/README.txt

doc/docbook/book_3rdparty.xml

doc/docbook/book_definitiveguide.xml

doc/docbook/chap_3p_ghbambus_part.xml

doc/docbook/chap_bitsandpieces_part.xml

doc/docbook/chap_est_part.xml

doc/docbook/chap_faq_part.xml

doc/docbook/chap_hard_part.xml

doc/docbook/chap_installation_part.xml

doc/docbook/chap_intro_part.xml

doc/docbook/chap_logfiles_part.xml

doc/docbook/chap_maf_part.xml

doc/docbook/chap_mirautils_part.xml

doc/docbook/chap_preface_part.xml

doc/docbook/chap_reference_part.xml

doc/docbook/chap_results_part.xml

doc/docbook/chap_seqadvice_part.xml

doc/docbook/copyrightfile

doc/docbook/warning_frontofchapter.xml

minidemo/bbdemo2/runme.sh

minidemo/estdemo2/README

src/Makefile.am

src/Makefile.in

src/caf/Makefile.in

src/caf/caf.C

src/debuggersupport/Makefile.in

src/debuggersupport/dbgsupport.C

src/errorhandling/Makefile.in

src/errorhandling/errorhandling.C

src/errorhandling/errorhandling.H

src/io/Makefile.in

src/io/fasta.C

src/io/fasta.H

src/io/fastq-lh.H

src/io/gap4_ft_so_map.xxd

src/io/ncbiinfoxml.C

src/io/ncbiinfoxml.H

src/io/phd.H

src/io/scf.C

src/io/scf.H

src/memorc/Makefile.in

src/memorc/main.C

src/memorc/memorc.C

src/mira/CHANGES.txt

src/mira/Makefile.am

src/mira/Makefile.in

src/mira/TODO

src/mira/adaptorsforclip.454.xxd

src/mira/ads.C

src/mira/align.C

src/mira/align.H

src/mira/assembly.C

src/mira/assembly.H

src/mira/assembly_info.C

src/mira/assembly_info.H

src/mira/assembly_io.C

src/mira/assembly_misc.C

src/mira/assembly_output.C

src/mira/assembly_output.H

src/mira/assembly_reduceskimhits.C

src/mira/assembly_swalign.C

src/mira/contig.C

src/mira/contig.H

src/mira/contig_analysis.C

src/mira/contig_consensus.C

src/mira/contig_edit.C

src/mira/contig_featureinfo.C

src/mira/contig_output.C

src/mira/contig_pairconsistency.C

src/mira/dataprocessing.C

src/mira/dataprocessing.H

src/mira/dynamic.C

src/mira/dynamic.H

src/mira/gbf_parse.H

src/mira/gff_parse.C

src/mira/hashstats.C

src/mira/hashstats.H

src/mira/hdeque.H

src/mira/maf_parse.C

src/mira/manifest.C

src/mira/manifest.H

src/mira/parameters.C

src/mira/parameters.H

src/mira/parameters_flexer.cc

src/mira/parameters_flexer.ll

src/mira/parameters_tokens.h

src/mira/pcrcontainer.C

src/mira/pcrcontainer.H

src/mira/ppathfinder.C

src/mira/ppathfinder.H

src/mira/preventinitfiasco.C

src/mira/read.C

src/mira/read.H

src/mira/readgrouplib.C

src/mira/readgrouplib.H

src/mira/readpool.C

src/mira/readpool.H

src/mira/sam_collect.C

src/mira/skim.C

src/mira/skim.H

src/mira/skim_farc.C

src/mira/skim_lowbph.C

src/mira/structs.H

src/progs/Makefile.am

src/progs/Makefile.in

src/progs/mira.C

src/progs/miramer.C

src/progs/quirks.H

src/scripts/Makefile.am

src/scripts/Makefile.in

src/scripts/fasta2frag.tcl

src/stdinc/Makefile.in

src/stdinc/defines.H

src/stdinc/stlincludes.H

src/stdinc/types.H

src/support/GTAGDB

src/support/Makefile.in

src/util/Makefile.am

src/util/Makefile.in

src/util/dptools.H

src/util/fileanddisk.C

src/util/fileanddisk.H

src/util/misc.C

src/version.H

src/version.stub

Show diffs side-by-side

added added

removed removed

doc/docbook/attic/chap_454_part.xml

<?xml version="1.0" ?>

<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.docbook.org/xml/4.5/docbookx.dtd">

<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="versionfile"/>

<firstname>Bastien</firstname>

<surname>Chevreux</surname>

<email>bach@chevreux.org</email>

</author>

<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="copyrightfile"/>

</chapterinfo>

<attribution>Solomon Short</attribution>

<para>

<emphasis><quote>Upset causes changes. Change causes upset.</quote></emphasis>

</para>

</blockquote>

<title>Assembly of 454 data with MIRA3</title>

<title>

Introduction

</title>

<para>

MIRA can assemble 454 type data either on its own or together with any

other technology MIRA know to handle (Illumina, Sanger,

etc.). Paired-end sequences coming from genomic projects can also be

used if you take care to prepare your data the way MIRA needs it.

</para>

<para>

MIRA goes a long way to assemble sequence in the best possible way: it

uses multiple passes, learning in each pass from errors that occurred in

the previous passes. There are routines specialised in handling oddities

that occur in different sequencing technologies

</para>

<note>

<title>Tip</title> Use the MIRA version of

the <command>sff_extract</command> script which is provided as

download in the MIRA 3rd party software package. This script knows

about adaptor information and all the little important details when

extracting data from SFF into FASTQ (or FASTA) format.

</note>

<title>

Some reading requirements

</title>

<para>

This guide assumes that you have basic working knowledge of Unix

systems, know the basic principles of sequencing (and sequence

assembly) and what assemblers do.

</para>

<para>

While there are step by step walkthroughs on how to setup your 454

data and then perform an assembly, this guide expects you to read at

some point in time

</para>

<para>

the <emphasis>"Caveats when using 454 data"</emphasis> section of

this document (just below). <emphasis

role="bold">This. Is. Important. Read. It!</emphasis>

</para>

</listitem>

<para>

the <emphasis>mira_usage</emphasis> introductory help file so that

you have a basic knowledge on how to set up projects in mira for

Sanger sequencing projects.

</para>

</listitem>

<para>

the <emphasis>GS FLX Data Processing Software Manual</emphasis>

from Roche Diagnostics (or the corresponding manual for the GS20

or Titanium instruments).

</para>

</listitem>

<para>

and last but not least the <emphasis>mira_reference</emphasis>

help file to look up some command line options.

</para>

</listitem>

</itemizedlist>

<para>

</para>

</sect2>

<title>

Playing around with some demo data

</title>

<para>

If you want to jump into action, I suggest you walk through the

"Walkthrough: combined unpaired and paired-end assembly of Brucella

ceti" section of this document to get a feeling on how things

work. That particular walkthrough is with paired and unpaired 454 data

from the NCBI short read archive, so be prepared to download a couple

of hundred MiBs.

100

</para>

101

<para>

102

But please do not forget to come back to the "Caveats" section just

103

below later, it contains a pointers to common traps lurking in the

104

depths of high throughput sequencing.

105

</para>

106

</sect2>

107

108

<title>

109

Estimating memory needs

110

</title>

111

<para>

112

<emphasis>"Do I have enough memory?"</emphasis> has been one of the

113

most often asked question in former times. To answer this question,

114

please use miramem which will give you an estimate. Basically, you

115

just need to start the program and answer the questions, for more

116

information please refer to the corresponding section in the main MIRA

117

documentation.

118

</para>

119

<para>

120

Take this estimate with a grain of salt, depending on the sequences

121

properties, variations in the estimate can be +/- 30%.

122

</para>

123

<para>

124

Take these estimates even with a larger grain of salt for

125

eukaryotes. Some of them are incredibly repetitive and this leads

126

currently to the explosion of some secondary tables in MIRA. I'm

127

working on it.

128

</para>

129

</sect2>

130

</sect1>

131

132

<title>

133

Caveats when using 454 data

134

</title>

135

<para>

136

Please take some time to read this section. If you're really eager to

137

jump into action, then feel free to skip forward to the walkthrough, but

138

make sure to come back later.

139

</para>

140

141

<title>

142

Screen. Your. Sequences! (part 1)

143

</title>

144

<para>

145

Or at least use the vector clipping info provided in the SFF file and

146

have them put into a standard NCBI TRACEINFO XML format. Yes, that's

147

right: vector clipping info.

148

</para>

149

<para>

150

Here's the short story: 454 reads can contain a kind of vector

151

sequence. To be more precise, they can - and very often do - contain

152

the sequence of the (A or B)-adaptors that were used for sequencing.

153

</para>

154

<para>

155

To quote a competent bioinformatician who thankfully dug through quite

156

some data and patent filings to find out what is going on: "These

157

adaptors consist of a PCR primer, a sequencing primer and a key. The

158

B-adaptor is always in because it's needed for the emPCR and

159

sequencing. If the fragments are long enough, then one usually does

160

not reach the adaptor at all. But if the fragments are too short -

161

tough luck."

162

</para>

163

<para>

164

Basically it's tough luck for a lot of 454 sequencing

165

project I have seen so far, both for public data (sequences available

166

at the NCBI trace archive) and non-public data.

167

</para>

168

</sect2>

169

170

<title>

171

Screen. Your. Sequences! (optional part 2)

172

</title>

173

<para>

174

Some labs use specially designed tags for their sequencing (I've heard

175

of cases with up to 20 bases). The tag sequences always being very

176

identical, they will behave like vector sequences in an assembly. Like

177

for any other assembler: if you happen to get such a project, then you

178

must take care that those tags are filtered out, respectively masked

179

from your sequences before going into an assembly. If you don't, the

180

results will be messy at best.

181

</para>

182

<note>

183

<title>Tip</title> Put your FASTAs through SSAHA2 or better, SMALT

184

with the sequence of your tags as masking target. MIRA can read the

185

SSAHA2 output (or SMALT when using "-f ssaha" output) and mask

186

internally using the MIRA <arg>-CL:msvs*</arg> parameters.

187

</note>

188

</sect2>

189

190

<title>

191

To right clip or not to right clip?

192

</title>

193

<para>

194

Sequences coming from the GS20, FLX or Titanium have usually pretty

195

good clip points set by the Roche/454 preprocessing software. There

196

is, however, a tendency to overestimate the quality towards the end of

197

the sequences and declare sequence parts as 'good' which really

198

shouldn't be.

199

</para>

200

<para>

201

Sometimes, these bad parts toward the end of sequences are so

202

annoyingly bad that they prevent MIRA from correctly building contigs,

203

that is, instead of one contig you might get two.

204

</para>

205

<para>

206

MIRA has the <arg>-CL:pec</arg> clipping option to deal with these

207

annoyances (standard for all <literal>--job=genome</literal>

208

assemblies). This algorithm performs <emphasis>proposed end

209

clipping</emphasis> which will guarantee that the ends of reads are

210

clean when the coverage of a project is high enough.

211

</para>

212

<para>

213

For genomic sequences: the term 'enough' being somewhat fuzzy

214

... everything above a coverage of 15x should be no problem at all,

215

coverages above 10x should also be fine. Things start to get tricky

216

below 10x, but give it a try. Below 6x however, switch off

217

the <arg>-CL:pec</arg> option.

218

</para>

219

</sect2>

220

221

<title>

222

Left clipping wrongly preprocessed data

223

</title>

224

<para>

225

Short intro, to be expanded. (see example in B:ceti walkthrough)

226

</para>

227

</sect2>

228

</sect1>

229

230

<title>

231

Walkthrough: a 454 assembly with unpaired reads

232

</title>

233

234

<title>

235

Preparing the 454 data for MIRA

236

</title>

237

<para>

238

The basic data type you will get from the sequencing instruments will

239

be SFF files. Those files contain almost all information needed for an

240

assembly, but they need to be converted into more standard files

241

before mira can use this information.

242

</para>

243

<para>

244

Let's assume we just sequenced a bug (<emphasis>Bacillus

245

chocorafoliensis</emphasis>) and internally our department uses the

246

short <emphasis>bchoc</emphasis> mnemonic for your

247

project/organism/whatever. So, whenever you

248

see <emphasis>bchoc</emphasis> in the following text, you can replace

249

it by whatever name suits you.

250

</para>

251

<para>

252

For this example, we will assume that you have created a directory

253

<filename>myProject</filename> for the data of your project and that

254

the SFF files are in there. Doing a <literal>ls -lR</literal> should

255

give you something like this:

256

</para>

257

258

<prompt>arcadia:/path/to/myProject$</prompt> <userinput>ls -lR</userinput>

259

-rw-rw-rw- 1 bach users 475849664 2007-09-23 10:10 EV10YMP01.sff

260

-rw-rw-rw- 1 bach users 452630172 2007-09-25 08:59 EV5RTWS01.sff

261

-rw-rw-rw- 1 bach users 436489612 2007-09-21 08:39 EVX95GF02.sff

262

</screen>

263

<para>

264

As you can see, this sequencing project has 3 <filename>SFF</filename>

265

files.

266

</para>

267

<para>

268

We'll use <command>sff_extract</command>:

269

</para>

270

271

<prompt>arcadia:/path/to/myProject$</prompt> <userinput>sff_extract -o bchoc EV10YMP01.sff EV5RTWS01.sff EVX95GF02.sff</userinput></screen>

272

<note>

273

For more information on how to use <command>sff_extract</command>,

274

please refer to the chapter in the NCBI Trace and Short Read archive.

275

</note>

276

<para>

277

This can take some time, the 1.2 million FLX reads from this

278

example need approximately 9 minutes for conversion. Your directory

279

should now look something like this:

280

</para>

281

282

<prompt>arcadia:/path/to/myProject$</prompt> <userinput>ls -l</userinput>

283

-rw-r--r-- 1 bach users 231698898 2007-10-21 15:16 bchoc.fastq

284

-rw-r--r-- 1 bach users 193962260 2007-10-21 15:16 bchoc.xml

285

-rw-rw-rw- 1 bach users 475849664 2007-09-23 10:10 EV10YMP01.sff

286

-rw-rw-rw- 1 bach users 452630172 2007-09-25 08:59 EV5RTWS01.sff

287

-rw-rw-rw- 1 bach users 436489612 2007-09-21 08:39 EVX95GF02.sff</screen>

288

<para>

289

By this time, the SFFs are not needed anymore. You can remove them

290

from this directory if you want.

291

</para>

292

</sect2>

293

294

<title>

295

Writing a manifest

296

</title>

297

<para>

298

The manifest is a configuration file for an assembly: it controls what

299

type of assembly you want to do and which data should go into the

300

assembly. For this first example, we just need a very simple manifest:

301

</para>

302

303

# A manifest file can contain comment lines, these start with the #-character

304

305

# First part of a manifest: defining some basic things

306

307

# In this example, we just give a name to the assembly

308

# and tell MIRA it should assemble a genome de-novo in accurate mode

309

# As special parameter, we want to use 4 threads in parallel (where possible)

310

311

<userinput>project = <replaceable>MyFirstAssembly</replaceable>

312

job = <replaceable>genome,denovo,accurate</replaceable>

313

parameters = <replaceable>-GE:not=4</replaceable></userinput>

314

315

# The second part defines the sequencing data MIRA should load and assemble

316

# The data is logically divided into "readgroups", for more information

317

# please consult the MIRA manual, chapter "Reference"

318

319

<userinput>readgroup = <replaceable>SomeUnpaired454ReadsIGotFromTheLab</replaceable>

320

technology = <replaceable>454</replaceable>

321

data = <replaceable>bchoc.fastq</replaceable> <replaceable>bchoc.xml</replaceable></userinput></screen>

322

<para>

323

Save the above lines into a file, we'll use

324

<filename>bchoc_1st_manifest.conf</filename> in this example.

325

</para>

326

327

<prompt>arcadia:/path/to/myProject$</prompt> <userinput>ls -l</userinput>

328

-rw-r--r-- 1 bach users 231698898 2007-10-21 15:16 bchoc.fastq

329

-rw-r--r-- 1 bach users 193962260 2007-10-21 15:16 bchoc.xml

330

-rw-r--r-- 1 bach users 756 2011-11-05 17:57 bchoc_1st_manifest.conf</screen>

331

</sect2>

332

333

<title>

334

Starting the assembly

335

</title>

336

<para>

337

Starting the assembly is now just a matter of one line:

338

</para>

339

340

<prompt>arcadia:/path/to/myProject$</prompt> <userinput>mira <replaceable>bchoc_1st_manifest.conf >&log_assembly.txt</replaceable></userinput></screen>

341

<para>

342

Now, that was easy, wasn't it? In the above example - for assemblies

343

having only 454 data and if you followed the walkthrough on how to

344

prepare the data - everything you might want to adapt in the first

345

time are the following line in the manifest file:

346

</para>

347

348

349

<para>

350

project= (for naming your assembly project)

351

</para>

352

</listitem>

353

354

<para>

355

job= (perhaps to change the quality of the assembly to 'draft')

356

</para>

357

</listitem>

358

359

<para>

360

parameters= -GE:not=xxx (perhaps to change the number of processors)

361

</para>

362

</listitem>

363

</itemizedlist>

364

<para>

365

Of course, you are free to change any option via the extended

366

parameters, but this is covered in the MIRA main reference manual.

367

</para>

368

</sect2>

369

</sect1>

370

371

<title>

372

Walkthrough: a paired-end Sanger / unpaired 454 hybrid assembly

373

</title>

374

<para>

375

Preparing the data for a Sanger / 454 hybrid assembly takes some more steps

376

but is not really more complicated than a normal Sanger-only or 454-only

377

assembly.

378

</para>

379

<para>

380

In the following sections, the files with 454 input data will have

381

<filename>.454.</filename> in the name, files with Sanger have

382

<filename>.sanger.</filename>. That's just a convention I use, you do

383

not need to do that, but it helps to keep things nicely organised.

384

</para>

385

386

<title>

387

Preparing the 454 data

388

</title>

389

<para>

390

Please proceed exactly in the same way as described for the assembly

391

of unpaired 454-only data in the section above, that is, without

392

writing a manifest and starting the actual assembly. The only difference: in the <command>sff_extract</command> part, use "-o" with the parameter "bchoc.454" to get the files named accordingly.

393

</para>

394

395

<prompt>arcadia:/path/to/myProject$</prompt> <userinput>sff_extract -o bchoc.454 EV10YMP01.sff EV5RTWS01.sff EVX95GF02.sff</userinput></screen>

396

<para>

397

In the end you should have two files (FASTQ and TRACEINFO) for the 454

398

data ready.

399

</para>

400

</sect2>

401

402

<title>

403

Preparing the Sanger data

404

</title>

405

<para>

406

There are quite a number of sequencing providers out there, all with

407

different pre-processing pipelines and different output

408

file-types. MIRA supports quite a number of them, the three most

409

important would probably be

410

</para>

411

412

413

<para>

414

(preferred option) FASTQ files and ancillary data in NCBI

415

TRACEINFO XML format.

416

</para>

417

</listitem>

418

419

<para>

420

(preferred option) FASTA files which are coupled with FASTA quality

421

files and ancillary data in NCBI TRACEINFO XML format.

422

</para>

423

</listitem>

424

425

<para>

426

(preferred option) CAF (from the Sanger Institute) files that

427

contain the sequence, quality values and ancillary data like

428

clippings etc.

429

</para>

430

</listitem>

431

432

<para>

433

(secondary option) EXP files as the Staden pregap4 package writes.

434

</para>

435

</listitem>

436

</orderedlist>

437

<para>

438

Your sequencing provider MUST have performed at least a sequencing

439

vector clip on this data. A quality clip might also be good to do by

440

the provider as they usually know best what quality they can expect

441

from their instruments (although MIRA can do this also if you want).

442

</para>

443

<para>

444

You can either perform clipping the hard way by removing physically

445

all bases from the input (this is

446

called <emphasis>trimming</emphasis>), or you can keep the clipped

447

bases in the input file and provided clipping information in ancillary

448

data files. These clipping information then MUST be present in the

449

ancillary data (either the TRACEINFO XML, or in the combined CAF, or

450

in the EXP files), together with other standard data like, e.g.,

451

mate-pair information when using a paired-ends approach.

452

</para>

453

<para>

454

This example assumes that the data is provided as FASTA together with a

455

quality file and ancillary data in NCBI TRACEINFO XML format.

456

</para>

457

<para>

458

Put these files (appropriately renamed) into the directory with the

459

454 data.

460

</para>

461

<para>

462

Here's how the directory with the preprocessed data should now look

463

approximately like:

464

</para>

465

466

<prompt>arcadia:/path/to/myProject$</prompt> <userinput>ls -l</userinput>

467

-rwxrwxrwx 1 bach 2007-10-13 22:44 bchoc.454.fastq

468

-rwxrwxrwx 1 bach 2007-10-13 22:44 bchoc.454.xml

469

470

-rwxrwxrwx 1 bach 2007-10-13 22:44 bchoc.sanger.fastq

471

-rwxrwxrwx 1 bach 2007-10-13 22:44 bchoc.sanger.xml</screen>

472

</sect2>

473

474

<title>

475

Writing a manifest

476

</title>

477

<para>

478

This assembly contains unpaired 454 data and paired-end Sanger

479

data. Let's assume the 454 data to be exactly the same as for the

480

previous walkthrough. For the Sanger data, let's assume the template

481

DNA size for the Sanger library to be between 2500 and 3500 bases and

482

the read naming to follow the TIGR naming scheme:

483

</para>

484

485

# A manifest file can contain comment lines, these start with the #-character

486

487

# First part of a manifest: defining some basic things

488

489

# In this example, we just give a name to the assembly

490

# and tell MIRA it should assemble a genome de-novo in accurate mode

491

# As special parameter, we want to use 4 threads in parallel (where possible)

492

493

<userinput>project = <replaceable>MyFirstHybridAssembly</replaceable>

494

job = <replaceable>genome,denovo,accurate</replaceable>

495

parameters = <replaceable>-GE:not=4</replaceable></userinput>

496

497

# The second part defines the sequencing data MIRA should load and assemble

498

# The data is logically divided into "readgroups", for more information

499

# please consult the MIRA manual, chapter "Reference"

500

501

<userinput>readgroup = <replaceable>SomeUnpaired454ReadsIGotFromTheLab</replaceable>

502

technology = <replaceable>454</replaceable>

503

data = <replaceable>bchoc.454.*</replaceable></userinput>

504

505

# Note the wildcard "bchoc.454.*" in the data line above: this

506

# will load both the FASTQ and XML data

507

508

<userinput>readgroup = <replaceable>SomePairedSangerReadsIGotFromTheLab</replaceable>

509

technology = <replaceable>sanger</replaceable>

510

template_size = <replaceable>2500 3500</replaceable>

511

segment_placement = <replaceable>---> <---</replaceable>

512

segment_naming = <replaceable>TIGR</replaceable>

513

data = <replaceable>bchoc.sanger.*</replaceable></userinput></screen>

514

<para>

515

If you compare the manifest above with the manifest in the walkthrough

516

for using only unpaired 454 data, you will see that large parts, i.e.,

517

the definition of the job, parameter and the 454 read group is

518

<emphasis>exactly</emphasis> the same. The only differences are in the

519

naming of the assembly project (in <literal>project =</literal>), and

520

the definition of a second readgroup containing the Sanger sequencing

521

data.

522

</para>

523

</sect2>

524

525

<title>

526

Starting the hybrid assembly

527

</title>

528

<para>

529

Quite unsurprisingly, the command to start the assembly is exactly the same as always:

530

</para>

531

532

<prompt>arcadia:/path/to/myProject$</prompt> <userinput>mira <replaceable>myassebly_manifest.conf</replaceable> >&log_assembly.txt</userinput></screen>

533

</sect2>

534

</sect1>

535

536

<title>

537

Walkthrough: combined unpaired and paired-end assembly of Brucella ceti

538

</title>

539

<para>

540

Here's a walkthrough which should help you in setting up own assemblies. You

541

do not need to set up your directory structures as I do, but for this

542

walkthrough it could help.

543

</para>

544

<note>

545

This walkthrough was written at times when the NCBI still offered SFFs

546

for 454 data, which now it does not anymore. However, the approach is

547

still valid for your data where you should get SFFs.

548

</note>

549

<note>

550

This walkthrough was written at times when the primary input for 454

551

data in MIRA was using FASTA + FASTA quality files. This has shifted

552

nowadays to FASTQ as input (it's more compact and faster to parse). I'm

553

sure you will be able to make the necessary changes to the command line

554

of <command>sff_extract</command> yourself :-)

555

</note>

556

557

<title>

558

Preliminaries

559

</title>

560

<para>

561

Please make sure that sff_extract is working properly and that you have

562

at least version 0.2.1 (use <literal>sff_extract -v</literal>). Please also make sure

563

that SSAHA2 can be run correctly (test this by running <literal>ssaha2 -v</literal>).

564

</para>

565

</sect2>

566

567

<title>

568

Preparing your file system

569

</title>

570

<para>

571

Note: this is how I set up a project, feel free to implement whatever

572

structure suits your needs.

573

</para>

574

575

<prompt>$</prompt> <userinput>mkdir bceti</userinput>

576

<prompt>$</prompt> <userinput>cd bceti</userinput>

577

<prompt>bceti_assembly$</prompt> <userinput>mkdir origdata data assemblies</userinput></screen>

578

<para>

579

Your directory should now look like this:

580

</para>

581

582

<prompt>arcadia:bceti$</prompt> <userinput>ls -l</userinput>

583

drwxr-xr-x 2 bach users 48 2008-11-08 16:51 assemblies

584

drwxr-xr-x 2 bach users 48 2008-11-08 16:51 data

585

drwxr-xr-x 2 bach users 48 2008-11-08 16:51 origdata</screen>

586

<para>

587

Explanation of the structure:

588

</para>

589

590

591

<para>

592

the <filename>origdata</filename> directory will contain the 'raw'

593

result files that one might get from sequencing. Basically,.

594

</para>

595

</listitem>

596

597

<para>

598

the <filename>data</filename> directory will contain the

599

preprocessed sequences for the assembly, ready to be used by MIRA

600

</para>

601

</listitem>

602

603

<para>

604

the <filename>assemblies</filename> directory will contain

605

assemblies we make with our data (we might want to make more than

606

one).

607

</para>

608

</listitem>

609

</itemizedlist>

610

<para>

611

</para>

612

</sect2>

613

614

<title>

615

Getting the data

616

</title>

617

<note>

618

Since early summer 2009, the NCBI does not offer SFF files anymore,

619

which is a pity. This guide will nevertheless allow you to perform

620

similar assemblies on own data.

621

</note>

622

<para>

623

Please browse to

624

625

and

626

627

and download the SFF files to the <filename>origdata</filename>

628

directory (press the download button on those pages).

629

</para>

630

<para>

631

En passant, note the following: SRR005481 is described to be a 454 FLX

632

data set where the library contains unpaired data ("Library Layout:

633

SINGLE"). SRR005482 has also 454 FLX data, but this time it's

634

paired-end data ("Library Layout: PAIRED

635

(ORIENTATION=forward)"). Knowing this will be important later on in

636

the process.

637

</para>

638

639

<prompt>arcadia:bceti$</prompt> <userinput>cd origdata</userinput>

640

<prompt>arcadia:origdata$</prompt> <userinput>ls -l</userinput>

641

-rw-r--r-- 1 bach users 240204619 2008-11-08 16:49 SRR005481.sff.gz

642

-rw-r--r-- 1 bach users 211333635 2008-11-08 16:55 SRR005482.sff.gz</screen>

643

<para>

644

We need to unzip those files:

645

</para>

646

647

<prompt>arcadia:bceti_assembly/origdata$</prompt> <userinput>gunzip *.gz</userinput></screen>

648

<para>

649

And now this directory should look like this

650

</para>

651

652

<prompt>arcadia:bceti_assembly/origdata$</prompt> <userinput>ls -l</userinput>

653

-rw-r--r-- 1 bach users 544623256 2008-11-08 16:49 SRR005481.sff

654

-rw-r--r-- 1 bach users 476632488 2008-11-08 16:55 SRR005482.sff</screen>

655

<para>

656

Now move into the (still empty) <filename>data</filename> directory

657

</para>

658

659

<prompt>arcadia:origdata$</prompt> <userinput>cd ../data</userinput></screen>

660

</sect2>

661

662

<title>

663

Data preprocessing with sff_extract

664

</title>

665

<para>

666

</para>

667

668

<title>

669

Extracting unpaired data from SFF

670

</title>

671

<para>

672

We will first extract the data from the unpaired experiment

673

(SRR005481), the generated file names should all start

674

with <emphasis>bceti</emphasis>:

675

</para>

676

677

<prompt>arcadia:bceti_assembly/data$</prompt> <userinput>sff_extract -o bceti ../origdata/SRR005481.sff</userinput>

678

Working on '../origdata/SRR005481.sff':

679

Converting '../origdata/SRR005481.sff' ... done.

680

Converted 311201 reads into 311201 sequences.

681

682

********************************************************************************

683

WARNING: weird sequences in file ../origdata/SRR005481.sff

684

685

After applying left clips, 307639 sequences (=99%) start with these bases:

686

TCTCCGTC

687

688

This does not look sane.

689

690

Countermeasures you *probably* must take:

691

1) Make your sequence provider aware of that problem and ask whether this can be

692

corrected in the SFF.

693

2) If you decide that this is not normal and your sequence provider does not

694

react, use the --min_left_clip of sff_extract.

695

(Probably '--min_left_clip=13' but you should cross-check that)

696

********************************************************************************</screen>

697

<para>

698

(Note: I got this on the SRR005481 data set downloaded in October

699

2008. In the mean time, the sequencing center or NCBI may have

700

corrected the error)

701

</para>

702

<para>

703

Wait a minute ... what happened here?

704

</para>

705

<para>

706

We launched a pretty standard extraction of reads where the whole

707

sequence were extracted and saved in the FASTA files and FASTA

708

quality files, and clipping information will be given in the

709

XML. Additionally, the clipped parts of every read will be shown in

710

lower case in the FASTA file.

711

</para>

712

<para>

713

After two or three minutes, the directory looked like this:

714

</para>

715

716

<prompt>arcadia:bceti_assembly/data$</prompt> <userinput>ls -l

717

-rw-r--r-- 1 bach users 91863124 2008-11-08 17:15 bceti.fasta

718

-rw-r--r-- 1 bach users 264238484 2008-11-08 17:15 bceti.fasta.qual

719

-rw-r--r-- 1 bach users 52197816 2008-11-08 17:15 bceti.xml</userinput></screen>

720

</sect3>

721

722

<title>

723

Dealing with wrong clip-offs in the SFF

724

</title>

725

<para>

726

In the example above, sff_extract discovered an unusual pattern

727

sequence and gave a (stern) warning: almost all the sequences

728

created for the FASTA file had a skew in the distribution of bases.

729

</para>

730

<para>

731

Let's have a look at the first 30 bases of the first 20 sequences of

732

the FASTA that was created:

733

</para>

734

735

<prompt>arcadia:bceti_assembly/data$</prompt> <userinput>head -40 bceti_in.454.fasta | grep -v ">" | cut -c 0-30</userinput>

736

tcagTCTCCGTCGCAATCGCCGCCCCCACA

737

tcagTCTCCGTCGGCGCTGCCCGCCCGATA

738

tcagTCTCCGTCGTGGAGGATTACTGGGCG

739

tcagTCTCCGTCGGCTGTCTGGATCATGAT

740

tcagTCTCCGTCCTCGCGTTCGATGGTGAC

741

tcagTCTCCGTCCATCTGTCGGGAACGGAT

742

tcagTCTCCGTCCGAGCTTCCGATGGCACA

743

tcagTCTCCGTCAGCCTTTAATGCCGCCGA

744

tcagTCTCCGTCCTCGAAACCAAGAGCGTG

745

tcagTCTCCGTCGCAGGCGTTGGCGCGGCG

746

tcagTCTCCGTCTCAAACAAAGGATTAGAG

747

tcagTCTCCGTCCTCACCCTGACGGTCGGC

748

tcagTCTCCGTCTTGTGCGGTTCGATCCGG

749

tcagTCTCCGTCTGCGGACGGGTATCGCGG

750

tcagTCTCCGTCTCGTTATGCGCTCGCCAG

751

tcagTCTCCGTCTCGCATTTTCCAACGCAA

752

tcagTCTCCGTCCGCTCATATCCTTGTTGA

753

tcagTCTCCGTCCTGTGCTGGGAAAGCGAA

754

tcagTCTCCGTCTCGAGCCGGGACAGGCGA

755

tcagTCTCCGTCGTCGTATCGGGTACGAAC</screen>

756

<para>

757

What you see is the following: the leftmost 4

758

characters <literal>tcag</literal> of every read are the last bases

759

of the standard 454 sequencing adaptor A. The fact that they are

760

given in lower case means that they are clipped away in the SFF

761

(which is good).

762

</para>

763

<para>

764

However, if you look closely, you will see that there is something

765

peculiar: after the adaptor sequence, all reads seem to start with

766

exactly the same sequence <literal>TCTCCGTC</literal>. This is *not*

767

sane.

768

</para>

769

<para>

770

This means that the left clip of the reads in the SFF has not been

771

set correctly. The reason for this is probably a wrong value which

772

was used in the 454 data processing pipeline. This seems to be a

773

problem especially when custom sequencing adaptors are used.

774

</para>

775

<para>

776

In this case, the result is pretty catastrophic: out of the 311201

777

reads in the SFF, 307639 (98.85%) show this behaviour. We will

778

certainly need to get rid of these first 12 bases.

779

</para>

780

<para>

781

Now, in cases like these, there are three steps that you really

782

should follow:

783

</para>

784

785

786

<para>

787

Is this something that you expect from the experimental setup?

788

If yes, then all is OK and you don't need to take further

789

action. But I suppose that for 99% of all people, these abnormal

790

sequences are not expected.

791

</para>

792

</listitem>

793

794

<para>

795

Contact. Your. Sequence. Provider! The underlying problem is

796

something that *MUST* be resolved on their side, not on

797

yours. It might be a simple human mistake, but it it might very

798

well be a symptom of a deeper problem in their quality

799

assurance. Notify. Them. Now!

800

</para>

801

</listitem>

802

803

<para>

804

In the mean time (or if the sequencing provider does not react),

805

you can use the <arg>--min_left_clip</arg> command line option

806

from sff_extract as suggested in the warning message.

807

</para>

808

</listitem>

809

</orderedlist>

810

<para>

811

</para>

812

<para>

813

So, to correct for this error, we will redo the extraction of the

814

sequence from the SFF, this time telling sff_extract to set the left

815

clip starting at base 13 at the lowest:

816

</para>

817

818

<prompt>arcadia:bceti_assembly/data$</prompt> <userinput>sff_extract -o bceti --min_left_clip=13 ../origdata/SRR005481.sff</userinput>

819

Working on '../origdata/SRR005481.sff':

820

Converting '../origdata/SRR005481.sff' ... done.

821

Converted 311201 reads into 311201 sequences.

822

<prompt>arcadia:sff_from_ncbi/bceti_assembly/data$</prompt> <userinput>ls -l</userinput>

823

-rw-r--r-- 1 bach users 91863124 2008-11-08 17:31 bceti.fasta

824

-rw-r--r-- 1 bach users 264238484 2008-11-08 17:31 bceti.fasta.qual

825

-rw-r--r-- 1 bach users 52509017 2008-11-08 17:31 bceti.xml</screen>

826

<para>

827

This concludes the small intermezzo on how to deal with wrong left

828

clips.

829

</para>

830

</sect3>

831

</sect2>

832

833

<title>

834

Preparing an assembly

835

</title>

836

<para>

837

Preparing an assembly is now just a matter of setting up a directory and

838

linking the input files into that directory.

839

</para>

840

841

<prompt>arcadia:bceti_assembly/data$</prompt> <userinput>cd ../assemblies/</userinput>

842

<prompt>arcadia:bceti_assembly/assemblies$</prompt> <userinput>mkdir arun_08112008</userinput>

843

<prompt>arcadia:bceti_assembly/assemblies$</prompt> <userinput>cd arun_08112008</userinput>

844

<prompt>arcadia:assemblies/arun_08112008$</prompt> <userinput>ln -s ../../data/* .</userinput>

845

<prompt>arcadia:bceti_assembly/assemblies/arun_08112008$</prompt> <userinput>ls -l</userinput>

846

lrwxrwxrwx 1 bach users 29 2008-11-08 18:17 bceti.454.fasta -> ../../data/bceti.454.fasta

847

lrwxrwxrwx 1 bach users 34 2008-11-08 18:17 bceti.454.fasta.qual -> ../../data/bceti.454.fasta.qual

848

lrwxrwxrwx 1 bach users 33 2008-11-08 18:17 bceti.454.xml -> ../../data/bceti.454.xml</screen>

849

</sect2>

850

851

<title>

852

Starting the assembly 2

853

</title>

854

<para>

855

Start an assembly with the options you like, for example like this:

856

</para>

857

858

<prompt>$</prompt> <userinput>NONONONONONONO ---- MAKE IT WITH MANIFEST !!!!mira --project=bceti --job=denovo,genome,accurate,454 >&log_assembly</userinput></screen>

859

</sect2>

860

</sect1>

861

862

<title>

863

What to do with the MIRA result files?

864

</title>

865

<note>

866

Please consult the corresponding section in the

867

<emphasis>mira</emphasis> <emphasis>usage</emphasis> document, it

868

contains much more information than this stub.

869

</note>

870

<para>

871

But basically, after the assembly has finished, you will find four

872

directories. The <filename>tmp</filename> directory can be deleted

873

without remorse as it contains logs and some tremendous amount of

874

temporary data (dozens of gigabytes for bigger

875

projects). The <filename>info</filename> directory has some text files

876

with basic statistics and other informative files. Start by having a

877

look at the <filename>*_info_assembly.txt</filename>, it'll give you a

878

first idea on how the assembly went.

879

</para>

880

<para>

881

The <filename>results</filename> directory finally contains the assembly

882

files in different formats, ready to be used for further processing with

883

other tools.

884

</para>

885

<para>

886

If you used the uniform read distribution option, you will inevitably

887

need to filter your results as this option produces larger and better

888

alignments, but also more ``debris contigs''. For this, use the

889

miraconvert which is distributed together with the MIRA package.

890

</para>

891

<para>

892

Also very important when analysing 454 assemblies: screen the small

893

contigs ( < 1000 bases) for abnormal behaviour. You wouldn't be the

894

first to have some human DNA contamination in a bacterial sequencing. Or

895

some herpes virus sequence in a bacterial project. Or some bacterial DNA

896

in a human data set. Look whether these small contigs

897

</para>

898

899

900

<para>

901

have a different GC content than the large contigs

902

</para>

903

</listitem>

904

905

<para>

906

whether a BLAST of these sequences against some selected databases

907

brings up hits in other organisms that you certainly were not

908

sequencing.

909

</para>

910

</listitem>

911

</itemizedlist>

912

<para>

913

</para>

914

</sect1>

915

</chapter>

916

Older »