2
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.docbook.org/xml/4.5/docbookx.dtd">
3
<chapter id="chap_454">
5
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="versionfile"/>
7
<firstname>Bastien</firstname>
8
<surname>Chevreux</surname>
9
<email>bach@chevreux.org</email>
11
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="copyrightfile"/>
14
<attribution>Solomon Short</attribution>
16
<emphasis><quote>Upset causes changes. Change causes upset.</quote></emphasis>
19
<title>Assembly of 454 data with MIRA3</title>
20
<sect1 id="sect_454_introduction">
25
MIRA can assemble 454 type data either on its own or together with any
26
other technology MIRA know to handle (Illumina, Sanger,
27
etc.). Paired-end sequences coming from genomic projects can also be
28
used if you take care to prepare your data the way MIRA needs it.
31
MIRA goes a long way to assemble sequence in the best possible way: it
32
uses multiple passes, learning in each pass from errors that occurred in
33
the previous passes. There are routines specialised in handling oddities
34
that occur in different sequencing technologies
37
<title>Tip</title> Use the MIRA version of
38
the <command>sff_extract</command> script which is provided as
39
download in the MIRA 3rd party software package. This script knows
40
about adaptor information and all the little important details when
41
extracting data from SFF into FASTQ (or FASTA) format.
43
<sect2 id="sect_454_some_reading_requirements">
45
Some reading requirements
48
This guide assumes that you have basic working knowledge of Unix
49
systems, know the basic principles of sequencing (and sequence
50
assembly) and what assemblers do.
53
While there are step by step walkthroughs on how to setup your 454
54
data and then perform an assembly, this guide expects you to read at
60
the <emphasis>"Caveats when using 454 data"</emphasis> section of
61
this document (just below). <emphasis
62
role="bold">This. Is. Important. Read. It!</emphasis>
67
the <emphasis>mira_usage</emphasis> introductory help file so that
68
you have a basic knowledge on how to set up projects in mira for
69
Sanger sequencing projects.
74
the <emphasis>GS FLX Data Processing Software Manual</emphasis>
75
from Roche Diagnostics (or the corresponding manual for the GS20
76
or Titanium instruments).
81
and last but not least the <emphasis>mira_reference</emphasis>
82
help file to look up some command line options.
89
<sect2 id="sect_454_playing_around_with_some_demo_data">
91
Playing around with some demo data
94
If you want to jump into action, I suggest you walk through the
95
"Walkthrough: combined unpaired and paired-end assembly of Brucella
96
ceti" section of this document to get a feeling on how things
97
work. That particular walkthrough is with paired and unpaired 454 data
98
from the NCBI short read archive, so be prepared to download a couple
102
But please do not forget to come back to the "Caveats" section just
103
below later, it contains a pointers to common traps lurking in the
104
depths of high throughput sequencing.
107
<sect2 id="sect_454_estimating_memory_needs">
109
Estimating memory needs
112
<emphasis>"Do I have enough memory?"</emphasis> has been one of the
113
most often asked question in former times. To answer this question,
114
please use miramem which will give you an estimate. Basically, you
115
just need to start the program and answer the questions, for more
116
information please refer to the corresponding section in the main MIRA
120
Take this estimate with a grain of salt, depending on the sequences
121
properties, variations in the estimate can be +/- 30%.
124
Take these estimates even with a larger grain of salt for
125
eukaryotes. Some of them are incredibly repetitive and this leads
126
currently to the explosion of some secondary tables in MIRA. I'm
131
<sect1 id="sect_454_caveats_when_using_454_data">
133
Caveats when using 454 data
136
Please take some time to read this section. If you're really eager to
137
jump into action, then feel free to skip forward to the walkthrough, but
138
make sure to come back later.
140
<sect2 id="sect_454_screen_your_sequences!_part_1">
142
Screen. Your. Sequences! (part 1)
145
Or at least use the vector clipping info provided in the SFF file and
146
have them put into a standard NCBI TRACEINFO XML format. Yes, that's
147
right: vector clipping info.
150
Here's the short story: 454 reads can contain a kind of vector
151
sequence. To be more precise, they can - and very often do - contain
152
the sequence of the (A or B)-adaptors that were used for sequencing.
155
To quote a competent bioinformatician who thankfully dug through quite
156
some data and patent filings to find out what is going on: "These
157
adaptors consist of a PCR primer, a sequencing primer and a key. The
158
B-adaptor is always in because it's needed for the emPCR and
159
sequencing. If the fragments are long enough, then one usually does
160
not reach the adaptor at all. But if the fragments are too short -
164
Basically it's tough luck for a lot of 454 sequencing
165
project I have seen so far, both for public data (sequences available
166
at the NCBI trace archive) and non-public data.
169
<sect2 id="sect_454_screen_your_sequences!_optional_part_2">
171
Screen. Your. Sequences! (optional part 2)
174
Some labs use specially designed tags for their sequencing (I've heard
175
of cases with up to 20 bases). The tag sequences always being very
176
identical, they will behave like vector sequences in an assembly. Like
177
for any other assembler: if you happen to get such a project, then you
178
must take care that those tags are filtered out, respectively masked
179
from your sequences before going into an assembly. If you don't, the
180
results will be messy at best.
183
<title>Tip</title> Put your FASTAs through SSAHA2 or better, SMALT
184
with the sequence of your tags as masking target. MIRA can read the
185
SSAHA2 output (or SMALT when using "-f ssaha" output) and mask
186
internally using the MIRA <arg>-CL:msvs*</arg> parameters.
189
<sect2 id="sect_454_to_right_clip_or_not_to_right_clip?">
191
To right clip or not to right clip?
194
Sequences coming from the GS20, FLX or Titanium have usually pretty
195
good clip points set by the Roche/454 preprocessing software. There
196
is, however, a tendency to overestimate the quality towards the end of
197
the sequences and declare sequence parts as 'good' which really
201
Sometimes, these bad parts toward the end of sequences are so
202
annoyingly bad that they prevent MIRA from correctly building contigs,
203
that is, instead of one contig you might get two.
206
MIRA has the <arg>-CL:pec</arg> clipping option to deal with these
207
annoyances (standard for all <literal>--job=genome</literal>
208
assemblies). This algorithm performs <emphasis>proposed end
209
clipping</emphasis> which will guarantee that the ends of reads are
210
clean when the coverage of a project is high enough.
213
For genomic sequences: the term 'enough' being somewhat fuzzy
214
... everything above a coverage of 15x should be no problem at all,
215
coverages above 10x should also be fine. Things start to get tricky
216
below 10x, but give it a try. Below 6x however, switch off
217
the <arg>-CL:pec</arg> option.
220
<sect2 id="sect_454_left_clipping_wrongly_preprocessed_data">
222
Left clipping wrongly preprocessed data
225
Short intro, to be expanded. (see example in B:ceti walkthrough)
229
<sect1 id="sect_454_walkthrough_a_454_assembly_unpaired_reads">
231
Walkthrough: a 454 assembly with unpaired reads
233
<sect2 id="sect_454_preparing_the_454_data_for_mira">
235
Preparing the 454 data for MIRA
238
The basic data type you will get from the sequencing instruments will
239
be SFF files. Those files contain almost all information needed for an
240
assembly, but they need to be converted into more standard files
241
before mira can use this information.
244
Let's assume we just sequenced a bug (<emphasis>Bacillus
245
chocorafoliensis</emphasis>) and internally our department uses the
246
short <emphasis>bchoc</emphasis> mnemonic for your
247
project/organism/whatever. So, whenever you
248
see <emphasis>bchoc</emphasis> in the following text, you can replace
249
it by whatever name suits you.
252
For this example, we will assume that you have created a directory
253
<filename>myProject</filename> for the data of your project and that
254
the SFF files are in there. Doing a <literal>ls -lR</literal> should
255
give you something like this:
258
<prompt>arcadia:/path/to/myProject$</prompt> <userinput>ls -lR</userinput>
259
-rw-rw-rw- 1 bach users 475849664 2007-09-23 10:10 EV10YMP01.sff
260
-rw-rw-rw- 1 bach users 452630172 2007-09-25 08:59 EV5RTWS01.sff
261
-rw-rw-rw- 1 bach users 436489612 2007-09-21 08:39 EVX95GF02.sff
264
As you can see, this sequencing project has 3 <filename>SFF</filename>
268
We'll use <command>sff_extract</command>:
271
<prompt>arcadia:/path/to/myProject$</prompt> <userinput>sff_extract -o bchoc EV10YMP01.sff EV5RTWS01.sff EVX95GF02.sff</userinput></screen>
273
For more information on how to use <command>sff_extract</command>,
274
please refer to the chapter in the NCBI Trace and Short Read archive.
277
This can take some time, the 1.2 million FLX reads from this
278
example need approximately 9 minutes for conversion. Your directory
279
should now look something like this:
282
<prompt>arcadia:/path/to/myProject$</prompt> <userinput>ls -l</userinput>
283
-rw-r--r-- 1 bach users 231698898 2007-10-21 15:16 bchoc.fastq
284
-rw-r--r-- 1 bach users 193962260 2007-10-21 15:16 bchoc.xml
285
-rw-rw-rw- 1 bach users 475849664 2007-09-23 10:10 EV10YMP01.sff
286
-rw-rw-rw- 1 bach users 452630172 2007-09-25 08:59 EV5RTWS01.sff
287
-rw-rw-rw- 1 bach users 436489612 2007-09-21 08:39 EVX95GF02.sff</screen>
289
By this time, the SFFs are not needed anymore. You can remove them
290
from this directory if you want.
293
<sect2 id="sect_454_writing_a_manifest">
298
The manifest is a configuration file for an assembly: it controls what
299
type of assembly you want to do and which data should go into the
300
assembly. For this first example, we just need a very simple manifest:
303
# A manifest file can contain comment lines, these start with the #-character
305
# First part of a manifest: defining some basic things
307
# In this example, we just give a name to the assembly
308
# and tell MIRA it should assemble a genome de-novo in accurate mode
309
# As special parameter, we want to use 4 threads in parallel (where possible)
311
<userinput>project = <replaceable>MyFirstAssembly</replaceable>
312
job = <replaceable>genome,denovo,accurate</replaceable>
313
parameters = <replaceable>-GE:not=4</replaceable></userinput>
315
# The second part defines the sequencing data MIRA should load and assemble
316
# The data is logically divided into "readgroups", for more information
317
# please consult the MIRA manual, chapter "Reference"
319
<userinput>readgroup = <replaceable>SomeUnpaired454ReadsIGotFromTheLab</replaceable>
320
technology = <replaceable>454</replaceable>
321
data = <replaceable>bchoc.fastq</replaceable> <replaceable>bchoc.xml</replaceable></userinput></screen>
323
Save the above lines into a file, we'll use
324
<filename>bchoc_1st_manifest.conf</filename> in this example.
327
<prompt>arcadia:/path/to/myProject$</prompt> <userinput>ls -l</userinput>
328
-rw-r--r-- 1 bach users 231698898 2007-10-21 15:16 bchoc.fastq
329
-rw-r--r-- 1 bach users 193962260 2007-10-21 15:16 bchoc.xml
330
-rw-r--r-- 1 bach users 756 2011-11-05 17:57 bchoc_1st_manifest.conf</screen>
332
<sect2 id="sect_454_starting_the_assembly">
334
Starting the assembly
337
Starting the assembly is now just a matter of one line:
340
<prompt>arcadia:/path/to/myProject$</prompt> <userinput>mira <replaceable>bchoc_1st_manifest.conf >&log_assembly.txt</replaceable></userinput></screen>
342
Now, that was easy, wasn't it? In the above example - for assemblies
343
having only 454 data and if you followed the walkthrough on how to
344
prepare the data - everything you might want to adapt in the first
345
time are the following line in the manifest file:
350
project= (for naming your assembly project)
355
job= (perhaps to change the quality of the assembly to 'draft')
360
parameters= -GE:not=xxx (perhaps to change the number of processors)
365
Of course, you are free to change any option via the extended
366
parameters, but this is covered in the MIRA main reference manual.
370
<sect1 id="sect_454_walkthrough_a_sanger_454_hybrid_assembly">
372
Walkthrough: a paired-end Sanger / unpaired 454 hybrid assembly
375
Preparing the data for a Sanger / 454 hybrid assembly takes some more steps
376
but is not really more complicated than a normal Sanger-only or 454-only
380
In the following sections, the files with 454 input data will have
381
<filename>.454.</filename> in the name, files with Sanger have
382
<filename>.sanger.</filename>. That's just a convention I use, you do
383
not need to do that, but it helps to keep things nicely organised.
385
<sect2 id="sect_454_preparing_the_454_data">
387
Preparing the 454 data
390
Please proceed exactly in the same way as described for the assembly
391
of unpaired 454-only data in the section above, that is, without
392
writing a manifest and starting the actual assembly. The only difference: in the <command>sff_extract</command> part, use "-o" with the parameter "bchoc.454" to get the files named accordingly.
395
<prompt>arcadia:/path/to/myProject$</prompt> <userinput>sff_extract -o bchoc.454 EV10YMP01.sff EV5RTWS01.sff EVX95GF02.sff</userinput></screen>
397
In the end you should have two files (FASTQ and TRACEINFO) for the 454
401
<sect2 id="sect_454_preparing_the_sanger_data">
403
Preparing the Sanger data
406
There are quite a number of sequencing providers out there, all with
407
different pre-processing pipelines and different output
408
file-types. MIRA supports quite a number of them, the three most
409
important would probably be
414
(preferred option) FASTQ files and ancillary data in NCBI
415
TRACEINFO XML format.
420
(preferred option) FASTA files which are coupled with FASTA quality
421
files and ancillary data in NCBI TRACEINFO XML format.
426
(preferred option) CAF (from the Sanger Institute) files that
427
contain the sequence, quality values and ancillary data like
433
(secondary option) EXP files as the Staden pregap4 package writes.
438
Your sequencing provider MUST have performed at least a sequencing
439
vector clip on this data. A quality clip might also be good to do by
440
the provider as they usually know best what quality they can expect
441
from their instruments (although MIRA can do this also if you want).
444
You can either perform clipping the hard way by removing physically
445
all bases from the input (this is
446
called <emphasis>trimming</emphasis>), or you can keep the clipped
447
bases in the input file and provided clipping information in ancillary
448
data files. These clipping information then MUST be present in the
449
ancillary data (either the TRACEINFO XML, or in the combined CAF, or
450
in the EXP files), together with other standard data like, e.g.,
451
mate-pair information when using a paired-ends approach.
454
This example assumes that the data is provided as FASTA together with a
455
quality file and ancillary data in NCBI TRACEINFO XML format.
458
Put these files (appropriately renamed) into the directory with the
462
Here's how the directory with the preprocessed data should now look
466
<prompt>arcadia:/path/to/myProject$</prompt> <userinput>ls -l</userinput>
467
-rwxrwxrwx 1 bach 2007-10-13 22:44 bchoc.454.fastq
468
-rwxrwxrwx 1 bach 2007-10-13 22:44 bchoc.454.xml
470
-rwxrwxrwx 1 bach 2007-10-13 22:44 bchoc.sanger.fastq
471
-rwxrwxrwx 1 bach 2007-10-13 22:44 bchoc.sanger.xml</screen>
473
<sect2 id="sect_454_manifest_for_hybrid_assembly">
478
This assembly contains unpaired 454 data and paired-end Sanger
479
data. Let's assume the 454 data to be exactly the same as for the
480
previous walkthrough. For the Sanger data, let's assume the template
481
DNA size for the Sanger library to be between 2500 and 3500 bases and
482
the read naming to follow the TIGR naming scheme:
485
# A manifest file can contain comment lines, these start with the #-character
487
# First part of a manifest: defining some basic things
489
# In this example, we just give a name to the assembly
490
# and tell MIRA it should assemble a genome de-novo in accurate mode
491
# As special parameter, we want to use 4 threads in parallel (where possible)
493
<userinput>project = <replaceable>MyFirstHybridAssembly</replaceable>
494
job = <replaceable>genome,denovo,accurate</replaceable>
495
parameters = <replaceable>-GE:not=4</replaceable></userinput>
497
# The second part defines the sequencing data MIRA should load and assemble
498
# The data is logically divided into "readgroups", for more information
499
# please consult the MIRA manual, chapter "Reference"
501
<userinput>readgroup = <replaceable>SomeUnpaired454ReadsIGotFromTheLab</replaceable>
502
technology = <replaceable>454</replaceable>
503
data = <replaceable>bchoc.454.*</replaceable></userinput>
505
# Note the wildcard "bchoc.454.*" in the data line above: this
506
# will load both the FASTQ and XML data
508
<userinput>readgroup = <replaceable>SomePairedSangerReadsIGotFromTheLab</replaceable>
509
technology = <replaceable>sanger</replaceable>
510
template_size = <replaceable>2500 3500</replaceable>
511
segment_placement = <replaceable>---> <---</replaceable>
512
segment_naming = <replaceable>TIGR</replaceable>
513
data = <replaceable>bchoc.sanger.*</replaceable></userinput></screen>
515
If you compare the manifest above with the manifest in the walkthrough
516
for using only unpaired 454 data, you will see that large parts, i.e.,
517
the definition of the job, parameter and the 454 read group is
518
<emphasis>exactly</emphasis> the same. The only differences are in the
519
naming of the assembly project (in <literal>project =</literal>), and
520
the definition of a second readgroup containing the Sanger sequencing
524
<sect2 id="sect_454_starting_the_hybrid_assembly">
526
Starting the hybrid assembly
529
Quite unsurprisingly, the command to start the assembly is exactly the same as always:
532
<prompt>arcadia:/path/to/myProject$</prompt> <userinput>mira <replaceable>myassebly_manifest.conf</replaceable> >&log_assembly.txt</userinput></screen>
535
<sect1 id="sect_454_walkthrough:_combined_unpaired_and_pairedend_assembly_of_brucella_ceti">
537
Walkthrough: combined unpaired and paired-end assembly of Brucella ceti
540
Here's a walkthrough which should help you in setting up own assemblies. You
541
do not need to set up your directory structures as I do, but for this
542
walkthrough it could help.
545
This walkthrough was written at times when the NCBI still offered SFFs
546
for 454 data, which now it does not anymore. However, the approach is
547
still valid for your data where you should get SFFs.
550
This walkthrough was written at times when the primary input for 454
551
data in MIRA was using FASTA + FASTA quality files. This has shifted
552
nowadays to FASTQ as input (it's more compact and faster to parse). I'm
553
sure you will be able to make the necessary changes to the command line
554
of <command>sff_extract</command> yourself :-)
556
<sect2 id="sect_454_preliminaries">
561
Please make sure that sff_extract is working properly and that you have
562
at least version 0.2.1 (use <literal>sff_extract -v</literal>). Please also make sure
563
that SSAHA2 can be run correctly (test this by running <literal>ssaha2 -v</literal>).
566
<sect2 id="sect_454_preparing_your_filesystem">
568
Preparing your file system
571
Note: this is how I set up a project, feel free to implement whatever
572
structure suits your needs.
575
<prompt>$</prompt> <userinput>mkdir bceti</userinput>
576
<prompt>$</prompt> <userinput>cd bceti</userinput>
577
<prompt>bceti_assembly$</prompt> <userinput>mkdir origdata data assemblies</userinput></screen>
579
Your directory should now look like this:
582
<prompt>arcadia:bceti$</prompt> <userinput>ls -l</userinput>
583
drwxr-xr-x 2 bach users 48 2008-11-08 16:51 assemblies
584
drwxr-xr-x 2 bach users 48 2008-11-08 16:51 data
585
drwxr-xr-x 2 bach users 48 2008-11-08 16:51 origdata</screen>
587
Explanation of the structure:
592
the <filename>origdata</filename> directory will contain the 'raw'
593
result files that one might get from sequencing. Basically,.
598
the <filename>data</filename> directory will contain the
599
preprocessed sequences for the assembly, ready to be used by MIRA
604
the <filename>assemblies</filename> directory will contain
605
assemblies we make with our data (we might want to make more than
613
<sect2 id="sect_454_getting_the_data">
618
Since early summer 2009, the NCBI does not offer SFF files anymore,
619
which is a pity. This guide will nevertheless allow you to perform
620
similar assemblies on own data.
624
<ulink url="http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?run=SRR005481&cmd=viewer&m=data&s=viewer"/>
626
<ulink url="http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?run=SRR005482&cmd=viewer&m=data&s=viewer"/>
627
and download the SFF files to the <filename>origdata</filename>
628
directory (press the download button on those pages).
631
En passant, note the following: SRR005481 is described to be a 454 FLX
632
data set where the library contains unpaired data ("Library Layout:
633
SINGLE"). SRR005482 has also 454 FLX data, but this time it's
634
paired-end data ("Library Layout: PAIRED
635
(ORIENTATION=forward)"). Knowing this will be important later on in
639
<prompt>arcadia:bceti$</prompt> <userinput>cd origdata</userinput>
640
<prompt>arcadia:origdata$</prompt> <userinput>ls -l</userinput>
641
-rw-r--r-- 1 bach users 240204619 2008-11-08 16:49 SRR005481.sff.gz
642
-rw-r--r-- 1 bach users 211333635 2008-11-08 16:55 SRR005482.sff.gz</screen>
644
We need to unzip those files:
647
<prompt>arcadia:bceti_assembly/origdata$</prompt> <userinput>gunzip *.gz</userinput></screen>
649
And now this directory should look like this
652
<prompt>arcadia:bceti_assembly/origdata$</prompt> <userinput>ls -l</userinput>
653
-rw-r--r-- 1 bach users 544623256 2008-11-08 16:49 SRR005481.sff
654
-rw-r--r-- 1 bach users 476632488 2008-11-08 16:55 SRR005482.sff</screen>
656
Now move into the (still empty) <filename>data</filename> directory
659
<prompt>arcadia:origdata$</prompt> <userinput>cd ../data</userinput></screen>
661
<sect2 id="sect_454_data_preprocessing_with_sff_extract">
663
Data preprocessing with sff_extract
667
<sect3 id="sect_454_extracting_unpaired_data_from_sff">
669
Extracting unpaired data from SFF
672
We will first extract the data from the unpaired experiment
673
(SRR005481), the generated file names should all start
674
with <emphasis>bceti</emphasis>:
677
<prompt>arcadia:bceti_assembly/data$</prompt> <userinput>sff_extract -o bceti ../origdata/SRR005481.sff</userinput>
678
Working on '../origdata/SRR005481.sff':
679
Converting '../origdata/SRR005481.sff' ... done.
680
Converted 311201 reads into 311201 sequences.
682
********************************************************************************
683
WARNING: weird sequences in file ../origdata/SRR005481.sff
685
After applying left clips, 307639 sequences (=99%) start with these bases:
688
This does not look sane.
690
Countermeasures you *probably* must take:
691
1) Make your sequence provider aware of that problem and ask whether this can be
692
corrected in the SFF.
693
2) If you decide that this is not normal and your sequence provider does not
694
react, use the --min_left_clip of sff_extract.
695
(Probably '--min_left_clip=13' but you should cross-check that)
696
********************************************************************************</screen>
698
(Note: I got this on the SRR005481 data set downloaded in October
699
2008. In the mean time, the sequencing center or NCBI may have
703
Wait a minute ... what happened here?
706
We launched a pretty standard extraction of reads where the whole
707
sequence were extracted and saved in the FASTA files and FASTA
708
quality files, and clipping information will be given in the
709
XML. Additionally, the clipped parts of every read will be shown in
710
lower case in the FASTA file.
713
After two or three minutes, the directory looked like this:
716
<prompt>arcadia:bceti_assembly/data$</prompt> <userinput>ls -l
717
-rw-r--r-- 1 bach users 91863124 2008-11-08 17:15 bceti.fasta
718
-rw-r--r-- 1 bach users 264238484 2008-11-08 17:15 bceti.fasta.qual
719
-rw-r--r-- 1 bach users 52197816 2008-11-08 17:15 bceti.xml</userinput></screen>
721
<sect3 id="sect_454_dealing_with_wrong_clipoffs_in_the_sff">
723
Dealing with wrong clip-offs in the SFF
726
In the example above, sff_extract discovered an unusual pattern
727
sequence and gave a (stern) warning: almost all the sequences
728
created for the FASTA file had a skew in the distribution of bases.
731
Let's have a look at the first 30 bases of the first 20 sequences of
732
the FASTA that was created:
735
<prompt>arcadia:bceti_assembly/data$</prompt> <userinput>head -40 bceti_in.454.fasta | grep -v ">" | cut -c 0-30</userinput>
736
tcagTCTCCGTCGCAATCGCCGCCCCCACA
737
tcagTCTCCGTCGGCGCTGCCCGCCCGATA
738
tcagTCTCCGTCGTGGAGGATTACTGGGCG
739
tcagTCTCCGTCGGCTGTCTGGATCATGAT
740
tcagTCTCCGTCCTCGCGTTCGATGGTGAC
741
tcagTCTCCGTCCATCTGTCGGGAACGGAT
742
tcagTCTCCGTCCGAGCTTCCGATGGCACA
743
tcagTCTCCGTCAGCCTTTAATGCCGCCGA
744
tcagTCTCCGTCCTCGAAACCAAGAGCGTG
745
tcagTCTCCGTCGCAGGCGTTGGCGCGGCG
746
tcagTCTCCGTCTCAAACAAAGGATTAGAG
747
tcagTCTCCGTCCTCACCCTGACGGTCGGC
748
tcagTCTCCGTCTTGTGCGGTTCGATCCGG
749
tcagTCTCCGTCTGCGGACGGGTATCGCGG
750
tcagTCTCCGTCTCGTTATGCGCTCGCCAG
751
tcagTCTCCGTCTCGCATTTTCCAACGCAA
752
tcagTCTCCGTCCGCTCATATCCTTGTTGA
753
tcagTCTCCGTCCTGTGCTGGGAAAGCGAA
754
tcagTCTCCGTCTCGAGCCGGGACAGGCGA
755
tcagTCTCCGTCGTCGTATCGGGTACGAAC</screen>
757
What you see is the following: the leftmost 4
758
characters <literal>tcag</literal> of every read are the last bases
759
of the standard 454 sequencing adaptor A. The fact that they are
760
given in lower case means that they are clipped away in the SFF
764
However, if you look closely, you will see that there is something
765
peculiar: after the adaptor sequence, all reads seem to start with
766
exactly the same sequence <literal>TCTCCGTC</literal>. This is *not*
770
This means that the left clip of the reads in the SFF has not been
771
set correctly. The reason for this is probably a wrong value which
772
was used in the 454 data processing pipeline. This seems to be a
773
problem especially when custom sequencing adaptors are used.
776
In this case, the result is pretty catastrophic: out of the 311201
777
reads in the SFF, 307639 (98.85%) show this behaviour. We will
778
certainly need to get rid of these first 12 bases.
781
Now, in cases like these, there are three steps that you really
787
Is this something that you expect from the experimental setup?
788
If yes, then all is OK and you don't need to take further
789
action. But I suppose that for 99% of all people, these abnormal
790
sequences are not expected.
795
Contact. Your. Sequence. Provider! The underlying problem is
796
something that *MUST* be resolved on their side, not on
797
yours. It might be a simple human mistake, but it it might very
798
well be a symptom of a deeper problem in their quality
799
assurance. Notify. Them. Now!
804
In the mean time (or if the sequencing provider does not react),
805
you can use the <arg>--min_left_clip</arg> command line option
806
from sff_extract as suggested in the warning message.
813
So, to correct for this error, we will redo the extraction of the
814
sequence from the SFF, this time telling sff_extract to set the left
815
clip starting at base 13 at the lowest:
818
<prompt>arcadia:bceti_assembly/data$</prompt> <userinput>sff_extract -o bceti --min_left_clip=13 ../origdata/SRR005481.sff</userinput>
819
Working on '../origdata/SRR005481.sff':
820
Converting '../origdata/SRR005481.sff' ... done.
821
Converted 311201 reads into 311201 sequences.
822
<prompt>arcadia:sff_from_ncbi/bceti_assembly/data$</prompt> <userinput>ls -l</userinput>
823
-rw-r--r-- 1 bach users 91863124 2008-11-08 17:31 bceti.fasta
824
-rw-r--r-- 1 bach users 264238484 2008-11-08 17:31 bceti.fasta.qual
825
-rw-r--r-- 1 bach users 52509017 2008-11-08 17:31 bceti.xml</screen>
827
This concludes the small intermezzo on how to deal with wrong left
832
<sect2 id="sect_454_preparing_an_assembly">
834
Preparing an assembly
837
Preparing an assembly is now just a matter of setting up a directory and
838
linking the input files into that directory.
841
<prompt>arcadia:bceti_assembly/data$</prompt> <userinput>cd ../assemblies/</userinput>
842
<prompt>arcadia:bceti_assembly/assemblies$</prompt> <userinput>mkdir arun_08112008</userinput>
843
<prompt>arcadia:bceti_assembly/assemblies$</prompt> <userinput>cd arun_08112008</userinput>
844
<prompt>arcadia:assemblies/arun_08112008$</prompt> <userinput>ln -s ../../data/* .</userinput>
845
<prompt>arcadia:bceti_assembly/assemblies/arun_08112008$</prompt> <userinput>ls -l</userinput>
846
lrwxrwxrwx 1 bach users 29 2008-11-08 18:17 bceti.454.fasta -> ../../data/bceti.454.fasta
847
lrwxrwxrwx 1 bach users 34 2008-11-08 18:17 bceti.454.fasta.qual -> ../../data/bceti.454.fasta.qual
848
lrwxrwxrwx 1 bach users 33 2008-11-08 18:17 bceti.454.xml -> ../../data/bceti.454.xml</screen>
850
<sect2 id="sect_454_starting_the_assembly_2">
852
Starting the assembly 2
855
Start an assembly with the options you like, for example like this:
858
<prompt>$</prompt> <userinput>NONONONONONONO ---- MAKE IT WITH MANIFEST !!!!mira --project=bceti --job=denovo,genome,accurate,454 >&log_assembly</userinput></screen>
861
<sect1 id="sect_454_what_to_do_with_the_mira_result_files?">
863
What to do with the MIRA result files?
866
Please consult the corresponding section in the
867
<emphasis>mira</emphasis> <emphasis>usage</emphasis> document, it
868
contains much more information than this stub.
871
But basically, after the assembly has finished, you will find four
872
directories. The <filename>tmp</filename> directory can be deleted
873
without remorse as it contains logs and some tremendous amount of
874
temporary data (dozens of gigabytes for bigger
875
projects). The <filename>info</filename> directory has some text files
876
with basic statistics and other informative files. Start by having a
877
look at the <filename>*_info_assembly.txt</filename>, it'll give you a
878
first idea on how the assembly went.
881
The <filename>results</filename> directory finally contains the assembly
882
files in different formats, ready to be used for further processing with
886
If you used the uniform read distribution option, you will inevitably
887
need to filter your results as this option produces larger and better
888
alignments, but also more ``debris contigs''. For this, use the
889
miraconvert which is distributed together with the MIRA package.
892
Also very important when analysing 454 assemblies: screen the small
893
contigs ( < 1000 bases) for abnormal behaviour. You wouldn't be the
894
first to have some human DNA contamination in a bacterial sequencing. Or
895
some herpes virus sequence in a bacterial project. Or some bacterial DNA
896
in a human data set. Look whether these small contigs
901
have a different GC content than the large contigs
906
whether a BLAST of these sequences against some selected databases
907
brings up hits in other organisms that you certainly were not