~ubuntu-branches/ubuntu/trusty/clustalx/trusty

Committer: Bazaar Package Importer
Author(s): Charles Plessy, Steffen Moeller, Charles Plessy
Date: 2009-10-21 13:25:44 UTC
mfrom: (1.1.1 upstream)
Revision ID: james.westby@ubuntu.com-20091021132544-r4hbcnjxp354wxh0

Tags: 2.0.12-1

http://bugs.debian.org/550893

* New upstream release (LP: #423648, #393769):
  - Uses Qt instead of lesstif.
  - Includes new code for UPGMA guide trees.
  - Includes iterative alignment facility.

[ Steffen Moeller ]
* New upstream release.
* Updated watch file (Closes: #550893).
* Removed LICENSE from debian/clustalx.docs
* rename to clustalx seems no longer required in debian/rules
* moved clustalx.1 into debian folder (eases working with svn-buildpackage)
* added German translation to desktop file

[ Charles Plessy ]
* Updated my email address.
* debian/copyright made machine-readable.
* Added various informations in debian/upstream-metadata.yaml.
* Switched to Debhelper 7.
  (debian/rules, debian/control, debian/patches, debian/compat)
* Removed useless Debhelper file debian/clustalx.dirs.
* Updated package description.
* Hardcoded the localisation of accessory files in /usr/share/clustalx.
  (debian/patches/hardcode-accessory-file-locations.patch)
* Documented in debian/README.source that the documentation for quilt
  is in /usr/share/doc/quilt.
* Added upstream changelog downloaded from upstream website
  (debian/rules, debian/CHANGELOG.upstream).
* Incremented Standards-Version to reflect conformance with Policy 3.8.3
  (debian/control, no other changes needed).
* Updated homepage in debian/clustalw.1.

files added:
AlignOutputFileNames.cpp

AlignOutputFileNames.h

AlignmentFormatOptions.cpp

AlignmentFormatOptions.h

AlignmentParameters.cpp

AlignmentParameters.h

AlignmentViewerWidget.cpp

AlignmentViewerWidget.h

AlignmentWidget.cpp

AlignmentWidget.h

BootstrapTreeDialog.cpp

BootstrapTreeDialog.h

ClustalQtParams.h

ColorFileXmlParser.cpp

ColorFileXmlParser.h

ColorParameters.cpp

ColorParameters.h

ColumnScoreParams.cpp

ColumnScoreParams.h

FileDialog.cpp

FileDialog.h

HardCodedColorScheme.cpp

HardCodedColorScheme.h

HelpDisplayWidget.cpp

HelpDisplayWidget.h

HistogramWidget.cpp

HistogramWidget.h

KeyController.cpp

KeyController.h

LICENSE

LowScoringSegParams.cpp

LowScoringSegParams.h

PSPrinter.cpp

PSPrinter.h

PairwiseParams.cpp

PairwiseParams.h

PostscriptFileParams.cpp

PostscriptFileParams.h

ProteinGapParameters.cpp

ProteinGapParameters.h

QTUtility.cpp

QTUtility.h

README

Resources.cpp

Resources.h

SaveSeqFile.cpp

SaveSeqFile.h

SearchForString.cpp

SearchForString.h

SecStructOptions.cpp

SecStructOptions.h

SeqNameWidget.cpp

SeqNameWidget.h

TreeFormatOptions.cpp

TreeFormatOptions.h

TreeOutputFileNames.cpp

TreeOutputFileNames.h

WritePostscriptFile.cpp

WritePostscriptFile.h

clustalW

clustalW/Clustal.cpp

clustalW/Clustal.h

clustalW/Help.cpp

clustalW/Help.h

clustalW/alignment

clustalW/alignment/Alignment.cpp

clustalW/alignment/Alignment.h

clustalW/alignment/AlignmentOutput.cpp

clustalW/alignment/AlignmentOutput.h

clustalW/alignment/ObjectiveScore.cpp

clustalW/alignment/ObjectiveScore.h

clustalW/alignment/Sequence.cpp

clustalW/alignment/Sequence.h

clustalW/clustalw_version.h

clustalW/fileInput

clustalW/fileInput/ClustalFileParser.cpp

clustalW/fileInput/ClustalFileParser.h

clustalW/fileInput/EMBLFileParser.cpp

clustalW/fileInput/EMBLFileParser.h

clustalW/fileInput/FileParser.cpp

clustalW/fileInput/FileParser.h

clustalW/fileInput/FileReader.cpp

clustalW/fileInput/FileReader.h

clustalW/fileInput/GDEFileParser.cpp

clustalW/fileInput/GDEFileParser.h

clustalW/fileInput/InFileStream.cpp

clustalW/fileInput/InFileStream.h

clustalW/fileInput/MSFFileParser.cpp

clustalW/fileInput/MSFFileParser.h

clustalW/fileInput/PIRFileParser.cpp

clustalW/fileInput/PIRFileParser.h

clustalW/fileInput/PearsonFileParser.cpp

clustalW/fileInput/PearsonFileParser.h

clustalW/fileInput/RSFFileParser.cpp

clustalW/fileInput/RSFFileParser.h

clustalW/general

clustalW/general/Array2D.h

clustalW/general/ClustalWResources.cpp

clustalW/general/ClustalWResources.h

clustalW/general/DebugLog.cpp

clustalW/general/DebugLog.h

clustalW/general/InvalidCombination.cpp

clustalW/general/OutputFile.cpp

clustalW/general/OutputFile.h

clustalW/general/RandomAccessLList.h

clustalW/general/SequenceNotFoundException.h

clustalW/general/SquareMat.h

clustalW/general/Stats.cpp

clustalW/general/Stats.h

clustalW/general/SymMatrix.cpp

clustalW/general/SymMatrix.h

clustalW/general/UserParameters.cpp

clustalW/general/UserParameters.h

clustalW/general/Utility.cpp

clustalW/general/Utility.h

clustalW/general/VectorOutOfRange.cpp

clustalW/general/VectorOutOfRange.h

clustalW/general/VectorUtility.h

clustalW/general/clustalw.h

clustalW/general/debuglogObject.h

clustalW/general/general.h

clustalW/general/param.h

clustalW/general/statsObject.h

clustalW/general/userparams.h

clustalW/general/utils.h

clustalW/interface

clustalW/interface/CommandLineParser.cpp

clustalW/interface/CommandLineParser.h

clustalW/interface/InteractiveMenu.cpp

clustalW/interface/InteractiveMenu.h

clustalW/multipleAlign

clustalW/multipleAlign/Iteration.cpp

clustalW/multipleAlign/Iteration.h

clustalW/multipleAlign/LowScoreSegProfile.cpp

clustalW/multipleAlign/LowScoreSegProfile.h

clustalW/multipleAlign/MSA.cpp

clustalW/multipleAlign/MSA.h

clustalW/multipleAlign/MyersMillerProfileAlign.cpp

clustalW/multipleAlign/MyersMillerProfileAlign.h

clustalW/multipleAlign/ProfileAlignAlgorithm.h

clustalW/multipleAlign/ProfileBase.cpp

clustalW/multipleAlign/ProfileBase.h

clustalW/multipleAlign/ProfileStandard.cpp

clustalW/multipleAlign/ProfileStandard.h

clustalW/multipleAlign/ProfileWithSub.cpp

clustalW/multipleAlign/ProfileWithSub.h

clustalW/pairwise

clustalW/pairwise/FastPairwiseAlign.cpp

clustalW/pairwise/FastPairwiseAlign.h

clustalW/pairwise/FullPairwiseAlign.cpp

clustalW/pairwise/FullPairwiseAlign.h

clustalW/pairwise/PairwiseAlignBase.h

clustalW/substitutionMatrix

clustalW/substitutionMatrix/SubMatrix.cpp

clustalW/substitutionMatrix/SubMatrix.h

clustalW/substitutionMatrix/globalmatrix.h

clustalW/substitutionMatrix/matrices.h

clustalW/tree

clustalW/tree/AlignmentSteps.cpp

clustalW/tree/AlignmentSteps.h

clustalW/tree/ClusterTree.cpp

clustalW/tree/ClusterTree.h

clustalW/tree/ClusterTreeAlgorithm.h

clustalW/tree/ClusterTreeOutput.cpp

clustalW/tree/ClusterTreeOutput.h

clustalW/tree/NJTree.cpp

clustalW/tree/NJTree.h

clustalW/tree/RandomGenerator.cpp

clustalW/tree/RandomGenerator.h

clustalW/tree/Tree.cpp

clustalW/tree/Tree.h

clustalW/tree/TreeInterface.cpp

clustalW/tree/TreeInterface.h

clustalW/tree/UPGMA

clustalW/tree/UPGMA/Node.cpp

clustalW/tree/UPGMA/Node.h

clustalW/tree/UPGMA/RootedClusterTree.cpp

clustalW/tree/UPGMA/RootedClusterTree.h

clustalW/tree/UPGMA/RootedGuideTree.cpp

clustalW/tree/UPGMA/RootedGuideTree.h

clustalW/tree/UPGMA/RootedTreeOutput.cpp

clustalW/tree/UPGMA/RootedTreeOutput.h

clustalW/tree/UPGMA/UPGMAAlgorithm.cpp

clustalW/tree/UPGMA/UPGMAAlgorithm.h

clustalW/tree/UPGMA/upgmadata.h

clustalW/tree/UnRootedClusterTree.cpp

clustalW/tree/UnRootedClusterTree.h

clustalW/tree/dayhoff.h

clustalx.hlp

clustalx.pro

coldna.xml

colprint.xml

colprot.xml

debian/CHANGELOG.upstream

debian/README.source

debian/patches/hardcode-accessory-file-locations.patch

debian/upstream-metadata.yaml

installer

main.cpp

mainwindow.cpp

mainwindow.h

moc_AlignOutputFileNames.cpp

moc_AlignmentFormatOptions.cpp

moc_AlignmentParameters.cpp

moc_AlignmentViewerWidget.cpp

moc_AlignmentWidget.cpp

moc_BootstrapTreeDialog.cpp

moc_ColorFileXmlParser.cpp

moc_ColumnScoreParams.cpp

moc_FileDialog.cpp

moc_HelpDisplayWidget.cpp

moc_HistogramWidget.cpp

moc_LowScoringSegParams.cpp

moc_PSPrinter.cpp

moc_PairwiseParams.cpp

moc_ProteinGapParameters.cpp

moc_SaveSeqFile.cpp

moc_SearchForString.cpp

moc_SecStructOptions.cpp

moc_SeqNameWidget.cpp

moc_TreeFormatOptions.cpp

moc_TreeOutputFileNames.cpp

moc_WritePostscriptFile.cpp

moc_mainwindow.cpp

usr/local

usr/local/qt

usr/local/qt/qt-x11-opensource-src-4.5.2

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs/common

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs/common/g++.conf

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs/common/linux.conf

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs/common/unix.conf

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs/features

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs/features/default_post.prf

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs/features/default_pre.prf

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs/features/exclusive_builds.prf

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs/features/include_source_dir.prf

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs/features/lex.prf

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs/features/moc.prf

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs/features/qt.prf

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs/features/qt_config.prf

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs/features/qt_functions.prf

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs/features/release.prf

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs/features/resources.prf

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs/features/static.prf

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs/features/uic.prf

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs/features/unix

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs/features/unix/thread.prf

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs/features/warn_on.prf

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs/features/yacc.prf

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs/qconfig.pri

version.h

files removed:
README_W

README_X

alnscore.c

amenu.c

calcgapcoeff.c

calcprf1.c

calcprf2.c

calctree.c

clustalv.doc

clustalw.c

clustalw.doc

clustalw.h

clustalw.ms

clustalw.new

clustalw_help

clustalx.c

clustalx.html

clustalx_help

coldna.par

colprint.par

colprot.par

dayhoff.h

debian/clustalx.dirs

debian/patches/amenu.c.patch

debian/patches/clustal-help.patch

debian/patches/clustalw.h.patch

debian/patches/clustalx.html.patch

debian/patches/clustalx_help.patch

debian/patches/interface.c.patch

debian/patches/makefile.patch

debian/patches/sequence.c.patch

debian/patches/trees.c.patch

debian/patches/util.c.patch

debian/patches/xmenu.c.patch

gcgcheck.c

general.h

globin.pep

gon90.bla

interface.c

makefile

makefile.alpha

makefile.linux

makefile.sgi

makefile.sun

malign.c

matrices.h

matrixseries.gon

pairalign.c

param.h

prfalign.c

random.c

readmat.c

sequence.c

showpair.c

trees.c

util.c

xcolor.c

xdisplay.c

xmenu.c

xmenu.h

xscore.c

xutils.c

files modified:
debian/changelog

debian/clustalx.1

debian/clustalx.docs

debian/clustalx.install

debian/clustalx.menu

debian/compat

debian/control

debian/copyright

debian/patches/series

debian/rules

debian/watch

Show diffs side-by-side

added added

removed removed

clustalw_help

This is the on-line help file for CLUSTAL W ( version 1.83).

It should be named or defined as: clustalw_help

except with MSDOS in which case it should be named CLUSTALW.HLP

For full details of usage and algorithms, please read the CLUSTALW.DOC file.

Toby Gibson EMBL, Heidelberg, Germany.

Des Higgins UCC, Cork, Ireland.

Julie Thompson IGBMC, Strasbourg, France.

>>NEW <<

Fasta output

===========

Write/Read sequence with range specified. The command line syntax

for range specification is flexible. You can use one of the following

syntax.

-range=n:m

-range=n-m

-range="n m"

where m is the starting and m is the length of the sequence.

Range and range numbers.

=======================

Include range numbers in the ouput.

-seqno_range=on/off

The sequence range will be appended as to the names of the sequence.

PIM: Percentage Identity Matrix

===============================

>>HELP 1 << General help for CLUSTAL W (1.81)

Clustal W is a general purpose multiple alignment program for DNA or proteins.

SEQUENCE INPUT: all sequences must be in 1 file, one after another.

7 formats are automatically recognised: NBRF-PIR, EMBL-SWISSPROT,

Pearson (Fasta), Clustal (*.aln), GCG-MSF (Pileup), GCG9-RSF and GDE flat file.

All non-alphabetic characters (spaces, digits, punctuation marks) are ignored

except "-" which is used to indicate a GAP ("." in MSF-RSF).

To do a MULTIPLE ALIGNMENT on a set of sequences, use item 1 from this menu to

INPUT them; go to menu item 2 to do the multiple alignment.

PROFILE ALIGNMENTS (menu item 3) are used to align 2 alignments. Use this to

add a new sequence to an old alignment, or to use secondary structure to guide

the alignment process. GAPS in the old alignments are indicated using the "-"

character. PROFILES can be input in ANY of the allowed formats; just

use "-" (or "." for MSF-RSF) for each gap position.

PHYLOGENETIC TREES (menu item 4) can be calculated from old alignments (read in

with "-" characters to indicate gaps) OR after a multiple alignment while the

alignment is still in memory.

The program tries to automatically recognise the different file formats used

and to guess whether the sequences are amino acid or nucleotide. This is not

always foolproof.

FASTA and NBRF-PIR formats are recognised by having a ">" as the first

character in the file.

EMBL-Swiss Prot formats are recognised by the letters

ID at the start of the file (the token for the entry name field).

CLUSTAL format is recognised by the word CLUSTAL at the beginning of the file.

GCG-MSF format is recognised by one of the following:

- the word PileUp at the start of the file.

- the word !!AA_MULTIPLE_ALIGNMENT or !!NA_MULTIPLE_ALIGNMENT

at the start of the file.

- the word MSF on the first line of the line, and the characters ..

at the end of this line.

GCG-RSF format is recognised by the word !!RICH_SEQUENCE at the beginning of

the file.

If 85% or more of the characters in the sequence are from A,C,G,T,U or N, the

sequence will be assumed to be nucleotide. This works in 97.3% of cases

but watch out!

>>HELP 2 << Help for multiple alignments

If you have already loaded sequences, use menu item 1 to do the complete

100

multiple alignment. You will be prompted for 2 output files: 1 for the

101

alignment itself; another to store a dendrogram that describes the similarity

102

of the sequences to each other.

103

104

Multiple alignments are carried out in 3 stages (automatically done from menu

105

item 1 ...Do complete multiple alignments now):

106

107

1) all sequences are compared to each other (pairwise alignments);

108

109

2) a dendrogram (like a phylogenetic tree) is constructed, describing the

110

approximate groupings of the sequences by similarity (stored in a file).

111

112

3) the final multiple alignment is carried out, using the dendrogram as a guide.

113

114

115

PAIRWISE ALIGNMENT parameters control the speed-sensitivity of the initial

116

alignments.

117

118

MULTIPLE ALIGNMENT parameters control the gaps in the final multiple alignments.

119

120

121

RESET GAPS (menu item 7) will remove any new gaps introduced into the sequences

122

during multiple alignment if you wish to change the parameters and try again.

123

This only takes effect just before you do a second multiple alignment. You

124

can make phylogenetic trees after alignment whether or not this is ON.

125

If you turn this OFF, the new gaps are kept even if you do a second multiple

126

alignment. This allows you to iterate the alignment gradually. Sometimes, the

127

alignment is improved by a second or third pass.

128

129

SCREEN DISPLAY (menu item 8) can be used to send the output alignments to the

130

screen as well as to the output file.

131

132

You can skip the first stages (pairwise alignments; dendrogram) by using an

133

old dendrogram file (menu item 3); or you can just produce the dendrogram

134

with no final multiple alignment (menu item 2).

135

136

137

OUTPUT FORMAT: Menu item 9 (format options) allows you to choose from 6

138

different alignment formats (CLUSTAL, GCG, NBRF-PIR, PHYLIP, GDE, NEXUS, and FASTA).

139

140

141

>>HELP 3 << Help for pairwise alignment parameters

142

A distance is calculated between every pair of sequences and these are used to

143

construct the dendrogram which guides the final multiple alignment. The scores

144

are calculated from separate pairwise alignments. These can be calculated using

145

2 methods: dynamic programming (slow but accurate) or by the method of Wilbur

146

and Lipman (extremely fast but approximate).

147

148

You can choose between the 2 alignment methods using menu option 8. The

149

slow-accurate method is fine for short sequences but will be VERY SLOW for

150

many (e.g. >100) long (e.g. >1000 residue) sequences.

151

152

SLOW-ACCURATE alignment parameters:

153

These parameters do not have any affect on the speed of the alignments.

154

They are used to give initial alignments which are then rescored to give percent

155

identity scores. These % scores are the ones which are displayed on the

156

screen. The scores are converted to distances for the trees.

157

158

1) Gap Open Penalty: the penalty for opening a gap in the alignment.

159

2) Gap extension penalty: the penalty for extending a gap by 1 residue.

160

3) Protein weight matrix: the scoring table which describes the similarity

161

of each amino acid to each other.

162

4) DNA weight matrix: the scores assigned to matches and mismatches

163

(including IUB ambiguity codes).

164

165

166

FAST-APPROXIMATE alignment parameters:

167

168

These similarity scores are calculated from fast, approximate, global align-

169

ments, which are controlled by 4 parameters. 2 techniques are used to make

170

these alignments very fast: 1) only exactly matching fragments (k-tuples) are

171

considered; 2) only the 'best' diagonals (the ones with most k-tuple matches)

172

are used.

173

174

K-TUPLE SIZE: This is the size of exactly matching fragment that is used.

175

INCREASE for speed (max= 2 for proteins; 4 for DNA), DECREASE for sensitivity.

176

For longer sequences (e.g. >1000 residues) you may need to increase the default.

177

178

GAP PENALTY: This is a penalty for each gap in the fast alignments. It has

179

little affect on the speed or sensitivity except for extreme values.

180

181

TOP DIAGONALS: The number of k-tuple matches on each diagonal (in an imaginary

182

dot-matrix plot) is calculated. Only the best ones (with most matches) are

183

used in the alignment. This parameter specifies how many. Decrease for speed;

184

increase for sensitivity.

185

186

WINDOW SIZE: This is the number of diagonals around each of the 'best'

187

diagonals that will be used. Decrease for speed; increase for sensitivity.

188

189

190

>>HELP 4 << Help for multiple alignment parameters

191

192

These parameters control the final multiple alignment. This is the core of the

193

program and the details are complicated. To fully understand the use of the

194

parameters and the scoring system, you will have to refer to the documentation.

195

196

Each step in the final multiple alignment consists of aligning two alignments

197

or sequences. This is done progressively, following the branching order in

198

the GUIDE TREE. The basic parameters to control this are two gap penalties and

199

the scores for various identical-non-indentical residues.

200

201

1) and 2) The GAP PENALTIES are set by menu items 1 and 2. These control the

202

cost of opening up every new gap and the cost of every item in a gap.

203

Increasing the gap opening penalty will make gaps less frequent. Increasing

204

the gap extension penalty will make gaps shorter. Terminal gaps are not

205

penalised.

206

207

3) The DELAY DIVERGENT SEQUENCES switch delays the alignment of the most

208

distantly related sequences until after the most closely related sequences have

209

been aligned. The setting shows the percent identity level required to delay

210

the addition of a sequence; sequences that are less identical than this level

211

to any other sequences will be aligned later.

212

213

214

215

4) The TRANSITION WEIGHT gives transitions (A <--> G or C <--> T

216

i.e. purine-purine or pyrimidine-pyrimidine substitutions) a weight between 0

217

and 1; a weight of zero means that the transitions are scored as mismatches,

218

while a weight of 1 gives the transitions the match score. For distantly related

219

DNA sequences, the weight should be near to zero; for closely related sequences

220

it can be useful to assign a higher score.

221

222

223

5) PROTEIN WEIGHT MATRIX leads to a new menu where you are offered a choice of

224

weight matrices. The default for proteins in version 1.8 is the PAM series

225

derived by Gonnet and colleagues. Note, a series is used! The actual matrix

226

that is used depends on how similar the sequences to be aligned at this

227

alignment step are. Different matrices work differently at each evolutionary

228

distance.

229

230

6) DNA WEIGHT MATRIX leads to a new menu where a single matrix (not a series)

231

can be selected. The default is the matrix used by BESTFIT for comparison of

232

nucleic acid sequences.

233

234

Further help is offered in the weight matrix menu.

235

236

237

7) In the weight matrices, you can use negative as well as positive values if

238

you wish, although the matrix will be automatically adjusted to all positive

239

scores, unless the NEGATIVE MATRIX option is selected.

240

241

8) PROTEIN GAP PARAMETERS displays a menu allowing you to set some Gap Penalty

242

options which are only used in protein alignments.

243

244

245

>>HELP A << Help for protein gap parameters.

246

1) RESIDUE SPECIFIC PENALTIES are amino acid specific gap penalties that reduce

247

or increase the gap opening penalties at each position in the alignment or

248

sequence. See the documentation for details. As an example, positions that

249

are rich in glycine are more likely to have an adjacent gap than positions that

250

are rich in valine.

251

252

2) 3) HYDROPHILIC GAP PENALTIES are used to increase the chances of a gap within

253

a run (5 or more residues) of hydrophilic amino acids; these are likely to

254

be loop or random coil regions where gaps are more common. The residues that

255

are "considered" to be hydrophilic are set by menu item 3.

256

257

4) GAP SEPARATION DISTANCE tries to decrease the chances of gaps being too

258

close to each other. Gaps that are less than this distance apart are penalised

259

more than other gaps. This does not prevent close gaps; it makes them less

260

frequent, promoting a block-like appearance of the alignment.

261

262

5) END GAP SEPARATION treats end gaps just like internal gaps for the purposes

263

of avoiding gaps that are too close (set by GAP SEPARATION DISTANCE above).

264

If you turn this off, end gaps will be ignored for this purpose. This is

265

useful when you wish to align fragments where the end gaps are not biologically

266

meaningful.

267

>>HELP 5 << Help for output format options.

268

269

Six output formats are offered. You can choose any (or all 6 if you wish).

270

271

CLUSTAL format output is a self explanatory alignment format. It shows the

272

sequences aligned in blocks. It can be read in again at a later date to

273

(for example) calculate a phylogenetic tree or add a new sequence with a

274

profile alignment.

275

276

GCG output can be used by any of the GCG programs that can work on multiple

277

alignments (e.g. PRETTY, PROFILEMAKE, PLOTALIGN). It is the same as the GCG

278

.msf format files (multiple sequence file); new in version 7 of GCG.

279

280

PHYLIP format output can be used for input to the PHYLIP package of Joe

281

Felsenstein. This is an extremely widely used package for doing every

282

imaginable form of phylogenetic analysis (MUCH more than the the modest intro-

283

duction offered by this program).

284

285

NBRF-PIR: this is the same as the standard PIR format with ONE ADDITION. Gap

286

characters "-" are used to indicate the positions of gaps in the multiple

287

alignment. These files can be re-used as input in any part of clustal that

288

allows sequences (or alignments or profiles) to be read in.

289

290

GDE: this is the flat file format used by the GDE package of Steven Smith.

291

292

NEXUS: the format used by several phylogeny programs, including PAUP and

293

MacClade.

294

295

GDE OUTPUT CASE: sequences in GDE format may be written in either upper or

296

lower case.

297

298

CLUSTALW SEQUENCE NUMBERS: residue numbers may be added to the end of the

299

alignment lines in clustalw format.

300

301

OUTPUT ORDER is used to control the order of the sequences in the output

302

alignments. By default, the order corresponds to the order in which the

303

sequences were aligned (from the guide tree-dendrogram), thus automatically

304

grouping closely related sequences. This switch can be used to set the order

305

to the same as the input file.

306

307

PARAMETER OUTPUT: This option allows you to save all your parameter settings

308

in a parameter file. This file can be used subsequently to rerun Clustal W

309

using the same parameters.

310

311

>>HELP 6 << Help for profile and structure alignments

312

313

By PROFILE ALIGNMENT, we mean alignment using existing alignments. Profile

314

alignments allow you to store alignments of your favourite sequences and add

315

new sequences to them in small bunches at a time. A profile is simply an

316

alignment of one or more sequences (e.g. an alignment output file from CLUSTAL

317

W). Each input can be a single sequence. One or both sets of input sequences

318

may include secondary structure assignments or gap penalty masks to guide the

319

alignment.

320

321

The profiles can be in any of the allowed input formats with "-" characters

322

used to specify gaps (except for MSF-RSF where "." is used).

323

324

You have to specify the 2 profiles by choosing menu items 1 and 2 and giving

325

2 file names. Then Menu item 3 will align the 2 profiles to each other.

326

Secondary structure masks in either profile can be used to guide the alignment.

327

328

Menu item 4 will take the sequences in the second profile and align them to

329

the first profile, 1 at a time. This is useful to add some new sequences to

330

an existing alignment, or to align a set of sequences to a known structure.

331

In this case, the second profile would not be pre-aligned.

332

333

334

The alignment parameters can be set using menu items 5, 6 and 7. These are

335

EXACTLY the same parameters as used by the general, automatic multiple

336

alignment procedure. The general multiple alignment procedure is simply a

337

series of profile alignments. Carrying out a series of profile alignments on

338

larger and larger groups of sequences, allows you to manually build up a

339

complete alignment, if necessary editing intermediate alignments.

340

341

SECONDARY STRUCTURE OPTIONS. Menu Option 0 allows you to set 2D structure

342

parameters. If a solved structure is available, it can be used to guide the

343

alignment by raising gap penalties within secondary structure elements, so

344

that gaps will preferentially be inserted into unstructured surface loops.

345

Alternatively, a user-specified gap penalty mask can be supplied directly.

346

347

A gap penalty mask is a series of numbers between 1 and 9, one per position in

348

the alignment. Each number specifies how much the gap opening penalty is to be

349

raised at that position (raised by multiplying the basic gap opening penalty

350

by the number) i.e. a mask figure of 1 at a position means no change

351

in gap opening penalty; a figure of 4 means that the gap opening penalty is

352

four times greater at that position, making gaps 4 times harder to open.

353

354

The format for gap penalty masks and secondary structure masks is explained

355

in the help under option 0 (secondary structure options).

356

>>HELP B << Help for secondary structure - gap penalty masks

357

358

The use of secondary structure-based penalties has been shown to improve the

359

accuracy of multiple alignment. Therefore CLUSTAL W now allows gap penalty

360

masks to be supplied with the input sequences. The masks work by raising gap

361

penalties in specified regions (typically secondary structure elements) so that

362

gaps are preferentially opened in the less well conserved regions (typically

363

surface loops).

364

365

Options 1 and 2 control whether the input secondary structure information or

366

gap penalty masks will be used.

367

368

Option 3 controls whether the secondary structure and gap penalty masks should

369

be included in the output alignment.

370

371

Options 4 and 5 provide the value for raising the gap penalty at core Alpha

372

Helical (A) and Beta Strand (B) residues. In CLUSTAL format, capital residues

373

denote the A and B core structure notation. The basic gap penalties are

374

multiplied by the amount specified.

375

376

Option 6 provides the value for the gap penalty in Loops. By default this

377

penalty is not raised. In CLUSTAL format, loops are specified by "." in the

378

secondary structure notation.

379

380

Option 7 provides the value for setting the gap penalty at the ends of

381

secondary structures. Ends of secondary structures are observed to grow

382

and-or shrink in related structures. Therefore by default these are given

383

intermediate values, lower than the core penalties. All secondary structure

384

read in as lower case in CLUSTAL format gets the reduced terminal penalty.

385

386

Options 8 and 9 specify the range of structure termini for the intermediate

387

penalties. In the alignment output, these are indicated as lower case.

388

For Alpha Helices, by default, the range spans the end helical turn. For

389

Beta Strands, the default range spans the end residue and the adjacent loop

390

residue, since sequence conservation often extends beyond the actual H-bonded

391

Beta Strand.

392

393

CLUSTAL W can read the masks from SWISS-PROT, CLUSTAL or GDE format input

394

files. For many 3-D protein structures, secondary structure information is

395

recorded in the feature tables of SWISS-PROT database entries. You should

396

always check that the assignments are correct - some are quite inaccurate.

397

CLUSTAL W looks for SWISS-PROT HELIX and STRAND assignments e.g.

398

399

FT HELIX 100 115

400

FT STRAND 118 119

401

402

The structure and penalty masks can also be read from CLUSTAL alignment format

403

as comment lines beginning "!SS_" or "!GM_" e.g.

404

405

!SS_HBA_HUMA ..aaaAAAAAAAAAAaaa.aaaAAAAAAAAAAaaaaaaAaaa.........aaaAAAAAA

406

!GM_HBA_HUMA 112224444444444222122244444444442222224222111111111222444444

407

HBA_HUMA VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK

408

409

Note that the mask itself is a set of numbers between 1 and 9 each of which is

410

assigned to the residue(s) in the same column below.

411

412

In GDE flat file format, the masks are specified as text and the names must

413

begin with "SS_ or "GM_.

414

415

Either a structure or penalty mask or both may be used. If both are included in

416

an alignment, the user will be asked which is to be used.

417

418

>>HELP C << Help for secondary structure - gap penalty mask output options

419

420

The options in this menu let you choose whether or not to include the masks

421

in the CLUSTAL W output alignments. Showing both is useful for understanding

422

how the masks work. The secondary structure information is itself very useful

423

in judging the alignment quality and in seeing how residue conservation

424

patterns vary with secondary structure.

425

426

427

>>HELP 7 << Help for phylogenetic trees

428

429

1) Before calculating a tree, you must have an ALIGNMENT in memory. This can be

430

input in any format or you should have just carried out a full multiple

431

alignment and the alignment is still in memory.

432

433

434

*************** Remember YOU MUST ALIGN THE SEQUENCES FIRST!!!! ***************

435

436

437

The method used is the NJ (Neighbour Joining) method of Saitou and Nei. First

438

you calculate distances (percent divergence) between all pairs of sequence from

439

a multiple alignment; second you apply the NJ method to the distance matrix.

440

441

2) EXCLUDE POSITIONS WITH GAPS? With this option, any alignment positions where

442

ANY of the sequences have a gap will be ignored. This means that 'like' will be

443

compared to 'like' in all distances, which is highly desirable. It also

444

automatically throws away the most ambiguous parts of the alignment, which are

445

concentrated around gaps (usually). The disadvantage is that you may throw away

446

much of the data if there are many gaps (which is why it is difficult for us to

447

make it the default).

448

449

450

451

3) CORRECT FOR MULTIPLE SUBSTITUTIONS? For small divergence (say <10%) this

452

option makes no difference. For greater divergence, it corrects for the fact

453

that observed distances underestimate actual evolutionary distances. This is

454

because, as sequences diverge, more than one substitution will happen at many

455

sites. However, you only see one difference when you look at the present day

456

sequences. Therefore, this option has the effect of stretching branch lengths

457

in trees (especially long branches). The corrections used here (for DNA or

458

proteins) are both due to Motoo Kimura. See the documentation for details.

459

460

Where possible, this option should be used. However, for VERY divergent

461

sequences, the distances cannot be reliably corrected. You will be warned if

462

this happens. Even if none of the distances in a data set exceed the reliable

463

threshold, if you bootstrap the data, some of the bootstrap distances may

464

randomly exceed the safe limit.

465

466

4) To calculate a tree, use option 4 (DRAW TREE NOW). This gives an UNROOTED

467

tree and all branch lengths. The root of the tree can only be inferred by

468

using an outgroup (a sequence that you are certain branches at the outside

469

of the tree .... certain on biological grounds) OR if you assume a degree

470

of constancy in the 'molecular clock', you can place the root in the 'middle'

471

of the tree (roughly equidistant from all tips).

472

473

5) TOGGLE PHYLIP BOOTSTRAP POSITIONS

474

By default, the bootstrap values are correctly placed on the tree branches of

475

the phylip format output tree. The toggle allows them to be placed on the

476

nodes, which is incorrect, but some display packages (e.g. TreeTool, TreeView

477

and Phylowin) only support node labelling but not branch labelling. Care

478

should be taken to note which branches and labels go together.

479

480

6) OUTPUT FORMATS: four different formats are allowed. None of these displays

481

the tree visually. Useful display programs accepting PHYLIP format include

482

NJplot (from Manolo Gouy and supplied with Clustal W), TreeView (Mac-PC), and

483

PHYLIP itself - OR get the PHYLIP package and use the tree drawing facilities

484

there. (Get the PHYLIP package anyway if you are interested in trees). The

485

NEXUS format can be read into PAUP or MacClade.

486

487

>>HELP 8 << Help for choosing a weight matrix

488

489

For protein alignments, you use a weight matrix to determine the similarity of

490

non-identical amino acids. For example, Tyr aligned with Phe is usually judged

491

to be 'better' than Tyr aligned with Pro.

492

493

There are three 'in-built' series of weight matrices offered. Each consists of

494

several matrices which work differently at different evolutionary distances. To

495

see the exact details, read the documentation. Crudely, we store several

496

matrices in memory, spanning the full range of amino acid distance (from almost

497

identical sequences to highly divergent ones). For very similar sequences, it

498

is best to use a strict weight matrix which only gives a high score to

499

identities and the most favoured conservative substitutions. For more divergent

500

sequences, it is appropriate to use "softer" matrices which give a high score

501

to many other frequent substitutions.

502

503

1) BLOSUM (Henikoff). These matrices appear to be the best available for

504

carrying out database similarity (homology searches). The matrices used are:

505

Blosum 80, 62, 45 and 30. (BLOSUM was the default in earlier Clustal W

506

versions)

507

508

2) PAM (Dayhoff). These have been extremely widely used since the late '70s.

509

We use the PAM 20, 60, 120 and 350 matrices.

510

511

3) GONNET. These matrices were derived using almost the same procedure as the

512

Dayhoff one (above) but are much more up to date and are based on a far larger

513

data set. They appear to be more sensitive than the Dayhoff series. We use the

514

GONNET 80, 120, 160, 250 and 350 matrices. This series is the default for

515

Clustal W version 1.8.

516

517

We also supply an identity matrix which gives a score of 1.0 to two identical

518

amino acids and a score of zero otherwise. This matrix is not very useful.

519

Alternatively, you can read in your own (just one matrix, not a series).

520

521

A new matrix can be read from a file on disk, if the filename consists only

522

of lower case characters. The values in the new weight matrix must be integers

523

and the scores should be similarities. You can use negative as well as positive

524

values if you wish, although the matrix will be automatically adjusted to all

525

positive scores.

526

527

528

529

For DNA, a single matrix (not a series) is used. Two hard-coded matrices are

530

available:

531

532

533

1) IUB. This is the default scoring matrix used by BESTFIT for the comparison

534

of nucleic acid sequences. X's and N's are treated as matches to any IUB

535

ambiguity symbol. All matches score 1.9; all mismatches for IUB symbols score 0.

536

537

538

2) CLUSTALW(1.6). The previous system used by Clustal W, in which matches score

539

1.0 and mismatches score 0. All matches for IUB symbols also score 0.

540

541

INPUT FORMAT The format used for a new matrix is the same as the BLAST program.

542

Any lines beginning with a # character are assumed to be comments. The first

543

non-comment line should contain a list of amino acids in any order, using the

544

1 letter code, followed by a * character. This should be followed by a square

545

matrix of integer scores, with one row and one column for each amino acid. The

546

last row and column of the matrix (corresponding to the * character) contain

547

the minimum score over the whole matrix.

548

549

>>HELP 9 << Help for command line parameters

550

DATA (sequences)

551

552

-INFILE=file.ext :input sequences.

553

-PROFILE1=file.ext and -PROFILE2=file.ext :profiles (old alignment).

554

555

556

VERBS (do things)

557

558

-OPTIONS :list the command line parameters

559

-HELP or -CHECK :outline the command line params.

560

-ALIGN :do full multiple alignment

561

-TREE :calculate NJ tree.

562

-BOOTSTRAP(=n) :bootstrap a NJ tree (n= number of bootstraps; def. = 1000).

563

-CONVERT :output the input sequences in a different file format.

564

565

566

PARAMETERS (set things)

567

568

***General settings:****

569

-INTERACTIVE :read command line, then enter normal interactive menus

570

-QUICKTREE :use FAST algorithm for the alignment guide tree

571

-TYPE= :PROTEIN or DNA sequences

572

-NEGATIVE :protein alignment with negative values in matrix

573

-OUTFILE= :sequence alignment file name

574

-OUTPUT= :GCG, GDE, PHYLIP, PIR or NEXUS

575

-OUTORDER= :INPUT or ALIGNED

576

-CASE :LOWER or UPPER (for GDE output only)

577

-SEQNOS= :OFF or ON (for Clustal output only)

578

-SEQNO_RANGE=:OFF or ON (NEW: for all output formats)

579

-RANGE=m,n :sequence range to write starting m to m+n.

580

581

***Fast Pairwise Alignments:***

582

-KTUPLE=n :word size

583

-TOPDIAGS=n :number of best diags.

584

-WINDOW=n :window around best diags.

585

-PAIRGAP=n :gap penalty

586

-SCORE :PERCENT or ABSOLUTE

587

588

589

***Slow Pairwise Alignments:***

590

-PWMATRIX= :Protein weight matrix=BLOSUM, PAM, GONNET, ID or filename

591

-PWDNAMATRIX= :DNA weight matrix=IUB, CLUSTALW or filename

592

-PWGAPOPEN=f :gap opening penalty

593

-PWGAPEXT=f :gap opening penalty

594

595

596

***Multiple Alignments:***

597

-NEWTREE= :file for new guide tree

598

-USETREE= :file for old guide tree

599

-MATRIX= :Protein weight matrix=BLOSUM, PAM, GONNET, ID or filename

600

-DNAMATRIX= :DNA weight matrix=IUB, CLUSTALW or filename

601

-GAPOPEN=f :gap opening penalty

602

-GAPEXT=f :gap extension penalty

603

-ENDGAPS :no end gap separation pen.

604

-GAPDIST=n :gap separation pen. range

605

-NOPGAP :residue-specific gaps off

606

-NOHGAP :hydrophilic gaps off

607

-HGAPRESIDUES= :list hydrophilic res.

608

-MAXDIV=n :% ident. for delay

609

-TYPE= :PROTEIN or DNA

610

-TRANSWEIGHT=f :transitions weighting

611

612

613

***Profile Alignments:***

614

-PROFILE :Merge two alignments by profile alignment

615

-NEWTREE1= :file for new guide tree for profile1

616

-NEWTREE2= :file for new guide tree for profile2

617

-USETREE1= :file for old guide tree for profile1

618

-USETREE2= :file for old guide tree for profile2

619

620

621

***Sequence to Profile Alignments:***

622

-SEQUENCES :Sequentially add profile2 sequences to profile1 alignment

623

-NEWTREE= :file for new guide tree

624

-USETREE= :file for old guide tree

625

626

627

***Structure Alignments:***

628

-NOSECSTR1 :do not use secondary structure-gap penalty mask for profile 1

629

-NOSECSTR2 :do not use secondary structure-gap penalty mask for profile 2

630

-SECSTROUT=STRUCTURE or MASK or BOTH or NONE :output in alignment file

631

-HELIXGAP=n :gap penalty for helix core residues

632

-STRANDGAP=n :gap penalty for strand core residues

633

-LOOPGAP=n :gap penalty for loop regions

634

-TERMINALGAP=n :gap penalty for structure termini

635

-HELIXENDIN=n :number of residues inside helix to be treated as terminal

636

-HELIXENDOUT=n :number of residues outside helix to be treated as terminal

637

-STRANDENDIN=n :number of residues inside strand to be treated as terminal

638

-STRANDENDOUT=n:number of residues outside strand to be treated as terminal

639

640

641

***Trees:***

642

-OUTPUTTREE=nj OR phylip OR dist OR nexus

643

-SEED=n :seed number for bootstraps.

644

-KIMURA :use Kimura's correction.

645

-TOSSGAPS :ignore positions with gaps.

646

-BOOTLABELS=node OR branch :position of bootstrap values in tree display

647

648

>>HELP 0 << Help for tree output format options

649

650

Four output formats are offered: 1) Clustal, 2) Phylip, 3) Just the distances

651

4) Nexus

652

653

None of these formats displays the results graphically. Many packages can

654

display trees in the the PHYLIP format 2) below. It can also be imported into

655

the PHYLIP programs RETREE, DRAWTREE and DRAWGRAM for graphical display.

656

NEXUS format trees can be read by PAUP and MacClade.

657

658

1) Clustal format output.

659

This format is verbose and lists all of the distances between the sequences and

660

the number of alignment positions used for each. The tree is described at the

661

end of the file. It lists the sequences that are joined at each alignment step

662

and the branch lengths. After two sequences are joined, it is referred to later

663

as a NODE. The number of a NODE is the number of the lowest sequence in that

664

NODE.

665

666

2) Phylip format output.

667

This format is the New Hampshire format, used by many phylogenetic analysis

668

packages. It consists of a series of nested parentheses, describing the

669

branching order, with the sequence names and branch lengths. It can be used by

670

the RETREE, DRAWGRAM and DRAWTREE programs of the PHYLIP package to see the

671

trees graphically. This is the same format used during multiple alignment for

672

the guide trees.

673

674

Use this format with NJplot (Manolo Gouy), supplied with Clustal W. Some other

675

packages that can read and display New Hampshire format are TreeView (Mac/PC),

676

TreeTool (UNIX), and Phylowin.

677

678

3) The distances only.

679

This format just outputs a matrix of all the pairwise distances in a format

680

that can be used by the Phylip package. It used to be useful when one could not

681

produce distances from protein sequences in the Phylip package but is now

682

redundant (Protdist of Phylip 3.5 now does this).

683

684

4) NEXUS FORMAT TREE. This format is used by several popular phylogeny programs,

685

including PAUP and MacClade. The format is described fully in:

686

Maddison, D. R., D. L. Swofford and W. P. Maddison. 1997.

687

NEXUS: an extensible file format for systematic information.

688

Systematic Biology 46:590-621.

689

690

5) TOGGLE PHYLIP BOOTSTRAP POSITIONS

691

By default, the bootstrap values are placed on the nodes of the phylip format

692

output tree. This is inaccurate as the bootstrap values should be associated

693

with the tree branches and not the nodes. However, this format can be read and

694

displayed by TreeTool, TreeView and Phylowin. An option is available to

695

correctly place the bootstrap values on the branches with which they are

696

associated.

697

Older »