~ubuntu-branches/ubuntu/trusty/clustalx/trusty

Viewing changes to debian/patches/clustalx.html.patch

Committer: Bazaar Package Importer
Author(s): Charles Plessy, Steffen Moeller, Charles Plessy
Date: 2009-10-21 13:25:44 UTC
mfrom: (1.1.1 upstream)
Revision ID: james.westby@ubuntu.com-20091021132544-r4hbcnjxp354wxh0

Tags: 2.0.12-1

http://bugs.debian.org/550893

* New upstream release (LP: #423648, #393769):
  - Uses Qt instead of lesstif.
  - Includes new code for UPGMA guide trees.
  - Includes iterative alignment facility.

[ Steffen Moeller ]
* New upstream release.
* Updated watch file (Closes: #550893).
* Removed LICENSE from debian/clustalx.docs
* rename to clustalx seems no longer required in debian/rules
* moved clustalx.1 into debian folder (eases working with svn-buildpackage)
* added German translation to desktop file

[ Charles Plessy ]
* Updated my email address.
* debian/copyright made machine-readable.
* Added various informations in debian/upstream-metadata.yaml.
* Switched to Debhelper 7.
  (debian/rules, debian/control, debian/patches, debian/compat)
* Removed useless Debhelper file debian/clustalx.dirs.
* Updated package description.
* Hardcoded the localisation of accessory files in /usr/share/clustalx.
  (debian/patches/hardcode-accessory-file-locations.patch)
* Documented in debian/README.source that the documentation for quilt
  is in /usr/share/doc/quilt.
* Added upstream changelog downloaded from upstream website
  (debian/rules, debian/CHANGELOG.upstream).
* Incremented Standards-Version to reflect conformance with Policy 3.8.3
  (debian/control, no other changes needed).
* Updated homepage in debian/clustalw.1.

files added:
AlignOutputFileNames.cpp

AlignOutputFileNames.h

AlignmentFormatOptions.cpp

AlignmentFormatOptions.h

AlignmentParameters.cpp

AlignmentParameters.h

AlignmentViewerWidget.cpp

AlignmentViewerWidget.h

AlignmentWidget.cpp

AlignmentWidget.h

BootstrapTreeDialog.cpp

BootstrapTreeDialog.h

ClustalQtParams.h

ColorFileXmlParser.cpp

ColorFileXmlParser.h

ColorParameters.cpp

ColorParameters.h

ColumnScoreParams.cpp

ColumnScoreParams.h

FileDialog.cpp

FileDialog.h

HardCodedColorScheme.cpp

HardCodedColorScheme.h

HelpDisplayWidget.cpp

HelpDisplayWidget.h

HistogramWidget.cpp

HistogramWidget.h

KeyController.cpp

KeyController.h

LICENSE

LowScoringSegParams.cpp

LowScoringSegParams.h

PSPrinter.cpp

PSPrinter.h

PairwiseParams.cpp

PairwiseParams.h

PostscriptFileParams.cpp

PostscriptFileParams.h

ProteinGapParameters.cpp

ProteinGapParameters.h

QTUtility.cpp

QTUtility.h

README

Resources.cpp

Resources.h

SaveSeqFile.cpp

SaveSeqFile.h

SearchForString.cpp

SearchForString.h

SecStructOptions.cpp

SecStructOptions.h

SeqNameWidget.cpp

SeqNameWidget.h

TreeFormatOptions.cpp

TreeFormatOptions.h

TreeOutputFileNames.cpp

TreeOutputFileNames.h

WritePostscriptFile.cpp

WritePostscriptFile.h

clustalW

clustalW/Clustal.cpp

clustalW/Clustal.h

clustalW/Help.cpp

clustalW/Help.h

clustalW/alignment

clustalW/alignment/Alignment.cpp

clustalW/alignment/Alignment.h

clustalW/alignment/AlignmentOutput.cpp

clustalW/alignment/AlignmentOutput.h

clustalW/alignment/ObjectiveScore.cpp

clustalW/alignment/ObjectiveScore.h

clustalW/alignment/Sequence.cpp

clustalW/alignment/Sequence.h

clustalW/clustalw_version.h

clustalW/fileInput

clustalW/fileInput/ClustalFileParser.cpp

clustalW/fileInput/ClustalFileParser.h

clustalW/fileInput/EMBLFileParser.cpp

clustalW/fileInput/EMBLFileParser.h

clustalW/fileInput/FileParser.cpp

clustalW/fileInput/FileParser.h

clustalW/fileInput/FileReader.cpp

clustalW/fileInput/FileReader.h

clustalW/fileInput/GDEFileParser.cpp

clustalW/fileInput/GDEFileParser.h

clustalW/fileInput/InFileStream.cpp

clustalW/fileInput/InFileStream.h

clustalW/fileInput/MSFFileParser.cpp

clustalW/fileInput/MSFFileParser.h

clustalW/fileInput/PIRFileParser.cpp

clustalW/fileInput/PIRFileParser.h

clustalW/fileInput/PearsonFileParser.cpp

clustalW/fileInput/PearsonFileParser.h

clustalW/fileInput/RSFFileParser.cpp

clustalW/fileInput/RSFFileParser.h

clustalW/general

clustalW/general/Array2D.h

clustalW/general/ClustalWResources.cpp

clustalW/general/ClustalWResources.h

clustalW/general/DebugLog.cpp

clustalW/general/DebugLog.h

clustalW/general/InvalidCombination.cpp

clustalW/general/OutputFile.cpp

clustalW/general/OutputFile.h

clustalW/general/RandomAccessLList.h

clustalW/general/SequenceNotFoundException.h

clustalW/general/SquareMat.h

clustalW/general/Stats.cpp

clustalW/general/Stats.h

clustalW/general/SymMatrix.cpp

clustalW/general/SymMatrix.h

clustalW/general/UserParameters.cpp

clustalW/general/UserParameters.h

clustalW/general/Utility.cpp

clustalW/general/Utility.h

clustalW/general/VectorOutOfRange.cpp

clustalW/general/VectorOutOfRange.h

clustalW/general/VectorUtility.h

clustalW/general/clustalw.h

clustalW/general/debuglogObject.h

clustalW/general/general.h

clustalW/general/param.h

clustalW/general/statsObject.h

clustalW/general/userparams.h

clustalW/general/utils.h

clustalW/interface

clustalW/interface/CommandLineParser.cpp

clustalW/interface/CommandLineParser.h

clustalW/interface/InteractiveMenu.cpp

clustalW/interface/InteractiveMenu.h

clustalW/multipleAlign

clustalW/multipleAlign/Iteration.cpp

clustalW/multipleAlign/Iteration.h

clustalW/multipleAlign/LowScoreSegProfile.cpp

clustalW/multipleAlign/LowScoreSegProfile.h

clustalW/multipleAlign/MSA.cpp

clustalW/multipleAlign/MSA.h

clustalW/multipleAlign/MyersMillerProfileAlign.cpp

clustalW/multipleAlign/MyersMillerProfileAlign.h

clustalW/multipleAlign/ProfileAlignAlgorithm.h

clustalW/multipleAlign/ProfileBase.cpp

clustalW/multipleAlign/ProfileBase.h

clustalW/multipleAlign/ProfileStandard.cpp

clustalW/multipleAlign/ProfileStandard.h

clustalW/multipleAlign/ProfileWithSub.cpp

clustalW/multipleAlign/ProfileWithSub.h

clustalW/pairwise

clustalW/pairwise/FastPairwiseAlign.cpp

clustalW/pairwise/FastPairwiseAlign.h

clustalW/pairwise/FullPairwiseAlign.cpp

clustalW/pairwise/FullPairwiseAlign.h

clustalW/pairwise/PairwiseAlignBase.h

clustalW/substitutionMatrix

clustalW/substitutionMatrix/SubMatrix.cpp

clustalW/substitutionMatrix/SubMatrix.h

clustalW/substitutionMatrix/globalmatrix.h

clustalW/substitutionMatrix/matrices.h

clustalW/tree

clustalW/tree/AlignmentSteps.cpp

clustalW/tree/AlignmentSteps.h

clustalW/tree/ClusterTree.cpp

clustalW/tree/ClusterTree.h

clustalW/tree/ClusterTreeAlgorithm.h

clustalW/tree/ClusterTreeOutput.cpp

clustalW/tree/ClusterTreeOutput.h

clustalW/tree/NJTree.cpp

clustalW/tree/NJTree.h

clustalW/tree/RandomGenerator.cpp

clustalW/tree/RandomGenerator.h

clustalW/tree/Tree.cpp

clustalW/tree/Tree.h

clustalW/tree/TreeInterface.cpp

clustalW/tree/TreeInterface.h

clustalW/tree/UPGMA

clustalW/tree/UPGMA/Node.cpp

clustalW/tree/UPGMA/Node.h

clustalW/tree/UPGMA/RootedClusterTree.cpp

clustalW/tree/UPGMA/RootedClusterTree.h

clustalW/tree/UPGMA/RootedGuideTree.cpp

clustalW/tree/UPGMA/RootedGuideTree.h

clustalW/tree/UPGMA/RootedTreeOutput.cpp

clustalW/tree/UPGMA/RootedTreeOutput.h

clustalW/tree/UPGMA/UPGMAAlgorithm.cpp

clustalW/tree/UPGMA/UPGMAAlgorithm.h

clustalW/tree/UPGMA/upgmadata.h

clustalW/tree/UnRootedClusterTree.cpp

clustalW/tree/UnRootedClusterTree.h

clustalW/tree/dayhoff.h

clustalx.hlp

clustalx.pro

coldna.xml

colprint.xml

colprot.xml

debian/CHANGELOG.upstream

debian/README.source

debian/patches/hardcode-accessory-file-locations.patch

debian/upstream-metadata.yaml

installer

main.cpp

mainwindow.cpp

mainwindow.h

moc_AlignOutputFileNames.cpp

moc_AlignmentFormatOptions.cpp

moc_AlignmentParameters.cpp

moc_AlignmentViewerWidget.cpp

moc_AlignmentWidget.cpp

moc_BootstrapTreeDialog.cpp

moc_ColorFileXmlParser.cpp

moc_ColumnScoreParams.cpp

moc_FileDialog.cpp

moc_HelpDisplayWidget.cpp

moc_HistogramWidget.cpp

moc_LowScoringSegParams.cpp

moc_PSPrinter.cpp

moc_PairwiseParams.cpp

moc_ProteinGapParameters.cpp

moc_SaveSeqFile.cpp

moc_SearchForString.cpp

moc_SecStructOptions.cpp

moc_SeqNameWidget.cpp

moc_TreeFormatOptions.cpp

moc_TreeOutputFileNames.cpp

moc_WritePostscriptFile.cpp

moc_mainwindow.cpp

usr/local

usr/local/qt

usr/local/qt/qt-x11-opensource-src-4.5.2

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs/common

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs/common/g++.conf

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs/common/linux.conf

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs/common/unix.conf

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs/features

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs/features/default_post.prf

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs/features/default_pre.prf

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs/features/exclusive_builds.prf

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs/features/include_source_dir.prf

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs/features/lex.prf

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs/features/moc.prf

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs/features/qt.prf

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs/features/qt_config.prf

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs/features/qt_functions.prf

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs/features/release.prf

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs/features/resources.prf

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs/features/static.prf

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs/features/uic.prf

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs/features/unix

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs/features/unix/thread.prf

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs/features/warn_on.prf

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs/features/yacc.prf

usr/local/qt/qt-x11-opensource-src-4.5.2/mkspecs/qconfig.pri

version.h

files removed:
README_W

README_X

alnscore.c

amenu.c

calcgapcoeff.c

calcprf1.c

calcprf2.c

calctree.c

clustalv.doc

clustalw.c

clustalw.doc

clustalw.h

clustalw.ms

clustalw.new

clustalw_help

clustalx.c

clustalx.html

clustalx_help

coldna.par

colprint.par

colprot.par

dayhoff.h

debian/clustalx.dirs

debian/patches/amenu.c.patch

debian/patches/clustal-help.patch

debian/patches/clustalw.h.patch

debian/patches/clustalx.html.patch

debian/patches/clustalx_help.patch

debian/patches/interface.c.patch

debian/patches/makefile.patch

debian/patches/sequence.c.patch

debian/patches/trees.c.patch

debian/patches/util.c.patch

debian/patches/xmenu.c.patch

gcgcheck.c

general.h

globin.pep

gon90.bla

interface.c

makefile

makefile.alpha

makefile.linux

makefile.sgi

makefile.sun

malign.c

matrices.h

matrixseries.gon

pairalign.c

param.h

prfalign.c

random.c

readmat.c

sequence.c

showpair.c

trees.c

util.c

xcolor.c

xdisplay.c

xmenu.c

xmenu.h

xscore.c

xutils.c

files modified:
debian/changelog

debian/clustalx.1

debian/clustalx.docs

debian/clustalx.install

debian/clustalx.menu

debian/compat

debian/control

debian/copyright

debian/patches/series

debian/rules

debian/watch

Show diffs side-by-side

added added

removed removed

debian/patches/clustalx.html.patch

Index: clustalw-1.83/clustalx.html

===================================================================

--- clustalw-1.83.orig/clustalx.html

+++ clustalw-1.83/clustalx.html

@@ -2029,6 +2029,2118 @@

Thompson,J.D., Gibson,T.J., Plewniak,F., Jeanmougin,F. and Higgins,D.G. (1997)

The ClustalX windows interface: flexible strategies for multiple sequence

+alignment aided by quality analysis tools. Nucleic Acids Research, 24:4876-4882.

+

+

+

+

+

+The ClustalW program is described in the manuscript:

+

+

+

+Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTAL W: improving the

+sensitivity of progressive multiple sequence alignment through sequence

+weighting, positions-specific gap penalties and weight matrix choice. Nucleic

+Acids Research, 22:4673-4680.

+

+

+

+

+

+The ClustalV program is described in the manuscript:

+

+

+

+Higgins,D.G., Bleasby,A.J. and Fuchs,R. (1992) CLUSTAL V: improved software for

+multiple sequence alignment. CABIOS 8,189-191.

+

+

+

+

+

+The original Clustal program is described in the manuscripts:

+

+

+

+Higgins,D.G. and Sharp,P.M. (1989) Fast and sensitive multiple sequence

+alignments on a microcomputer.

+CABIOS 5,151-153.

+

+

+Higgins,D.G. and Sharp,P.M. (1988) CLUSTAL: a package for performing multiple

+sequence alignment on a microcomputer. Gene 73,237-244.

+

+

+

+Some tips on using Clustal X:

+

+

+

+Jeannmougin,F., Thompson,J.D., Gouy,M., Higgins,D.G. and Gibson,T.J. (1998)

+Multiple sequence alignment with Clustal X. Trends Biochem Sci, 23, 403-5.

+

+

+

+Some tips on using Clustal W:

+

+

+

+Higgins, D. G., Thompson, J. D. and Gibson, T. J. (1996) Using CLUSTAL for

+multiple sequence alignments. Methods Enzymol., 266, 383-402.

+

+

+

+You can get the latest version of the ClustalX program by anonymous ftp to:

+

+

+

+ftp-igbmc.u-strasbg.fr

+ftp.embl-heidelberg.de

+ftp.ebi.ac.uk

+

+

+

+Or, have a look at the following WWW site:

+

+

+

+http://www-igbmc.u-strasbg.fr/BioInfo/

+

+

+

+<A HREF="#INDEX"> Back to Index </A>

+<HEAD>

+<TITLE>ClustalX Help</TITLE>

+</HEAD>

+<BODY BGCOLOR=white>

+<CENTER><H1>ClustalX Help</H1></CENTER>

+

+You can get the latest version of the ClustalX program here:

+

+<DL><DD>

+<A HREF="ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalX/">

100

+ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalX/</A>

101

+</DL>

102

+For full details of usage and algorithms, please read the <A HREF="clustalw.doc">ClustalW.Doc</A> file.

103

+<PRE>

104

+Toby Gibson EMBL, Heidelberg, Germany.

105

+Des Higgins UCC, Cork, Ireland.

106

+Julie Thompson/Francois Jeanmougin IGBMC, Strasbourg, France.

107

+</PRE>

108

+<CENTER><H2><A NAME="Index">Index</A></H2></CENTER>

109

+<OL>

110

+<LI><A HREF="#G"> General help for CLUSTAL X (1.8)

111

+</A></LI>

112

+<LI><A HREF="#F"> Input / Output Files

113

+</A></LI>

114

+<LI><A HREF="#E"> Editing Alignments

115

+</A></LI>

116

+<LI><A HREF="#M"> Multiple Alignments

117

+</A></LI>

118

+<LI><A HREF="#P"> Profile and Structure Alignments

119

+</A></LI>

120

+<LI><A HREF="#B"> Secondary Structure / Gap Penalty Masks

121

+</A></LI>

122

+<LI><A HREF="#T"> Phylogenetic Trees

123

+</A></LI>

124

+<LI><A HREF="#C"> Colors

125

+</A></LI>

126

+<LI><A HREF="#Q"> Alignment Quality Analysis

127

+</A></LI>

128

+<LI><A HREF="#9"> Command Line Parameters

129

+</A></LI>

130

+<LI><A HREF="#R"> References

131

+</A></LI>

132

+</OL>

133

+<CENTER><H2><A NAME="G"> General help for CLUSTAL X (1.8)

134

+</A></H2></CENTER>

135

+

136

+

137

+

138

+Clustal X is a windows interface for the ClustalW multiple sequence alignment

139

+program. It provides an integrated environment for performing multiple sequence

140

+and profile alignments and analysing the results. The sequence alignment is

141

+displayed in a window on the screen. A versatile coloring scheme has been

142

+incorporated allowing you to highlight conserved features in the alignment.

143

+The pull-down menus at the top of the window allow you to select all the

144

+options required for traditional multiple sequence and profile alignment.

145

+

146

+

147

+You can cut-and-paste sequences to change the order of the alignment; you can

148

+select a subset of sequences to be aligned; you can select a sub-range of the

149

+alignment to be realigned and inserted back into the original alignment.

150

+

151

+

152

+Alignment quality analysis can be performed and low-scoring segments or

153

+exceptional residues can be highlighted.

154

+

155

+

156

+ClustalX is available for a number of different platforms including: SUN

157

+Solaris, IRIX5.3 on Silicon Graphics, Digital UNIX on DECStations, Microsoft

158

+Windows (32 bit) for PC's, Linux ELF for x86 PC's and Macintosh PowerMac. (See

159

+the README file for Installation instructions.)

160

+

161

+

162

+

163

+

164

+<H4>

165

+SEQUENCE INPUT

166

+</H4>

167

+

168

+

169

+Sequences and profiles (a term for pre-existing alignments) are input using

170

+the FILE menu. Invalid options will be disabled. All sequences must be included

171

+into 1 file. 7 formats are automatically recognised: NBRF/PIR, EMBL/SWISSPROT,

172

+Pearson (Fasta), Clustal (*.aln), GCG/MSF (Pileup), GCG9 RSF and GDE flat file.

173

+All non-alphabetic characters (spaces, digits, punctuation marks) are ignored

174

+except "-" which is used to indicate a GAP ("." in MSF/RSF).

175

+

176

+

177

+<H4>

178

+SEQUENCE / PROFILE ALIGNMENTS

179

+</H4>

180

+

181

+

182

+Clustal X has two modes which can be selected using the switch directly above

183

+the sequence display: MULTIPLE ALIGNMENT MODE and PROFILE ALIGNMENT MODE.

184

+

185

+

186

+To do a MULTIPLE ALIGNMENT on a set of sequences, make sure MULTIPLE ALIGNMENT

187

+MODE is selected. A single sequence data area is then displayed. The ALIGNMENT

188

+menu then allows you to either produce a guide tree for the alignment, or to do

189

+a multiple alignment following the guide tree, or to do a full multiple

190

+alignment.

191

+

192

+

193

+In PROFILE ALIGNMENT MODE, two sequence data areas are displayed, allowing you

194

+to align 2 alignments (termed profiles). Profiles are also used to add a new

195

+sequence to an old alignment, or to use secondary structure to guide the

196

+alignment process. GAPS in the old alignments are indicated using the "-"

197

+character. PROFILES can be input in ANY of the allowed formats; just use "-"

198

+(or "." for MSF/RSF) for each gap position. In Profile Alignment Mode, a button

199

+"Lock Scroll" is displayed which allows you to scroll the two profiles together

200

+using a single scroll bar. When the Lock Scroll is turned off, the two profiles

201

+can be scrolled independently.

202

+

203

+

204

+<H4>

205

+PHYLOGENETIC TREES

206

+</H4>

207

+

208

+

209

+Phylogenetic trees can be calculated from old alignments (read in with "-"

210

+characters to indicate gaps) OR after a multiple alignment while the alignment

211

+is still displayed.

212

+

213

+

214

+<H4>

215

+ALIGNMENT DISPLAY

216

+</H4>

217

+

218

+

219

+The alignment is displayed on the screen with the sequence names on the left

220

+hand side. The sequence alignment is for display only, it cannot be edited here

221

+(except for changing the sequence order by cutting-and-pasting on the sequence

222

+names).

223

+

224

+

225

+A ruler is displayed below the sequences, starting at 1 for the first residue

226

+position (residue numbers in the sequence input file are ignored).

227

+

228

+

229

+A line above the alignment is used to mark strongly conserved positions. Three

230

+characters ('*', ':' and '.') are used:

231

+

232

+

233

+'*' indicates positions which have a single, fully conserved residue

234

+

235

+

236

+':' indicates that one of the following 'strong' groups is fully conserved:-

237

+<PRE>

238

+ STA

239

+ NEQK

240

+ NHQK

241

+ NDEQ

242

+ QHRK

243

+ MILV

244

+ MILF

245

+ HY

246

+ FYW

247

+</PRE>

248

+

249

+

250

+'.' indicates that one of the following 'weaker' groups is fully conserved:-

251

+<PRE>

252

+ CSA

253

+ ATV

254

+ SAG

255

+ STNK

256

+ STPA

257

+ SGND

258

+ SNDEQK

259

+ NDEQHK

260

+ NEQHRK

261

+ FVLIM

262

+ HFY

263

+</PRE>

264

+

265

+

266

+These are all the positively scoring groups that occur in the Gonnet Pam250

267

+matrix. The strong and weak groups are defined as strong score >0.5 and weak

268

+score =<0.5 respectively.

269

+

270

+

271

+For profile alignments, secondary structure and gap penalty masks are displayed

272

+above the sequences, if any data is found in the profile input file.

273

+

274

+

275

+

276

+

277

+

278

+<A HREF="#INDEX"> Back to Index </A>

279

+<CENTER><H2><A NAME="F"> Input / Output Files

280

+</A></H2></CENTER>

281

+

282

+

283

+

284

+LOAD SEQUENCES reads sequences from one of 7 file formats, replacing any

285

+sequences that are already loaded. All sequences must be in 1 file. The formats

286

+that are automatically recognised are: NBRF/PIR, EMBL/SWISSPROT, Pearson

287

+(Fasta), Clustal (*.aln), GCG/MSF (Pileup), GCG9/RSF and GDE flat file. All

288

+non-alphabetic characters (spaces, digits, punctuation marks) are ignored

289

+except "-" which is used to indicate a GAP ("." in MSF/RSF).

290

+

291

+

292

+The program tries to automatically recognise the different file formats used

293

+and to guess whether the sequences are amino acid or nucleotide. This is not

294

+always foolproof.

295

+

296

+

297

+FASTA and NBRF/PIR formats are recognised by having a ">" as the first

298

+character in the file.

299

+

300

+

301

+EMBL/Swiss Prot formats are recognised by the letters "ID" at the start of the

302

+file (the token for the entry name field).

303

+

304

+

305

+CLUSTAL format is recognised by the word CLUSTAL at the beginning of the file.

306

+

307

+

308

+GCG/MSF format is recognised by one of the following:

309

+<UL>

310

+<LI>

311

+ - the word PileUp at the start of the file.

312

+</LI><LI>

313

+ - the word !!AA_MULTIPLE_ALIGNMENT or !!NA_MULTIPLE_ALIGNMENT

314

+ at the start of the file.

315

+</LI><LI>

316

+ - the word MSF on the first line of the file, and the characters ..

317

+ at the end of this line.

318

+</LI>

319

+</UL>

320

+

321

+

322

+GCG/RSF format is recognised by the word !!RICH_SEQUENCE at the beginning of

323

+the file.

324

+

325

+

326

+

327

+

328

+If 85% or more of the characters in the sequence are from A,C,G,T,U or N, the

329

+sequence will be assumed to be nucleotide. This works in 97.3% of cases but

330

+watch out!

331

+

332

+

333

+APPEND SEQUENCES is only valid in MULTIPLE ALIGNMENT MODE. The input sequences

334

+do not replace those already loaded, but are appended at the end of the

335

+alignment.

336

+

337

+

338

+SAVE SEQUENCES AS... offers the user a choice of one of six output formats:

339

+CLUSTAL, NBRF/PIR, GCG/MSF, PHYLIP, NEXUS or GDE. All sequences are written

340

+to a single file. Options are available to save a range of the alignment,

341

+switch between UPPER/LOWER case for GDE files, and to output SEQUENCE NUMBERING

342

+for CLUSTAL files.

343

+

344

+

345

+LOAD PROFILE 1 reads sequences in the same 7 file formats, replacing any

346

+sequences already loaded as Profile 1. This option will also remove any

347

+sequences which are loaded in Profile 2.

348

+

349

+

350

+LOAD PROFILE 2 reads sequences in the same 7 file formats, replacing any

351

+sequences already loaded as Profile 2.

352

+

353

+

354

+SAVE PROFILE 1 AS... is similar to the Save Sequences option except that only

355

+those sequences in Profile 1 will be written to the output file.

356

+

357

+

358

+SAVE PROFILE 2 AS... is similar to the Save Sequences option except that only

359

+those sequences in Profile 2 will be written to the output file.

360

+

361

+

362

+WRITE ALIGNMENT AS POSTSCRIPT will write the sequence display to a postscript

363

+format file. This will include any secondary structure / gap penalty mask

364

+information and the consensus and ruler lines which are displayed on the

365

+screen. The Alignment Quality curve can be optionally included in the output

366

+file.

367

+

368

+

369

+WRITE PROFILE 1 AS POSTSCRIPT is similar to WRITE ALIGNMENT AS POSTSCRIPT

370

+except that only the profile 1 display will be printed.

371

+

372

+

373

+WRITE PROFILE 2 AS POSTSCRIPT is similar to WRITE ALIGNMENT AS POSTSCRIPT

374

+except that only the profile 2 display will be printed.

375

+

376

+

377

+

378

+

379

+<H4>

380

+POSTSCRIPT PARAMETERS

381

+</H4>

382

+

383

+

384

+A number of options are available to allow you to configure your postscript

385

+output file.

386

+

387

+

388

+PS COLORS FILE:

389

+

390

+

391

+The exact RGB values required to reproduce the colors used in the alignment

392

+window will vary from printer to printer. A PS colors file can be specified

393

+that contains the RGB values for all the colors required by each of your

394

+postscript printers.

395

+

396

+

397

+By default, Clustal X looks for a file called 'colprint.par' in the current

398

+directory (if your running under UNIX, it then looks in your home directory,

399

+and finally in the directories in your PATH environment variable). If no PS

400

+colors file is found or a color used on the screen is not defined here, the

401

+screen RGB values (from the Color Parameter File) are used.

402

+

403

+

404

+The PS colors file consists of one line for each color to be defined, with the

405

+color name followed by the RGB values (on a scale of 0 to 1). For example,

406

+

407

+

408

+RED 0.9 0.1 0.1

409

+

410

+

411

+Blank lines and comments (lines beginning with a '#' character) are ignored.

412

+

413

+

414

+

415

+

416

+PAGE SIZE: The alignment can be displayed on either A4, A3 or US Letter size

417

+pages.

418

+

419

+

420

+ORIENTATION: The alignment can be displayed on either a landscape or portrait

421

+page.

422

+

423

+

424

+PRINT HEADER: An optional header including the postscript filename, and

425

+creation date can be printed at the top of each page.

426

+

427

+

428

+PRINT QUALITY CURVE: The Alignment Quality curve which is displayed underneath

429

+the alignment on the screen can be included in the postscript output.

430

+

431

+

432

+PRINT RULER: The ruler which is displayed underneath the alignment on the

433

+screen can be included in the postscript output.

434

+

435

+

436

+PRINT RESIDUE NUMBERS: Sequence residue numbers can be printed at the right

437

+hand side of the alignment.

438

+

439

+

440

+RESIZE TO FIT PAGE: By default, the alignment is scaled to fit the page size

441

+selected. This option can be turned off, in which case a font size of 10 will

442

+be used for the sequences.

443

+

444

+

445

+PRINT FROM POSITION/TO: A range of the alignment can be printed. The default

446

+is to print the full alignment. The first and last residues to be printed are

447

+specified here.

448

+

449

+

450

+USE BLOCK LENGTH: The alignment can be divided into blocks of residues. The

451

+number of residues in a block is specified here. More than one block may then

452

+be printed on a single page. This is useful for long alignments of a small

453

+number of sequences. If the block length is set to 0, The alignment will not

454

+be divided into blocks, but printed across a number of pages.

455

+

456

+

457

+

458

+<A HREF="#INDEX"> Back to Index </A>

459

+<CENTER><H2><A NAME="E"> Editing Alignments

460

+</A></H2></CENTER>

461

+

462

+

463

+

464

+Clustal X allows you to change the order of the sequences in the alignment, by

465

+cutting-and-pasting the sequence names.

466

+

467

+

468

+To select a group of sequences to be moved, click on a sequence name and drag

469

+the cursor until all the required sequences are highlighted. Holding down the

470

+Shift key when clicking on the first name will add new sequences to those

471

+already selected.

472

+

473

+

474

+(Options are provided to Select All Sequences, Select Profile 1 or Select

475

+Profile 2.)

476

+

477

+

478

+The selected sequences can be removed from the alignment by using the EDIT

479

+menu, CUT option.

480

+

481

+

482

+To add the cut sequences back into an alignment, select a sequence by clicking

483

+on the sequence name. The cut sequences will be added to the alignment,

484

+immediately following the selected sequence, by the EDIT menu, PASTE option.

485

+

486

+

487

+To add the cut sequences to an empty alignment (eg. when cutting sequences from

488

+Profile 1 and pasting them to Profile 2), click on the empty sequence name

489

+display area, and select the EDIT menu, PASTE option as before.

490

+

491

+

492

+The sequence selection and sequence range selection can be cleared using the

493

+EDIT menu, CLEAR SEQUENCE SELECTION and CLEAR RANGE SELECTION options

494

+respectively.

495

+

496

+

497

+To search for a string of residues in the sequences, select the sequences to be

498

+searched by clicking on the sequence names. You can then enter the string to

499

+search for by selecting the SEARCH FOR STRING option. If the string is found in

500

+any of the sequences selected, the sequence name and column number is printed

501

+below the sequence display.

502

+

503

+

504

+In PROFILE ALIGNMENT MODE, the two profiles can be merged (normally done after

505

+alignment) by selecting ADD PROFILE 2 TO PROFILE 1. The sequences currently

506

+displayed as Profile 2 will be appended to Profile 1.

507

+

508

+

509

+The REMOVE ALL GAPS option will remove all gaps from the sequences currently

510

+selected.

511

+WARNING: This option removes ALL gaps, not only those introduced by ClustalX,

512

+but also those that were read from the input alignment file. Any secondary

513

+structure information associated with the alignment will NOT be automatically

514

+realigned.

515

+

516

+

517

+The REMOVE GAP-ONLY COLUMNS will remove those positions in the alignment which

518

+contain gaps in all sequences. This can occur as a result of removing divergent

519

+sequences from an alignment, or if an alignment has been realigned.

520

+

521

+

522

+

523

+<A HREF="#INDEX"> Back to Index </A>

524

+<CENTER><H2><A NAME="M"> Multiple Alignments

525

+</A></H2></CENTER>

526

+

527

+

528

+

529

+Make sure MULTIPLE ALIGNMENT MODE is selected, using the switch directly above

530

+the sequence display area. Then, use the ALIGNMENT menu to do multiple

531

+alignments.

532

+

533

+

534

+Multiple alignments are carried out in 3 stages:

535

+

536

+

537

+1) all sequences are compared to each other (pairwise alignments);

538

+

539

+

540

+2) a dendrogram (like a phylogenetic tree) is constructed, describing the

541

+approximate groupings of the sequences by similarity (stored in a file).

542

+

543

+

544

+3) the final multiple alignment is carried out, using the dendrogram as a guide.

545

+

546

+

547

+The 3 stages are carried out automatically by the DO COMPLETE ALIGNMENT option.

548

+You can skip the first stages (pairwise alignments; guide tree) by using an old

549

+guide tree file (DO ALIGNMENT FROM GUIDE TREE); or you can just produce the

550

+guide tree with no final multiple alignment (PRODUCE GUIDE TREE ONLY).

551

+

552

+

553

+

554

+

555

+REALIGN SELECTED SEQUENCES is used to realign badly aligned sequences in the

556

+alignment. Sequences can be selected by clicking on the sequence names - see

557

+Editing Alignments for more details. The unselected sequences are then 'fixed'

558

+and a profile is made including only the unselected sequences. Each of the

559

+selected sequences in turn is then realigned to this profile. The realigned

560

+sequences will be displayed as a group at the end the alignment.

561

+

562

+

563

+

564

+

565

+REALIGN SELECTED SEQUENCE RANGE is used to realign a small region of the

566

+alignment. A residue range can be selected by clicking on the sequence display

567

+area. A multiple alignment is then performed, following the 3 stages described

568

+above, but only using the selected residue range. Finally the new alignment of

569

+the range is pasted back into the full sequence alignment.

570

+

571

+

572

+By default, gap penalties are used at each end of the subrange in order to

573

+penalise terminal gaps. If the REALIGN SEGMENT END GAP PENALTIES option is

574

+switched off, gaps can be introduced at the ends of the residue range at no

575

+cost.

576

+

577

+

578

+

579

+

580

+ALIGNMENT PARAMETERS displays a sub-menu with the following options:

581

+

582

+

583

+RESET NEW GAPS BEFORE ALIGNMENT will remove any new gaps introduced into the

584

+sequences during multiple alignment if you wish to change the parameters and

585

+try again. This only takes effect just before you do a second multiple

586

+alignment. You can make phylogenetic trees after alignment whether or not this

587

+is ON. If you turn this OFF, the new gaps are kept even if you do a second

588

+multiple alignment. This allows you to iterate the alignment gradually.

589

+Sometimes, the alignment is improved by a second or third pass.

590

+

591

+

592

+RESET ALL GAPS BEFORE ALIGNMENT will remove all gaps in the sequences including

593

+gaps which were read in from the sequence input file. This only takes effect

594

+just before you do a second multiple alignment. You can make phylogenetic

595

+trees after alignment whether or not this is ON. If you turn this OFF, all

596

+gaps are kept even if you do a second multiple alignment. This allows you to

597

+iterate the alignment gradually. Sometimes, the alignment is improved by a

598

+second or third pass.

599

+

600

+

601

+

602

+

603

+PAIRWISE ALIGNMENT PARAMETERS control the speed/sensitivity of the initial

604

+alignments.

605

+

606

+

607

+MULTIPLE ALIGNMENT PARAMETERS control the gaps in the final multiple

608

+alignments.

609

+

610

+

611

+PROTEIN GAP PARAMETERS displays a temporary window which allows you to set

612

+various parameters only used in the alignment of protein sequences.

613

+

614

+

615

+(SECONDARY STRUCTURE PARAMETERS, for use with the Profile Alignment Mode only,

616

+allows you to set various parameters only used with gap penalty masks.)

617

+

618

+

619

+SAVE LOG FILE will write the alignment calculation scores to a file. The log

620

+filename is the same as the input sequence filename, with an extension .log

621

+appended.

622

+

623

+

624

+

625

+

626

+<H4>

627

+OUTPUT FORMAT OPTIONS

628

+</H4>

629

+

630

+

631

+You can choose from 6 different alignment formats (CLUSTAL, GCG, NBRF/PIR,

632

+PHYLIP, GDE and NEXUS). You can choose more than one (or all 6 if you wish).

633

+

634

+

635

+CLUSTAL format output is a self explanatory alignment format. It shows the

636

+sequences aligned in blocks. It can be read in again at a later date to (for

637

+example) calculate a phylogenetic tree or add in new sequences by profile

638

+alignment.

639

+

640

+

641

+GCG output can be used by any of the GCG programs that can work on multiple

642

+alignments (e.g. PRETTY, PROFILEMAKE, PLOTALIGN). It is the same as the GCG

643

+.msf format files (multiple sequence file); new in version 7 of GCG.

644

+

645

+

646

+NEXUS format is used by several phylogeny programs, including PAUP and

647

+MacClade.

648

+

649

+

650

+PHYLIP format output can be used for input to the PHYLIP package of Joe

651

+Felsenstein. This is a very widely used package for doing every imaginable

652

+form of phylogenetic analysis (MUCH more than the the modest introduction

653

+offered by this program).

654

+

655

+

656

+NBRF/PIR: this is the same as the standard PIR format with ONE ADDITION. Gap

657

+characters "-" are used to indicate the positions of gaps in the multiple

658

+alignment. These files can be re-used as input in any part of clustal that

659

+allows sequences (or alignments or profiles) to be read in.

660

+

661

+

662

+GDE: this format is used by the GDE package of Steven Smith and is understood

663

+by SEQLAB in GCG 9 or later.

664

+

665

+

666

+GDE OUTPUT CASE: sequences in GDE format may be written in either upper or

667

+lower case.

668

+

669

+

670

+CLUSTALW SEQUENCE NUMBERS: residue numbers may be added to the end of the

671

+alignment lines in clustalw format.

672

+

673

+

674

+OUTPUT ORDER is used to control the order of the sequences in the output

675

+alignments. By default, it uses the order in which the sequences were aligned

676

+(from the guide tree/dendrogram), thus automatically grouping closely related

677

+sequences. It can be switched to be the same as the original input order.

678

+

679

+

680

+PARAMETER OUTPUT: This option will save all your parameter settings in a

681

+parameter file (suffix .par) during alignment. The file can be subsequently

682

+used to rerun ClustalW using the same parameters.

683

+

684

+

685

+

686

+

687

+<H3>

688

+ALIGNMENT PARAMETERS

689

+</H3>

690

+

691

+

692

+

693

+PAIRWISE ALIGNMENT PARAMETERS

694

+

695

+

696

+

697

+A distance is calculated between every pair of sequences and these are used to

698

+construct the phylogenetic tree which guides the final multiple alignment. The

699

+scores are calculated from separate pairwise alignments. These can be

700

+calculated using 2 methods: dynamic programming (slow but accurate) or by the

701

+method of Wilbur and Lipman (extremely fast but approximate).

702

+

703

+

704

+You can choose between the 2 alignment methods using the PAIRWISE ALIGNMENTS

705

+option. The slow/accurate method is fast enough for short sequences but will be

706

+VERY SLOW for many (e.g. >100) long (e.g. >1000 residue) sequences.

707

+

708

+

709

+

710

+

711

+

712

+SLOW-ACCURATE alignment parameters:

713

+

714

+

715

+

716

+These parameters do not have any affect on the speed of the alignments. They

717

+are used to give initial alignments which are then rescored to give percent

718

+identity scores. These % scores are the ones which are displayed on the

719

+screen. The scores are converted to distances for the trees.

720

+

721

+

722

+Gap Open Penalty: the penalty for opening a gap in the alignment.

723

+

724

+

725

+Gap Extension Penalty: the penalty for extending a gap by 1 residue.

726

+

727

+

728

+Protein Weight Matrix: the scoring table which describes the similarity of

729

+each amino acid to each other.

730

+

731

+

732

+Load protein matrix: allows you to read in a comparison table from a file.

733

+

734

+

735

+DNA weight matrix: the scores assigned to matches and mismatches (including

736

+IUB ambiguity codes).

737

+

738

+

739

+Load DNA matrix: allows you to read in a comparison table from a file.

740

+

741

+

742

+See the Multiple alignment parameters, MATRIX option below for details of the

743

+matrix input format.

744

+

745

+

746

+

747

+

748

+

749

+FAST-APPROXIMATE alignment parameters:

750

+

751

+

752

+

753

+These similarity scores are calculated from fast, approximate, global align-

754

+ments, which are controlled by 4 parameters. 2 techniques are used to make

755

+these alignments very fast: 1) only exactly matching fragments (k-tuples) are

756

+considered; 2) only the 'best' diagonals (the ones with most k-tuple matches)

757

+are used.

758

+

759

+

760

+GAP PENALTY: This is a penalty for each gap in the fast alignments. It has

761

+little effect on the speed or sensitivity except for extreme values.

762

+

763

+

764

+K-TUPLE SIZE: This is the size of exactly matching fragment that is used.

765

+INCREASE for speed (max= 2 for proteins; 4 for DNA), DECREASE for sensitivity.

766

+For longer sequences (e.g. >1000 residues) you may wish to increase the

767

+default.

768

+

769

+

770

+TOP DIAGONALS: The number of k-tuple matches on each diagonal (in an imaginary

771

+dot-matrix plot) is calculated. Only the best ones (with most matches) are used

772

+in the alignment. This parameter specifies how many. Decrease for speed;

773

+increase for sensitivity.

774

+

775

+

776

+WINDOW SIZE: This is the number of diagonals around each of the 'best'

777

+diagonals that will be used. Decrease for speed; increase for sensitivity.

778

+

779

+

780

+

781

+

782

+

783

+MULTIPLE ALIGNMENT PARAMETERS

784

+

785

+

786

+

787

+These parameters control the final multiple alignment. This is the core of the

788

+program and the details are complicated. To fully understand the use of the

789

+parameters and the scoring system, you will have to refer to the documentation.

790

+

791

+

792

+Each step in the final multiple alignment consists of aligning two alignments

793

+or sequences. This is done progressively, following the branching order in the

794

+GUIDE TREE. The basic parameters to control this are two gap penalties and the

795

+scores for various identical/non-indentical residues.

796

+

797

+

798

+The GAP OPENING and EXTENSION PENALTIES can be set here. These control the

799

+cost of opening up every new gap and the cost of every item in a gap.

800

+Increasing the gap opening penalty will make gaps less frequent. Increasing

801

+the gap extension penalty will make gaps shorter. Terminal gaps are not

802

+penalised.

803

+

804

+

805

+The DELAY DIVERGENT SEQUENCES switch delays the alignment of the most distantly

806

+related sequences until after the most closely related sequences have been

807

+aligned. The setting shows the percent identity level required to delay the

808

+addition of a sequence; sequences that are less identical than this level to

809

+any other sequences will be aligned later.

810

+

811

+

812

+The TRANSITION WEIGHT gives transitions (A<-->G or C<-->T i.e. purine-purine or

813

+pyrimidine-pyrimidine substitutions) a weight between 0 and 1; a weight of zero

814

+means that the transitions are scored as mismatches, while a weight of 1 gives

815

+the transitions the match score. For distantly related DNA sequences, the

816

+weight should be near to zero; for closely related sequences it can be useful

817

+to assign a higher score. The default is set to 0.5.

818

+

819

+

820

+

821

+

822

+The PROTEIN WEIGHT MATRIX option allows you to choose a series of weight

823

+matrices. For protein alignments, you use a weight matrix to determine the

824

+similarity of non-identical amino acids. For example, Tyr aligned with Phe is

825

+usually judged to be 'better' than Tyr aligned with Pro.

826

+

827

+

828

+There are three 'in-built' series of weight matrices offered. Each consists of

829

+several matrices which work differently at different evolutionary distances. To

830

+see the exact details, read the documentation. Crudely, we store several

831

+matrices in memory, spanning the full range of amino acid distance (from almost

832

+identical sequences to highly divergent ones). For very similar sequences, it

833

+is best to use a strict weight matrix which only gives a high score to

834

+identities and the most favoured conservative substitutions. For more divergent

835

+sequences, it is appropriate to use "softer" matrices which give a high score

836

+to many other frequent substitutions.

837

+

838

+

839

+1) BLOSUM (Henikoff). These matrices appear to be the best available for

840

+carrying out data base similarity (homology searches). The matrices currently

841

+used are: Blosum 80, 62, 45 and 30. BLOSUM was the default in earlier Clustal X

842

+versions.

843

+

844

+

845

+2) PAM (Dayhoff). These have been extremely widely used since the late '70s. We

846

+currently use the PAM 20, 60, 120, 350 matrices.

847

+

848

+

849

+3) GONNET. These matrices were derived using almost the same procedure as the

850

+Dayhoff one (above) but are much more up to date and are based on a far larger

851

+data set. They appear to be more sensitive than the Dayhoff series. We

852

+currently use the GONNET 80, 120, 160, 250 and 350 matrices. This series is the

853

+default for Clustal X version 1.8.

854

+

855

+

856

+We also supply an identity matrix which gives a score of 10 to two identical

857

+amino acids and a score of zero otherwise. This matrix is not very useful.

858

+

859

+

860

+Load protein matrix: allows you to read in a comparison matrix from a file.

861

+This can be either a single matrix or a series of matrices (see below for

862

+format).

863

+

864

+

865

+

866

+

867

+DNA WEIGHT MATRIX option allows you to select a single matrix (not a series)

868

+used for aligning nucleic acid sequences. Two hard-coded matrices are available:

869

+

870

+

871

+1) IUB. This is the default scoring matrix used by BESTFIT for the comparison

872

+of nucleic acid sequences. X's and N's are treated as matches to any IUB

873

+ambiguity symbol. All matches score 1.9; all mismatches for IUB symbols score 0.

874

+

875

+

876

+2) CLUSTALW(1.6). A previous system used by ClustalW, in which matches score

877

+1.0 and mismatches score 0. All matches for IUB symbols also score 0.

878

+

879

+

880

+Load DNA matrix: allows you to read in a nucleic acid comparison matrix from a

881

+file (just one matrix, not a series).

882

+

883

+

884

+

885

+

886

+SINGLE MATRIX INPUT FORMAT

887

+The format used for a single matrix is the same as the BLAST program. The

888

+scores in the new weight matrix should be similarities. You can use negative as

889

+well as positive values if you wish, although the matrix will be automatically

890

+adjusted to all positive scores, unless the NEGATIVE MATRIX option is selected.

891

+Any lines beginning with a # character are assumed to be comments. The first

892

+non-comment line should contain a list of amino acids in any order, using the 1

893

+letter code, followed by a * character. This should be followed by a square

894

+matrix of scores, with one row and one column for each amino acid. The last row

895

+and column of the matrix (corresponding to the * character) contain the minimum

896

+score over the whole matrix.

897

+

898

+

899

+MATRIX SERIES INPUT FORMAT

900

+ClustalX uses different matrices depending on the mean percent identity of the

901

+sequences to be aligned. You can specify a series of matrices and the range of

902

+the percent identity for each matrix in a matrix series file. The file is

903

+automatically recognised by the word CLUSTAL_SERIES at the beginning of the

904

+file. Each matrix in the series is then specified on one line which should

905

+start with the word MATRIX. This is followed by the lower and upper limits of

906

+the sequence percent identities for which you want to apply the matrix. The

907

+final entry on the matrix line is the filename of a Blast format matrix file

908

+(see above for details of the single matrix file format).

909

+

910

+

911

+Example.

912

+

913

+

914

+CLUSTAL_SERIES

915

+

916

+

917

+MATRIX 81 100 /us1/user/julie/matrices/blosum80

918

+MATRIX 61 80 /us1/user/julie/matrices/blosum62

919

+MATRIX 31 60 /us1/user/julie/matrices/blosum45

920

+MATRIX 0 30 /us1/user/julie/matrices/blosum30

921

+

922

+

923

+

924

+

925

+

926

+PROTEIN GAP PARAMETERS

927

+

928

+

929

+

930

+RESIDUE SPECIFIC PENALTIES are amino acid specific gap penalties that reduce or

931

+increase the gap opening penalties at each position in the alignment or

932

+sequence. See the documentation for details. As an example, positions that are

933

+rich in glycine are more likely to have an adjacent gap than positions that are

934

+rich in valine.

935

+

936

+

937

+HYDROPHILIC GAP PENALTIES are used to increase the chances of a gap within a

938

+run (5 or more residues) of hydrophilic amino acids; these are likely to be

939

+loop or random coil regions where gaps are more common. The residues that are

940

+"considered" to be hydrophilic can be entered in HYDROPHILIC RESIDUES.

941

+

942

+

943

+GAP SEPARATION DISTANCE tries to decrease the chances of gaps being too close

944

+to each other. Gaps that are less than this distance apart are penalised more

945

+than other gaps. This does not prevent close gaps; it makes them less frequent,

946

+promoting a block-like appearance of the alignment.

947

+

948

+

949

+END GAP SEPARATION treats end gaps just like internal gaps for the purposes of

950

+avoiding gaps that are too close (set by GAP SEPARATION DISTANCE above). If you

951

+turn this off, end gaps will be ignored for this purpose. This is useful when

952

+you wish to align fragments where the end gaps are not biologically meaningful.

953

+

954

+

955

+

956

+

957

+

958

+<A HREF="#INDEX"> Back to Index </A>

959

+<CENTER><H2><A NAME="P"> Profile and Structure Alignments

960

+</A></H2></CENTER>

961

+

962

+

963

+

964

+By PROFILE ALIGNMENT, we mean alignment using existing alignments. Profile

965

+alignments allow you to store alignments of your favourite sequences and add

966

+new sequences to them in small bunches at a time. A profile is simply an

967

+alignment of one or more sequences (e.g. an alignment output file from Clustal

968

+X). Each input can be a single sequence. One or both sets of input sequences

969

+may include secondary structure assignments or gap penalty masks to guide the

970

+alignment.

971

+

972

+

973

+Make sure PROFILE ALIGNMENT MODE is selected, using the switch directly above

974

+the sequence display area. Then, use the ALIGNMENT menu to do profile and

975

+secondary structure alignments.

976

+

977

+

978

+The profiles can be in any of the allowed input formats with "-" characters

979

+used to specify gaps (except for GCG/MSF where "." is used).

980

+

981

+

982

+You have to load the 2 profiles by choosing FILE, LOAD PROFILE 1 and LOAD

983

+PROFILE 2. Then ALIGNMENT, ALIGN PROFILE 2 TO PROFILE 1 will align the 2

984

+profiles to each other. Secondary structure masks in either profile can be used

985

+to guide the alignment. This option compares all the sequences in profile 1

986

+with all the sequences in profile 2 in order to build guide trees which will be

987

+used to calculate sequence weights, and select appropriate alignment parameters

988

+for the final profile alignment.

989

+

990

+

991

+You can skip the first stage (pairwise alignments; guide trees) by using old

992

+guide tree files (ALIGN PROFILES FROM GUIDE TREES).

993

+

994

+

995

+The ALIGN SEQUENCES TO PROFILE 1 option will take the sequences in the second

996

+profile and align them to the first profile, 1 at a time. This is useful to

997

+add some new sequences to an existing alignment, or to align a set of sequences

998

+to a known structure. In this case, the second profile set need not be

999

+pre-aligned.

1000

+

1001

+

1002

+You can skip the first stage (pairwise alignments; guide tree) by using an old

1003

+guide tree file (ALIGN SEQUENCES TO PROFILE 1 FROM TREE).

1004

+

1005

+

1006

+SAVE LOG FILE will write the alignment calculation scores to a file. The log

1007

+filename is the same as the input sequence filename, with an extension .log

1008

+appended.

1009

+

1010

+

1011

+The alignment parameters can be set using the ALIGNMENT PARAMETERS menu,

1012

+Pairwise Parameters, Multiple Parameters and Protein Gap Parameters options.

1013

+These are EXACTLY the same parameters as used by the general, automatic

1014

+multiple alignment procedure. The general multiple alignment procedure is

1015

+simply a series of profile alignments. Carrying out a series of profile

1016

+alignments on larger and larger groups of sequences, allows you to manually

1017

+build up a complete alignment, if necessary editing intermediate alignments.

1018

+

1019

+

1020

+

1021

+SECONDARY STRUCTURE PARAMETERS

1022

+

1023

+

1024

+

1025

+Use this menu to set secondary structure options. If a solved structure is

1026

+known, it can be used to guide the alignment by raising gap penalties within

1027

+secondary structure elements, so that gaps will preferentially be inserted into

1028

+unstructured surface loop regions. Alternatively, a user-specified gap penalty

1029

+mask can be supplied for a similar purpose.

1030

+

1031

+

1032

+A gap penalty mask is a series of numbers between 1 and 9, one per position in

1033

+the alignment. Each number specifies how much the gap opening penalty is to be

1034

+raised at that position (raised by multiplying the basic gap opening penalty

1035

+by the number) i.e. a mask figure of 1 at a position means no change

1036

+in gap opening penalty; a figure of 4 means that the gap opening penalty is

1037

+four times greater at that position, making gaps 4 times harder to open.

1038

+

1039

+

1040

+The format for gap penalty masks and secondary structure masks is explained in

1041

+a separate help section.

1042

+

1043

+

1044

+

1045

+<A HREF="#INDEX"> Back to Index </A>

1046

+<CENTER><H2><A NAME="B"> Secondary Structure / Gap Penalty Masks

1047

+</A></H2></CENTER>

1048

+

1049

+

1050

+

1051

+The use of secondary structure-based penalties has been shown to improve the

1052

+accuracy of sequence alignment. Clustal X now allows secondary structure/ gap

1053

+penalty masks to be supplied with the input sequences used during profile

1054

+alignment. (NB. The secondary structure information is NOT used during multiple

1055

+sequence alignment). The masks work by raising gap penalties in specified

1056

+regions (typically secondary structure elements) so that gaps are

1057

+preferentially opened in the less well conserved regions (typically surface

1058

+loops).

1059

+

1060

+

1061

+The USE PROFILE 1(2) SECONDARY STRUCTURE / GAP PENALTY MASK options control

1062

+whether the input 2D-structure information or gap penalty masks will be used

1063

+during the profile alignment.

1064

+

1065

+

1066

+The OUTPUT options control whether the secondary structure and gap penalty

1067

+masks should be included in the Clustal X output alignments. Showing both is

1068

+useful for understanding how the masks work. The 2D-structure information is

1069

+itself useful in judging the alignment quality and in seeing how residue

1070

+conservation patterns vary with secondary structure.

1071

+

1072

+

1073

+The HELIX and STRAND GAP PENALTY options provide the value for raising the gap

1074

+penalty at core Alpha Helical (A) and Beta Strand (B) residues. In CLUSTAL

1075

+format, capital residues denote the A and B core structure notation. Basic gap

1076

+penalties are multiplied by the amount specified.

1077

+

1078

+

1079

+The LOOP GAP PENALTY option provides the value for the gap penalty in Loops.

1080

+By default this penalty is not raised. In CLUSTAL format, loops are specified

1081

+by "." in the secondary structure notation.

1082

+

1083

+

1084

+The SECONDARY STRUCTURE TERMINAL PENALTY provides the value for setting the gap

1085

+penalty at the ends of secondary structures. Ends of secondary structures are

1086

+known to grow or shrink, comparing related structures. Therefore by default

1087

+these are given intermediate values, lower than the core penalties. All

1088

+secondary structure read in as lower case in CLUSTAL format gets the reduced

1089

+terminal penalty.

1090

+

1091

+

1092

+The HELIX and STRAND TERMINAL POSITIONS options specify the range of structure

1093

+termini for the intermediate penalties. In the alignment output, these are

1094

+indicated as lower case. For Alpha Helices, by default, the range spans the

1095

+end-helical turn (3 residues). For Beta Strands, the default range spans the

1096

+end residue and the adjacent loop residue, since sequence conservation often

1097

+extends beyond the actual H-bonded Beta Strand.

1098

+

1099

+

1100

+Clustal X can read the masks from SWISS-PROT, CLUSTAL or GDE format input

1101

+files. For many 3-D protein structures, secondary structure information is

1102

+recorded in the feature tables of SWISS-PROT database entries. You should

1103

+always check that the assignments are correct - some are quite inaccurate.

1104

+Clustal X looks for SWISS-PROT HELIX and STRAND assignments e.g.

1105

+

1106

+

1107

+

1108

+

1109

+<PRE>

1110

+FT HELIX 100 115

1111

+FT STRAND 118 119

1112

+</PRE>

1113

+

1114

+

1115

+The structure and penalty masks can also be read from CLUSTAL alignment format

1116

+as comment lines beginning "!SS_" or "!GM_" e.g.

1117

+

1118

+

1119

+<PRE>

1120

+!SS_HBA_HUMA ..aaaAAAAAAAAAAaaa.aaaAAAAAAAAAAaaaaaaAaaa.........aaaAAAAAA

1121

+!GM_HBA_HUMA 112224444444444222122244444444442222224222111111111222444444

1122

+HBA_HUMA VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK

1123

+</PRE>

1124

+

1125

+

1126

+Note that the mask itself is a set of numbers between 1 and 9 each of which is

1127

+assigned to the residue(s) in the same column below.

1128

+

1129

+

1130

+In GDE flat file format, the masks are specified as text and the names must

1131

+begin with "SS_ or "GM_.

1132

+

1133

+

1134

+Either a structure or penalty mask or both may be used. If both are included

1135

+in an alignment, the user will be asked which is to be used.

1136

+

1137

+

1138

+

1139

+

1140

+

1141

+<A HREF="#INDEX"> Back to Index </A>

1142

+<CENTER><H2><A NAME="T"> Phylogenetic Trees

1143

+</A></H2></CENTER>

1144

+

1145

+

1146

+

1147

+Before calculating a tree, you must have an ALIGNMENT in memory. This can be

1148

+input using the FILE menu, LOAD SEQUENCES option or you should have just

1149

+carried out a full multiple alignment and the alignment is still in memory.

1150

+Remember YOU MUST ALIGN THE SEQUENCES FIRST!!!!

1151

+

1152

+

1153

+The method used is the NJ (Neighbour Joining) method of Saitou and Nei. First

1154

+you calculate distances (percent divergence) between all pairs of sequence from

1155

+a multiple alignment; second you apply the NJ method to the distance matrix.

1156

+

1157

+

1158

+To calculate a tree, use the DRAW N-J TREE option. This gives an UNROOTED tree

1159

+and all branch lengths. The root of the tree can only be inferred by using an

1160

+outgroup (a sequence that you are certain branches at the outside of the tree

1161

+.... certain on biological grounds) OR if you assume a degree of constancy in

1162

+the 'molecular clock', you can place the root in the 'middle' of the tree

1163

+(roughly equidistant from all tips).

1164

+

1165

+

1166

+BOOTSTRAP N-J TREE uses a method for deriving confidence values for the

1167

+groupings in a tree (first adapted for trees by Joe Felsenstein). It involves

1168

+making N random samples of sites from the alignment (N should be LARGE, e.g.

1169

+500 - 1000); drawing N trees (1 from each sample) and counting how many times

1170

+each grouping from the original tree occurs in the sample trees. You can set N

1171

+using the NUMBER OF BOOTSTRAP TRIALS option in the BOOTSTRAP TREE window. In

1172

+practice, you should use a large number of bootstrap replicates (1000 is

1173

+recommended, even if it means running the program for an hour on a slow

1174

+computer). You can also supply a seed number for the random number generator

1175

+here. Different runs with the same seed will give the same answer. See the

1176

+documentation for more details.

1177

+

1178

+

1179

+EXCLUDE POSITIONS WITH GAPS? With this option, any alignment positions where

1180

+ANY of the sequences have a gap will be ignored. This means that 'like' will

1181

+be compared to 'like' in all distances, which is highly desirable. It also

1182

+automatically throws away the most ambiguous parts of the alignment, which are

1183

+concentrated around gaps (usually). The disadvantage is that you may throw away

1184

+much of the data if there are many gaps (which is why it is difficult for us to

1185

+make it the default).

1186

+

1187

+

1188

+CORRECT FOR MULTIPLE SUBSTITUTIONS? For small divergence (say <10%) this option

1189

+makes no difference. For greater divergence, this option corrects for the fact

1190

+that observed distances underestimate actual evolutionary distances. This is

1191

+because, as sequences diverge, more than one substitution will happen at many

1192

+sites. However, you only see one difference when you look at the present day

1193

+sequences. Therefore, this option has the effect of stretching branch lengths

1194

+in trees (especially long branches). The corrections used here (for DNA or

1195

+proteins) are both due to Motoo Kimura. See the documentation for details.

1196

+

1197

+

1198

+Where possible, this option should be used. However, for VERY divergent

1199

+sequences, the distances cannot be reliably corrected. You will be warned if

1200

+this happens. Even if none of the distances in a data set exceed the reliable

1201

+threshold, if you bootstrap the data, some of the bootstrap distances may

1202

+randomly exceed the safe limit.

1203

+

1204

+

1205

+SAVE LOG FILE will write the tree calculation scores to a file. The log

1206

+filename is the same as the input sequence filename, with an extension .log

1207

+appended.

1208

+

1209

+

1210

+<H4>

1211

+OUTPUT FORMAT OPTIONS

1212

+</H4>

1213

+

1214

+

1215

+Three different formats are allowed. None of these displays the tree visually.

1216

+You can display the tree using the NJPLOT program distributed with Clustal X

1217

+OR get the PHYLIP package and use the tree drawing facilities there.

1218

+

1219

+

1220

+1) CLUSTAL FORMAT TREE. This format is verbose and lists all of the distances

1221

+between the sequences and the number of alignment positions used for each. The

1222

+tree is described at the end of the file. It lists the sequences that are

1223

+joined at each alignment step and the branch lengths. After two sequences are

1224

+joined, it is referred to later as a NODE. The number of a NODE is the number

1225

+of the lowest sequence in that NODE.

1226

+

1227

+

1228

+2) PHYLIP FORMAT TREE. This format is the New Hampshire format, used by many

1229

+phylogenetic analysis packages. It consists of a series of nested parentheses,

1230

+describing the branching order, with the sequence names and branch lengths. It

1231

+can be read by the NJPLOT program distributed with ClustalX. It can also be

1232

+used by the RETREE, DRAWGRAM and DRAWTREE programs of the PHYLIP package to see

1233

+the trees graphically. This is the same format used during multiple alignment

1234

+for the guide trees. Some other packages that can read and display New

1235

+Hampshire format are TreeTool, TreeView, and Phylowin.

1236

+

1237

+

1238

+3) PHYLIP DISTANCE MATRIX. This format just outputs a matrix of all the

1239

+pairwise distances in a format that can be used by the PHYLIP package. It used

1240

+to be useful when one could not produce distances from protein sequences in the

1241

+Phylip package but is now redundant (PROTDIST of Phylip 3.5 now does this).

1242

+

1243

+

1244

+4) NEXUS FORMAT TREE. This format is used by several popular phylogeny programs,

1245

+including PAUP and MacClade. The format is described fully in:

1246

+Maddison, D. R., D. L. Swofford and W. P. Maddison. 1997.

1247

+NEXUS: an extensible file format for systematic information.

1248

+Systematic Biology 46:590-621.

1249

+

1250

+

1251

+BOOTSTRAP LABELS ON: By default, the bootstrap values are correctly placed on

1252

+the tree branches of the phylip format output tree. The toggle allows them to

1253

+be placed on the nodes, which is incorrect, but some display packages (e.g.

1254

+TreeTool, TreeView and Phylowin) only support node labelling but not branch

1255

+labelling. Care should be taken to note which branches and labels go together.

1256

+

1257

+

1258

+

1259

+

1260

+

1261

+<A HREF="#INDEX"> Back to Index </A>

1262

+<CENTER><H2><A NAME="C"> Colors

1263

+</A></H2></CENTER>

1264

+

1265

+

1266

+

1267

+Clustal X provides a versatile coloring scheme for the sequence alignment

1268

+display. The sequences (or profiles) are colored automatically, when they are

1269

+loaded. Sequences can be colored either by assigning a color to specific

1270

+residues, or on the basis of an alignment consensus. In the latter case, the

1271

+alignment consensus is calculated automatically, and the residues in each

1272

+column are colored according to the consensus character assigned to that

1273

+column. In this way, you can choose to highlight, for example, conserved

1274

+hydrophylic or hydrophobic positions in the alignment.

1275

+

1276

+

1277

+The 'rules' used to color the alignment are specified in a COLOR PARAMETER

1278

+FILE. Clustal X automatically looks for a file called 'colprot.par' for protein

1279

+sequences or 'coldna.par' for DNA, in the current directory. (If your running

1280

+under UNIX, it then looks in your home directory, and finally in the

1281

+directories in your PATH environment variable).

1282

+

1283

+

1284

+By default, if no color parameter file is found, protein sequences are colored

1285

+by residue as follows:

1286

+

1287

+

1288

+<PRE>

1289

+ Color Residue Code

1290

+

1291

+

1292

+ ORANGE GPST

1293

+ RED HKR

1294

+ BLUE FWY

1295

+ GREEN ILMV

1296

+</PRE>

1297

+

1298

+

1299

+In the case of DNA sequences, the default colors are as follows:

1300

+

1301

+

1302

+<PRE>

1303

+ Color Residue Code

1304

+

1305

+

1306

+ ORANGE A

1307

+ RED C

1308

+ BLUE T

1309

+ GREEN G

1310

+</PRE>

1311

+

1312

+

1313

+

1314

+

1315

+The default BACKGROUND COLORING option shows the sequence residues using a

1316

+black character on a colored background. It can be switched off to show

1317

+residues as a colored character on a white background.

1318

+

1319

+

1320

+Either BLACK AND WHITE or DEFAULT COLOR options can be selected. The Color

1321

+option looks first for the color parameter file (as described above) and, if no

1322

+file is found, uses the default residue-specific colors.

1323

+

1324

+

1325

+You can specify your own coloring scheme by using the LOAD COLOR PARAMETER FILE

1326

+option. The format of the color parameter file is described below.

1327

+

1328

+

1329

+<H4>

1330

+COLOR PARAMETER FILE

1331

+</H4>

1332

+

1333

+

1334

+This file is divided into 3 sections:

1335

+

1336

+

1337

+1) the names and rgb values of the colors

1338

+2) the rules for calculating the consensus

1339

+3) the rules for assigning colors to the residues

1340

+

1341

+

1342

+An example file is given here.

1343

+

1344

+

1345

+<PRE>

1346

+ --------------------------------------------------------------------

1347

+@rgbindex

1348

+RED 0.9 0.1 0.1

1349

+BLUE 0.1 0.1 0.9

1350

+GREEN 0.1 0.9 0.1

1351

+YELLOW 0.9 0.9 0.0

1352

+

1353

+

1354

+@consensus

1355

+% = 60% w:l:v:i:m:a:f:c:y:h:p

1356

+# = 80% w:l:v:i:m:a:f:c:y:h:p

1357

+- = 50% e:d

1358

++ = 60% k:r

1359

+q = 50% q:e

1360

+p = 50% p

1361

+n = 50% n

1362

+t = 50% t:s

1363

+

1364

+

1365

+@color

1366

+g = RED

1367

+p = YELLOW

1368

+t = GREEN if t:%:#

1369

+n = GREEN if n

1370

+w = BLUE if %:#:p

1371

+k = RED if +

1372

+ --------------------------------------------------------------------

1373

+</PRE>

1374

+

1375

+

1376

+The first section is optional and is identified by the header @rgbindex. If

1377

+this section exists, each color used in the file must be named and the rgb

1378

+values specified (on a scale from 0 to 1). If the rgb index section is not

1379

+found, the following set of hard-coded colors will be used.

1380

+

1381

+

1382

+<PRE>

1383

+RED 0.9 0.1 0.1

1384

+BLUE 0.1 0.1 0.9

1385

+GREEN 0.1 0.9 0.1

1386

+ORANGE 0.9 0.7 0.3

1387

+CYAN 0.1 0.9 0.9

1388

+PINK 0.9 0.5 0.5

1389

+MAGENTA 0.9 0.1 0.9

1390

+YELLOW 0.9 0.9 0.0

1391

+</PRE>

1392

+

1393

+

1394

+The second section is optional and is identified by the header @consensus. It

1395

+defines how the consensus is calculated.

1396

+

1397

+

1398

+The format of each consensus parameter is:-

1399

+

1400

+

1401

+<PRE>

1402

+c = n% residue_list

1403

+

1404

+

1405

+ where

1406

+ c is a character used to identify the parameter.

1407

+ n is an integer value used as the percentage cutoff

1408

+ point.

1409

+ residue_list is a list of residues denoted by a single

1410

+ character, delimited by a colon (:).

1411

+</PRE>

1412

+

1413

+

1414

+For example: # = 60% w:l:v:i

1415

+

1416

+

1417

+will assign a consensus character # to any column in the alignment which

1418

+contains more than 60% of the residues w,l,v and i.

1419

+

1420

+

1421

+

1422

+

1423

+The third section is identified by the header @color, and defines how colors

1424

+are assigned to each residue in the alignment.

1425

+

1426

+

1427

+The color parameters can take one of two formats:

1428

+

1429

+

1430

+<PRE>

1431

+1) r = color

1432

+2) r = color if consensus_list

1433

+

1434

+

1435

+ where

1436

+ r is a character used to denote a residue.

1437

+ color is one of the colors in the GDE color lookup table.

1438

+ residue_list is a list of residues denoted by a single

1439

+ character, delimited by a colon (:).

1440

+</PRE>

1441

+

1442

+

1443

+Examples:

1444

+1) g = ORANGE

1445

+

1446

+

1447

+will color all glycines ORANGE, regardless of the consensus.

1448

+

1449

+

1450

+2) w = BLUE if w:%:#

1451

+

1452

+

1453

+will color BLUE any tryptophan which is found in a column with a consensus of

1454

+w, % or #.

1455

+

1456

+

1457

+

1458

+

1459

+

1460

+<A HREF="#INDEX"> Back to Index </A>

1461

+<CENTER><H2><A NAME="Q"> Alignment Quality Analysis

1462

+</A></H2></CENTER>

1463

+

1464

+

1465

+

1466

+<H3>

1467

+QUALITY SCORES

1468

+</H3>

1469

+

1470

+

1471

+Clustal X provides an indication of the quality of an alignment by plotting

1472

+a 'conservation score' for each column of the alignment. A high score indicates

1473

+a well-conserved column; a low score indicates low conservation. The quality

1474

+curve is drawn below the alignment.

1475

+

1476

+

1477

+Two methods are also provided to indicate single residues or sequence segments

1478

+which score badly in the alignment.

1479

+

1480

+

1481

+Low-scoring residues are expected to occur at a moderate frequency in all the

1482

+sequences because of their steady divergence due to the natural processes of

1483

+evolution. The most divergent sequences are likely to have the most outliers.

1484

+However, the highlighted residues are especially useful in pointing to

1485

+sequence misalignments. Note that clustering of highlighted residues is a

1486

+strong indication of misalignment. This can arise due to various reasons, for

1487

+example:

1488

+

1489

+

1490

+ 1. Partial or total misalignments caused by a failure in the

1491

+ alignment algorithm. Usually only in difficult alignment cases.

1492

+

1493

+

1494

+ 2. Partial or total misalignments because at least one of the

1495

+ sequences in the given set is partly or completely unrelated to the

1496

+ other sequences. It is up to the user to check that the set of

1497

+ sequences are alignable.

1498

+

1499

+

1500

+ 3. Frameshift translation errors in a protein sequence causing local

1501

+ mismatched regions to be heavily highlighted. These are surprisingly

1502

+ common in database entries. If suspected, a 3-frame translation of

1503

+ the source DNA needs to be examined.

1504

+

1505

+

1506

+Occasionally, highlighted residues may point to regions of some biological

1507

+significance. This might happen for example if a protein alignment contains a

1508

+sequence which has acquired new functions relative to the main sequence set. It

1509

+is important to exclude other explanations, such as error or the natural

1510

+divergence of sequences, before invoking a biological explanation.

1511

+

1512

+

1513

+

1514

+

1515

+<H3>

1516

+LOW-SCORING SEGMENTS

1517

+</H3>

1518

+

1519

+

1520

+Unreliable regions in the alignment can be highlighted using the Low-Scoring

1521

+Segments option. A sequence-weighted profile is used to indicate any segments

1522

+in the sequences which score badly. Because the profile calculation may take

1523

+some time, an option is provided to calculate LOW-SCORING SEGMENTS. The

1524

+segment display can then be toggled on or off without having to repeat the

1525

+time-consuming calculations.

1526

+

1527

+

1528

+For details of the low-scoring segment calculation, see the CALCULATION section

1529

+below.

1530

+

1531

+

1532

+

1533

+

1534

+<H4>

1535

+LOW-SCORING SEGMENT PARAMETERS

1536

+</H4>

1537

+

1538

+

1539

+MINIMUM LENGTH OF SEGMENTS: short segments (or even single residues) can be

1540

+hidden by increasing the minimum length of segments which will be displayed.

1541

+

1542

+

1543

+DNA MARKING SCALE is used to remove less significant segments from the

1544

+highlighted display. Increase the scale to display more segments; decrease the

1545

+scale to remove the least significant.

1546

+

1547

+

1548

+

1549

+

1550

+PROTEIN WEIGHT MATRIX: the scoring table which describes the similarity of each

1551

+amino acid to each other. The matrix is used to calculate the sequence-

1552

+weighted profile scores. There are four 'in-built' Log-Odds matrices offered:

1553

+the Gonnet PAM 80, 120, 250, 350 matrices. A more stringent matrix which only

1554

+gives a high score to identities and the most favoured conservative

1555

+substitutions, may be more suitable when the sequences are closely related. For

1556

+more divergent sequences, it is appropriate to use "softer" matrices which give

1557

+a high score to many other frequent substitutions. This option automatically

1558

+recalculates the low-scoring segments.

1559

+

1560

+

1561

+

1562

+

1563

+DNA WEIGHT MATRIX: Two hard-coded matrices are available:

1564

+

1565

+

1566

+1) IUB. This is the default scoring matrix used by BESTFIT for the comparison

1567

+of nucleic acid sequences. X's and N's are treated as matches to any IUB

1568

+ambiguity symbol. All matches score 1.0; all mismatches for IUB symbols score

1569

+0.9.

1570

+

1571

+

1572

+2) CLUSTALW(1.6). The previous system used by ClustalW, in which matches score

1573

+1.0 and mismatches score 0. All matches for IUB symbols also score 0.

1574

+

1575

+

1576

+A new matrix can be read from a file on disk, if the filename consists only

1577

+of lower case characters. The values in the new weight matrix should be

1578

+similarities and should be NEGATIVE for infrequent substitutions.

1579

+

1580

+

1581

+INPUT FORMAT. The format used for a new matrix is the same as the BLAST

1582

+program. Any lines beginning with a # character are assumed to be comments. The

1583

+first non-comment line should contain a list of amino acids in any order, using

1584

+the 1 letter code, followed by a * character. This should be followed by a

1585

+square matrix of scores, with one row and one column for each amino acid. The

1586

+last row and column of the matrix (corresponding to the * character) contain

1587

+the minimum score over the whole matrix.

1588

+

1589

+

1590

+<H4>

1591

+QUALITY SCORE PARAMETERS

1592

+</H4>

1593

+

1594

+

1595

+You can customise the column 'quality scores' plotted underneath the alignment

1596

+display using the following options.

1597

+

1598

+

1599

+SCORE PLOT SCALE: this is a scalar value from 1 to 10, which can be used to

1600

+change the scale of the quality score plot.

1601

+

1602

+

1603

+RESIDUE EXCEPTION CUTOFF: this is a scalar value from 1 to 10, which can be

1604

+used to change the number of residue exceptions which are highlighted in the

1605

+alignment display. (For an explanation of this cutoff, see the CALCULATION OF

1606

+RESIDUE EXCEPTIONS section below.)

1607

+

1608

+

1609

+PROTEIN WEIGHT MATRIX: the scoring table which describes the similarity of

1610

+each amino acid to each other.

1611

+

1612

+

1613

+DNA WEIGHT MATRIX: two hard-coded matrices are available: IUB and CLUSTALW(1.6).

1614

+

1615

+

1616

+For more information about the weight matrices, see the help above for

1617

+the Low-scoring Segments Weight Matrix.

1618

+

1619

+

1620

+For details of the quality score calculations, see the CALCULATION section

1621

+below.

1622

+

1623

+

1624

+

1625

+

1626

+

1627

+SHOW LOW-SCORING SEGMENTS

1628

+

1629

+

1630

+

1631

+The low-scoring segment display can be toggled on or off. This option does not

1632

+recalculate the profile scores.

1633

+

1634

+

1635

+

1636

+

1637

+

1638

+SHOW EXCEPTIONAL RESIDUES

1639

+

1640

+

1641

+

1642

+This option highlights individual residues which score badly in the alignment

1643

+quality calculations. Residues which score exceptionally low are highlighted by

1644

+using a white character on a grey background.

1645

+

1646

+

1647

+

1648

+SAVE QUALITY SCORES TO FILE

1649

+

1650

+

1651

+

1652

+The quality scores that are plotted underneath the alignment display can also

1653

+be saved in a text file. Each column in the alignment is written on one line in

1654

+the output file, with the value of the quality score at the end of the line.

1655

+Only the sequences currently selected in the display are written to the file.

1656

+One use for quality scores is to color residues in a protein structure by

1657

+sequence conservation. In this way conserved surface residues can be

1658

+highlighted to locate functional regions such as ligand-binding sites.

1659

+

1660

+

1661

+

1662

+

1663

+<H3>

1664

+CALCULATION OF QUALITY SCORES

1665

+</H3>

1666

+

1667

+

1668

+Suppose we have an alignment of m sequences of length n. Then, the alignment

1669

+can be written as:

1670

+

1671

+

1672

+<PRE>

1673

+ A11 A12 A13 .......... A1n

1674

+ A21 A22 A23 .......... A2n

1675

+ .

1676

+ .

1677

+ Am1 Am2 Am3 .......... Amn

1678

+</PRE>

1679

+

1680

+

1681

+We also have a residue comparison matrix of size R where C(i,j) is the score

1682

+for aligning residue i with residue j.

1683

+

1684

+

1685

+We want to calculate a score for the conservation of the jth position in the

1686

+alignment.

1687

+

1688

+

1689

+To do this, we define an R-dimensional sequence space. For the jth position in

1690

+the alignment, each sequence consists of a single residue which is assigned a

1691

+point S in the space. S has R dimensions, and for sequence i, the rth dimension

1692

+is defined as:

1693

+

1694

+

1695

+<PRE>

1696

+ Sr = C(r,Aij)

1697

+</PRE>

1698

+

1699

+

1700

+We then calculate a consensus value for the jth position in the alignment. This

1701

+value X also has R dimensions, and the rth dimension is defined as:

1702

+

1703

+

1704

+<PRE>

1705

+ Xr = ( SUM (Fij * C(i,r)) ) / m

1706

+ 1<=i<=R

1707

+</PRE>

1708

+

1709

+

1710

+where Fij is the count of residues i at position j in the alignment.

1711

+

1712

+

1713

+Now we can calculate the distance Di between each sequence i and the consensus

1714

+position X in the R-dimensional space.

1715

+

1716

+

1717

+<PRE>

1718

+ Di = SQRT ( SUM (Xr - Sr)(Xr - Sr) )

1719

+ 1<=i<=R

1720

+

1721

+

1722

+</PRE>

1723

+

1724

+

1725

+The quality score for the jth position in the alignment is defined as the mean

1726

+of the sequence distances Di.

1727

+

1728

+

1729

+The score is normalised by multiplying by the percentage of sequences which

1730

+have residues (and not gaps) at this position.

1731

+

1732

+

1733

+<H3>

1734

+CALCULATION OF RESIDUE EXCEPTIONS

1735

+</H3>

1736

+

1737

+

1738

+The jth residue of the ith sequence is considered as an exception if the

1739

+distance Di of the sequence from the consensus value P is greater than (Upper

1740

+Quartile + Inter Quartile Range * Cutoff). The value used as a cutoff for

1741

+displaying exceptions can be set from the SCORE PARAMETERS menu. A high cutoff

1742

+value will only display very significant exceptions; a low value will allow

1743

+more, less significant, exceptions to be highlighted.

1744

+

1745

+

1746

+(NB. Sequences which contain gaps at this position are not included in the

1747

+exception calculation.)

1748

+

1749

+

1750

+

1751

+

1752

+<H3>

1753

+CALCULATION OF LOW-SCORING SEGMENTS

1754

+</H3>

1755

+

1756

+

1757

+Suppose we have an alignment of m sequences of length n. Then, the alignment

1758

+can be written as:

1759

+

1760

+

1761

+<PRE>

1762

+ A11 A12 A13 .......... A1n

1763

+ A21 A22 A23 .......... A2n

1764

+ .

1765

+ .

1766

+ Am1 Am2 Am3 .......... Amn

1767

+</PRE>

1768

+

1769

+

1770

+We also have a residue comparison matrix of size R where C(i,j) is the score

1771

+for aligning residue i with residue j.

1772

+

1773

+

1774

+We calculate sequence weights by building a neighbour-joining tree, in which

1775

+branch lengths are proportional to divergence. Summing the branches by branch

1776

+ownership provides the weights. See (Thompson et al., CABIOS, 10, 19 (1994) and

1777

+Henikoff et al.,JMB, 243, 574 1994).

1778

+

1779

+

1780

+To find the low-scoring segments in a sequence Si, we build a weighted profile

1781

+of the remaining sequences in the alignment. Suppose we find residue r at

1782

+position j in the sequence; then the score for the jth position in the sequence

1783

+is defined as

1784

+

1785

+

1786

+<PRE>

1787

+ Score(Si,j) = Profile(j,r) where Profile(j,r) is the profile score

1788

+ for residue r at position j in the

1789

+ alignment.

1790

+</PRE>

1791

+

1792

+

1793

+These residue scores are summed along the sequence in both forward and backward

1794

+directions. If the sum of the scores is positive, then it is reset to zero.

1795

+Segments which score negatively in both directions are considered as

1796

+'low-scoring' and will be highlighted in the alignment display.

1797

+

1798

+

1799

+

1800

+

1801

+

1802

+<A HREF="#INDEX"> Back to Index </A>

1803

+<CENTER><H2><A NAME="9"> Command Line Parameters

1804

+</A></H2></CENTER>

1805

+<CENTER><H3> DATA (sequences)

1806

+</H3></CENTER>

1807

+<CENTER><TABLE ALIGN=ABSCENTER BORDER=1 CELLSPACING=1 CELLPADDING=5>

1808

+<TR>

1809

+<TD>Parameter</TD>

1810

+<TD>Description</TD>

1811

+</TR>

1812

+<TR>

1813

+<TD><TT>-PROFILE1=file.ext and -PROFILE2=file.ext </TT></TD>

1814

+<TD>profiles (aligned sequences)</TD>

1815

+</TR>

1816

+</TABLE></CENTER>

1817

+<CENTER><H3> VERBS (do things)

1818

+</H3></CENTER>

1819

+<CENTER><TABLE ALIGN=ABSCENTER BORDER=1 CELLSPACING=1 CELLPADDING=5>

1820

+<TR>

1821

+<TD>Parameter</TD>

1822

+<TD>Description</TD>

1823

+</TR>

1824

+<TR>

1825

+<TD><TT>-HELP or -CHECK </TT></TD>

1826

+<TD>outline the command line parameters</TD>

1827

+</TR>

1828

+<TR>

1829

+<TD><TT>-ALIGN </TT></TD>

1830

+<TD>do full multiple alignment </TD>

1831

+</TR>

1832

+<TR>

1833

+<TD><TT>-TREE </TT></TD>

1834

+<TD>calculate NJ tree</TD>

1835

+</TR>

1836

+<TR>

1837

+<TD><TT>-BOOTSTRAP(=n) </TT></TD>

1838

+<TD>bootstrap a NJ tree (n= number of bootstraps; def. = 1000)</TD>

1839

+</TR>

1840

+<TR>

1841

+<TD><TT>-CONVERT </TT></TD>

1842

+<TD>output the input sequences in a different file format</TD>

1843

+</TR>

1844

+</TABLE></CENTER>

1845

+<CENTER><H3> PARAMETERS (set things)

1846

+</H3></CENTER>

1847

+<CENTER>***General settings:****

1848

+</CENTER>

1849

+<CENTER><TABLE ALIGN=ABSCENTER BORDER=1 CELLSPACING=1 CELLPADDING=5>

1850

+<TR>

1851

+<TD>Parameter</TD>

1852

+<TD>Description</TD>

1853

+</TR>

1854

+<TR>

1855

+<TD><TT>-INTERACTIVE </TT></TD>

1856

+<TD>read command line, then enter normal interactive menus</TD>

1857

+</TR>

1858

+<TR>

1859

+<TD><TT>-QUICKTREE </TT></TD>

1860

+<TD>use FAST algorithm for the alignment guide tree</TD>

1861

+</TR>

1862

+<TR>

1863

+<TD><TT>-TYPE= </TT></TD>

1864

+<TD>PROTEIN or DNA sequences</TD>

1865

+</TR>

1866

+<TR>

1867

+<TD><TT>-NEGATIVE </TT></TD>

1868

+<TD>protein alignment with negative values in matrix</TD>

1869

+</TR>

1870

+<TR>

1871

+<TD><TT>-OUTFILE= </TT></TD>

1872

+<TD>sequence alignment file name</TD>

1873

+</TR>

1874

+<TR>

1875

+<TD><TT>-OUTPUT= </TT></TD>

1876

+<TD>GCG, GDE, PHYLIP, PIR or NEXUS</TD>

1877

+</TR>

1878

+<TR>

1879

+<TD><TT>-OUTORDER= </TT></TD>

1880

+<TD>INPUT or ALIGNED</TD>

1881

+</TR>

1882

+<TR>

1883

+<TD><TT>-CASE= </TT></TD>

1884

+<TD>LOWER or UPPER (for GDE output only)</TD>

1885

+</TR>

1886

+<TR>

1887

+<TD><TT>-SEQNOS= </TT></TD>

1888

+<TD>OFF or ON (for Clustal output only)</TD>

1889

+</TR>

1890

+</TABLE></CENTER>

1891

+<CENTER><H3>***Fast Pairwise Alignments:***

1892

+</H3></CENTER>

1893

+<CENTER><TABLE ALIGN=ABSCENTER BORDER=1 CELLSPACING=1 CELLPADDING=5>

1894

+<TR>

1895

+<TD>Parameter</TD>

1896

+<TD>Description</TD>

1897

+</TR>

1898

+<TR>

1899

+<TD><TT>-TOPDIAGS=n </TT></TD>

1900

+<TD>number of best diags.</TD>

1901

+</TR>

1902

+<TR>

1903

+<TD><TT>-WINDOW=n </TT></TD>

1904

+<TD>window around best diags.</TD>

1905

+</TR>

1906

+<TR>

1907

+<TD><TT>-PAIRGAP=n </TT></TD>

1908

+<TD>gap penalty</TD>

1909

+</TR>

1910

+<TR>

1911

+<TD><TT>-SCORE= </TT></TD>

1912

+<TD>PERCENT or ABSOLUTE</TD>

1913

+</TR>

1914

+</TABLE></CENTER>

1915

+<CENTER><H3>***Slow Pairwise Alignments:***

1916

+</H3></CENTER>

1917

+<CENTER><TABLE ALIGN=ABSCENTER BORDER=1 CELLSPACING=1 CELLPADDING=5>

1918

+<TR>

1919

+<TD>Parameter</TD>

1920

+<TD>Description</TD>

1921

+</TR>

1922

+<TR>

1923

+<TD><TT>-PWDNAMATRIX= </TT></TD>

1924

+<TD>DNA weight matrix=IUB, CLUSTALW or filename</TD>

1925

+</TR>

1926

+<TR>

1927

+<TD><TT>-PWGAPOPEN=f </TT></TD>

1928

+<TD>gap opening penalty</TD>

1929

+</TR>

1930

+<TR>

1931

+<TD><TT>-PWGAPEXT=f </TT></TD>

1932

+<TD>gap opening penalty</TD>

1933

+</TR>

1934

+</TABLE></CENTER>

1935

+<CENTER><H3>***Multiple Alignments:***

1936

+</H3></CENTER>

1937

+<CENTER><TABLE ALIGN=ABSCENTER BORDER=1 CELLSPACING=1 CELLPADDING=5>

1938

+<TR>

1939

+<TD>Parameter</TD>

1940

+<TD>Description</TD>

1941

+</TR>

1942

+<TR>

1943

+<TD><TT>-USETREE= </TT></TD>

1944

+<TD>file for old guide tree</TD>

1945

+</TR>

1946

+<TR>

1947

+<TD><TT>-MATRIX= </TT></TD>

1948

+<TD>Protein weight matrix=BLOSUM, PAM, GONNET, ID or filename</TD>

1949

+</TR>

1950

+<TR>

1951

+<TD><TT>-DNAMATRIX= </TT></TD>

1952

+<TD>DNA weight matrix=IUB, CLUSTALW or filename</TD>

1953

+</TR>

1954

+<TR>

1955

+<TD><TT>-GAPOPEN=f </TT></TD>

1956

+<TD>gap opening penalty</TD>

1957

+</TR>

1958

+<TR>

1959

+<TD><TT>-GAPEXT=f </TT></TD>

1960

+<TD>gap extension penalty</TD>

1961

+</TR>

1962

+<TR>

1963

+<TD><TT>-ENDGAPS </TT></TD>

1964

+<TD>no end gap separation pen.</TD>

1965

+</TR>

1966

+<TR>

1967

+<TD><TT>-GAPDIST=n </TT></TD>

1968

+<TD>gap separation pen. range</TD>

1969

+</TR>

1970

+<TR>

1971

+<TD><TT>-NOPGAP </TT></TD>

1972

+<TD>residue-specific gaps off</TD>

1973

+</TR>

1974

+<TR>

1975

+<TD><TT>-NOHGAP </TT></TD>

1976

+<TD>hydrophilic gaps off</TD>

1977

+</TR>

1978

+<TR>

1979

+<TD><TT>-HGAPRESIDUES= </TT></TD>

1980

+<TD>list hydrophilic res.</TD>

1981

+</TR>

1982

+<TR>

1983

+<TD><TT>-MAXDIV=n </TT></TD>

1984

+<TD>% ident. for delay</TD>

1985

+</TR>

1986

+<TR>

1987

+<TD><TT>-TYPE= </TT></TD>

1988

+<TD>PROTEIN or DNA</TD>

1989

+</TR>

1990

+<TR>

1991

+<TD><TT>-TRANSWEIGHT=f </TT></TD>

1992

+<TD>transitions weighting</TD>

1993

+</TR>

1994

+</TABLE></CENTER>

1995

+<CENTER><H3>***Profile Alignments:***

1996

+</H3></CENTER>

1997

+<CENTER><TABLE ALIGN=ABSCENTER BORDER=1 CELLSPACING=1 CELLPADDING=5>

1998

+<TR>

1999

+<TD>Parameter</TD>

2000

+<TD>Description</TD>

2001

+</TR>

2002

+<TR>

2003

+<TD><TT>-NEWTREE1= </TT></TD>

2004

+<TD>file for new guide tree for profile1</TD>

2005

+</TR>

2006

+<TR>

2007

+<TD><TT>-NEWTREE2= </TT></TD>

2008

+<TD>file for new guide tree for profile2</TD>

2009

+</TR>

2010

+<TR>

2011

+<TD><TT>-USETREE1= </TT></TD>

2012

+<TD>file for old guide tree for profile1</TD>

2013

+</TR>

2014

+<TR>

2015

+<TD><TT>-USETREE2= </TT></TD>

2016

+<TD>file for old guide tree for profile2</TD>

2017

+</TR>

2018

+</TABLE></CENTER>

2019

+<CENTER><H3>***Sequence to Profile Alignments:***

2020

+</H3></CENTER>

2021

+<CENTER><TABLE ALIGN=ABSCENTER BORDER=1 CELLSPACING=1 CELLPADDING=5>

2022

+<TR>

2023

+<TD>Parameter</TD>

2024

+<TD>Description</TD>

2025

+</TR>

2026

+<TR>

2027

+<TD><TT>-NEWTREE= </TT></TD>

2028

+<TD>file for new guide tree</TD>

2029

+</TR>

2030

+<TR>

2031

+<TD><TT>-USETREE= </TT></TD>

2032

+<TD>file for old guide tree</TD>

2033

+</TR>

2034

+</TABLE></CENTER>

2035

+<CENTER><H3>***Structure Alignments:***

2036

+</H3></CENTER>

2037

+<CENTER><TABLE ALIGN=ABSCENTER BORDER=1 CELLSPACING=1 CELLPADDING=5>

2038

+<TR>

2039

+<TD>Parameter</TD>

2040

+<TD>Description</TD>

2041

+</TR>

2042

+<TR>

2043

+<TD><TT>-NOSECSTR2 </TT></TD>

2044

+<TD>do not use secondary structure/gap penalty mask for profile 2</TD>

2045

+</TR>

2046

+<TR>

2047

+<TD><TT>-SECSTROUT=STRUCTURE or MASK or BOTH or NONE </TT></TD>

2048

+<TD>output in alignment file</TD>

2049

+</TR>

2050

+<TR>

2051

+<TD><TT>-HELIXGAP=n </TT></TD>

2052

+<TD>gap penalty for helix core residues </TD>

2053

+</TR>

2054

+<TR>

2055

+<TD><TT>-STRANDGAP=n </TT></TD>

2056

+<TD>gap penalty for strand core residues</TD>

2057

+</TR>

2058

+<TR>

2059

+<TD><TT>-LOOPGAP=n </TT></TD>

2060

+<TD>gap penalty for loop regions</TD>

2061

+</TR>

2062

+<TR>

2063

+<TD><TT>-TERMINALGAP=n </TT></TD>

2064

+<TD>gap penalty for structure termini</TD>

2065

+</TR>

2066

+<TR>

2067

+<TD><TT>-HELIXENDIN=n </TT></TD>

2068

+<TD>number of residues inside helix to be treated as terminal</TD>

2069

+</TR>

2070

+<TR>

2071

+<TD><TT>-HELIXENDOUT=n </TT></TD>

2072

+<TD>number of residues outside helix to be treated as terminal</TD>

2073

+</TR>

2074

+<TR>

2075

+<TD><TT>-STRANDENDIN=n </TT></TD>

2076

+<TD>number of residues inside strand to be treated as terminal</TD>

2077

+</TR>

2078

+<TR>

2079

+<TD><TT>-STRANDENDOUT=n</TT></TD>

2080

+<TD>number of residues outside strand to be treated as terminal </TD>

2081

+</TR>

2082

+</TABLE></CENTER>

2083

+<CENTER><H3>***Trees:***

2084

+</H3></CENTER>

2085

+<CENTER><TABLE ALIGN=ABSCENTER BORDER=1 CELLSPACING=1 CELLPADDING=5>

2086

+<TR>

2087

+<TD>Parameter</TD>

2088

+<TD>Description</TD>

2089

+</TR>

2090

+<TR>

2091

+<TD><TT>-SEED=n </TT></TD>

2092

+<TD>seed number for bootstraps</TD>

2093

+</TR>

2094

+<TR>

2095

+<TD><TT>-KIMURA </TT></TD>

2096

+<TD>use Kimura's correction</TD>

2097

+</TR>

2098

+<TR>

2099

+<TD><TT>-TOSSGAPS </TT></TD>

2100

+<TD>ignore positions with gaps</TD>

2101

+</TR>

2102

+<TR>

2103

+<TD><TT>-BOOTLABELS=node OR branch </TT></TD>

2104

+<TD>position of bootstrap values in tree display</TD>

2105

+</TR>

2106

+</TABLE></CENTER>

2107

+

2108

+<A HREF="#INDEX"> Back to Index </A>

2109

+<CENTER><H2><A NAME="R"> References

2110

+</A></H2></CENTER>

2111

+

2112

+

2113

+

2114

+

2115

+The ClustalX program is described in the manuscript:

2116

+

2117

+

2118

+

2119

+Thompson,J.D., Gibson,T.J., Plewniak,F., Jeanmougin,F. and Higgins,D.G. (1997)

2120

+The ClustalX windows interface: flexible strategies for multiple sequence

2121

alignment aided by quality analysis tools. Nucleic Acids Research, 25:4876-4882.

2122

2123

Older »