1
Index: clustalw-1.83/clustalx.html
2
===================================================================
3
--- clustalw-1.83.orig/clustalx.html
4
+++ clustalw-1.83/clustalx.html
5
@@ -2029,6 +2029,2118 @@
7
Thompson,J.D., Gibson,T.J., Plewniak,F., Jeanmougin,F. and Higgins,D.G. (1997)
8
The ClustalX windows interface: flexible strategies for multiple sequence
9
+alignment aided by quality analysis tools. Nucleic Acids Research, 24:4876-4882.
15
+The ClustalW program is described in the manuscript:
19
+Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTAL W: improving the
20
+sensitivity of progressive multiple sequence alignment through sequence
21
+weighting, positions-specific gap penalties and weight matrix choice. Nucleic
22
+Acids Research, 22:4673-4680.
28
+The ClustalV program is described in the manuscript:
32
+Higgins,D.G., Bleasby,A.J. and Fuchs,R. (1992) CLUSTAL V: improved software for
33
+multiple sequence alignment. CABIOS 8,189-191.
39
+The original Clustal program is described in the manuscripts:
43
+Higgins,D.G. and Sharp,P.M. (1989) Fast and sensitive multiple sequence
44
+alignments on a microcomputer.
48
+Higgins,D.G. and Sharp,P.M. (1988) CLUSTAL: a package for performing multiple
49
+sequence alignment on a microcomputer. Gene 73,237-244.
53
+Some tips on using Clustal X:
57
+Jeannmougin,F., Thompson,J.D., Gouy,M., Higgins,D.G. and Gibson,T.J. (1998)
58
+Multiple sequence alignment with Clustal X. Trends Biochem Sci, 23, 403-5.
62
+Some tips on using Clustal W:
66
+Higgins, D. G., Thompson, J. D. and Gibson, T. J. (1996) Using CLUSTAL for
67
+multiple sequence alignments. Methods Enzymol., 266, 383-402.
71
+You can get the latest version of the ClustalX program by anonymous ftp to:
75
+ftp-igbmc.u-strasbg.fr
76
+ftp.embl-heidelberg.de
81
+Or, have a look at the following WWW site:
85
+http://www-igbmc.u-strasbg.fr/BioInfo/
89
+<A HREF="#INDEX"> <EM>Back to Index</EM> </A>
91
+<TITLE>ClustalX Help</TITLE>
94
+<CENTER><H1>ClustalX Help</H1></CENTER>
96
+You can get the latest version of the ClustalX program here:
99
+<A HREF="ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalX/">
100
+ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalX/</A>
102
+<P>For full details of usage and algorithms, please read the <A HREF="clustalw.doc"><EM>ClustalW.Doc</EM></A> file.</P>
104
+Toby Gibson EMBL, Heidelberg, Germany.
105
+Des Higgins UCC, Cork, Ireland.
106
+Julie Thompson/Francois Jeanmougin IGBMC, Strasbourg, France.
108
+<CENTER><H2><A NAME="Index">Index</A></H2></CENTER>
110
+<LI><A HREF="#G"> General help for CLUSTAL X (1.8)
112
+<LI><A HREF="#F"> Input / Output Files
114
+<LI><A HREF="#E"> Editing Alignments
116
+<LI><A HREF="#M"> Multiple Alignments
118
+<LI><A HREF="#P"> Profile and Structure Alignments
120
+<LI><A HREF="#B"> Secondary Structure / Gap Penalty Masks
122
+<LI><A HREF="#T"> Phylogenetic Trees
124
+<LI><A HREF="#C"> Colors
126
+<LI><A HREF="#Q"> Alignment Quality Analysis
128
+<LI><A HREF="#9"> Command Line Parameters
130
+<LI><A HREF="#R"> References
133
+<CENTER><H2><A NAME="G"> General help for CLUSTAL X (1.8)
138
+Clustal X is a windows interface for the ClustalW multiple sequence alignment
139
+program. It provides an integrated environment for performing multiple sequence
140
+and profile alignments and analysing the results. The sequence alignment is
141
+displayed in a window on the screen. A versatile coloring scheme has been
142
+incorporated allowing you to highlight conserved features in the alignment.
143
+The pull-down menus at the top of the window allow you to select all the
144
+options required for traditional multiple sequence and profile alignment.
147
+You can cut-and-paste sequences to change the order of the alignment; you can
148
+select a subset of sequences to be aligned; you can select a sub-range of the
149
+alignment to be realigned and inserted back into the original alignment.
152
+Alignment quality analysis can be performed and low-scoring segments or
153
+exceptional residues can be highlighted.
156
+ClustalX is available for a number of different platforms including: SUN
157
+Solaris, IRIX5.3 on Silicon Graphics, Digital UNIX on DECStations, Microsoft
158
+Windows (32 bit) for PC's, Linux ELF for x86 PC's and Macintosh PowerMac. (See
159
+the README file for Installation instructions.)
169
+Sequences and profiles (a term for pre-existing alignments) are input using
170
+the FILE menu. Invalid options will be disabled. All sequences must be included
171
+into 1 file. 7 formats are automatically recognised: NBRF/PIR, EMBL/SWISSPROT,
172
+Pearson (Fasta), Clustal (*.aln), GCG/MSF (Pileup), GCG9 RSF and GDE flat file.
173
+All non-alphabetic characters (spaces, digits, punctuation marks) are ignored
174
+except "-" which is used to indicate a GAP ("." in MSF/RSF).
178
+SEQUENCE / PROFILE ALIGNMENTS
182
+Clustal X has two modes which can be selected using the switch directly above
183
+the sequence display: MULTIPLE ALIGNMENT MODE and PROFILE ALIGNMENT MODE.
186
+To do a MULTIPLE ALIGNMENT on a set of sequences, make sure MULTIPLE ALIGNMENT
187
+MODE is selected. A single sequence data area is then displayed. The ALIGNMENT
188
+menu then allows you to either produce a guide tree for the alignment, or to do
189
+a multiple alignment following the guide tree, or to do a full multiple
193
+In PROFILE ALIGNMENT MODE, two sequence data areas are displayed, allowing you
194
+to align 2 alignments (termed profiles). Profiles are also used to add a new
195
+sequence to an old alignment, or to use secondary structure to guide the
196
+alignment process. GAPS in the old alignments are indicated using the "-"
197
+character. PROFILES can be input in ANY of the allowed formats; just use "-"
198
+(or "." for MSF/RSF) for each gap position. In Profile Alignment Mode, a button
199
+"Lock Scroll" is displayed which allows you to scroll the two profiles together
200
+using a single scroll bar. When the Lock Scroll is turned off, the two profiles
201
+can be scrolled independently.
209
+Phylogenetic trees can be calculated from old alignments (read in with "-"
210
+characters to indicate gaps) OR after a multiple alignment while the alignment
219
+The alignment is displayed on the screen with the sequence names on the left
220
+hand side. The sequence alignment is for display only, it cannot be edited here
221
+(except for changing the sequence order by cutting-and-pasting on the sequence
225
+A ruler is displayed below the sequences, starting at 1 for the first residue
226
+position (residue numbers in the sequence input file are ignored).
229
+A line above the alignment is used to mark strongly conserved positions. Three
230
+characters ('*', ':' and '.') are used:
233
+'*' indicates positions which have a single, fully conserved residue
236
+':' indicates that one of the following 'strong' groups is fully conserved:-
250
+'.' indicates that one of the following 'weaker' groups is fully conserved:-
266
+These are all the positively scoring groups that occur in the Gonnet Pam250
267
+matrix. The strong and weak groups are defined as strong score >0.5 and weak
268
+score =<0.5 respectively.
271
+For profile alignments, secondary structure and gap penalty masks are displayed
272
+above the sequences, if any data is found in the profile input file.
278
+<A HREF="#INDEX"> <EM>Back to Index</EM> </A>
279
+<CENTER><H2><A NAME="F"> Input / Output Files
284
+LOAD SEQUENCES reads sequences from one of 7 file formats, replacing any
285
+sequences that are already loaded. All sequences must be in 1 file. The formats
286
+that are automatically recognised are: NBRF/PIR, EMBL/SWISSPROT, Pearson
287
+(Fasta), Clustal (*.aln), GCG/MSF (Pileup), GCG9/RSF and GDE flat file. All
288
+non-alphabetic characters (spaces, digits, punctuation marks) are ignored
289
+except "-" which is used to indicate a GAP ("." in MSF/RSF).
292
+The program tries to automatically recognise the different file formats used
293
+and to guess whether the sequences are amino acid or nucleotide. This is not
297
+FASTA and NBRF/PIR formats are recognised by having a ">" as the first
298
+character in the file.
301
+EMBL/Swiss Prot formats are recognised by the letters "ID" at the start of the
302
+file (the token for the entry name field).
305
+CLUSTAL format is recognised by the word CLUSTAL at the beginning of the file.
308
+GCG/MSF format is recognised by one of the following:
311
+ - the word PileUp at the start of the file.
313
+ - the word !!AA_MULTIPLE_ALIGNMENT or !!NA_MULTIPLE_ALIGNMENT
314
+ at the start of the file.
316
+ - the word MSF on the first line of the file, and the characters ..
317
+ at the end of this line.
322
+GCG/RSF format is recognised by the word !!RICH_SEQUENCE at the beginning of
328
+If 85% or more of the characters in the sequence are from A,C,G,T,U or N, the
329
+sequence will be assumed to be nucleotide. This works in 97.3% of cases but
333
+APPEND SEQUENCES is only valid in MULTIPLE ALIGNMENT MODE. The input sequences
334
+do not replace those already loaded, but are appended at the end of the
338
+SAVE SEQUENCES AS... offers the user a choice of one of six output formats:
339
+CLUSTAL, NBRF/PIR, GCG/MSF, PHYLIP, NEXUS or GDE. All sequences are written
340
+to a single file. Options are available to save a range of the alignment,
341
+switch between UPPER/LOWER case for GDE files, and to output SEQUENCE NUMBERING
345
+LOAD PROFILE 1 reads sequences in the same 7 file formats, replacing any
346
+sequences already loaded as Profile 1. This option will also remove any
347
+sequences which are loaded in Profile 2.
350
+LOAD PROFILE 2 reads sequences in the same 7 file formats, replacing any
351
+sequences already loaded as Profile 2.
354
+SAVE PROFILE 1 AS... is similar to the Save Sequences option except that only
355
+those sequences in Profile 1 will be written to the output file.
358
+SAVE PROFILE 2 AS... is similar to the Save Sequences option except that only
359
+those sequences in Profile 2 will be written to the output file.
362
+WRITE ALIGNMENT AS POSTSCRIPT will write the sequence display to a postscript
363
+format file. This will include any secondary structure / gap penalty mask
364
+information and the consensus and ruler lines which are displayed on the
365
+screen. The Alignment Quality curve can be optionally included in the output
369
+WRITE PROFILE 1 AS POSTSCRIPT is similar to WRITE ALIGNMENT AS POSTSCRIPT
370
+except that only the profile 1 display will be printed.
373
+WRITE PROFILE 2 AS POSTSCRIPT is similar to WRITE ALIGNMENT AS POSTSCRIPT
374
+except that only the profile 2 display will be printed.
380
+POSTSCRIPT PARAMETERS
384
+A number of options are available to allow you to configure your postscript
391
+The exact RGB values required to reproduce the colors used in the alignment
392
+window will vary from printer to printer. A PS colors file can be specified
393
+that contains the RGB values for all the colors required by each of your
394
+postscript printers.
397
+By default, Clustal X looks for a file called 'colprint.par' in the current
398
+directory (if your running under UNIX, it then looks in your home directory,
399
+and finally in the directories in your PATH environment variable). If no PS
400
+colors file is found or a color used on the screen is not defined here, the
401
+screen RGB values (from the Color Parameter File) are used.
404
+The PS colors file consists of one line for each color to be defined, with the
405
+color name followed by the RGB values (on a scale of 0 to 1). For example,
411
+Blank lines and comments (lines beginning with a '#' character) are ignored.
416
+PAGE SIZE: The alignment can be displayed on either A4, A3 or US Letter size
420
+ORIENTATION: The alignment can be displayed on either a landscape or portrait
424
+PRINT HEADER: An optional header including the postscript filename, and
425
+creation date can be printed at the top of each page.
428
+PRINT QUALITY CURVE: The Alignment Quality curve which is displayed underneath
429
+the alignment on the screen can be included in the postscript output.
432
+PRINT RULER: The ruler which is displayed underneath the alignment on the
433
+screen can be included in the postscript output.
436
+PRINT RESIDUE NUMBERS: Sequence residue numbers can be printed at the right
437
+hand side of the alignment.
440
+RESIZE TO FIT PAGE: By default, the alignment is scaled to fit the page size
441
+selected. This option can be turned off, in which case a font size of 10 will
442
+be used for the sequences.
445
+PRINT FROM POSITION/TO: A range of the alignment can be printed. The default
446
+is to print the full alignment. The first and last residues to be printed are
450
+USE BLOCK LENGTH: The alignment can be divided into blocks of residues. The
451
+number of residues in a block is specified here. More than one block may then
452
+be printed on a single page. This is useful for long alignments of a small
453
+number of sequences. If the block length is set to 0, The alignment will not
454
+be divided into blocks, but printed across a number of pages.
458
+<A HREF="#INDEX"> <EM>Back to Index</EM> </A>
459
+<CENTER><H2><A NAME="E"> Editing Alignments
464
+Clustal X allows you to change the order of the sequences in the alignment, by
465
+cutting-and-pasting the sequence names.
468
+To select a group of sequences to be moved, click on a sequence name and drag
469
+the cursor until all the required sequences are highlighted. Holding down the
470
+Shift key when clicking on the first name will add new sequences to those
474
+(Options are provided to Select All Sequences, Select Profile 1 or Select
478
+The selected sequences can be removed from the alignment by using the EDIT
482
+To add the cut sequences back into an alignment, select a sequence by clicking
483
+on the sequence name. The cut sequences will be added to the alignment,
484
+immediately following the selected sequence, by the EDIT menu, PASTE option.
487
+To add the cut sequences to an empty alignment (eg. when cutting sequences from
488
+Profile 1 and pasting them to Profile 2), click on the empty sequence name
489
+display area, and select the EDIT menu, PASTE option as before.
492
+The sequence selection and sequence range selection can be cleared using the
493
+EDIT menu, CLEAR SEQUENCE SELECTION and CLEAR RANGE SELECTION options
497
+To search for a string of residues in the sequences, select the sequences to be
498
+searched by clicking on the sequence names. You can then enter the string to
499
+search for by selecting the SEARCH FOR STRING option. If the string is found in
500
+any of the sequences selected, the sequence name and column number is printed
501
+below the sequence display.
504
+In PROFILE ALIGNMENT MODE, the two profiles can be merged (normally done after
505
+alignment) by selecting ADD PROFILE 2 TO PROFILE 1. The sequences currently
506
+displayed as Profile 2 will be appended to Profile 1.
509
+The REMOVE ALL GAPS option will remove all gaps from the sequences currently
511
+WARNING: This option removes ALL gaps, not only those introduced by ClustalX,
512
+but also those that were read from the input alignment file. Any secondary
513
+structure information associated with the alignment will NOT be automatically
517
+The REMOVE GAP-ONLY COLUMNS will remove those positions in the alignment which
518
+contain gaps in all sequences. This can occur as a result of removing divergent
519
+sequences from an alignment, or if an alignment has been realigned.
523
+<A HREF="#INDEX"> <EM>Back to Index</EM> </A>
524
+<CENTER><H2><A NAME="M"> Multiple Alignments
529
+Make sure MULTIPLE ALIGNMENT MODE is selected, using the switch directly above
530
+the sequence display area. Then, use the ALIGNMENT menu to do multiple
534
+Multiple alignments are carried out in 3 stages:
537
+1) all sequences are compared to each other (pairwise alignments);
540
+2) a dendrogram (like a phylogenetic tree) is constructed, describing the
541
+approximate groupings of the sequences by similarity (stored in a file).
544
+3) the final multiple alignment is carried out, using the dendrogram as a guide.
547
+The 3 stages are carried out automatically by the DO COMPLETE ALIGNMENT option.
548
+You can skip the first stages (pairwise alignments; guide tree) by using an old
549
+guide tree file (DO ALIGNMENT FROM GUIDE TREE); or you can just produce the
550
+guide tree with no final multiple alignment (PRODUCE GUIDE TREE ONLY).
555
+REALIGN SELECTED SEQUENCES is used to realign badly aligned sequences in the
556
+alignment. Sequences can be selected by clicking on the sequence names - see
557
+Editing Alignments for more details. The unselected sequences are then 'fixed'
558
+and a profile is made including only the unselected sequences. Each of the
559
+selected sequences in turn is then realigned to this profile. The realigned
560
+sequences will be displayed as a group at the end the alignment.
565
+REALIGN SELECTED SEQUENCE RANGE is used to realign a small region of the
566
+alignment. A residue range can be selected by clicking on the sequence display
567
+area. A multiple alignment is then performed, following the 3 stages described
568
+above, but only using the selected residue range. Finally the new alignment of
569
+the range is pasted back into the full sequence alignment.
572
+By default, gap penalties are used at each end of the subrange in order to
573
+penalise terminal gaps. If the REALIGN SEGMENT END GAP PENALTIES option is
574
+switched off, gaps can be introduced at the ends of the residue range at no
580
+ALIGNMENT PARAMETERS displays a sub-menu with the following options:
583
+RESET NEW GAPS BEFORE ALIGNMENT will remove any new gaps introduced into the
584
+sequences during multiple alignment if you wish to change the parameters and
585
+try again. This only takes effect just before you do a second multiple
586
+alignment. You can make phylogenetic trees after alignment whether or not this
587
+is ON. If you turn this OFF, the new gaps are kept even if you do a second
588
+multiple alignment. This allows you to iterate the alignment gradually.
589
+Sometimes, the alignment is improved by a second or third pass.
592
+RESET ALL GAPS BEFORE ALIGNMENT will remove all gaps in the sequences including
593
+gaps which were read in from the sequence input file. This only takes effect
594
+just before you do a second multiple alignment. You can make phylogenetic
595
+trees after alignment whether or not this is ON. If you turn this OFF, all
596
+gaps are kept even if you do a second multiple alignment. This allows you to
597
+iterate the alignment gradually. Sometimes, the alignment is improved by a
598
+second or third pass.
603
+PAIRWISE ALIGNMENT PARAMETERS control the speed/sensitivity of the initial
607
+MULTIPLE ALIGNMENT PARAMETERS control the gaps in the final multiple
611
+PROTEIN GAP PARAMETERS displays a temporary window which allows you to set
612
+various parameters only used in the alignment of protein sequences.
615
+(SECONDARY STRUCTURE PARAMETERS, for use with the Profile Alignment Mode only,
616
+allows you to set various parameters only used with gap penalty masks.)
619
+SAVE LOG FILE will write the alignment calculation scores to a file. The log
620
+filename is the same as the input sequence filename, with an extension .log
627
+OUTPUT FORMAT OPTIONS
631
+You can choose from 6 different alignment formats (CLUSTAL, GCG, NBRF/PIR,
632
+PHYLIP, GDE and NEXUS). You can choose more than one (or all 6 if you wish).
635
+CLUSTAL format output is a self explanatory alignment format. It shows the
636
+sequences aligned in blocks. It can be read in again at a later date to (for
637
+example) calculate a phylogenetic tree or add in new sequences by profile
641
+GCG output can be used by any of the GCG programs that can work on multiple
642
+alignments (e.g. PRETTY, PROFILEMAKE, PLOTALIGN). It is the same as the GCG
643
+.msf format files (multiple sequence file); new in version 7 of GCG.
646
+NEXUS format is used by several phylogeny programs, including PAUP and
650
+PHYLIP format output can be used for input to the PHYLIP package of Joe
651
+Felsenstein. This is a very widely used package for doing every imaginable
652
+form of phylogenetic analysis (MUCH more than the the modest introduction
653
+offered by this program).
656
+NBRF/PIR: this is the same as the standard PIR format with ONE ADDITION. Gap
657
+characters "-" are used to indicate the positions of gaps in the multiple
658
+alignment. These files can be re-used as input in any part of clustal that
659
+allows sequences (or alignments or profiles) to be read in.
662
+GDE: this format is used by the GDE package of Steven Smith and is understood
663
+by SEQLAB in GCG 9 or later.
666
+GDE OUTPUT CASE: sequences in GDE format may be written in either upper or
670
+CLUSTALW SEQUENCE NUMBERS: residue numbers may be added to the end of the
671
+alignment lines in clustalw format.
674
+OUTPUT ORDER is used to control the order of the sequences in the output
675
+alignments. By default, it uses the order in which the sequences were aligned
676
+(from the guide tree/dendrogram), thus automatically grouping closely related
677
+sequences. It can be switched to be the same as the original input order.
680
+PARAMETER OUTPUT: This option will save all your parameter settings in a
681
+parameter file (suffix .par) during alignment. The file can be subsequently
682
+used to rerun ClustalW using the same parameters.
688
+ALIGNMENT PARAMETERS
693
+PAIRWISE ALIGNMENT PARAMETERS
697
+A distance is calculated between every pair of sequences and these are used to
698
+construct the phylogenetic tree which guides the final multiple alignment. The
699
+scores are calculated from separate pairwise alignments. These can be
700
+calculated using 2 methods: dynamic programming (slow but accurate) or by the
701
+method of Wilbur and Lipman (extremely fast but approximate).
704
+You can choose between the 2 alignment methods using the PAIRWISE ALIGNMENTS
705
+option. The slow/accurate method is fast enough for short sequences but will be
706
+VERY SLOW for many (e.g. >100) long (e.g. >1000 residue) sequences.
712
+SLOW-ACCURATE alignment parameters:
716
+These parameters do not have any affect on the speed of the alignments. They
717
+are used to give initial alignments which are then rescored to give percent
718
+identity scores. These % scores are the ones which are displayed on the
719
+screen. The scores are converted to distances for the trees.
722
+Gap Open Penalty: the penalty for opening a gap in the alignment.
725
+Gap Extension Penalty: the penalty for extending a gap by 1 residue.
728
+Protein Weight Matrix: the scoring table which describes the similarity of
729
+each amino acid to each other.
732
+Load protein matrix: allows you to read in a comparison table from a file.
735
+DNA weight matrix: the scores assigned to matches and mismatches (including
736
+IUB ambiguity codes).
739
+Load DNA matrix: allows you to read in a comparison table from a file.
742
+See the Multiple alignment parameters, MATRIX option below for details of the
743
+matrix input format.
749
+FAST-APPROXIMATE alignment parameters:
753
+These similarity scores are calculated from fast, approximate, global align-
754
+ments, which are controlled by 4 parameters. 2 techniques are used to make
755
+these alignments very fast: 1) only exactly matching fragments (k-tuples) are
756
+considered; 2) only the 'best' diagonals (the ones with most k-tuple matches)
760
+GAP PENALTY: This is a penalty for each gap in the fast alignments. It has
761
+little effect on the speed or sensitivity except for extreme values.
764
+K-TUPLE SIZE: This is the size of exactly matching fragment that is used.
765
+INCREASE for speed (max= 2 for proteins; 4 for DNA), DECREASE for sensitivity.
766
+For longer sequences (e.g. >1000 residues) you may wish to increase the
770
+TOP DIAGONALS: The number of k-tuple matches on each diagonal (in an imaginary
771
+dot-matrix plot) is calculated. Only the best ones (with most matches) are used
772
+in the alignment. This parameter specifies how many. Decrease for speed;
773
+increase for sensitivity.
776
+WINDOW SIZE: This is the number of diagonals around each of the 'best'
777
+diagonals that will be used. Decrease for speed; increase for sensitivity.
783
+MULTIPLE ALIGNMENT PARAMETERS
787
+These parameters control the final multiple alignment. This is the core of the
788
+program and the details are complicated. To fully understand the use of the
789
+parameters and the scoring system, you will have to refer to the documentation.
792
+Each step in the final multiple alignment consists of aligning two alignments
793
+or sequences. This is done progressively, following the branching order in the
794
+GUIDE TREE. The basic parameters to control this are two gap penalties and the
795
+scores for various identical/non-indentical residues.
798
+The GAP OPENING and EXTENSION PENALTIES can be set here. These control the
799
+cost of opening up every new gap and the cost of every item in a gap.
800
+Increasing the gap opening penalty will make gaps less frequent. Increasing
801
+the gap extension penalty will make gaps shorter. Terminal gaps are not
805
+The DELAY DIVERGENT SEQUENCES switch delays the alignment of the most distantly
806
+related sequences until after the most closely related sequences have been
807
+aligned. The setting shows the percent identity level required to delay the
808
+addition of a sequence; sequences that are less identical than this level to
809
+any other sequences will be aligned later.
812
+The TRANSITION WEIGHT gives transitions (A<-->G or C<-->T i.e. purine-purine or
813
+pyrimidine-pyrimidine substitutions) a weight between 0 and 1; a weight of zero
814
+means that the transitions are scored as mismatches, while a weight of 1 gives
815
+the transitions the match score. For distantly related DNA sequences, the
816
+weight should be near to zero; for closely related sequences it can be useful
817
+to assign a higher score. The default is set to 0.5.
822
+The PROTEIN WEIGHT MATRIX option allows you to choose a series of weight
823
+matrices. For protein alignments, you use a weight matrix to determine the
824
+similarity of non-identical amino acids. For example, Tyr aligned with Phe is
825
+usually judged to be 'better' than Tyr aligned with Pro.
828
+There are three 'in-built' series of weight matrices offered. Each consists of
829
+several matrices which work differently at different evolutionary distances. To
830
+see the exact details, read the documentation. Crudely, we store several
831
+matrices in memory, spanning the full range of amino acid distance (from almost
832
+identical sequences to highly divergent ones). For very similar sequences, it
833
+is best to use a strict weight matrix which only gives a high score to
834
+identities and the most favoured conservative substitutions. For more divergent
835
+sequences, it is appropriate to use "softer" matrices which give a high score
836
+to many other frequent substitutions.
839
+1) BLOSUM (Henikoff). These matrices appear to be the best available for
840
+carrying out data base similarity (homology searches). The matrices currently
841
+used are: Blosum 80, 62, 45 and 30. BLOSUM was the default in earlier Clustal X
845
+2) PAM (Dayhoff). These have been extremely widely used since the late '70s. We
846
+currently use the PAM 20, 60, 120, 350 matrices.
849
+3) GONNET. These matrices were derived using almost the same procedure as the
850
+Dayhoff one (above) but are much more up to date and are based on a far larger
851
+data set. They appear to be more sensitive than the Dayhoff series. We
852
+currently use the GONNET 80, 120, 160, 250 and 350 matrices. This series is the
853
+default for Clustal X version 1.8.
856
+We also supply an identity matrix which gives a score of 10 to two identical
857
+amino acids and a score of zero otherwise. This matrix is not very useful.
860
+Load protein matrix: allows you to read in a comparison matrix from a file.
861
+This can be either a single matrix or a series of matrices (see below for
867
+DNA WEIGHT MATRIX option allows you to select a single matrix (not a series)
868
+used for aligning nucleic acid sequences. Two hard-coded matrices are available:
871
+1) IUB. This is the default scoring matrix used by BESTFIT for the comparison
872
+of nucleic acid sequences. X's and N's are treated as matches to any IUB
873
+ambiguity symbol. All matches score 1.9; all mismatches for IUB symbols score 0.
876
+2) CLUSTALW(1.6). A previous system used by ClustalW, in which matches score
877
+1.0 and mismatches score 0. All matches for IUB symbols also score 0.
880
+Load DNA matrix: allows you to read in a nucleic acid comparison matrix from a
881
+file (just one matrix, not a series).
886
+SINGLE MATRIX INPUT FORMAT
887
+The format used for a single matrix is the same as the BLAST program. The
888
+scores in the new weight matrix should be similarities. You can use negative as
889
+well as positive values if you wish, although the matrix will be automatically
890
+adjusted to all positive scores, unless the NEGATIVE MATRIX option is selected.
891
+Any lines beginning with a # character are assumed to be comments. The first
892
+non-comment line should contain a list of amino acids in any order, using the 1
893
+letter code, followed by a * character. This should be followed by a square
894
+matrix of scores, with one row and one column for each amino acid. The last row
895
+and column of the matrix (corresponding to the * character) contain the minimum
896
+score over the whole matrix.
899
+MATRIX SERIES INPUT FORMAT
900
+ClustalX uses different matrices depending on the mean percent identity of the
901
+sequences to be aligned. You can specify a series of matrices and the range of
902
+the percent identity for each matrix in a matrix series file. The file is
903
+automatically recognised by the word CLUSTAL_SERIES at the beginning of the
904
+file. Each matrix in the series is then specified on one line which should
905
+start with the word MATRIX. This is followed by the lower and upper limits of
906
+the sequence percent identities for which you want to apply the matrix. The
907
+final entry on the matrix line is the filename of a Blast format matrix file
908
+(see above for details of the single matrix file format).
917
+MATRIX 81 100 /us1/user/julie/matrices/blosum80
918
+MATRIX 61 80 /us1/user/julie/matrices/blosum62
919
+MATRIX 31 60 /us1/user/julie/matrices/blosum45
920
+MATRIX 0 30 /us1/user/julie/matrices/blosum30
926
+PROTEIN GAP PARAMETERS
930
+RESIDUE SPECIFIC PENALTIES are amino acid specific gap penalties that reduce or
931
+increase the gap opening penalties at each position in the alignment or
932
+sequence. See the documentation for details. As an example, positions that are
933
+rich in glycine are more likely to have an adjacent gap than positions that are
937
+HYDROPHILIC GAP PENALTIES are used to increase the chances of a gap within a
938
+run (5 or more residues) of hydrophilic amino acids; these are likely to be
939
+loop or random coil regions where gaps are more common. The residues that are
940
+"considered" to be hydrophilic can be entered in HYDROPHILIC RESIDUES.
943
+GAP SEPARATION DISTANCE tries to decrease the chances of gaps being too close
944
+to each other. Gaps that are less than this distance apart are penalised more
945
+than other gaps. This does not prevent close gaps; it makes them less frequent,
946
+promoting a block-like appearance of the alignment.
949
+END GAP SEPARATION treats end gaps just like internal gaps for the purposes of
950
+avoiding gaps that are too close (set by GAP SEPARATION DISTANCE above). If you
951
+turn this off, end gaps will be ignored for this purpose. This is useful when
952
+you wish to align fragments where the end gaps are not biologically meaningful.
958
+<A HREF="#INDEX"> <EM>Back to Index</EM> </A>
959
+<CENTER><H2><A NAME="P"> Profile and Structure Alignments
964
+By PROFILE ALIGNMENT, we mean alignment using existing alignments. Profile
965
+alignments allow you to store alignments of your favourite sequences and add
966
+new sequences to them in small bunches at a time. A profile is simply an
967
+alignment of one or more sequences (e.g. an alignment output file from Clustal
968
+X). Each input can be a single sequence. One or both sets of input sequences
969
+may include secondary structure assignments or gap penalty masks to guide the
973
+Make sure PROFILE ALIGNMENT MODE is selected, using the switch directly above
974
+the sequence display area. Then, use the ALIGNMENT menu to do profile and
975
+secondary structure alignments.
978
+The profiles can be in any of the allowed input formats with "-" characters
979
+used to specify gaps (except for GCG/MSF where "." is used).
982
+You have to load the 2 profiles by choosing FILE, LOAD PROFILE 1 and LOAD
983
+PROFILE 2. Then ALIGNMENT, ALIGN PROFILE 2 TO PROFILE 1 will align the 2
984
+profiles to each other. Secondary structure masks in either profile can be used
985
+to guide the alignment. This option compares all the sequences in profile 1
986
+with all the sequences in profile 2 in order to build guide trees which will be
987
+used to calculate sequence weights, and select appropriate alignment parameters
988
+for the final profile alignment.
991
+You can skip the first stage (pairwise alignments; guide trees) by using old
992
+guide tree files (ALIGN PROFILES FROM GUIDE TREES).
995
+The ALIGN SEQUENCES TO PROFILE 1 option will take the sequences in the second
996
+profile and align them to the first profile, 1 at a time. This is useful to
997
+add some new sequences to an existing alignment, or to align a set of sequences
998
+to a known structure. In this case, the second profile set need not be
1002
+You can skip the first stage (pairwise alignments; guide tree) by using an old
1003
+guide tree file (ALIGN SEQUENCES TO PROFILE 1 FROM TREE).
1006
+SAVE LOG FILE will write the alignment calculation scores to a file. The log
1007
+filename is the same as the input sequence filename, with an extension .log
1011
+The alignment parameters can be set using the ALIGNMENT PARAMETERS menu,
1012
+Pairwise Parameters, Multiple Parameters and Protein Gap Parameters options.
1013
+These are EXACTLY the same parameters as used by the general, automatic
1014
+multiple alignment procedure. The general multiple alignment procedure is
1015
+simply a series of profile alignments. Carrying out a series of profile
1016
+alignments on larger and larger groups of sequences, allows you to manually
1017
+build up a complete alignment, if necessary editing intermediate alignments.
1021
+SECONDARY STRUCTURE PARAMETERS
1025
+Use this menu to set secondary structure options. If a solved structure is
1026
+known, it can be used to guide the alignment by raising gap penalties within
1027
+secondary structure elements, so that gaps will preferentially be inserted into
1028
+unstructured surface loop regions. Alternatively, a user-specified gap penalty
1029
+mask can be supplied for a similar purpose.
1032
+A gap penalty mask is a series of numbers between 1 and 9, one per position in
1033
+the alignment. Each number specifies how much the gap opening penalty is to be
1034
+raised at that position (raised by multiplying the basic gap opening penalty
1035
+by the number) i.e. a mask figure of 1 at a position means no change
1036
+in gap opening penalty; a figure of 4 means that the gap opening penalty is
1037
+four times greater at that position, making gaps 4 times harder to open.
1040
+The format for gap penalty masks and secondary structure masks is explained in
1041
+a separate help section.
1045
+<A HREF="#INDEX"> <EM>Back to Index</EM> </A>
1046
+<CENTER><H2><A NAME="B"> Secondary Structure / Gap Penalty Masks
1051
+The use of secondary structure-based penalties has been shown to improve the
1052
+accuracy of sequence alignment. Clustal X now allows secondary structure/ gap
1053
+penalty masks to be supplied with the input sequences used during profile
1054
+alignment. (NB. The secondary structure information is NOT used during multiple
1055
+sequence alignment). The masks work by raising gap penalties in specified
1056
+regions (typically secondary structure elements) so that gaps are
1057
+preferentially opened in the less well conserved regions (typically surface
1061
+The USE PROFILE 1(2) SECONDARY STRUCTURE / GAP PENALTY MASK options control
1062
+whether the input 2D-structure information or gap penalty masks will be used
1063
+during the profile alignment.
1066
+The OUTPUT options control whether the secondary structure and gap penalty
1067
+masks should be included in the Clustal X output alignments. Showing both is
1068
+useful for understanding how the masks work. The 2D-structure information is
1069
+itself useful in judging the alignment quality and in seeing how residue
1070
+conservation patterns vary with secondary structure.
1073
+The HELIX and STRAND GAP PENALTY options provide the value for raising the gap
1074
+penalty at core Alpha Helical (A) and Beta Strand (B) residues. In CLUSTAL
1075
+format, capital residues denote the A and B core structure notation. Basic gap
1076
+penalties are multiplied by the amount specified.
1079
+The LOOP GAP PENALTY option provides the value for the gap penalty in Loops.
1080
+By default this penalty is not raised. In CLUSTAL format, loops are specified
1081
+by "." in the secondary structure notation.
1084
+The SECONDARY STRUCTURE TERMINAL PENALTY provides the value for setting the gap
1085
+penalty at the ends of secondary structures. Ends of secondary structures are
1086
+known to grow or shrink, comparing related structures. Therefore by default
1087
+these are given intermediate values, lower than the core penalties. All
1088
+secondary structure read in as lower case in CLUSTAL format gets the reduced
1092
+The HELIX and STRAND TERMINAL POSITIONS options specify the range of structure
1093
+termini for the intermediate penalties. In the alignment output, these are
1094
+indicated as lower case. For Alpha Helices, by default, the range spans the
1095
+end-helical turn (3 residues). For Beta Strands, the default range spans the
1096
+end residue and the adjacent loop residue, since sequence conservation often
1097
+extends beyond the actual H-bonded Beta Strand.
1100
+Clustal X can read the masks from SWISS-PROT, CLUSTAL or GDE format input
1101
+files. For many 3-D protein structures, secondary structure information is
1102
+recorded in the feature tables of SWISS-PROT database entries. You should
1103
+always check that the assignments are correct - some are quite inaccurate.
1104
+Clustal X looks for SWISS-PROT HELIX and STRAND assignments e.g.
1115
+The structure and penalty masks can also be read from CLUSTAL alignment format
1116
+as comment lines beginning "!SS_" or "!GM_" e.g.
1120
+!SS_HBA_HUMA ..aaaAAAAAAAAAAaaa.aaaAAAAAAAAAAaaaaaaAaaa.........aaaAAAAAA
1121
+!GM_HBA_HUMA 112224444444444222122244444444442222224222111111111222444444
1122
+HBA_HUMA VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK
1126
+Note that the mask itself is a set of numbers between 1 and 9 each of which is
1127
+assigned to the residue(s) in the same column below.
1130
+In GDE flat file format, the masks are specified as text and the names must
1131
+begin with "SS_ or "GM_.
1134
+Either a structure or penalty mask or both may be used. If both are included
1135
+in an alignment, the user will be asked which is to be used.
1141
+<A HREF="#INDEX"> <EM>Back to Index</EM> </A>
1142
+<CENTER><H2><A NAME="T"> Phylogenetic Trees
1147
+Before calculating a tree, you must have an ALIGNMENT in memory. This can be
1148
+input using the FILE menu, LOAD SEQUENCES option or you should have just
1149
+carried out a full multiple alignment and the alignment is still in memory.
1150
+Remember YOU MUST ALIGN THE SEQUENCES FIRST!!!!
1153
+The method used is the NJ (Neighbour Joining) method of Saitou and Nei. First
1154
+you calculate distances (percent divergence) between all pairs of sequence from
1155
+a multiple alignment; second you apply the NJ method to the distance matrix.
1158
+To calculate a tree, use the DRAW N-J TREE option. This gives an UNROOTED tree
1159
+and all branch lengths. The root of the tree can only be inferred by using an
1160
+outgroup (a sequence that you are certain branches at the outside of the tree
1161
+.... certain on biological grounds) OR if you assume a degree of constancy in
1162
+the 'molecular clock', you can place the root in the 'middle' of the tree
1163
+(roughly equidistant from all tips).
1166
+BOOTSTRAP N-J TREE uses a method for deriving confidence values for the
1167
+groupings in a tree (first adapted for trees by Joe Felsenstein). It involves
1168
+making N random samples of sites from the alignment (N should be LARGE, e.g.
1169
+500 - 1000); drawing N trees (1 from each sample) and counting how many times
1170
+each grouping from the original tree occurs in the sample trees. You can set N
1171
+using the NUMBER OF BOOTSTRAP TRIALS option in the BOOTSTRAP TREE window. In
1172
+practice, you should use a large number of bootstrap replicates (1000 is
1173
+recommended, even if it means running the program for an hour on a slow
1174
+computer). You can also supply a seed number for the random number generator
1175
+here. Different runs with the same seed will give the same answer. See the
1176
+documentation for more details.
1179
+EXCLUDE POSITIONS WITH GAPS? With this option, any alignment positions where
1180
+ANY of the sequences have a gap will be ignored. This means that 'like' will
1181
+be compared to 'like' in all distances, which is highly desirable. It also
1182
+automatically throws away the most ambiguous parts of the alignment, which are
1183
+concentrated around gaps (usually). The disadvantage is that you may throw away
1184
+much of the data if there are many gaps (which is why it is difficult for us to
1185
+make it the default).
1188
+CORRECT FOR MULTIPLE SUBSTITUTIONS? For small divergence (say <10%) this option
1189
+makes no difference. For greater divergence, this option corrects for the fact
1190
+that observed distances underestimate actual evolutionary distances. This is
1191
+because, as sequences diverge, more than one substitution will happen at many
1192
+sites. However, you only see one difference when you look at the present day
1193
+sequences. Therefore, this option has the effect of stretching branch lengths
1194
+in trees (especially long branches). The corrections used here (for DNA or
1195
+proteins) are both due to Motoo Kimura. See the documentation for details.
1198
+Where possible, this option should be used. However, for VERY divergent
1199
+sequences, the distances cannot be reliably corrected. You will be warned if
1200
+this happens. Even if none of the distances in a data set exceed the reliable
1201
+threshold, if you bootstrap the data, some of the bootstrap distances may
1202
+randomly exceed the safe limit.
1205
+SAVE LOG FILE will write the tree calculation scores to a file. The log
1206
+filename is the same as the input sequence filename, with an extension .log
1211
+OUTPUT FORMAT OPTIONS
1215
+Three different formats are allowed. None of these displays the tree visually.
1216
+You can display the tree using the NJPLOT program distributed with Clustal X
1217
+OR get the PHYLIP package and use the tree drawing facilities there.
1220
+1) CLUSTAL FORMAT TREE. This format is verbose and lists all of the distances
1221
+between the sequences and the number of alignment positions used for each. The
1222
+tree is described at the end of the file. It lists the sequences that are
1223
+joined at each alignment step and the branch lengths. After two sequences are
1224
+joined, it is referred to later as a NODE. The number of a NODE is the number
1225
+of the lowest sequence in that NODE.
1228
+2) PHYLIP FORMAT TREE. This format is the New Hampshire format, used by many
1229
+phylogenetic analysis packages. It consists of a series of nested parentheses,
1230
+describing the branching order, with the sequence names and branch lengths. It
1231
+can be read by the NJPLOT program distributed with ClustalX. It can also be
1232
+used by the RETREE, DRAWGRAM and DRAWTREE programs of the PHYLIP package to see
1233
+the trees graphically. This is the same format used during multiple alignment
1234
+for the guide trees. Some other packages that can read and display New
1235
+Hampshire format are TreeTool, TreeView, and Phylowin.
1238
+3) PHYLIP DISTANCE MATRIX. This format just outputs a matrix of all the
1239
+pairwise distances in a format that can be used by the PHYLIP package. It used
1240
+to be useful when one could not produce distances from protein sequences in the
1241
+Phylip package but is now redundant (PROTDIST of Phylip 3.5 now does this).
1244
+4) NEXUS FORMAT TREE. This format is used by several popular phylogeny programs,
1245
+including PAUP and MacClade. The format is described fully in:
1246
+Maddison, D. R., D. L. Swofford and W. P. Maddison. 1997.
1247
+NEXUS: an extensible file format for systematic information.
1248
+Systematic Biology 46:590-621.
1251
+BOOTSTRAP LABELS ON: By default, the bootstrap values are correctly placed on
1252
+the tree branches of the phylip format output tree. The toggle allows them to
1253
+be placed on the nodes, which is incorrect, but some display packages (e.g.
1254
+TreeTool, TreeView and Phylowin) only support node labelling but not branch
1255
+labelling. Care should be taken to note which branches and labels go together.
1261
+<A HREF="#INDEX"> <EM>Back to Index</EM> </A>
1262
+<CENTER><H2><A NAME="C"> Colors
1267
+Clustal X provides a versatile coloring scheme for the sequence alignment
1268
+display. The sequences (or profiles) are colored automatically, when they are
1269
+loaded. Sequences can be colored either by assigning a color to specific
1270
+residues, or on the basis of an alignment consensus. In the latter case, the
1271
+alignment consensus is calculated automatically, and the residues in each
1272
+column are colored according to the consensus character assigned to that
1273
+column. In this way, you can choose to highlight, for example, conserved
1274
+hydrophylic or hydrophobic positions in the alignment.
1277
+The 'rules' used to color the alignment are specified in a COLOR PARAMETER
1278
+FILE. Clustal X automatically looks for a file called 'colprot.par' for protein
1279
+sequences or 'coldna.par' for DNA, in the current directory. (If your running
1280
+under UNIX, it then looks in your home directory, and finally in the
1281
+directories in your PATH environment variable).
1284
+By default, if no color parameter file is found, protein sequences are colored
1285
+by residue as follows:
1289
+ Color Residue Code
1299
+In the case of DNA sequences, the default colors are as follows:
1303
+ Color Residue Code
1315
+The default BACKGROUND COLORING option shows the sequence residues using a
1316
+black character on a colored background. It can be switched off to show
1317
+residues as a colored character on a white background.
1320
+Either BLACK AND WHITE or DEFAULT COLOR options can be selected. The Color
1321
+option looks first for the color parameter file (as described above) and, if no
1322
+file is found, uses the default residue-specific colors.
1325
+You can specify your own coloring scheme by using the LOAD COLOR PARAMETER FILE
1326
+option. The format of the color parameter file is described below.
1330
+COLOR PARAMETER FILE
1334
+This file is divided into 3 sections:
1337
+1) the names and rgb values of the colors
1338
+2) the rules for calculating the consensus
1339
+3) the rules for assigning colors to the residues
1342
+An example file is given here.
1346
+ --------------------------------------------------------------------
1355
+% = 60% w:l:v:i:m:a:f:c:y:h:p
1356
+# = 80% w:l:v:i:m:a:f:c:y:h:p
1372
+ --------------------------------------------------------------------
1376
+The first section is optional and is identified by the header @rgbindex. If
1377
+this section exists, each color used in the file must be named and the rgb
1378
+values specified (on a scale from 0 to 1). If the rgb index section is not
1379
+found, the following set of hard-coded colors will be used.
1389
+MAGENTA 0.9 0.1 0.9
1394
+The second section is optional and is identified by the header @consensus. It
1395
+defines how the consensus is calculated.
1398
+The format of each consensus parameter is:-
1402
+c = n% residue_list
1406
+ c is a character used to identify the parameter.
1407
+ n is an integer value used as the percentage cutoff
1409
+ residue_list is a list of residues denoted by a single
1410
+ character, delimited by a colon (:).
1414
+For example: # = 60% w:l:v:i
1417
+will assign a consensus character # to any column in the alignment which
1418
+contains more than 60% of the residues w,l,v and i.
1423
+The third section is identified by the header @color, and defines how colors
1424
+are assigned to each residue in the alignment.
1427
+The color parameters can take one of two formats:
1432
+2) r = color if consensus_list
1436
+ r is a character used to denote a residue.
1437
+ color is one of the colors in the GDE color lookup table.
1438
+ residue_list is a list of residues denoted by a single
1439
+ character, delimited by a colon (:).
1447
+will color all glycines ORANGE, regardless of the consensus.
1450
+2) w = BLUE if w:%:#
1453
+will color BLUE any tryptophan which is found in a column with a consensus of
1460
+<A HREF="#INDEX"> <EM>Back to Index</EM> </A>
1461
+<CENTER><H2><A NAME="Q"> Alignment Quality Analysis
1471
+Clustal X provides an indication of the quality of an alignment by plotting
1472
+a 'conservation score' for each column of the alignment. A high score indicates
1473
+a well-conserved column; a low score indicates low conservation. The quality
1474
+curve is drawn below the alignment.
1477
+Two methods are also provided to indicate single residues or sequence segments
1478
+which score badly in the alignment.
1481
+Low-scoring residues are expected to occur at a moderate frequency in all the
1482
+sequences because of their steady divergence due to the natural processes of
1483
+evolution. The most divergent sequences are likely to have the most outliers.
1484
+However, the highlighted residues are especially useful in pointing to
1485
+sequence misalignments. Note that clustering of highlighted residues is a
1486
+strong indication of misalignment. This can arise due to various reasons, for
1490
+ 1. Partial or total misalignments caused by a failure in the
1491
+ alignment algorithm. Usually only in difficult alignment cases.
1494
+ 2. Partial or total misalignments because at least one of the
1495
+ sequences in the given set is partly or completely unrelated to the
1496
+ other sequences. It is up to the user to check that the set of
1497
+ sequences are alignable.
1500
+ 3. Frameshift translation errors in a protein sequence causing local
1501
+ mismatched regions to be heavily highlighted. These are surprisingly
1502
+ common in database entries. If suspected, a 3-frame translation of
1503
+ the source DNA needs to be examined.
1506
+Occasionally, highlighted residues may point to regions of some biological
1507
+significance. This might happen for example if a protein alignment contains a
1508
+sequence which has acquired new functions relative to the main sequence set. It
1509
+is important to exclude other explanations, such as error or the natural
1510
+divergence of sequences, before invoking a biological explanation.
1516
+LOW-SCORING SEGMENTS
1520
+Unreliable regions in the alignment can be highlighted using the Low-Scoring
1521
+Segments option. A sequence-weighted profile is used to indicate any segments
1522
+in the sequences which score badly. Because the profile calculation may take
1523
+some time, an option is provided to calculate LOW-SCORING SEGMENTS. The
1524
+segment display can then be toggled on or off without having to repeat the
1525
+time-consuming calculations.
1528
+For details of the low-scoring segment calculation, see the CALCULATION section
1535
+LOW-SCORING SEGMENT PARAMETERS
1539
+MINIMUM LENGTH OF SEGMENTS: short segments (or even single residues) can be
1540
+hidden by increasing the minimum length of segments which will be displayed.
1543
+DNA MARKING SCALE is used to remove less significant segments from the
1544
+highlighted display. Increase the scale to display more segments; decrease the
1545
+scale to remove the least significant.
1550
+PROTEIN WEIGHT MATRIX: the scoring table which describes the similarity of each
1551
+amino acid to each other. The matrix is used to calculate the sequence-
1552
+weighted profile scores. There are four 'in-built' Log-Odds matrices offered:
1553
+the Gonnet PAM 80, 120, 250, 350 matrices. A more stringent matrix which only
1554
+gives a high score to identities and the most favoured conservative
1555
+substitutions, may be more suitable when the sequences are closely related. For
1556
+more divergent sequences, it is appropriate to use "softer" matrices which give
1557
+a high score to many other frequent substitutions. This option automatically
1558
+recalculates the low-scoring segments.
1563
+DNA WEIGHT MATRIX: Two hard-coded matrices are available:
1566
+1) IUB. This is the default scoring matrix used by BESTFIT for the comparison
1567
+of nucleic acid sequences. X's and N's are treated as matches to any IUB
1568
+ambiguity symbol. All matches score 1.0; all mismatches for IUB symbols score
1572
+2) CLUSTALW(1.6). The previous system used by ClustalW, in which matches score
1573
+1.0 and mismatches score 0. All matches for IUB symbols also score 0.
1576
+A new matrix can be read from a file on disk, if the filename consists only
1577
+of lower case characters. The values in the new weight matrix should be
1578
+similarities and should be NEGATIVE for infrequent substitutions.
1581
+INPUT FORMAT. The format used for a new matrix is the same as the BLAST
1582
+program. Any lines beginning with a # character are assumed to be comments. The
1583
+first non-comment line should contain a list of amino acids in any order, using
1584
+the 1 letter code, followed by a * character. This should be followed by a
1585
+square matrix of scores, with one row and one column for each amino acid. The
1586
+last row and column of the matrix (corresponding to the * character) contain
1587
+the minimum score over the whole matrix.
1591
+QUALITY SCORE PARAMETERS
1595
+You can customise the column 'quality scores' plotted underneath the alignment
1596
+display using the following options.
1599
+SCORE PLOT SCALE: this is a scalar value from 1 to 10, which can be used to
1600
+change the scale of the quality score plot.
1603
+RESIDUE EXCEPTION CUTOFF: this is a scalar value from 1 to 10, which can be
1604
+used to change the number of residue exceptions which are highlighted in the
1605
+alignment display. (For an explanation of this cutoff, see the CALCULATION OF
1606
+RESIDUE EXCEPTIONS section below.)
1609
+PROTEIN WEIGHT MATRIX: the scoring table which describes the similarity of
1610
+each amino acid to each other.
1613
+DNA WEIGHT MATRIX: two hard-coded matrices are available: IUB and CLUSTALW(1.6).
1616
+For more information about the weight matrices, see the help above for
1617
+the Low-scoring Segments Weight Matrix.
1620
+For details of the quality score calculations, see the CALCULATION section
1627
+SHOW LOW-SCORING SEGMENTS
1631
+The low-scoring segment display can be toggled on or off. This option does not
1632
+recalculate the profile scores.
1638
+SHOW EXCEPTIONAL RESIDUES
1642
+This option highlights individual residues which score badly in the alignment
1643
+quality calculations. Residues which score exceptionally low are highlighted by
1644
+using a white character on a grey background.
1648
+SAVE QUALITY SCORES TO FILE
1652
+The quality scores that are plotted underneath the alignment display can also
1653
+be saved in a text file. Each column in the alignment is written on one line in
1654
+the output file, with the value of the quality score at the end of the line.
1655
+Only the sequences currently selected in the display are written to the file.
1656
+One use for quality scores is to color residues in a protein structure by
1657
+sequence conservation. In this way conserved surface residues can be
1658
+highlighted to locate functional regions such as ligand-binding sites.
1664
+CALCULATION OF QUALITY SCORES
1668
+Suppose we have an alignment of m sequences of length n. Then, the alignment
1673
+ A11 A12 A13 .......... A1n
1674
+ A21 A22 A23 .......... A2n
1677
+ Am1 Am2 Am3 .......... Amn
1681
+We also have a residue comparison matrix of size R where C(i,j) is the score
1682
+for aligning residue i with residue j.
1685
+We want to calculate a score for the conservation of the jth position in the
1689
+To do this, we define an R-dimensional sequence space. For the jth position in
1690
+the alignment, each sequence consists of a single residue which is assigned a
1691
+point S in the space. S has R dimensions, and for sequence i, the rth dimension
1700
+We then calculate a consensus value for the jth position in the alignment. This
1701
+value X also has R dimensions, and the rth dimension is defined as:
1705
+ Xr = ( SUM (Fij * C(i,r)) ) / m
1710
+where Fij is the count of residues i at position j in the alignment.
1713
+Now we can calculate the distance Di between each sequence i and the consensus
1714
+position X in the R-dimensional space.
1718
+ Di = SQRT ( SUM (Xr - Sr)(Xr - Sr) )
1725
+The quality score for the jth position in the alignment is defined as the mean
1726
+of the sequence distances Di.
1729
+The score is normalised by multiplying by the percentage of sequences which
1730
+have residues (and not gaps) at this position.
1734
+CALCULATION OF RESIDUE EXCEPTIONS
1738
+The jth residue of the ith sequence is considered as an exception if the
1739
+distance Di of the sequence from the consensus value P is greater than (Upper
1740
+Quartile + Inter Quartile Range * Cutoff). The value used as a cutoff for
1741
+displaying exceptions can be set from the SCORE PARAMETERS menu. A high cutoff
1742
+value will only display very significant exceptions; a low value will allow
1743
+more, less significant, exceptions to be highlighted.
1746
+(NB. Sequences which contain gaps at this position are not included in the
1747
+exception calculation.)
1753
+CALCULATION OF LOW-SCORING SEGMENTS
1757
+Suppose we have an alignment of m sequences of length n. Then, the alignment
1762
+ A11 A12 A13 .......... A1n
1763
+ A21 A22 A23 .......... A2n
1766
+ Am1 Am2 Am3 .......... Amn
1770
+We also have a residue comparison matrix of size R where C(i,j) is the score
1771
+for aligning residue i with residue j.
1774
+We calculate sequence weights by building a neighbour-joining tree, in which
1775
+branch lengths are proportional to divergence. Summing the branches by branch
1776
+ownership provides the weights. See (Thompson et al., CABIOS, 10, 19 (1994) and
1777
+Henikoff et al.,JMB, 243, 574 1994).
1780
+To find the low-scoring segments in a sequence Si, we build a weighted profile
1781
+of the remaining sequences in the alignment. Suppose we find residue r at
1782
+position j in the sequence; then the score for the jth position in the sequence
1787
+ Score(Si,j) = Profile(j,r) where Profile(j,r) is the profile score
1788
+ for residue r at position j in the
1793
+These residue scores are summed along the sequence in both forward and backward
1794
+directions. If the sum of the scores is positive, then it is reset to zero.
1795
+Segments which score negatively in both directions are considered as
1796
+'low-scoring' and will be highlighted in the alignment display.
1802
+<A HREF="#INDEX"> <EM>Back to Index</EM> </A>
1803
+<CENTER><H2><A NAME="9"> Command Line Parameters
1805
+<CENTER><H3> DATA (sequences)
1807
+<CENTER><TABLE ALIGN=ABSCENTER BORDER=1 CELLSPACING=1 CELLPADDING=5>
1809
+<TD><STRONG>Parameter</STRONG></TD>
1810
+<TD><STRONG><EM>Description</EM></STRONG></TD>
1813
+<TD><TT>-PROFILE1=file.ext and -PROFILE2=file.ext </TT></TD>
1814
+<TD><EM>profiles (aligned sequences)</EM></TD>
1817
+<CENTER><H3> VERBS (do things)
1819
+<CENTER><TABLE ALIGN=ABSCENTER BORDER=1 CELLSPACING=1 CELLPADDING=5>
1821
+<TD><STRONG>Parameter</STRONG></TD>
1822
+<TD><STRONG><EM>Description</EM></STRONG></TD>
1825
+<TD><TT>-HELP or -CHECK </TT></TD>
1826
+<TD><EM>outline the command line parameters</EM></TD>
1829
+<TD><TT>-ALIGN </TT></TD>
1830
+<TD><EM>do full multiple alignment </EM></TD>
1833
+<TD><TT>-TREE </TT></TD>
1834
+<TD><EM>calculate NJ tree</EM></TD>
1837
+<TD><TT>-BOOTSTRAP(=n) </TT></TD>
1838
+<TD><EM>bootstrap a NJ tree (n= number of bootstraps; def. = 1000)</EM></TD>
1841
+<TD><TT>-CONVERT </TT></TD>
1842
+<TD><EM>output the input sequences in a different file format</EM></TD>
1845
+<CENTER><H3> PARAMETERS (set things)
1847
+<CENTER><P><STRONG>***General settings:****
1848
+</STRONG></P></CENTER>
1849
+<CENTER><TABLE ALIGN=ABSCENTER BORDER=1 CELLSPACING=1 CELLPADDING=5>
1851
+<TD><STRONG>Parameter</STRONG></TD>
1852
+<TD><STRONG><EM>Description</EM></STRONG></TD>
1855
+<TD><TT>-INTERACTIVE </TT></TD>
1856
+<TD><EM>read command line, then enter normal interactive menus</EM></TD>
1859
+<TD><TT>-QUICKTREE </TT></TD>
1860
+<TD><EM>use FAST algorithm for the alignment guide tree</EM></TD>
1863
+<TD><TT>-TYPE= </TT></TD>
1864
+<TD><EM>PROTEIN or DNA sequences</EM></TD>
1867
+<TD><TT>-NEGATIVE </TT></TD>
1868
+<TD><EM>protein alignment with negative values in matrix</EM></TD>
1871
+<TD><TT>-OUTFILE= </TT></TD>
1872
+<TD><EM>sequence alignment file name</EM></TD>
1875
+<TD><TT>-OUTPUT= </TT></TD>
1876
+<TD><EM>GCG, GDE, PHYLIP, PIR or NEXUS</EM></TD>
1879
+<TD><TT>-OUTORDER= </TT></TD>
1880
+<TD><EM>INPUT or ALIGNED</EM></TD>
1883
+<TD><TT>-CASE= </TT></TD>
1884
+<TD><EM>LOWER or UPPER (for GDE output only)</EM></TD>
1887
+<TD><TT>-SEQNOS= </TT></TD>
1888
+<TD><EM>OFF or ON (for Clustal output only)</EM></TD>
1891
+<CENTER><H3>***Fast Pairwise Alignments:***
1893
+<CENTER><TABLE ALIGN=ABSCENTER BORDER=1 CELLSPACING=1 CELLPADDING=5>
1895
+<TD><STRONG>Parameter</STRONG></TD>
1896
+<TD><STRONG><EM>Description</EM></STRONG></TD>
1899
+<TD><TT>-TOPDIAGS=n </TT></TD>
1900
+<TD><EM>number of best diags.</EM></TD>
1903
+<TD><TT>-WINDOW=n </TT></TD>
1904
+<TD><EM>window around best diags.</EM></TD>
1907
+<TD><TT>-PAIRGAP=n </TT></TD>
1908
+<TD><EM>gap penalty</EM></TD>
1911
+<TD><TT>-SCORE= </TT></TD>
1912
+<TD><EM>PERCENT or ABSOLUTE</EM></TD>
1915
+<CENTER><H3>***Slow Pairwise Alignments:***
1917
+<CENTER><TABLE ALIGN=ABSCENTER BORDER=1 CELLSPACING=1 CELLPADDING=5>
1919
+<TD><STRONG>Parameter</STRONG></TD>
1920
+<TD><STRONG><EM>Description</EM></STRONG></TD>
1923
+<TD><TT>-PWDNAMATRIX= </TT></TD>
1924
+<TD><EM>DNA weight matrix=IUB, CLUSTALW or filename</EM></TD>
1927
+<TD><TT>-PWGAPOPEN=f </TT></TD>
1928
+<TD><EM>gap opening penalty</EM></TD>
1931
+<TD><TT>-PWGAPEXT=f </TT></TD>
1932
+<TD><EM>gap opening penalty</EM></TD>
1935
+<CENTER><H3>***Multiple Alignments:***
1937
+<CENTER><TABLE ALIGN=ABSCENTER BORDER=1 CELLSPACING=1 CELLPADDING=5>
1939
+<TD><STRONG>Parameter</STRONG></TD>
1940
+<TD><STRONG><EM>Description</EM></STRONG></TD>
1943
+<TD><TT>-USETREE= </TT></TD>
1944
+<TD><EM>file for old guide tree</EM></TD>
1947
+<TD><TT>-MATRIX= </TT></TD>
1948
+<TD><EM>Protein weight matrix=BLOSUM, PAM, GONNET, ID or filename</EM></TD>
1951
+<TD><TT>-DNAMATRIX= </TT></TD>
1952
+<TD><EM>DNA weight matrix=IUB, CLUSTALW or filename</EM></TD>
1955
+<TD><TT>-GAPOPEN=f </TT></TD>
1956
+<TD><EM>gap opening penalty</EM></TD>
1959
+<TD><TT>-GAPEXT=f </TT></TD>
1960
+<TD><EM>gap extension penalty</EM></TD>
1963
+<TD><TT>-ENDGAPS </TT></TD>
1964
+<TD><EM>no end gap separation pen.</EM></TD>
1967
+<TD><TT>-GAPDIST=n </TT></TD>
1968
+<TD><EM>gap separation pen. range</EM></TD>
1971
+<TD><TT>-NOPGAP </TT></TD>
1972
+<TD><EM>residue-specific gaps off</EM></TD>
1975
+<TD><TT>-NOHGAP </TT></TD>
1976
+<TD><EM>hydrophilic gaps off</EM></TD>
1979
+<TD><TT>-HGAPRESIDUES= </TT></TD>
1980
+<TD><EM>list hydrophilic res.</EM></TD>
1983
+<TD><TT>-MAXDIV=n </TT></TD>
1984
+<TD><EM>% ident. for delay</EM></TD>
1987
+<TD><TT>-TYPE= </TT></TD>
1988
+<TD><EM>PROTEIN or DNA</EM></TD>
1991
+<TD><TT>-TRANSWEIGHT=f </TT></TD>
1992
+<TD><EM>transitions weighting</EM></TD>
1995
+<CENTER><H3>***Profile Alignments:***
1997
+<CENTER><TABLE ALIGN=ABSCENTER BORDER=1 CELLSPACING=1 CELLPADDING=5>
1999
+<TD><STRONG>Parameter</STRONG></TD>
2000
+<TD><STRONG><EM>Description</EM></STRONG></TD>
2003
+<TD><TT>-NEWTREE1= </TT></TD>
2004
+<TD><EM>file for new guide tree for profile1</EM></TD>
2007
+<TD><TT>-NEWTREE2= </TT></TD>
2008
+<TD><EM>file for new guide tree for profile2</EM></TD>
2011
+<TD><TT>-USETREE1= </TT></TD>
2012
+<TD><EM>file for old guide tree for profile1</EM></TD>
2015
+<TD><TT>-USETREE2= </TT></TD>
2016
+<TD><EM>file for old guide tree for profile2</EM></TD>
2019
+<CENTER><H3>***Sequence to Profile Alignments:***
2021
+<CENTER><TABLE ALIGN=ABSCENTER BORDER=1 CELLSPACING=1 CELLPADDING=5>
2023
+<TD><STRONG>Parameter</STRONG></TD>
2024
+<TD><STRONG><EM>Description</EM></STRONG></TD>
2027
+<TD><TT>-NEWTREE= </TT></TD>
2028
+<TD><EM>file for new guide tree</EM></TD>
2031
+<TD><TT>-USETREE= </TT></TD>
2032
+<TD><EM>file for old guide tree</EM></TD>
2035
+<CENTER><H3>***Structure Alignments:***
2037
+<CENTER><TABLE ALIGN=ABSCENTER BORDER=1 CELLSPACING=1 CELLPADDING=5>
2039
+<TD><STRONG>Parameter</STRONG></TD>
2040
+<TD><STRONG><EM>Description</EM></STRONG></TD>
2043
+<TD><TT>-NOSECSTR2 </TT></TD>
2044
+<TD><EM>do not use secondary structure/gap penalty mask for profile 2</EM></TD>
2047
+<TD><TT>-SECSTROUT=STRUCTURE or MASK or BOTH or NONE </TT></TD>
2048
+<TD><EM>output in alignment file</EM></TD>
2051
+<TD><TT>-HELIXGAP=n </TT></TD>
2052
+<TD><EM>gap penalty for helix core residues </EM></TD>
2055
+<TD><TT>-STRANDGAP=n </TT></TD>
2056
+<TD><EM>gap penalty for strand core residues</EM></TD>
2059
+<TD><TT>-LOOPGAP=n </TT></TD>
2060
+<TD><EM>gap penalty for loop regions</EM></TD>
2063
+<TD><TT>-TERMINALGAP=n </TT></TD>
2064
+<TD><EM>gap penalty for structure termini</EM></TD>
2067
+<TD><TT>-HELIXENDIN=n </TT></TD>
2068
+<TD><EM>number of residues inside helix to be treated as terminal</EM></TD>
2071
+<TD><TT>-HELIXENDOUT=n </TT></TD>
2072
+<TD><EM>number of residues outside helix to be treated as terminal</EM></TD>
2075
+<TD><TT>-STRANDENDIN=n </TT></TD>
2076
+<TD><EM>number of residues inside strand to be treated as terminal</EM></TD>
2079
+<TD><TT>-STRANDENDOUT=n</TT></TD>
2080
+<TD><EM>number of residues outside strand to be treated as terminal </EM></TD>
2083
+<CENTER><H3>***Trees:***
2085
+<CENTER><TABLE ALIGN=ABSCENTER BORDER=1 CELLSPACING=1 CELLPADDING=5>
2087
+<TD><STRONG>Parameter</STRONG></TD>
2088
+<TD><STRONG><EM>Description</EM></STRONG></TD>
2091
+<TD><TT>-SEED=n </TT></TD>
2092
+<TD><EM>seed number for bootstraps</EM></TD>
2095
+<TD><TT>-KIMURA </TT></TD>
2096
+<TD><EM>use Kimura's correction</EM></TD>
2099
+<TD><TT>-TOSSGAPS </TT></TD>
2100
+<TD><EM>ignore positions with gaps</EM></TD>
2103
+<TD><TT>-BOOTLABELS=node OR branch </TT></TD>
2104
+<TD><EM>position of bootstrap values in tree display</EM></TD>
2108
+<A HREF="#INDEX"> <EM>Back to Index</EM> </A>
2109
+<CENTER><H2><A NAME="R"> References
2115
+The ClustalX program is described in the manuscript:
2119
+Thompson,J.D., Gibson,T.J., Plewniak,F., Jeanmougin,F. and Higgins,D.G. (1997)
2120
+The ClustalX windows interface: flexible strategies for multiple sequence
2121
alignment aided by quality analysis tools. Nucleic Acids Research, 25:4876-4882.