Universitaet Bielefeld - GSF Research Center - Universite de Rouen - 
Rhone-Poulenc Rorer



                            DIALIGN 2.1

                            User Guide



                           Developed by: 
                        

     Burkhard Morgenstern, Said Abdeddaim, Klaus Hahn, Thomas Werner, 
     Kornelie Frech, and Andreas Dress 


at GSF, AG BioDV, University of Bielefeld - FSPM, GSF, Institute of 
Biomathematics and Biometry, North Carolina State University, 
Department of Genetics, Universite de Rouen, LIFAR  - ABISS, Faculte 
des Sciences et Techniques, and Rhone-Poulenc Rorer.


E-mail contact:  burkhard@mathematik.uni-bielefeld.de


               
                           Important Note:
 
Use of DIALIGN 2 is subject to the copyright notice (COPYRIGHT).

Distribution of copies of this user guide to all users of the program is
explicitly encouraged since it will facilitate using DIALIGN.



                            Reference: 

     B. Morgenstern (1999). 
     DIALIGN 2: improvement of the segment-to-segment approach to
     multiple sequence alignment.
     Bioinformatics 15, 211 - 218.

Public research assisted by DIALIGN should cite this article. For more 
information, updated references etc. please visit the DIALIGN home page at

     http://bibiserv.techfak.uni-bielefeld.de/dialign/





                            Program Input:

There are two ways to run DIALIGN on your computer: You can run the program 
interactively or you can enter parameters via command line. In either case, 
sequences must be contained in a single 

Sequence file:

DIALIGN requires an ASCII file containing the sequences to be aligned. Four 
different file formats are supported: IG, FASTA, EMBL and GCG-RSF format. 
The following is an example of the FASTA sequence file format: 



        >HTL2  
        LDTAPCLFSDGSPQKAAYVLWDQTILQQDITPLPSHETHSAQKGELLALICGLRAAKPWP
        SLNIFLDSKYLIKYLHSLAIGAFLGTSAHQTLQAALPPLLQGKTIYLHHVRSHTNLPDPI
        STFNEYTDSLILAPL
        >MMLV   
        PDADHTWYTDGSSLLQEGQRKAGAAVTTETEVIWAKALDAGTSAQRAELIALTQALKMAE
        GKKLNVYTDSRYAFATAHIHGEIYRRRGLLTSEGKEIKNKDEILALLKALFLPKRLSIIH
        CPGHQKGHSAEARGNRMADQAARKAAITETPDTSTLL
        >HEPB 
        RPGLCQVFADATPTGWGLVMGHQRMRGTFSAPLPIHTAELLAACFARSRSGANIIGTDNS
        VVLSRKYTSFPWLLGCAANWILRGTSFVYVPSALNPADDPSRGRLGLSRPLLRLPFRPTT
        GRTSLYADSPSVPSHLPDRVH
        >ECOL   
        MLKQVEIFTDGSCLGNPGPGGYGAILRYRGREKTFSAGYTRTTNNRMELMAAIVALEALK
        EHCEVILSTDSQYVRQGITQWIHNWKKRGWKTADKKPVKNVDLWQRLDAALGQHQIKWEW
        VKGHAGHPENERCDELARAAAMNPTLEDTGYQVEV




The first line for each sequence starts with ">" and contains the name of 
the sequence. Please make sure, that the first line in the input file is
not empty and that the first character in the first line is not blank.

Options:

     Sequence Type: 

     The user can decide if nucleic acid or protein sequences are to be 
     aligned. 

     Threshold T: 

     As described in our papers, the program DIALIGN constructs alignments 
     from gapfree pairs of similar segments of the sequences. Such segment 
     pairs are referred to as `diagonals'. 

     Every possible diagonal is given a so-called weight reflecting the 
     degree of similarity among the two segments involved. The overall 
     score of an alignment ist then defined as the sum of weights of the 
     diagonals it consists of and the program tries to find an alignment with
     maximum score -- in other words: the program tries to find a consistent
     collection of diagonals with maximum sum of weights. This novel scoring
     scheme for alignments is the basic difference between DIALIGN and other
     global or local alignment methods. Note that DIALIGN does not employ any 
     kind of gap penalty. 

     It is possible to use a threshold T for the quality of the diagonals. 
     In this case, a diagonal is considered for alignment only if its 
     `weight' exceeds this threshold. Regions of lower similarity are ignored. 

     In the first version of the program (DIALIGN 1), this threshold was in 
     many situations absolutely necessary to obtain meaningful alignments. 
     By contrast, DIALIGN 2 should produce reasonable alignments without a 
     threshold, i.e. with T = 0. This is the most important difference between
     DIALIGN 2 and the first version of the program. Nevertheless, it is still
     possible to use a positive threshold T to filter out regions of lower 
     significance and to include only high scoring diagonals into the 
     alignment.

     Translation of `nucleotide diagonals' into `peptide diagonals': 

     If (possibly) coding nucleic acid sequences are to be aligned, DIALIGN 
     optionally translates the compared `nucleic acid segments' to `peptide 
     segments' according to the genetic code -- without (necessarily) 
     presupposing any of the three possible reading frames, so all 
     combinations of reading frames get checked for significant similarity. 
     If this option is used, the similarity among segments will be assessed 
     on the `peptide level' rather than on the `nucleic acid level'. 

     We strongly recommend to use the `translation' option if nucleic acid 
     sequences are expected to contain protein coding regions, as it will 
     significantly increase the sensitivity of the alignment procedure in 
     such cases. 

     `*' characters: 

     The user can specify the maximum number of `*' characters per column  
     indicating the degree of local similarity among sequences in the
     DIALIGN alignment. They are only a rough measure of local similarity. 
     Since in EVERY alignment, the region of highest similarity will get 
     the specified maximum number of stars, they only reflect the RELATIVE 
     degree of similarity WITHIN a given alignment and are NOT an absolute
     measure of similarity. Nevertheless, they are useful to spot conserved 
     domains within the sequences.

     `overlap weights':

     This option improves the sensitivity of the program if multiple sequences
     are aligned but it also increases the running time, especially if large
     numbers of sequences are aligned. By default, `overlap weights' are used
     if up to 35 sequences are aligned but switched off for larger data sets. 
     In the command-line version, `overlap weights' can be switched on or off 
     for data sets of any size, see below.


Entering parameters via command line:

If you want to enter options via command line, the program call is 

 
  dialign [ options ] <seq_file>
  

where, <seq_file> is the name of the input sequence file. In the command-line
version, some more options are available:

 -cw             separate output file in CLUSTAL W format.

 -fa             separate output file in FASTA format.

 -fn <out_file>  output file is named <out_file>. 

 -iw             overlap weights NOT used (by default, overlap weights are
                 used if up to 35 sequences are aligned).

 -max_link       "maximum linkage" clustering used to construct sequence tree
                 (instead of UPGMA).

 -min_link       "minimum linkage" clustering used.

 -msf            separate output file in MSF format.

 -n              input sequences are nucleic acid sequences. No translation 
		 of diagonals. 

 -nt             input sequences are nucleic acid sequences and `nucleic acid 
		 segments' are translated to `peptide segments'. 

 -o              fast version, resulting alignments may be slightly different.

 -ow             overlap weights used, regardless of number of input sequences.

 -stars x        maximum number of `*' characters indicating 
                 degree of local similarity among sequences = x. 

 -sto            Results written to standard output. 

 -thr x          Threshold T = x. 



Similarity Matrix:

DIALIGN 2 uses the BLOSUM62 amino acid substitution matrix. In the current 
version, it is NOT possible to replace BLOSUM62 by other similarity matrices,
since the probability values contained in the files n_prob and p_prob refer 
to the BLOSUM62 matrix. 



                             Program Output: 

If default options are used, DIALIGN creates a single file containing

    - An alignment of the input sequences in DIALIGN format. 
    - The same alignment in FASTA format. 
    - A sequence tree in PHYLIP format. This tree is constructed by applying 
      the UPGMA clustering method to the DIALIGN similarity scores. It roughly 
      reflects the different degrees of similarity among sequences. For 
      detailed phylogenetic analysis, we recommend the usual methods for 
      phylogenetic reconstruction. 


This is the DIALIGN alignment format: 

  
HTL2          1   ldtapcLFSD GS------PQ KAAYVLWDQT IL---QQDIT PLPSHethSA
MMLV          1   pdadhtwYTD GSSLLQEGQR KAGAAVTTET eviwaKALDA G---T---SA
HEPB          1   rpglcQVFAD AT------PT GWGLVMGHQR MR---GTFSA PLPIHt----
ECOL          1   mlkqvEIFTD GSCLGNPGPG GYGAILRYRG RE---KTFSA GytrT---TN
                                                                
                       ***** ********** ********** **   ***** *****   **
                        **** **      ** ********** **   ***** *****   **
                         *** **      ** ********** **   *****           
                                     ** ******                          
                                                                        


HTL2         42   QKGELLALIC GLRAAKPWPS LNIFLDSKYL IKYLHslaig aflgtsah--
MMLV         45   QRAELIALTQ ALKMAEgkk- LNVYTDSRYA FATAHIHGEI YRRRGLLTSE
HEPB         38   --AELLAACF Arsrsgan-- -IIGTDN--- ---------- ----------
ECOL         45   NRMELMAAIV ALEALKEHCE VILSTDSQYV RQGITQWIHN WKKRGWKTAD
                                                                
                  ********** ********** ********** ********** **********
                  ********** ********** ********** ********** **********
                     ******* ******     ********** *****                
                     ******* ******     ********** *****                
                                          ********                      


HTL2         90   -------QT- --LQAALPPL LQGKTIYLHH VRSHT----- -NLPDPISTF
MMLV         94   GKEIKNKDE- --ILALLKAL FLPKRLSIIH CPGHQ----- -KGHSAEARG
HEPB         60   ---------- ---SVVLSR- ---------- ---KYTSFPW LLGCAANWI-
ECOL         95   KKPVKNVDlw qrLDAALGQ- ---------- ---HQIKWEW VKGHAGHPE-
                                                                
                  *********    ******** ********** ********** **********
                  ********                                              
                         *                                              
                                                                        
                                                        


HTL2        124   NEYTDSLILA pl-------- ---------- ---------- ----------
MMLV        135   NRMADQAARK AAITETPDTS tll------- ---------- ----------
HEPB         82   LRGTSFVYVP SALNPADDPS rgrlglsrpl lrlpfrpttg rtslyadsps
ECOL        130   NERCDELARA AAMNPTledt gyqvev---- ---------- ----------
                                                                
                  ********** **********                                 
                  ********** ******                                     
                                                                        
                                                                        
                                                                        


HTL2        136   ----------
MMLV              ----------
HEPB        132   vpshlpdrvh
ECOL        156   ----------
                    



     Names of aligned sequences are shown on the left hand side of the 
     alignment. 
     
     Numbers on the left hand side of the alignment denote the position 
     of the first residue in a line within the respective sequence. 
     
     Capital letters denote aligned residues, i.e. residues involved in 
     at least one of the `diagonals' the alignment consists of. Lower-case
     letters denote residues not belonging to any of these selected 
     `diagonals'. They are not considered to be aligned by DIALIGN. Thus, 
     if a lower-case letter is standing in the same column with other letters,
     this is pure chance; these residues are not considered to be homologous. 

     The number of `*' characters below the alignment reflects the degree of 
     local similarity among sequences. More precisely: They represent the sum 
     of `weights' of diagonals connecting residues at the respective position.

     The number of `*' characters is normalized such that regions of maximum
     similarity have always m `*' characters per column - no matter how strong
     this maximum simliarity is. m can be specified by the user, the default 
     value is m = 5. 



This is FASTA alignment format: 


>HTL2
ldtapcLFSDGS------PQKAAYVLWDQTIL---QQDITPLPSHethSA
QKGELLALICGLRAAKPWPSLNIFLDSKYLIKYLHslaigaflgtsah--
-------QT---LQAALPPLLQGKTIYLHHVRSHT------NLPDPISTF
NEYTDSLILApl--------------------------------------
----------
>MMLV
pdadhtwYTDGSSLLQEGQRKAGAAVTTETeviwaKALDAG---T---SA
QRAELIALTQALKMAEgkk-LNVYTDSRYAFATAHIHGEIYRRRGLLTSE
GKEIKNKDE---ILALLKALFLPKRLSIIHCPGHQ------KGHSAEARG
NRMADQAARKAAITETPDTStll---------------------------
----------
>HEPB
rpglcQVFADAT------PTGWGLVMGHQRMR---GTFSAPLPIHt----
--AELLAACFArsrsgan---IIGTDN-----------------------
-------------SVVLSR--------------KYTSFPWLLGCAANWI-
LRGTSFVYVPSALNPADDPSrgrlglsrpllrlpfrpttgrtslyadsps
vpshlpdrvh
>ECOL
mlkqvEIFTDGSCLGNPGPGGYGAILRYRGRE---KTFSAGytrT---TN
NRMELMAAIVALEALKEHCEVILSTDSQYVRQGITQWIHNWKKRGWKTAD
KKPVKNVDlwqrLDAALGQ--------------HQIKWEWVKGHAGHPE-
NERCDELARAAAMNPTledtgyqvev------------------------
----------



This is PHYLIP tree format: 

 
((HTL2:0.111024,
(MMLV:0.078471,
ECOL:0.078471):0.032554):0.121218,
HEPB:0.232242);



Trees can be visualized using the treetool program contained in the PHYLIP 
software package:

  http://www.no.embnet.org/phylip.html

burkhard morgenstern, december 1999 






