Delila Program: malign

malign program

Documentation for the malign program is below, with links to related programs in the "see also" section.

{   version = 2.72; (* of malign.p 2022 May 21}

(* begin module describe.malign *)
(*
name
   malign: optimal alignment of a book, based on minimum uncertainty

synopsis
   malign(inst: in, book: in, malignp: in, uncert: out, newalign: out,
      optalign: out, optinst: out, malignxyin: out, output: out)

files

   inst: delila instructions of the form 'get from 56 -5 to 56 +10;'

   book: the book generated by delila using inst

   malignp: parameter file with the following parameters:
      winleft, winright: left and right ends of window for calculating
          uncertainty, relative to aligned base
      shiftmin, shiftmax: minimum and maximum shift of aligned base
      iseed: integer random seed
      nranseq: number of random sequences, or 0 to use sequences in book
      nshuffle: number of times to redo alignment after random shuffle
      ifpaired: 1 to treat each pair of sequences as complementary strands,
         0 not to
      standout: output run #, pass # and H to standard output every pass
         if 1, every run if 0, or not at all if -1
      npassout: output H and alignment every npassout passes to file newalign,
         or only at end of runs if zero, or not at all if -1
      nshiftout: output L and H(L) every nshiftout sequence shifts (to file
         uncert), or only at end of passes if zero, or not at all if -1
      tolerance: tolerance in change of H
      ntolpass: maximum number of passes with change below tolerance

      new parameter allowed but not required (default is i):

      alignmenttype: char; 'f' means alignment by First internal coordinate
         base, 'b' means alignment by Book, 'i' means alignment by
         Instructions.  See the alist program for more information.

         Normally one will align by delila instructions.

         If this parameter is 'f', then the first base of the book is
         considered the zero base and if it is 'b' then the zero base is
         given by the coordinates of each piece in the book.

   uncert: uncertainty as function of position, for the last run, at the
      end of each pass or after selected number of sequence shifts
      Controlled by variable nshiftout.

   newalign: values of H and the relative alignments; starting, final, and
      intermediate if selected.
      Controlled by variable npassout.

   optalign: user-readable listing of unique optimal relative alignments
      and number of times each was achieved

   optinst: list of unique optimal alignments in absolute coordinates,
      to be used to make inst file for selected alignment
         This file is like optalign, but the coordinates are for the
         original sequence.

   malignxyin: a list of the number of occurrences of alignments and their
      H values in bits.  This may be plotted with xyplo, as described
      in the paper.  Each line contains these numbers:
         rank:  from 1 to the number of alignment classes
         occurrences:  how many times the class was found
         H: the uncertainty of the alignment, in bits
         R: the information content of the alignment, in bits
            with small sample correction.

description
   Given a book of aligned sequences, this program searches for the alignment
   of the sequences that has the lowest uncertainty, i.e. the highest value of
   Rsequence.  The user specifies the "window" of bases within which
   uncertainty is calculated, and the maximum number of bases that each
   sequence is allowed to shift from the original alignment.  The program
   considers each sequence in turn, shifting it to an alignment with minimum
   uncertainty while holding the other sequences fixed.  A "pass" is complete
   when all sequences have been considered.  A "run" is complete when no
   alignments have changed in the preceding pass, and the alignment is then
   considered "optimal".  The first run starts with the original alignment;
   every run after that starts with a "shuffled" alignment obtained by shifting
   each sequence independently by a random amount between the allowed limits.
   The program maintains a list of all of the unique optimal alignments
   achieved from these starting alignments, and it outputs them in order of
   increasing uncertainty.

   In version 2.33 and earlier, the program did not keep track of the
   organism and chromosome names in bestinst.  This file is now superceeded
   by the malin program which copies the inst file to cinst and modifies
   it according to one of the alignments in optinst.

   Statistical testing

   We have found a method to work with malign that gives reliable
   results.  First run malign many times (e.g. 100) on the sites of
   interest using the timeseed (with at least 1 second delay between
   runs so that the timeseed changes).  Collect the information
   content distribution.  Then extract the same length sequences from
   random places on the host organism or use comp to get the
   composition of the host and the markov program to generate a random
   set.  Run malign again with the same number of sequences and
   parameters.

   If you find that the two distributions differ significantly (using
   the ttest program) then you've got something.  This was useful for
   us in a paper we are just finishing - in one case we see a distinct
   pattern clearly distinguishable from the random host sequences and
   in another the distributions were identical.

********************************************************************************

Summary of file output:

Malign produces:

  uncert, newalign, optalign, optinst, malignxyin, output

-- output --

   Line 7 of "malignp" controls output
   Parameter:
      1 - every run, pass, and uncertainty will be outputed to the
          screen
      0 - only the lowest uncertainty run will be outputed
     -1 - nothing will be outputed

   *NOTE* no file will be produced regardless of the parameter

-- newalign --

   Line 8 of "malignp" controls newalign
   Parameter:
      1 - produces a full newalign file
      0 - produces a smaller newalign file (about half the size)
     -1 - produces no newalign file

   *NOTE* this is the largest file produced and is unnecessary

-- uncert --

   Line 9 of "malignp" controls uncert
   Parameter:
      1 - produces a full uncert file
      0 - produces a smaller uncert file (about half the size)
     -1 - produces an empty uncert file

   *NOTE* file will be produced regardless of the parameter,
          however this file is large and unnecessary

-- optalign --

   *NOTE* this file will always be produced and is needed to run malin

-- optinst --

   *NOTE* this file will always be produced and is needed to
          run malin

-- malignxyin --

   *NOTE* this file will always be produced and can be used to
          plot data using xyplo

If you set our parameters so newalign and uncert are not created,
this can save some space.

(Thanks to Brent M. Jewett for compiling this information on 2001 Feb 7.)

********************************************************************************

documentation

A paper describing the algorithm in detail is available from
<A href="https://alum.mit.edu/www/toms/papers/malign/malign.pdf"
>https://alum.mit.edu/www/toms/papers/malign/malign.pdf</A>

@article{Schneider.Mastronarde.malign,
author = "T. D. Schneider
 and D. Mastronarde",
title = "{Fast} multiple alignment of ungapped {DNA}
sequences using information theory and a relaxation method",
journal = "Discrete Applied Mathematics",
note = "https://alum.mit.edu/www/toms/papers/malign",
volume = "71",
pages = "259-268",
year = "1996"}

The use of malign to align sequences with a subtle pattern is described in:

@article{Toledano1994,
author = "M. B. Toledano
 and I. Kullik
 and F. Trinh
 and P. T. Baird
 and T. D. Schneider
 and G. Storz",
title = "Redox-Dependent Shift of {OxyR-DNA} Contacts Along
An Extended {DNA} Binding Site:
A Mechanism for Differential Promoter Selection",
journal = "Cell",
volume = "78",
pages = "897-909",
year = "1994"}

For how the information content and small sample correction are computed:

@article{Schneider1986,
author = "T. D. Schneider
 and G. D. Stormo
 and L. Gold
 and A. Ehrenfeucht",
title = "Information content of binding sites on nucleotide sequences",
journal = "J. Mol. Biol.",
volume = "188",
pages = "415-431",
year = "1986"}

see also

   Paper Schneider.Mastronarde.malign:
   https://alum.mit.edu/www/toms/paper/malign/

   Program to graph the malignxyin file: xyplo.p
   You can use the malign.xyplop  file for the xyplop
   and the malignxyin for the xyin.  Set the xyplom to be empty.
   I ALWAYS make this graph to see what is going on.

   Program to make delila instructions from nth alignment of malign:
   malin.p

   Example parameter file, malignp: malignp

   A COMPLETE SET FOR DEMONSTRATING THE PROGRAM
   malign.inst instructions for grabbing the first 6 EcoRI sites on
                the E. coli genome, but messed up by setting the last
                digit to zero
   malign.book The Delila book corresponding to malign.inst
   malign.malignp Parameter file for malign to realign the inst and book.

   If one uses malin to pick the first alignment, one finds that they are
   correctly realigned:

                       -                   +
                       1--------- +++++++++1
                       098765432101234567890
                       .....................
EcoRI U00096  3842 + 1 cgacctgccggaattcagcct
      U00096 12889 + 2 tctggttgaagaattcaagaa
      U00096 32545 + 3 tcagggtatcgaattcgacta
      U00096 50237 + 4 ggtattcagcgaattccacga
      U00096 56282 + 5 agaggtagcggaattcgttct
      U00096 96860 + 6 gctacgtcaggaattcctgct


   Program for listing aligned sequences, as above: alist.p

   Program for comparing two distributions: ttest.p

author
   David Mastronarde and Tom Schneider

bugs
   The realignment algorithm, which shifts all sequences by the same amount
   to attempt to keep the window near its original position, is somewhat
   ad hoc in nature and the effects of different settings for it parameters
   have not been explored.  If the window spans two real sites with competing
   alignments, many optimal but meaningless alignments with similar
   uncertainties may be obtained.  The random sequences can't be examined.

   For the computation of Rsequence, composition is assumed to be
   equiprobable, there is no provision for reading in a cmp file yet.

technical notes

   The malignxyin file Rsequence has the small sample correction.  The choice
   between the estimate and the more exact computation is determined
   by constant "kickover".

   The constant maxlen is one longer than the longest sequence.

   The constant maxnseq is the maximum number of sequences.

*)
(* end module describe.malign *)
{This manual page was created by makman 1.45}


{created by htmlink 1.62}