Delila Program: multiscan

multiscan program

Documentation for the multiscan program is below, with links to related programs in the "see also" section.

{   version = 4.56; (* of multiscan.p 2022 Oct 08}

(* begin module describe.multiscan *)
(*

name
   multiscan: scan a Delila system book for many multipart binding sites

synopsis
   multiscan(book: in, ribl: in, scanp: in, histog: in, multiscanp: in,
   data: out, scanfeatures: out, scaninst: out, output: out);

files

   book: a book from the delila system

   ribl: a concatenated file containing all the ribls of all elements used in
         the scan.  The order of these ribls needs to match the order of
         elements in the scanp file file, and the order of elements in the
         multiscanp file.

   scanp: a concatenated file containing scanp files for all of the
          elements.  The scanp files are the same as with the scan program,
          and a detailed description of these files is in scan.p.

   multiscanp: the parameter file for multiscan.

      Parameterversion: Real;   This is the version of the paramter file, if
      it is less than the version used by the program, the program will not
      work.

      Maximum range:  Integer;   This is the range with which the sites have
      to be inside for them to be able to work together.  So if a minus
      10, a minus 35, and an Activator do not fall within this range then
      they will not be considered as a feasible combination of sites.  This
      is used so that the user can limit the spacing histograms, you may want
      to do a scan that has a smaller range than what your maximum histogram
      spacings allow, without having to change your histograms, so you change
      your maximum range to limit it.

      Total Ri cutoff: Real;  The total information for a site has to be
      above this value.  The equation of total information is given in
      greater detail in the program description.

      Filter toggle: char; This tells whether or not to do a filter based on
      the strongest parent site.  This is the first site in the list of
      sites.  So if your first site was a minus 10 site, followed by a -35,
      and activator, it would find the best combination of -35 and activators
      for the -10 and only report that combination.

   'f' means filter.  'n' means don't filter.

      Essential number cutoff: Real; So if a number is non-essential we only
      include it if it contributes positive information.  If you want your
      cutoff to be higher, you raise this value and the non-essential (info -
      GS) has to be greater than this to be included.  If it is negative it
      uses the scanp cutoff value only and doesn't subtract the gap, so if
      the scanp allows for it to be a site, then it will be included, no
      matter if it is positive or not, or if its GS is larger than it.

      Individual model string: "string", (string), char, char, char,
      char+integer; This is the string that defines how you treat each
      model.  The paramters are as follows:

   '#' if the character following the filter character (above)
      is '#', then the groupnumber is reported after the bits on each
      total line, e.g. in the form "bits-#16".  This is useful for
      debugging extremely complex maps.

 Model name: string; This is the name of the model.  This name needs
 to be the same as that in the Ribl file.  The model name needs to be
 in " ".

 Parent name: string;  The name that this model points to.  This needs
 to be the exact same name as the parent model.  The parent name needs
 to be in ( ).

 Essentiality: char;  Is it essential 'e' or non-essential 'n'.  See
 above for a better explanation of these terms.

 Orientation toggle: char;  Does the site have to have the same 's',
 different 'd', or any 'n' orientation relative to its parent.  So the
 minus 35 has to have the same orientation as the minus 10 so its
 toggle would be 's'.  But, an activator could have either orientation
 so its toggle would be 'n'.

 GS toggle: char;  It tells whether or not to include the gap surprisal.
            'g' you include the gap surprisal in the total information.
            'n' you do not include the gap surprisal.

     If the gap surprisal is turned off, then the program uses the
     furthest range in the histog that has some value greater than
     zero, and all bases in between, regardless of whether they are
     above zero.  Also it will write out the GS as 'NA' in the lister
     map, which means the GS is not counted (ie set to zero).

 Bar toggle: char; If you want to write out the gap surprsial bar then
 use the symbol 'b'.  Any other symbol will keep the GS bar from
 getting made.


 Overlap toggle and overlap difference: char and integer;  This
 controls the amount of overlapping that the site will allow.  You can
 have some overlap, no overlap, or complete overlap.
        'o' means allow overlap, no matter how much it overlaps.
        'n' means allow no overlap at all.
 'd' means allow overlap as long as the zero coordinates are further
  than the overlap distance apart.

   histog: concatenated histog files for each of the elements.  These files
      need to be in the same order as the multiscanp, scanp, and ribl files.
      Histog files need to be whole numbers and can be generated using
      genhis.

   data: The results.  Comments are lines that begin with '*'.  This file
      contains information containing all of the components in the scan.  A
      line is given for each element that gives the model name, coordinate,
      matrix orienation, Ri evaluation, gap size, and gap surprisal.  For
      each grouping the total information and the order of elements, and all
      elements used are reported.

   scanfeatures: The results in the "features" format for input to the lister
      program.  This consists of comment lines (beginning with "*"),

      definition lines (as shown above), and features of the form:
      @ U00096 3646159.0 -1 "p10" " 4.0 bits"     3.97    -0.0     0.0

      See program lister.p for details.

   scaninst:  The results are given in the form of delila instructions:
      name "dnaA"; piece K01789; get from 229 -100 to 229 +100 direction -;
      The name is the long name in the book.
      This is turned on by the last scanp parameter set in the scanp file.


   output: messages to the user

description

Multiscan searches a Delila book with an information-theory based
flexible binding site model and reports the located features.

Multiscan is the 3rd upgrade of scan. The first upgrade was to be able
to handle two part models, and was called Discan.  Discan worked by
comparing scanfeature files of two different models with some spacing
histogram to compute the total information of the site based on the
information of the two parts and the Gap surprisal of the
spacing between the two parts.  Biscan did the same computation but
did not use the scanfeatures output, instead it scanned the sequence
directly.

The biscan algorithm was to find the first site, and then based on the
spacing histogram provided, looked to see if a second site was near
by, using the same code that Scan uses.  If it found a suitable
site it would then report the site and all of the relevant information
about it.   Biscan is much faster and has more flexibility in control
than discan did, and therefore maded discan obsolete.

Discan and Biscan could only handle two part models, Multiscan can
handle an infinite number of models binding together with a large
number of controls over the binding characteristics of each model.
Multiscan can replace Biscan, but cannot replace scan (yet).

Here is a brief description of multiscan and the way that it can be
manipulated:

The models that are used by multiscan are either "essential" or
"non-essential".  Essential sites have to be present for a site to be
found (for example, in a promoter the minus 10 has to bind in order
for the polymerase to work).  Non-essential sites, are sites that do
not have be present for the system to work (for example and activator
does not have to be present for the RNAp to start initiation).  This
distinction between essential and non-essential sites is important in
understandind the multiscan algorithm.

Non-essential sites are only included if the site contributes to the
total information of the system.  That is, if some activator had an
individual information of 3 bits, but a gap surprisal value of 4 bits,
it would have an overall net nformation contribution of -1 bits to the
system, not improving the total strength of the site.  Therefore it
will not be included.

Anoter important concept is the idea of parent sites.  Since we based
the gap surprisal value on the distance between two sites, we have to
identify which model another model will bind relative to.  For
example, the minus 35 will bind relative to the minus 10 so its parent
is the minus 10.  Some activator will bind relative to the minus 35 so
its parent is the minus 35 and its grandparent is the minus 10.
Another activator may bind relative to the minus 35 so its parent is
also the minus 35 and can be thought of as the brother of the first
activator.  But like a parent, that molecule has to be bound in order
to have a child bind relative to it.  So if a parent does not find a
binding site, no children can bind.

The math for the calculating the total information of a site is the
same as presented in biscan and in the Shultaberger, Schneider
ribosome paper.  The only difference is that instead of two parts, we
now have unlimited parts.  So the math is:

Total info = Ri(UP) - GS(UP-35) + Ri(35) + Ri(10) - GS(10-35)  if
Ri(UP) - GS(UP-35) > 0 bits

else

Total info = Ri(10) + Ri(35) - GS(10-35).

The lister maps are fairly complicated now.  Since there are so many
components which could be part of a model, we used the following
techniques to help group them.  First:  All connecting bars have the
coordinate of the parent molecule.  That is if you had a p10, p35, and
Fis and p10 was the first site in your model (in the paramter file,
not spacially on the DNA), the connector bars for all the other
sites would contain its coordinate in their feature name tags.

Second: The total information line has the names of all the components
used in the model, and the order that they fall on the DNA.  So if Fis
was the most upstream site, followed by p35, and then p10 the line
would read Fis-p35-p10.

You use catted scanp files, catted ribl files, and catted histog files
as well as a multiscanp file.

examples

   an example multiscanp file:

1.03    version of multiscan that this parameter file is designed for.
200     maximum range in bases
5.0     total Ri cutoff
f       f means filter out duplicates of the features
0       essential number cutoff (Ri - gap), if negative use scanp cutoff, no gap.
"p10" (null) e s n n o3  parent, essential or non, orient tog, use GS, bar, overlap tog
"p35" (p10)  e s g n o3  parent, essential or non, orient tog, use GS, bar, overlap tog
"Fis" (p35)  e s n b o3  parent, essential or non, orient tog, use GS, bar, overlap tog
.

The order of gap surprisal numbers is now (2005 Sep 1) defined to be:
  gapsurp - gap surprisal (in Ri position)
  MinGapSurprisal - minimum gap (in P position)
  MaxGapSurprisal - maximum gap (in Z position)
This makes the computation of the sum easier.

documentation

@article{Schneider-Ri1997,
author = "T. D. Schneider
 and J. Spouge",
title = "{Information content of individual genetic sequences}",
journal = "J. Theor. Biol.",
volume = "189",
pages = "427--441",
pmid = "9446751",
note = "\htmladdnormallink
{http://dx.doi.org/10.1006/jtbi.1997.0540}
{http://dx.doi.org/10.1006/jtbi.1997.0540},
\htmladdnormallink
{https://alum.mit.edu/www/toms/papers/ri/}
{https://alum.mit.edu/www/toms/papers/ri/}",
year = "1997"}

@article{Schneider-walker1997,
author = "T. D. Schneider",
title = "Sequence Walkers:
a graphical method to display how binding proteins
interact with {DNA} or {RNA} sequences",
journal = "Nucleic Acids Res.",
volume = "25",
comment = "walker.tex, November 1, issue 21",
note = "\htmladdnormallink
{https://alum.mit.edu/www/toms/papers/walker/}
{https://alum.mit.edu/www/toms/papers/walker/},
erratum: NAR 26(4): 1135, 1998",
pages = "4408--4415",
pmid = "9336476",
pmcid = "PMC147041",
year = "1997"}

@article{Shultzaberger.Schneider2001,
author = "R. K. Shultzaberger
 and R. E. Bucheimer
 and K. E. Rudd
 and T. D. Schneider",
title = "{Anatomy of \emph{Escherichia coli}
Ribosome Binding Sites}",
journal = "J. Mol. Biol.",
volume = "313",
pages = "215--228",
pmid = "11601857",
note = "\htmladdnormallink
{https://alum.mit.edu/www/toms/papers/flexrbs/}
{https://alum.mit.edu/www/toms/papers/flexrbs/},
\htmladdnormallink
{http://dx.doi.org/10.1006/jmbi.2001.5040}
{http://dx.doi.org/10.1006/jmbi.2001.5040}",
year = "2001"}

@article{Shultzaberger.Schneider-flexprom2007,
author = "R. K. Shultzaberger
 and Zehua Chen
 and Karen A. Lewis
 and T. D. Schneider",
title = "{Anatomy of \emph{Escherichia coli} $\sigma^{70}$ promoters}",
journal = "Nucleic Acids Res.",
volume = "35",
pages = "771--788",
pmid = "17189297",
pmcid = "PMC1807945",
note = "\htmladdnormallink
{https://alum.mit.edu/www/toms/papers/flexprom/}
{https://alum.mit.edu/www/toms/papers/flexprom/}",
year = "2007"}

see also
   sites.p, ri.p, genhis.p, lister.p, dnaplot.p

   The first published paper using the program:
   https://alum.mit.edu/www/toms/paper/flexprom/

   This example lister map using multiscan output
    shows multiple sequence walkers:
image for flexprom-fig5

author
   Ryan Kent Shultzaberger

bugs
   * Multiscan cannot handle negative coordinates!!
     (2016 Aug 30 fix in progress)

   * The large connecting bar may not be right if the series of features do
     not have the most important site as the furthest downstream, and a positive
     orientation.  You need to check this.

   * Sites with even symmetry have coordinates at position x.5, so the
     distribution has to have half base values (-0.5 0.5, 1.5, 2.5).
     I don't know if you can use them.

   * There is a problem with coupling non-essential sites.  Let me give an
     example:  You want to look at where the sigma 70 model (composed of a
     minus 10 and minus 35) binds relative to a RBS, but in your scan you do
     not want the promoter to be essential.  That is, you can identify RBSs
     and only include the promoter if it has positive information.  Since the
     promoter is a flexible thing, and you need both a p10 and p35 for the
     site to be complete, then the algorithm should only include the promoter
     if it contains both a p10, and p35.  The current algorithm does not know
     that both things need to be present and would include the p10 if it was
     good, even if the p35 wasn't.   This is a tricky concept.  My
     recommendation is to make all flexible elements essential, and that way
     you will not have any problems.

   * The rho normalization factor parameter has not been added, so the
     information contribution of any regulator is the Ri of the site.

   * It would be nice if the program could determine the range size
     based on the sizes of the individual histograms.

   * If some weird segmentation fault comes up, it may be related to freef
     being a multifeatureptr, instead of a featuretypeptr.  For some reason
     the program worked with freef as a featuretypeptr.  Try changing it back
     in themain and see what happens.

   * 2005 Sep 1: The Ri value is now properly reported for the features,
     but the P and Z values are set to zero.  This should be provided
     at some point.

   * 2016 Sep 20: multitype records use parameters: allparameters
     but should use the parameterlist instead - this would save memory!
     Those parameterlist is not even used!

technical notes
   The mean and standard deviation of the Ri distribution are stored just
   after the Ri(b,l) table in the ribl file.  They are produced automatically
   by the ri program.
*)
(* end module describe.multiscan *)
{This manual page was created by makman 1.45}


{created by htmlink 1.62}