Delila Program: alpro

alpro program

Documentation for the alpro program is below, with links to related programs in the "see also" section.

{   version = 2.09; (* of alpro.p 2020 Mar 26}

(* begin module describe.alpro *)
(*
name
   alpro: frequency and information of aligned sequences

synopsis
   alpro(protseq: in, alprop: inout, symvec: out, sequ: output, output: out)

files
   protseq:  Aligned sequences in one of two formats.

      The first line, intended for identification of the entire data set, is
      skipped.  This header line must begin with an asterisk '*' or '>'.

      When the header begins with '>', fasta format is used, otherwise the
      original protseq format is used.

      In the original protseq format, the remaining lines are used for the
      sequences.  They are divided into `entries'.  The beginning of an entry
      has any (positive) number of identification lines, each of which begins
      with an asterisk '*'.  The sequence follows.  Gaps are indicated with
      dashes (-).  The end of the sequence is indicated by a period.  The
      program automatically figures out what the sequences are so that the
      correct kind of information calculation can be made.  Sequences can be
      DNA (ACGT - 4 characters), RNA (with U - 4 characters), protein (20
      characters) or alphabetic (26 characters).

      Fasta format has two differeces.  First, all identification lines begin
      with '>'.  Second, sequences do not end with a period.  Instead, they
      end with the next sequence entry identifier (ie another '>') or the end
      of the file.  In this format dashes '-' or dots '.' may be used
      as the alignment character.

      If fasta format is used then the dots represent bases of the
      first sequence.  (New as of 2007 jul 16; previously the dot
      became a dash.)

      Spaces are allowed in the sequence, but they are ignored.

   alprop: parameters to control the program, a series of lines:

      1. parameterversion: The version number of the program.  This allows the
         user to be warned if an old parameter file is used.

      2. alignment: alignment point for the sequences.  This allows one to
      assign the numbering in the symvec.

      3. normalization: 4 integers (a, c, g, t) giving the relative
      frequencies of
      random sequence to normalize against.  Use "1 1 1 1" normally.  If the
      data represent randomized chemical synthesis and there are biases in
      the bases, use the base biases.  Normalization is performed on the
      frequencies using equations 1 and 2 of Schneider1989:

      fo(b,l) = rho(b) fi(b,l)                                   (1)
      f'o(b,l) = f'i(b,l) rho(b) / sum_b [f'i(b,l) rho(b,l)]     (2)

      For fi(b,l) being the frequencies defined by the second line, fo(b,l)
      should = 1/4 for DNA.  This defines rho(b).  (Note: rho is not a
      function of position for this prograam, so rho(b) not rho(b,l).)

      DO NOT USE THIS FEATURE UNLESS YOU HAVE GENERATED SYNTHETIC RANDOM
      SEQUENCE as in Schneider1989.  USE 1:1:1:1 NORMALLY.

      See Schneider.ridebate1999 for further discussion.

      4. varlogo:  If the first letter is 'v' then the makelogo
      program will produce a 'varlogo'.  This method was invented by
      Peter Shenkin (Shenkin.Mastrandrea1991).  In a regular sequence
      logo the vertical scale is the information content.  However in
      some systems, as in the immunoglobulin variable regions, one is
      not interested in the conservation, but rather the degree of
      variability.  This is best expressed as the uncertainty Hafter
      rather than the information R = Hbefore - Hafter.  Basically, it
      "turns over" the curve.

      5. genomic composition: 4 integers (gna, gnc, gng, gnt) giving the
      numbers of A, C, G and T in the genome of the organism from which DNA
      or RNA sequences come from.  This genomic composition is used to
      compute Hgenome.  The information content is Rsequence(l) = Hgenome -
      (H(b,l) + e(n)), where e(n) is the small sample correction.  You can
      use '1 1 1 1' to set an equiprobable genome.  See
      Schneider.ridebate1999 for a discussion of relevant issues.

      6. sequ: If the first letter is 's' then create a file called
      sequ that contains the full sequences followed by a period which
      can be used by makebk to create a Delila book.  This allows one
      to convert periods from the protseq into the corresponding
      letter of the initial sequence.

      Old versions of alpro will be automatically upgraded to new versions
      if you set the version number to less than 1.

   symvec:  Table of frequencies and information content.  The information
      measure is corrected for small sample size (Schneider et al, 1986).
      The format of this file is the same as produced by dalvec.

   sequ:  raw sequences followed by periods for creating
      a delila book.  This is generated only if the 6th parameter
      is 's'.

   output: messages to the user

description

   Take an aligned set of sequences and produce input to the makelogo program
   for producing a sequence logo.  Small sample size and odd genomic
   composition are accounted for.

   The program will take lines that begin with '>' to accomodate fasta format.
   However, sequences still must end with a period.

   This program provides a 'short cut' for making logos.  The "longer" route
   (in terms of numbers of programs and complexity, but not significantly
   time to compute) is formed by these Delila programs:

   dbbk.p, catal.p, delila.p, alist.p, encode.p, rseq.p, dalvec.p

   * dbbk converts from genbank to delila format
   * catal creates a delila library
   * delila extracts the precise sequences you want (powerful!)
   * alist shows the extracted, aligned sequences
   * encode converts the aligned sequences to binary vectors
   * rseq converts the binary vectors to a table of computed information
   * dalvec converts the table of computed information to a symvec

   Why use alpro?  Because it is currently the *only* way to get a protein
   sequence logo, and it is currently the only way to handle sequences with
   gaps in them (someday Delila will do these things).  Why use the above
   Delila programs?  Because they provide much more flexibility for chosing
   the range of sites (via Delila) and interfacing with the sequence walker
   programs (via the information table, rsdata).

examples

* Example protseq file
* This is an example sequence.
AG-EGCTT.
* This is the second example sequence.
* It is the last one.
YLREBS-A.

Example parameter file (NOTE CHANGED FORMAT AS OF 1999 NOVEMBER 29!):

1.71  version of alpro that this parameter file is designed for.
1        alignment point
1 1 1 1  normalization bases
normal   a first letter 'v' will give varlogo
1 1 1 1  genomic composition

The files globin.protseq (see below) and protseq.fasta are working examples.
Use protseq.makelogop and colors.protein with makelogo.  If you also use
protein.wave as the wave file, you can see how much the logo corresponds to
an alpha helix.

documentation

@article{Hein1990,
author = "Jotun Hein",
title = "Unified approach to alignment and phylogenies",
journal = "Methods Enzymol",
volume = "183",
pages = "626-645",
year = "1990"}

@article{Schneider1986,
author = "T. D. Schneider
 and G. D. Stormo
 and L. Gold
 and A. Ehrenfeucht",
title = "Information content of binding sites on nucleotide sequences",
journal = "J. Mol. Biol.",
volume = "188",
pages = "415-431",
year = "1986"}

@article{Schneider.Stephens.Logo,
author = "T. D. Schneider
 and R. M. Stephens",
title = "Sequence Logos: A New Way to Display Consensus Sequences",
journal = "Nucl. Acids Res.",
volume = "18",
pages = "6097-6100",
year = "1990"}

@article{Schneider1989,
author = "T. D. Schneider
 and G. D. Stormo",
title = "Excess Information at Bacteriophage {T7} Genomic Promoters
Detected by a Random Cloning Technique",
year = "1989",
journal = "Nucl. Acids Res.",
volume = "17",
pages = "659-674"}

@article{Schneider.ridebate1999,
author = "T. D. Schneider",
title = "Measuring Molecular Information",
journal = "Journal of Theoretical Biology",
volume = "201",
pages = "87-92",
note = "\htmladdnormallink
{https://alum.mit.edu/www/toms/paper/ridebate/}
{https://alum.mit.edu/www/toms/paper/ridebate/}",
year = "1999"}
as:
https://alum.mit.edu/www/toms/paper/ridebate/

@article{Shenkin.Mastrandrea1991,
author = "P. S. Shenkin
 and B. Erman
 and L. D. Mastrandrea",
title = "{Information-theoretical entropy as a measure of sequence
variability}",
journal = "Proteins",
volume = "11",
pages = "297--313",
pmid = "1758884",
comment = "was Shenkin1991",
year = "1991"}

see also

   Standard parameter file:  alprop

   PROTEIN EXAMPLE: THE GLOBINS
   To try the alpro program, use the standard alprop
   with a copy of these files as your protseq:
   Example input file for protseq:  globin.protseq
   Example like globin.protseq but in fasta format:  globin.fasta

   The symvec file generated by alpro with this globin data should be
   close to or identical with this symvec:  globin.symvec

   Then you can use the program that makes the logo, makelogo.p, to
   create a logo.  You will need these files:
   symvec (from above or from the archive):  globin.symvec
   marks (currently empty):  globin.marks
   colors file to use for proteins:  protein.colors
   wave file to use for proteins:  protein.wave
   makelogop (parameter file to use for this globin example): globin.makelogop
   NOTE:  each file needs to have the name that makelogo expects.  Get
    the file and rename it.

   After you run makelogo, the resulting sequence logo should be like this:
   globin.logo.ps

   Read the manual page on makelogo.p to learn how to control the display
   more.

   There is a more powerful way to make DNA logos.  See:
   https://alum.mit.edu/www/toms/logoprograms.html

   Related programs:
   dbbk.p, catal.p, delila.p, alist.p, encode.p, rseq.p, dalvec.p

   What the heck is Pascal system error 0?
   See:
   https://alum.mit.edu/www/toms/pascalp2c.html#system.error.0

   Michael Sauder <michael_sauder@stromix.com> has generously written two
   perl scripts to convert files into the protseq format that alpro uses:

       convert from CLUSTAL format to protseq: clustalw2alpro.pl
       convert from     MSF format to protseq: msf2alpro.pl

author

  Dr. Thomas D. Schneider
  Laboratory of Experimental and Computational Biology
  toms@alum.mit.edu
  permanent email: toms@alum.mit.edu
  https://alum.mit.edu/www/toms/

bugs

technical notes

   Historical note:  The program originally only created a vector that
   contained the characters of the alphabet, so the output was called an
   'alvec'.  To reflect the use of symbols, the name of the output file was
   changed to symvec, but I like 'alpro', and 'prosym' is awkward that I
   decided to keep the name alpro.  Later I generalized the program to handle
   DNA or RNA or alphabetic sequences, but kept the name.  Now it might be
   considered to be the 'alignment professional'.  Oh well.

   The feature which adjusts the stack height when there is a small amounts of
   data, (described in the second paragraph of page 6100 of the logo paper),
   has been removed now because the ability to display the variance as a
   standard deviation by makelogo alerts the person that the position has
   little data in it.  Thanks to Peter Shenkin for the suggestion.

   The original feature was described as follows:

      "Positions that contain mostly spacer characters for the alignment are
      also reduced in weight by multiplying the information by the maximum
      number of sequences and dividing it by the actual number at the spacer
      position.  Thus if there are 10,000 sequences, a position with 200 A's
      would would be close to 2 bits of pattern.  However, since the position
      only represents 2% of the sequences, this program would only give it a
      weight of 0.02*2 = 0.04 bits.  A better method is not known.  However,
      this prevents one from being fooled by positions that don't appear in
      most sequences."

*)
(* end module describe.alpro *)
{This manual page was created by makman 1.45}


{created by htmlink 1.62}