Delila Program: exon

exon program

Documentation for the exon program is below, with links to related programs in the "see also" section.

{  version = 2.38; (* of exon.p 2018 Mar 06}

(* begin module describe.exon *)
(*
name
   exon: determine lengths of exons in GenBank entries

synopsis
   exon(exonp: inout, db: in,
           dinst: out, ainst: out, einst: out,
           lengths: out, exonfeatures: out output: out)

files
   exonp: parameters to control the program, one per line:

     0: parameterversion: The version number of the program.  This allows the
        user to be warned if an old parameter file is used.
        (Introduced 2007 Dec 10.)

     1: If 'n' then the end exons are not included.  These do not
        have reliable lengths.

        Even if end exons are included, the program will never add the Delila
        instructions for the very ends of the CDS, because these are not
        reliable.  Often they are CAP or polyA sites.  Specifically, the
        first coordinate of the CDS is likely to be a CAP and so should not
        be added to the acceptors in ainst, while the last coordinate of the
        CDS is likely to be a polyA site and so should not be added to the
        donors in dinst.

     2: if 'd' then gobs of debugging output are printed
        to the output file.  If 'v' then verbose output is given
        but not debugging information. ('v' is true when debugging.)

     3: Two constants, theDfromrange and theDtorange, that determine
        the from and to range to be written for Donor Delila instructions.

     4: Two constants, theAfromrange and theAtorange, that determine
        the from and to range to be written for Acceptor Delila instructions.

     5: If the first character is 'e' then exon features are also
        used.  If the second character is 'i' then intron features are
        also used.

     6: If the first character is 'a' (for alternative) then exon
        features that have one end point the same are included.  If it is not
        'a' then only exons that are completely different are included.

     7: 4 characters that determine the harshness of which entries
        to keep.  The categories are:

           single letter   name            string in GenBank:
           p               putative        "putative"
           n               notexperimental "not_experimental"
           g               geneprediction  "gene prediction"
           u               unpublished     "Unpublished"
           s               pseudo          "pseudo"

        The letters 'pngus' are on the parameter line.
        If a letter is capitilzed, then any entry with that string
        in it ANYWHERE will be killed.  This is harsh but effective
        at removing GenBank crap.

     8: If the first character is 'n' (for "notes") then if there
        is no /gene or /number for a feature, the program will
        use the /note feature.  WARNING:  Despite 15 years of complaining
        to GenBank, names in notes are NOT PARSABLE and may cause ill health.

     9: If the first character is 'r', 'R' or 'm' (for "mRNA")
        then the mRNA feature is used instead of CDS.
        Otherwise CDS is used.
        (Introduced 2007 Dec 10.)

     If exonp is old (before having a parameter version) exon will
     attempt to upgrade it.

   db: a set of GenBank entries

   lengths: A list of the exon lengths found in db.

   dinst: Delila instructions for donor sites.

   ainst: Delila instructions for acceptor sites.

   einst: Delila instructions for exons.  The acceptor from (theAfromrange)
          and donor to (theDtorange) are used to extend beyond the exon
          edges.

   exonfeatures: Locations of exons in the format for the Lister program.

   output: messages to the user

description
   The program searches for 'CDS'.  If the next word is 'join' it parses out
   the parts of the CDS, determining the lengths of the exons.  If
   'complement' is found, the complementary exons are identified.

   To ensure a clean data set, the program eliminates:

   * single exons in a locus (unreliable data for lengths)
   * exons which have one end not defined (< or > mark)
   * exons at the beginning or end of the CDS (unreliable data)
   * exons that are references to other entries.
   * duplicate exons within a single locus
   * exons that have any coordinates the same as other exons in the same
     entry.  This (arbitrarily) eliminates alternative splice cases.

   To remove further junk from the database, entries that contain any of
   these phrases are skipped:

'not_experimental'
'gene prediction'
'Unpublished'

   GenBank contains many mRNA sequences masquerading as DNA.  They can be
   identified by zero length introns.  They are ruthlessly eliminated.

   If a CDS has a no /gene name in the feature table, it will be named like
   this:

       U00096.CDS.190-255      no /gene, no /number

   If a CDS has a /gene name in the feature table, it would be nice to name
   it like this:

       U00096.thrA             (this name can fail)

   Unfortunately that alone will fail because all exons end up being named
   the same!  So if there is a /number the name will include it:

       M95740.IDUA.exon-3      /gene, /number

   If there is a /gene but no /number the range will be given:

       M95740.IDUA.427-512     /gene, no /number

   So there are three options for names.

   * The exons are placed into ascending order.
   * The Delila name command is used to name the pieces.

examples

documentation

@article{Stephens.Schneider.Splice,
author = "R. M. Stephens
  and T. D. Schneider",
title = "Features of spliceosome evolution and function
inferred from an analysis of the information at human splice sites",
journal = "J. Mol. Biol.",
volume = "228",
pages = "1124-1136",
year = "1992"}

see also
   dbinst.p

author
   Thomas Dana Schneider

bugs

technical notes

   The program deals with alternative splicing by removing any exon that has
   any coordinate the same as another exon.

   The program only can accept a single type of organism to be put into the
   instruction files.  It's not clear that one would ever want to mix
   organisms for this analysis!!

   The zero coordinate for splice junctions follows the convention of
   Stephens.Schneider.Splice: it is the base on the intron side of the splice
   junction.

   2007 Dec 05.  The exon program would compile and run with exonmax
   set to 15000.  Unfortunately this is not enough for H.sapiens
   chromosome 1 (NC_000001 247249719 bp).  The program compiles (gpc)
   but gives a 'Segmentation Fault'.  The reason (thanks to David
   Bryant) is that the stack size in Unix is restricted.  The Unix
   command 'limit' gives 'stacksize       8192 kbytes'.

   There are at least 3 solutions.

   1. The Unix stack size can be increased by the command:

         limit stacksize 65538

   Doing so solved the problem.

   Although this works, it requires setting the operating system so it
   is not too portable.

   2. The exonmax determines the number of exonrecords:

   fealist: array[1..exonmax] of exonrecord;

   The exonrecords use the standard 'string' for the gene namewhich
   has an array of characters whose size is determined by constant
   maxstring for which the default is 150.  Setting maxstring to 20
   solves the problem.  Although this would work, perhaps it is best
   to allow long names.

   2. Put the array into the program heap, which is unlimited, instead
   of the stack, which has the current limit.  This requires a program
   change.  I implemented it by making the fealist be a pointer to an
   array: 'fealist^.a' replaced 'fealist' through the code.  This
   worked.

   Thanks to David Bryant for pointing out the situation and
   explaining the possible solution of putting the data on the heap.

*)
(* end module describe.exon *)
{This manual page was created by makman 1.45}


{created by htmlink 1.62}