Markov Chain Promoter Finder McPromoter MM:II

Changes between version 3 and MM:II

The web interface was re-written; there should be much fewer problems to submit sequences now when using recent browser versions. We also added sequence upload in addition to cut and paste.
The underlying code has been sped up by a factor of 5--10; we therefore allow submissions up to 20 kb now.
McPromoter MM:II has a button to choose your preferred cut-off value for hits. The lower the cut-off, the higher the sensitivity, and the more hits you will get -- but also more false positives among them.
Note: Due to the different algorithms, the threshold and output values are on a different range for Drosophila and human; Drosophila models give you predictions between 0 and 1, humans between approx -0.5 and +0.1. You can see the score profile along the whole sequence with the graphics attached to the result mail.

How do I submit my sequence ?

Upload your DNA sequence, or paste your sequence into the sequence box. Your sequence should consist of one-letter nucleotides (A, C, G, T). Characters that do not uniquely determine a base (e.g. R or N) are replaced at random. The sequence should be in plain or FASTA format. FASTA format looks like this:

>gb|V00574|HSRAS1 Human germ line gene homologous to bladder carcinoma oncogene T24
GGATCCCAGCCTTTCCCCAGCCCGTAGCCCCGGGACCTCCGCGGTGGG
CGGCGCCGCGCTGCCGGCGCAGGGAGGGCCTCTGGT

Please beware that lines longer than 1024 symbols will be truncated! You can choose whether to show predictions only for the forward strand or for the backward strand as well.

The program works by shifting a 300 bases long window over the sequence and judging its content every 10 bases. If a promoter is detected, the position within the window when the model enters the initiator state is reported. This has the consequence that NO predictions are made within the first 250 bases on the forward and the last 250 bases on the backward strand -- and that your sequence has to be at least 300 bp long.
The output of McPromoter is a list of predicted transcription start sites in gff format . The score which is printed next to the predicted site is the output of the predictor and lies between 0 and 1 (Drosophila) and approximately -0.5 and 0.1 (human), larger values being better. The threshold defines a minimum score for a promoter to be reported. If there are multiple predictions within 500 bases (Drosophila) or 2000 bases (human), only the best one is showed.

How can I obtain the most meaningful results with McPromoter ?

McPromoter MM:II has been greatly improved over the last release, especially for Droopshila, but of course, false positives are still around. Anything that you know about your sequence is thus helpful to restrict the search to meaningful parts of your sequence. Look only on the strand that your gene is located, and use other results (i.e. BLAST hits or cDNA/EST alignments) to throw out parts of your sequence which most likely do not contain a promoter.
Any cutoff threshold is a compromise between sensitivity and specificity. Thus, if you don't get a hit, try a lower threshold. You don't have to re-run your sequence for that; a look at the attached plots reveals almost everything.
If the system detects multiple hits close to each other, the result list contains only the best one; if you expect multiple initiation sites, a look at the graphics might also help to reveal neighbored maxima.

What are those plots attached to the result email good for ?

We provide a plot for each strand, depicting the system output over your submitted sequence. This can help to quickly find local optima that are below the threshold, or multiple hits that are close to each other (see the section above).

The plot (or two, if you chose the "forward & reverse" option) contains the output values of the promoter model along the sequence. The selected cut-off value is also shown.

And now I want to know: What is inside McPromoter ?

Version 3.0 of McPromoter is a hybrid method to look for eukaryotic polymerase II transcription start sites. The system contains a background model consisting of states for coding and non-coding sequences and a promoter model which divides a promoter in six consecutive segments. The models are applied to a window of 300 bases, and in the Drosophila model, the promoter and the background likelihoods are then fed into a neural network, along with likelihoods representing the DNA structure in the six segments. The program makes a prediction for each local maximum above the threshold, applying the neural network in steps of 10 bp along the sequence. Because McPromoter is a statistical system, it does not require that certain patterns must be present, but that the combination of all features is good enough. E.g., even if the TATA box score is very low, there can still be predictions if the other features score well.

The models were trained on a representative set consisting of vertebrate promoters and human non-promoter sequences respectively on D. melanogaster promoters and non-promoters (see link below). Cross-validation on the human promoter/non-promoter data set delivered an equal recognition rate of 86.9%, with a correlation coefficient of 0.67. On the promoters of known genes in human chromosome 22, we could identify 52% of the promoters with a false positive every 84 kb. Cross-validation on the Drosophila promoter/non-promoter data set: Equal recognition rate of 89.2%, with a CC of 0.78. On a set of 92 Drosophila genes from the well-studied Adh region, we could identify 52% of the promoters with a false hit in 12 kb.

Get the whole story!

U. Ohler, Computational Promoter Recognition in Eukaryotic Genomic DNA .
All you (n)ever wanted to know about promoter finding... either as book from the LOGOS publishing house, or as a preprint of the thesis, submitted to the University of Erlangen in 2001 (pdf).

Our methods were also described in detail in the following papers. Please cite paper (7) when quoting results obtained with McPromoter for Drosophila, and paper (3) for results on human.

(1) U. Ohler, S. Harbeck, H. Niemann, E. Noeth and M. G. Reese
Interpolated Markov chains for eukaryotic promoter recognition.
Bioinformatics 15(5), p. 362-369, 1999.

(2) U. Ohler, S. Harbeck and H. Niemann
Discriminative training of language model classifiers
Proc. European Conference on Speech Communication and Technology (EUROSPEECH), Budapest 1999.

(3) U. Ohler, G. Stemmer, S.Harbeck and H. Niemann
Stochastic segment models of eukaryotic promoter regions
Proc Pacific Symposium on Biocomputing 5:377-388, Honolulu 2000.

(4) U. Ohler
Promoter prediction on a genomic scale - the Adh experience
Genome Res 10(4):539-542, 2000.

(5) U. Ohler and H. Niemann
Identification and analysis of eukaryotic promoters: recent computational approaches
Trends Genet. 17:56-60, 2001.

(6) U. Ohler, H. Niemann, G. Liao and G. M. Rubin
Joint modeling of DNA sequence and physical properties to improve eukaryotic promoter recognition
Bioinformatics 17:S199-S206, 2001.

(7) U. Ohler, G. Liao, H. Niemann and G. M. Rubin
Computational analysis of core promoters in the Drosophila genome.
Genome Biol. 3:research0087.1-0087.12, 2002.

Preprints of papers (1)-(6) can be found here .
More information and our training and test sequences are publicly available! ( click here ) Return to the McProm interface

McPromoter promoter predictor by Uwe Ohler
Web interface by Moussa Sagna and Uwe Ohler