Prodigal's algorithm for gene prediction follows the basic principle of KISS (Keep It Simple, Stupid). Compared to other methods, Prodigal's naive log-likelihood functions seem deceptively simple. Despite its lack of complexity (no Hidden Markov Model, no Interpolated Markov Model, etc.), Prodigal nonetheless achieves good results.
The basic steps of the Prodigal algorithm can be summarized as follows:
- Constructing a training set for protein coding: Many genefinders just take all open reading frames, or ORFs, above a certain size and consider them to be real genes. While this may be fine for low-GC genomes, this assumption proves dangerous in high GC organisms. Due to the lack of A and T in high GC genomes, there are many fewer stop codons. Long ORFs occur simply by chance in high GC genomes, and many of them aren't real genes at all. Prodigal addresses this problem with GC frame plot based training, wherein it examines all the ORFs in a genome looking for a bias for G or C in the 1st, 2nd, and 3rd positions of each codon. It then does a dynamic programming across the entire genome, building gene models using this frame plot bias as its only coding scoring function. While the gene models built by this initial run are far from perfect, they provide a sound enough basis to gather coding statistics from, and a far better basis than merely choosing all ORFs above a certain size.
- Building log-likelihood coding statistics from the training data: Prodigal gathers dicodon (hexamer) statistics for all the genes in its initial dynamic programming model. The coding function is a simple log-likelihood of signal to background. Once this function has been established, every potential gene in the genome (all possible starts and stops) is scored. This simple function, plus the two factors listed below, are all Prodigal uses in the way of coding scores.
- Sharpening coding scores: Once Prodigal has scored all potential candidates in a given ORF, it then implements a "sharpening" of the coding score, wherein it penalizes all potential start candidates that lie downstream from a higher-scoring start. The reason for this is that if we choose a more interior start in the dynamic programming stage, we are also NOT choosing the region upstream of that start. Therefore, an additional penalty is assigned representing this bypassing of a good coding region. For example, gene 3701-4000 has a score of 100. Gene 3763-4000 has a score of 75. We revise the score of gene 3763-4000 to be 75 minus the coding not selected (25), or 50.
- Length factor to coding: A static length factor is added to the coding score. This factor is higher in low GC genomes, and lower in high GC genomes. If an ORF is especially long, but has negative coding, its coding score is artificially replaced with a small positive coding per base. This enables the long ORF to be recognized as a true gene, but it won't be chosen over a genuinely good alternative.
- Iterative start training: For every open reading frame containing a gene with a coding score above a certain threshold, the translation initiation site with the highest coding score is recorded. This set of "coding peaks", although usually only 60-70% likely to be the true gene start, provides a sound foundation for start training. These starts are examined for ATG/GTG/TTG frequency and ribosomal binding site (RBS) motifs. The starts are then rescored based on these discoveries, and the new set of starts with the highest score in each ORF is selected. The start trainer iterates until the set of "best starts" no longer changes (usually only a few iterations). This final set of "best starts" is used as the training set for start scoring, and data is gathered from this set regarding RBS motifs, distances, and ATG/GTG/TTG frequency.
- Final dynamic programming: A final dynamic programming is performed over the set of all start-stop pairs in the genome. Each potential gene's score is the sum of its start score and its coding score. Some small overlap is allowed between two genes on the same strand, and a greater amount of overlap is allowed for 3' ends of two genes on opposite strands. Bonuses are given to the scoring for potential operon distances, with larger bonuses given to -1 and -4 base overlaps between two genes on the same strand.