top of page
Writer's picturegoldrodhartvascrec

Rna Secondary Structure Prediction Pdf Free: A Survey of Datasets and Evaluation Metrics



The work of Knudsen and Hein (12) (here denoted as the KH-99 algorithm) combines an explicit evolutionary model of RNA sequences with a probabilistic model for secondary structures. It assumes an alignment and gives one common structural prediction for all the sequences.


A structure prediction for three hypothetical sequences. In the top alignment, gaps are treated as unknown nucleotides. The structure, shown as parentheses, include pairs between nucleotides and gaps. In the parenthesis notation, corresponding parentheses indicate positions forming base-pairs. In the bottom alignment, the columns with gaps have been left out of the prediction, because




Rna Secondary Structure Prediction Pdf Free




RNA secondary structure prediction methods based on probabilistic modeling can be developed using stochastic context-free grammars (SCFGs). Such methods can readily combine different sources of information that can be expressed probabilistically, such as an evolutionary model of comparative RNA sequence analysis and a biophysical model of structure plausibility. However, the number of free parameters in an integrated model for consensus RNA structure prediction can become untenable if the underlying SCFG design is too complex. Thus a key question is, what small, simple SCFG designs perform best for RNA secondary structure prediction?


Nine different small SCFGs were implemented to explore the tradeoffs between model complexity and prediction accuracy. Each model was tested for single sequence structure prediction accuracy on a benchmark set of RNA secondary structures.


Four SCFG designs had prediction accuracies near the performance of current energy minimization programs. One of these designs, introduced by Knudsen and Hein in their PFOLD algorithm, has only 21 free parameters and is significantly simpler than the others.


In addition to the Knudsen and Hein approach, at least three other SCFG-based approaches to RNA secondary structure prediction have been described. These include an SCFG-based mirror of the standard Zuker algorithm for single-sequence structure prediction [35], and two "pair-SCFG" approaches for simultaneous folding and alignment of two homologous RNAs [31, 36]. All four papers use different underlying SCFG designs. No group appears to have explored different possible SCFG designs before settling on the one they used. Only Knudsen and Hein reported any benchmark results for the accuracy of their secondary structure predictions [18, 26]. It is not known how different designs affect the accuracy of SCFG-based secondary structure prediction. Flexibility in model design comes from the fact that SCFG probability parameter estimation can be done by counting frequencies in databases of trusted RNA secondary structures, so it is easy to parameterize different models that vary in complexity and capture different features of RNA structure. In contrast, energy minimization algorithms are based on a standard set of thermodynamic parameters, most of which are determined experimentally [2, 7], so it would take substantial effort to develop a radically new thermodynamic model.


Design decisions are likely to be particularly important in consensus structure prediction applications, because a natural trade-off arises. A complex RNA folding SCFG might predict structures for single sequences better than a simpler model, but extending a complex RNA folding SCFG to deal with multiple evolutionarily correlated sequences can easily result in a combinatorial explosion of parameters, making the model impractical. One wants to build consensus prediction models on top of small, simple (i.e. "lightweight") SCFG designs that sacrifice as little RNA structure prediction accuracy as possible, relative to state-of-the-art energy minimization approaches.


Here we explore the impact of different SCFG designs on single-sequence RNA secondary structure prediction accuracy. Our goal is to identify lightweight SCFG model designs that can serve as cores underlying more complex integrated approaches. We have implemented nine different lightweight SCFGs, estimated their parameters from rRNA structure data, evaluated their prediction accuracy on a benchmark of trusted RNA structures, and compared these results to the accuracy of energy minimization methods.


Dynamic programming algorithms for non-pseudoknotted RNA secondary structure prediction work by calculating scores for optimal foldings for all subsequences x i ...x j , starting with subsequences of zero length and working outwards recursively on increasingly longer sequences [2]. For example, an example of an RNA folding algorithm [3] is:


Nonstochastic CFGs are used in pattern search applications, where one represents an RNA structural consensus as a CFG and ask if a particular sequence matches or doesn't match that query. They are not useful for structure prediction. For the CFG above, for example, for any RNA sequence there will be a huge number of valid parse trees, each of which corresponds to a possible RNA secondary structure. However, our problem in structure prediction is not to determine whether an RNA sequence has at least one possible structure. Given a sequence, we want to score and rank the possible parse trees for that sequence to infer the optimal one. To score and rank parse trees, we need to use stochastic context free grammars. In addition, we need efficient algorithms for finding the optimal SCFG parse tree for a given sequence.


The near-exact correspondence between the CYK algorithm and standard dynamic programming algorithms for RNA folding should be clear. SCFG algorithms are essentially the same as existing RNA folding algorithms, but the scoring system is probabilistic, based on factoring the score for a structure down into a sum of log probability terms, rather than factoring the structure into a sum of energy terms or arbitrary base-pair scores. The thermodynamic scoring parameters for energy minimization are largely derived by experimental melting studies of small model structures [7]; in contrast, SCFG log probability parameters are derived from frequencies observed in training sets of known RNA secondary structures. That is, instead of scoring a G-C pair stacked on a C-G base pair by adding a term for the free energy contribution of the GC/CG stack, an SCFG would add a log probability that GC/CG stacks are observed in known RNA structures.


; we want the optimal secondary structure . The optimal parse tree gives us the optimal structure if and only if there is a one to one correspondence between parse trees and secondary structures. However, a given secondary structure does not necessarily have a unique parse tree. For instance, consider the two possible parse trees for the example in Figure 1, both of which express the same set of base pairs but use different series of production rules. (We consider two structures to be identical if they have the same set of base pairs.) When multiple valid parse trees describe the same secondary structure, we call the grammar structurally ambiguous. If a grammar is structurally ambiguous, then we cannot equate the probability of a parse tree with the probability of its structure [43]. The probability of a structure is a sum over the probabilities of all parse trees consistent with that structure. This summation is not reconcilable with the CYK algorithm; an optimal structure cannot be calculated efficiently if we need to do the summation over multiple possible parse trees for each structure. Thus, we will either have to use grammars that are structurally unambiguous, or we will have to assume that it is a valid approximation to assume an optimal parse tree gives us the optimal structure. We explore this issue in the results.


An interesting difference between the thermodynamic and probabilistic approaches with respect to ambiguity is worth noting. The thermodynamic scoring scheme is not normalized, so structural ambiguity is not an issue for finding optimal structures; regardless of how many different ways there are of scoring the energy of a structure, the lowest energy structure still wins. However, ambiguity becomes a painstaking issue for calculating the equilibrium partition function [21], where one must be careful not to count any structure more than once. For SCFG-based methods, with normalized probabilities as scores, exactly the opposite is the case. Ambiguity is an issue for optimal structure prediction, but the summed Inside calculation (the analog of the summed partition function calculation) gives the correct result even for ambiguous grammars.


The RNA SCFG shown above factors a secondary structure into scoring terms for each individual base pair and each individual unpaired residue. In this paper we will examine four additional grammars of this type. However, state of the art thermodynamic models use a loop-dependent thermodynamic model that factors a structure in a more complex way, into nearest-neighbor base stacking terms (as opposed to individual base pairs) and tables of penalties for different lengths of different kinds of loops (bulge, interior, hairpin, and multifurcation). SCFG methods can also capture more sophisticated folding features.


The parameters of each SCFG were estimated from frequencies observed in annotated secondary structures. The training data were large and small subunit rRNAs, obtained from the European Ribosomal Database [47, 48]. Sequences containing more than 5% ambiguous bases and with less than 40% base pairing are discarded. The resulting data set was then filtered to remove sequences with greater than 80% identity. The final training set contains randomly chosen equal numbers of LSU and SSU sequences from the filtered data, totaling 278 sequences, 586,293 nucleotides, and 146,759 base pairs.


For a given grammar, a parse tree for each structure is determined from the secondary structure annotation, and the number of occurrences of each production type is counted. Production probabilities are then estimated from these counts using a Laplace (plus-one) prior [10]. 2ff7e9595c


1 view0 comments

Recent Posts

See All

Comments


bottom of page