7. Local Structure Prediction of Proteins


as well as increased confidence in predictions made could be gained from testing the possibility of it containing a coiled-coil supersecondary structure.

The program COILS2 (Lupas et al., 1991; Lupas, 1996) compares a query sequence with a database of known parallel two-stranded coiled coils. A similarity score is derived and compared to two score distributions, one for globular proteins (without coiled coils) and one for known coiled-coil structures. The two scores are then converted to a probability for the query sequence to adopt a coiled-coil conformation. Since the program assumes the presence of heptad repeats, probabilities are derived using default window lengths of 14, 21, and 28 amino acids. The program can also use user-defined window lengths for the prediction of extreme coiled-coil lengths. A recently updated scoring matrix, based on data from recent coiled-coil structures and containing amino acid type propensities for various positions in the heptad repeat, shows improved recognition of coiled-coil elements. The COILS2 method accurately recognizes left-handed two-stranded coiled coils but loses sensitivity for coiled-coil structures consisting of more than two strands. Also, it is not able to recognize right-handed or buried coiled–coil helices and therefore is not applicable to transmembrane coiled-coil structures known to show basically similar coiled-coil conformations as soluble proteins, albeit with dramatically different and more hydrophobic constituent amino acids (Langosch and Heringa, 1998). Software Package 2. WD-repeats Prediction

The server “WD repeat Family of Proteins” (see http://bmerc-www.bu.edu/ wdrepeat/) is able to recognize putative WD-repeat sequences associated with 4- to 9-bladed 3D WD-repeat structures. These models combine a particular so-called Type-1 structural model with sequence-specific pattern information. Multidomain proteins can be handed to the server intact; the region containing the WD-repeat domain will be identified by the server automatically.

The analysis algorithm is based on probabilistic Discrete State-space Models (DSMs), and optimal filtering and smoothing algorithms (Stultz et al., 1993). The mathematical basis for the models and algorithms is described in White et al. (1994).

A protein sequence submitted to the server is first classified as “generic” or “wd repeat.” The class “generic” is designed for proteins not containing WD repeats. Superclass “wd repeat” is designed for the WD-repeat family of proteins. Under this superclass, there are six macroclasses for WD-repeat proteins, each of which contains a different number of WD repeats. Sequences containing fewer than four WD repeats will not be reported as a WD-repeat protein. This is due to the assumption that all WD-repeat proteins adopt a -propeller fold, which must have at least four blades to form a circular structure. The 4- to 9-bladed (WD4 to WD9) models that can be produced by the server correspond to sequence length ranges of 187–279, 233–332, 278–385, 323–437, 368–489, and 413–541 residues, respectively. To handle longer sequences, the algorithm is able to add leader and trailer to the models on the fly. Therefore, all models can recognize WD repeats within sequences longer than its maximum domain length up to an upper limit on sequence length of 1000 residues.


V.A. Simossis and J. Heringa

Each WD repeat has two conserved profiles denoted “profile 1” and “profile 2” (which may be approximated as “GHXXXVXXVXFX” and “XLASGSXDXTIKVWD,” respectively, as shown at http://bmerc-www.bu.edu/wdrepeat/) that are used in the DSM prediction. The probabilities of occurrence of each of these profiles will be reported if WD repeats are identified in the sequence. In addition, the strands within each of the aligned putative WD repeats will be designated, although individual -strand probabilities will not be reported. To provide the user insight in the 3D orientation of the WD repeats, a skeleton coordinate file in PDB format is included.

7.8.3Disordered Region Prediction Software Package 1. PONDR

The PONDR suite contains several disorder prediction methods (Obradovic et al., 2003). The predictions from the methods VL2 and VLXT in the PONDR suite (Obradovic et al., 2003) come from ensembles of feedforward NNs trained on combinations of amino acid composition, flexibility, and sequence complexity. Sequence information is parsed using windows of generally 21 amino acids. The amino acid attributes are calculated over this window, and these values are used as inputs for the NNs, which calculate a value for the central amino acid in each window. These prediction values are then smoothed over a sliding window of 9 amino acids. If a residue value exceeds a threshold, the residue is declared disordered. Another predictor VL3 was trained using ordinary least squares regression with partitioning of the training set to cluster various “flavors” of disorder (Vucetic et al., 2003). Recently, a new disorder predictor VSL1 was added to the PONDR suite. The VSL1 predictor obtained the best results in a comparison including 20 different disorder prediction methods presented at the CASP6 structure prediction meeting in December 2004. The methods in the PONDR suite are not freely available. Software Package 2. FoldIndex

The FoldIndex program is based on the calculations developed by Uversky et al. (2000) and predicts whether a sequence will fold by computing its mean net charge and hydrophobicity (Uversky et al., 2000). The window parameter for the FoldIndex classifier was set to 31 residues as this value achieved the highest accuracy on a validation set. The resulting data show that the combination of low mean hydrophobicity and relatively high net charge represents an important prerequisite for the absence of regular structure in proteins under physiologic conditions, thus leading to “natively unfolded” proteins. Software Package 3. DisEMBL

Linding et al. (2003a) developed the NN-based method DisEMBL. The authors carefully selected a number of protein sets—including a coil and a “hot loop”

7. Local Structure Prediction of Proteins


set—to train the neural nets using 5-fold cross validation, while the best parameter settings were selected based on ROC curves. The optimal network architecture was a window size of 19 residues and 30 hidden units. The coil and hot loop NN ensembles, the score distributions of positive and negative test examples were estimated using Gaussian kernel density estimation. Based on these distributions, a calibration curve for converting NN output scores to probabilities was constructed. To predict disorder for an unknown query sequence, the network output is smoothed and the resulting amino acid disorder probabilities are plotted. Software Package 4. GLOBPLOT

The GLOBPLOT method (Linding et al., 2003b) is based on the hypothesis that the tendency for disorder can be expressed as P = RCSS where RC and SS arethe propensity for a given amino acid to be in “random coil” and regular “secondary structure,” respectively. The RC and SS propensity values were derived by the authors employing a data set using a single representative of each superfamily in the SCOP database (version 1.59). The two types of propensities were then combined in a single “Russel/Linding” amino acid propensity set, which is able to discriminate between disorder and globular packing. Software Package 5. DISOPRED

The DISOPRED2 method (Ward et al., 2004) exploits an SVM classifier based on a linear kernel function and compares favorably to the above methods across the range of decision thresholds. Ward et al. (2004) also noted that using homologous sequences improves disorder prediction slightly as compared to single sequence prediction, but the beneficial effect is clearly lower than that for secondary structure prediction. Software Package 6. PDISORDER

The PDISORDER method (Softberry, Inc.) exploits a combination of machine learning techniques comprising NNs, linear discriminant functions, and an acute smoothing procedure. At the recent CASP6 prediction assessment workshop, the method scored high in terms of the correlations it yields with crystallographic B-factors, which are included as evidence for disorder. Software Package 7. DISpro

Cheng et al. (2005) reported a state-of-the-art disorder prediction accuracy of 92.8% with a false positive rate of 5% on large cross-validated tests. Their method DISpro uses evolutionary information in the form of profiles, predicted secondary structure and relative solvent accessibility, and ensembles of 1D-recursive NNs. The method shows an improved performance over previous methods, for example using the CASP5 data set (Cheng et al., 2005).


V.A. Simossis and J. Heringa

7.8.4Internal Repeats Recognition Software Package 1. REPRO

Heringa and Argos (1993) adapted the basic Waterman and Eggert algorithm to repeat situations within a single protein by demanding, in addition to top-scoring alignments being nonintersecting, that locally aligned fragments do not overlap. They introduced a graph-based iterative clustering mechanism, which takes the thus produced list of top-scoring nonoverlapping local alignments for a single query sequence, declares the N-terminal matched amino acid pair in each top alignment as start sites of a repeats pair, and then attempts to delineate associated start-sites within the top alignments (i.e., find more repeats internal to the top alignment) that match the repeat type based on alignment consistency with already clustered members of the repeat type. If such new repeats are found, the cluster procedure is iterated. The cluster consistency criterion assesses the number of established repeats that align with a putative repeat, and selects it only if three or more of such top-scoring alignments can be found and if at least one of these associated alignments has already contributed one or more repeat members to the current repeat type and therefore can be trusted to be “in phase” with that repeat type. After the clustering phase, the repeats can be multiply aligned and turned into a profile, which can then be slid over the query sequence to verify the repeats already found and possibly detect new incarnations missed by the preceding algorithmic steps (Heringa and Argos, 1993): If new repeats are found, the profile can be updated and the procedure iterated. The REPRO algorithm is able to detect multiple repeat types independently, and is a sensitive but slow technique. A web server for the REPRO algorithm is available at http://ibivu.cs.vu.nl (George and Heringa, 2000). Software Package 2. Pellegrini et al.

A quick algorithm for calculating the length and copy number of internal repeat sets has been devised by Pellegrini et al. (1999). The method uses the Waterman and Eggert algorithm and converts the scores of the selected top alignments to probabilities. An N × N path matrix, where N is the length of the protein sequence, is then filled with ones for matrix cells corresponding to local nonintersecting alignments that score above a preset threshold value for the probabilities, and zero values elsewhere. Two simple summing protocols are then applied to this matrix to obtain an approximate notion of the repeat length and copy number, albeit the repeat boundaries are not determined. Marcotte et al. (1999) used the algorithm to derive a general census of repeats in proteins using the SWISS-PROT protein sequence database. Software Package 3. RADAR

The method RADAR (Heger and Holm, 2000) basically follows the algorithmic steps of the REPRO method (Heringa and Argos, 1993). It calculates nonintersecting

7. Local Structure Prediction of Proteins


Table 7.1 A list of all prediction methods independently assessed by the EVA server and their corresponding overall scores and test set sizes, until the end of 2004. Methods whose names are in boldface have been covered in Section 7.8


Test set


Server URL (assume “http://” at the start of each address)












www.compbio.dundee.ac.uk/ www-jpred/submit.html













PROF king



www.aber.ac.uk/ phiwww/prof/













































local alignments, and then uses these in an iterative procedure to determine the shortest nonreducible repeat unit and determine the associated boundaries. A profile is constructed from a multiple alignment of a repeat set, and slid over the query sequence to capture more repeats. The whole procedure is then iterated in an attempt to find multiple repeat types. The RADAR step to find the shortest possible repeat unit, includes an iterative wraparound DP algorithm to detect the smallest repeat unit within a potentially reducible set of repeats. The RADAR method is sensitive and sufficiently fast for genomic application. Software Package 4. REP

Andrade et al. (2000) produced a supervised repeats detection method REP, which searches the query sequence using a number of profiles, each profile containing

Table 7.2 A list of methods for predicting coiled-coil and WD-repeat protein regions from sequence


Server URL (assume “http://” at the start of each address)







Lupas et al., 1991



The same URL









V.A. Simossis and J. Heringa

Table 7.3 A list of methods for predicting disordered protein regions from sequence






Server URL (assume “http://”




at the start of each address)










Priluski et al., 2005






www.ics.uci.edu/ baldig/




Neural net

Cheng et al., 2005







Neural net

Linding et al., 2003a



Amino acid

Linding et al., 2003b








Ward et al., 2004







Neural net

Obradovic et al., 2003















Neural network














Acute smoothing procedure

the information of a multiple alignment of a known repeats family. The user can scan the query sequence for the following repeat types: Ankyrin, Armadillo, HAT, HEAT, HEAT AAA, HEAT ADB, HEAT IMB, Kelch, Leucin-e-Rich Repeats, PFTA, PFTB, RCC1, TPR, and WD40.

Table 7.4 List of methods for internal repeats recognition


Server URL (assume “http://”




at the start of each address)









Waterman and

Pellegrini et al., 1999

et al.










Local alignment

Heger and Holm, 2000


www.embl-heidelberg.de/ andrade/

Checking known

Andrade et al., 2000



repeat types




Local alignment,

George and Heringa,



graph clustering





Szklarczyk and




Heringa, 2004





7. Local Structure Prediction of Proteins

243 Software Package 5. TRUST

Szklarczyk and Heringa (2004) developed a method TRUST for protein internal repeats detection based on transitivity of repeats. The authors reported an increased sensitivity and accuracy of the method. This is achieved by exploiting the concept of transitivity of alignments, which relies on mutual reinforcement (or attenuation) of repeat signals, and thus can be used as a noise filter. Starting from local suboptimal alignments, the application of transitivity allows (1) identification of distant repeat homologues for which no alignments were found; (2) gaining confidence about consistently well-aligned regions; and (3) reducing the contribution of nonhomologous repeats. The thus obtained increased consistency generally leads to a virtually noise-free profile representing a generalized repeat with high fidelity. The TRUST method also employs a rigid statistical test for self-sequence and profile-sequence alignments.

7.9 Resources

This section contains useful resources available at the time this chapter was written for online software applications and other useful material.

7.9.1Secondary Structure Prediction

7.9.2Supersecondary Structure Prediction

7.10 Summary

This chapter presents an overview of issues in predicting local structural features of proteins. The inherent hierarchical order of protein structure is discussed in a bottomup fashion, from secondary structure via supersecondary structure to prediction aspects of local three-dimensional structure, the latter including protein disordered region detection and internal repeats recognition. Some approaches to use these structural features in multiple sequence alignment are also discussed. State-of-the- art prediction methods are described and the addresses of their web interfaces, if available, are provided.


