Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Computational Methods for Protein Structure Prediction & Modeling V1 - Xu Xu and Liang

.pdf
Скачиваний:
61
Добавлен:
10.08.2013
Размер:
10.5 Mб
Скачать

7. Local Structure Prediction of Proteins

247

Garnier, J., Gibrat, J.F., and Robson, B. 1996. GOR method for predicting protein secondary structure from amino acid sequence. Methods Enzymol 266:540– 553.

Garnier, J., Osguthorpe, D.J., and Robson, B. 1978. Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. J. Mol. Biol. 120:97–120.

George, R.A., and Heringa, J. 2000. The REPRO server: Finding protein internal sequence repeats through the Web. Trends Biochem. Sci. 25:515–517.

Gibrat, J.F., Garnier, J., and Robson, B. 1987. Further developments of protein secondary structure prediction using information theory. New parameters and consideration of residue pairs. J. Mol. Biol. 198:425–443.

Ginalski, K., Pas, J., Wyrwicz, L.S., von Grotthuss, M., Bujnicki, J.M., and Rychlewski, L. 2003. ORFeus: Detection of distant homology using sequence profiles and predicted secondary structure. Nucleic Acids Res. 31:3804–3807.

Ginalski, K., von Grotthuss, M., Grishin, N.V., and Rychlewski, L. 2004. Detecting distant homology with Meta-BASIC. Nucleic Acids Res. 32:W576–581.

Guermeur, Y., Geourjon, C., Gallinari, P., and Deleage, G. 1999. Improved performance in protein secondary structure prediction by inhomogeneous score combination. Bioinformatics 15:413–421.

Guo, J., Chen, H., Sun, Z., and Lin, Y. 2004. A novel method for protein secondary structure prediction using dual-layer SVM and profiles. Proteins 54:738–743.

Hedman, M., Deloof, H., Von Heijne, G., and Elofsson, A. 2002. Improved detection of homologous membrane proteins by inclusion of information from topology predictions. Protein Sci. 11:652–658.

Heger, A., and Holm, L. 2000. Rapid automatic detection and alignment of repeats in protein sequences. Proteins 41:224–237.

Heringa, J. 1994. The evolution and recognition of protein sequence repeats. Comput. Chem. 18:233–243.

Heringa, J. 1998. Detection of internal repeats: How common are they? Curr. Opin. Struct. Biol. 8:338–345.

Heringa, J. 1999. Two strategies for sequence comparison: Profile-preprocessed and secondary structure-induced multiple alignment. Comput. Chem. 23:341–364.

Heringa, J. 2000. Computational methods for protein secondary structure prediction using multiple sequence alignments. Curr. Protein Pept. Sci. 1:273–301.

Heringa, J. 2002. Local weighting schemes for protein multiple sequence alignment.

Comput. Chem. 26:459–477.

Heringa, J., and Argos, P. 1993. A method to recognize distant repeats in protein sequences. Proteins 17:391–341.

Hu, H.J., Pan, Y., Harrison, R., and Tai, P.C. 2004. Improved protein secondary structure prediction using support vector machine with a new encoding scheme and an advanced tertiary classifier. IEEE Trans. Nanobiosci. 3:265–271.

Hu, W.P., Kolinski, A., and Skolnick, J. 1997. Improved method for prediction of protein backbone U-turn positions and major secondary structural elements between U-turns. Proteins 29:443–460.

248

V.A. Simossis and J. Heringa

Hua, S., and Sun, Z. 2001. A novel method of protein secondary structure prediction with high segment overlap measure: Support vector machine approach. J. Mol. Biol. 308:397–407.

Huang, C.H., Lin, Y.S., Yang, Y.L., Huang, S.W., and Chen, C.W. 1998. The telomeres of Streptomyces chromosomes contain conserved palindromic sequences with potential to form complex secondary structures. Mol. Microbiol. 28:905–916.

Huang, X.Q., Hardison, R.C., and Miller, W. 1990. A space-efficient algorithm for local similarities. Comput. Appl. Biosci. 6:373–381.

Hughey, R., and Krogh, A. 1996. Hidden Markov models for sequence analysis: Extension and analysis of the basic method. Comput. Appl. Biosci. 12:95–107.

Hutchinson, E.G., and Thornton, J.M. 1993. The Greek key motif: Extraction, classification and analysis. Protein Eng. 6:233–245.

Jones, D.T. 1999. Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292:195–202.

Karplus, K., Barrett, C., Cline, M., Diekhans, M., Grate, L., and Hughey, R. 1999. Predicting protein structure using only sequence information. Proteins Suppl. 3:121–125.

Karplus, K., Barrett, C., and Hughey, R. 1998. Hidden Markov models for detecting remote protein homologies. Bioinformatics 14:846–856.

Karplus, K., Karchin, R., Barrett, C., Tu, S., Cline, M., Diekhans, M., Grate, L., Casper, J., and Hughey, R. 2001. What is the value added by human intervention in protein structure prediction? Proteins Suppl. 5:86–91.

Karplus, K., Karchin, R., Draper, J., Casper, J., Mandel-Gutfreund, Y., Diekhans, M., and Hughey, R. 2003. Combining local-structure, fold-recognition, and new fold methods for protein structure prediction. Proteins 53 (Suppl.6):491– 496.

Karplus, K., Karchin, R., Hughey, R., Draper, J., Mandel-Gutfreund, Y., Casper, J., and Diekhans, M. 2002. SAM-T02: Protein structure prediction with neural nets, hidden Markov models, and fragment packing. CASP 5.

Kim, D.E., Chivian, D., and Baker, D. 2004. Protein structure prediction and analysis using the Robetta server. Nucleic Acids Res. 32:W526–531.

Kim, H., and Park, H. 2003. Protein secondary structure prediction based on an improved support vector machines approach. Protein Eng. 16:553–560.

King, R.D., Ouali, M., Strong, A.T., Aly, A., Elmaghraby, A., Kantardzic, M., and Page, D. 2000. Is it better to combine predictions? Protein Eng. 13:15–19.

Kirshenbaum, K., Young, M., and Highsmith, S. 1999. Predicting allosteric switches in myosins. Protein Sci. 8:1806–1815.

Kleinjung, J., Romein, J., Lin, K., and Heringa, J. 2004. Contact-based sequence alignment. Nucleic Acids Res. 32:2464–2473.

Koh, I.Y., Eyrich, V.A., Marti-Renom, M.A., Przybylski, D., Madhusudhan, M.S., Eswar, N., Grana, O., Pazos, F., Valencia, A., Sali, A., and Rost, B. 2003. EVA: Evaluation of protein structure prediction servers. Nucleic Acids Res. 31:3311– 3315.

Kolaskar, A.S., and Kulkarni-Kale, U. 1992. Sequence alignment approach to pick up conformationally similar protein fragments. J. Mol. Biol. 223:1053–1061.

7. Local Structure Prediction of Proteins

249

Kolinski, A., Skolnick, J., Godzik, A., and Hu, W.P. 1997. A method for the prediction of surface “U”-turns and transglobular connections in small proteins. Proteins 27:290–308.

Krogh, A., Brown, M., Mian, I.S., Sjolander, K., Haussler, D. 1994. Hidden Markov models in computational biology. Applications to protein modeling. J. Mol. Biol. 235:1501–1531.

Kuhn, M., Meiler, J., and Baker, D. 2004. Strand–loop–strand motifs: Prediction of hairpins and diverging turns in proteins. Proteins 54:282–288.

Kurtz, S., and Schleiermacher, C. 1999. REPuter: Fast computation of maximal repeats in complete genomes. Bioinformatics 15:426–427.

Langosch, D., and Heringa, J. 1998. Interaction of transmembrane helices by a knobs- into-holes packing characteristic of soluble coiled coils, Proteins: Struct. Func. and Gen. 31:150–159.

Lim, V.I. 1974. Structural principles of the globular organization of protein chains. A stereochemical theory of globular protein secondary structure. J. Mol. Biol. 88:857–872.

Lin, K., Simossis, V.A., Taylor, W.R., and Heringa, J. 2005. A simple and fast secondary structure prediction method using hidden neural networks. Bioinformatics 21:152–159.

Linding, R., Jensen, L.J., Diella, F., Bork, P., Gibson, T.J., and Russell, R.B. 2003a. Protein disorder prediction: Implications for structural proteomics. Structure 11:1453–1459.

Linding, R., Russell, R.B., Neduva, V., and Gibson, T.J. 2003b. GlobPlot: Exploring protein sequences for globularity and disorder. Nucleic Acids Res. 31:3701– 3708.

Luisi, D.L., Wu, W.J., and Raleigh, D.P. 1999. Conformational analysis of a set of peptides corresponding to the entire primary sequence of the N-terminal domain of the ribosomal protein L9: Evidence for stable native-like secondary structure in the unfolded state. J. Mol. Biol. 287:395–407.

Lupas, A. 1996. Prediction and analysis of coiled-coil structures. Methods Enzymol 266:513–525.

Lupas, A., Van Dyke, M., and Stock, J. 1991. Predicting coiled coils from protein sequences, Science 252:1162–1164.

Luthy, R., Xenarios, I., and Bucher, P. 1994. Improving the sensitivity of the sequence profile method. Protein Sci. 3:139–146.

Macdonald, J.R., and Johnson, W.C., Jr. 2001. Environmental features are important in determining protein secondary structure. Protein Sci. 10:1172–1177.

Marcotte, E.M., Pellegrini, M., Yeates, T.O., and Eisenberg, D. 1999. A census of protein repeats. J. Mol. Biol. 293:151–160.

McGuffin, L.J., and Jones, D.T. 2003. Benchmarking secondary structure prediction for fold recognition. Proteins 52:166–175.

McLachlan, A.D. 1972. Repeating sequences and gene duplication in proteins. J. Mol. Biol. 64:417–437.

McLachlan, A.D. 1977. Analysis of periodic patterns in amino acid sequences: Collagen. Biopolymers 16:1271–1297.

250 V.A. Simossis and J. Heringa

McLachlan, A.D. 1979. Gene duplications in the structural evolution of chymotrypsin. J. Mol. Biol. 128:49–79.

McLachlan, A.D. 1983. Analysis of gene duplication repeats in the myosin rod. J. Mol. Biol. 169:15–30.

McLachlan, A.D., and Stewart, M. 1976. The 14-fold periodicity in alphatropomyosin and the interaction with actin. J. Mol. Biol. 103:271–298.

Mehta, P.K., Heringa, J., and Argos, P. 1995. A simple and fast approach to prediction of protein secondary structure from multiply aligned sequences with accuracy above 70%. Protein Sci. 4:2517–2525.

Meiler, J., and Baker, D. 2003. Coupled prediction of protein secondary and tertiary structure. Proc. Natl. Acad. Sci. USA 100:12105–12110.

Metropolis, N., and Ulam, S. 1949. The Monte Carlo method. J. Am. Stat. Assoc. 44:335–341.

Minor, D.L., Jr., and Kim, P.S. 1996. Context-dependent secondary structure formation of a designed protein sequence. Nature 380:730–734.

Minsky, M.L., and Papert, S. 1988. Perceptrons: An Introduction to Computational Geometry. Cambridge, Mass., MIT Press.

Mittelman, D., Sadreyev, R., and Grishin, N. 2003. Probabilistic scoring measures for profile–profile comparison yield more accurate short seed alignments. Bioinformatics 19:1531–1539.

Nagano, K. 1973. Logical analysis of the mechanism of protein folding. I. Predictions of helices, loops and beta-structures from primary structure. J. Mol. Biol. 75:401–420.

Needleman, S.B., and Wunsch, C.D. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48:443– 453.

Noble, W.S. 2004. Support vector machine applications in computational biology. In Kernel Methods in Computational Biology (J.-p. Vert, B. Schoelkopf, and K. Tsuda, Eds.). Cambridge, Mass., MIT Press, pp. 71–92.

Obradovic, Z., Peng, K., Vucetic, S., Radivojac, P., Brown, C.J., and Dunker, A.K. 2003. Predicting intrinsic disorder from amino acid sequence. Proteins 53 (Suppl. 6):566–572.

Ohlson, T., Wallner, B., and Elofsson, A. 2004. Profile–profile methods provide improved fold-recognition: A study of different profile–profile alignment methods. Proteins 57:188–197.

Ouali, M., and King, R.D. 2000. Cascaded multiple classifiers for secondary structure prediction. Protein Sci. 9:1162–1176.

Park, J., Karplus, K., Barrett, C., Hughey, R., Haussler, D., Hubbard, T., and Chothia, C. 1998. Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J. Mol. Biol. 284:1201–1210.

Pellegrini, M., Marcotte, E.M., and Yeates, T.O. 1999. A fast algorithm for genomewide analysis of proteins with repeated sequences. Proteins 35:440–446.

Petersen, T.N., Lundegaard, C., Nielsen, M., Bohr, H., Bohr, J., Brunak, S., Gippert, G.P., and Lund, O. 2000. Prediction of protein secondary structure at 80% accuracy. Proteins 41:17–20.

7. Local Structure Prediction of Proteins

251

Pollastri, G., Przybylski, D., Rost, B., and Baldi, P. 2002. Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins 47:228–235.

Prilusky, J., Felder, C.E., Zeev-Ben-Mordehai, T., Rydberg, E., Man, O., Beckmann, J.S., Silman, I., and Sussman, J.L. 2005. FoldIndex: A simple tool to predict whether a given protein sequence is intrinsically unfolded. Bioinformatics 21:3435–3438.

Przybylski, D., and Rost, B. 2002. Alignments grow, secondary structure prediction improves. Proteins 46:197–205.

Ptitsyn, O.B. 1994. Kinetic and equilibrium intermediates in protein folding. Protein Eng 7:593–596.

Raghava, G.P.S. 2000. Protein secondary structure prediction using nearest neighbor and neural network approach. CASP 4, 75-76.

Raghava, G.P.S. 2002a. APSSP2: A combination method for protein secondary structure prediction based on neural network and example based learning. CASP 5. URL: http://www.imtech.res.in/raghava/apssp2/

Raghava, G.P.S. 2002b. APSSP: Automatic method for protein secondary structure prediction. CASP 5. URL: http://www.imtech.res.in/raghava/apssp2/

Ramirez-Alvarado, M., Serrano, L., and Blanco, F.J. 1997. Conformational analysis of peptides corresponding to all the secondary structure elements of protein L B1 domain: Secondary structure propensities are not conserved in proteins with the same fold. Protein Sci. 6:162–174.

Rao, S.T., and Rossmann, M.G. 1973. Comparison of super secondary structures in proteins. J. Mol. Biol. 76:241–256.

Reymond, M.T., Merutka, G., Dyson, H.J., and Wright, P.E. 1997. Folding propensities of peptide fragments of myoglobin. Protein Sci. 6:706–716.

Romero, P., Obradovic, Z., Li, X., Garner, E.C., Brown, C.J., and Dunker, A.K. 2001. Sequence complexity of disordered protein. Proteins 42:38–48.

Rose, G.D. 1979. Hierarchic organization of domains in globular proteins. J. Mol. Biol. 134:447–470.

Rost, B., and Sander, C. 1993. Prediction of protein secondary structure at better than 70% accuracy. J. Mol. Biol. 232:584–599.

Rost, B., Sander, C., and Schneider, R. 1994. Redefining the goals of protein secondary structure prediction. J. Mol. Biol. 235:13–26.

Rost, B., Schneider, R., and Sander, C. 1997. Protein fold recognition by predictionbased threading. J. Mol. Biol. 270:471–480.

Rychlewski, L., Jaroszewski, L., Li, W., and Godzik, A. 2000. Comparison of sequence profiles. Strategies for structural predictions using sequence information. Protein Sci. 9:232–241.

Salem, G.M., Hutchinson, E.G., Orengo, C.A., and Thornton, J.M. 1999. Correlation of observed fold frequency with the occurrence of local structural motifs. J. Mol. Biol. 287:969–981.

Sander, C., and Schneider, R. 1991. Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins 9: 56–68.

252

V.A. Simossis and J. Heringa

Schaffer, A.A., Aravind, L., Madden, T.L., Shavirin, S., Spouge, J.L., Wolf, Y.I., Koonin, E.V., and Altschul, S.F. 2001. Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res. 29:2994–3005.

Schiffer, M., and Edmundson, A.B. 1967. Use of helical wheels to represent the structures of proteins and to identify segments with helical potential. Biophys. J. 7:121–135.

Schoelkopf, B., Tsuda, K., and Vert, J.-P.(Eds.). 2004. Kernel Methods in Computational Biology. Cambridge, Mass., MIT Press.

Schulz, G.E. 1988. A critical evaluation of methods for prediction of protein secondary structures. Annu. Rev. Biophys. Biophys. Chem. 17:1–21.

Selbig, J., Mevissen, T., and Lengauer, T. 1999. Decision tree-based formation of consensus protein secondary structure prediction. Bioinformatics 15:1039–1046.

Simossis, V.A., and Heringa, J. 2004a. The influence of gapped positions in multiple sequence alignments on secondary structure prediction methods. Comput. Biol. Chem. 28(5-6:351–366.

Simossis, V.A., and Heringa, J. 2004b. Integrating protein secondary structure prediction and multiple sequence alignment. Curr. Protein Pept. Sci. 5:249–266.

Simossis, V.A., and Heringa, J. 2005. SYMPRED consensus secondary structure prediction. http://ibi.vu.nl/programs/sympredwww/

Smit, A., Hubley, R., and Green, P. 2004. RepeatMasker open-3.0. 1996–2004. http://www.repeatmasker.org.

Smith, T.F., and Waterman, M.S. 1981. Identification of common molecular subsequences. J. Mol.. Biol. 147:195–197.

Soding, J. 2005. Protein homology detection by HMM–HMM comparison. Bioinformatics 21:951–960.

Stultz, C.M., White, J.V., and Smith, T.F. 1993. Structural analysis based on statespace modeling.Protein Sci. 2:305–314.

Sun, Z., Rao, X., Peng, L., and Xu, D. 1997. Prediction of protein supersecondary structures based on the artificial neural network method. Protein Eng. 10:763– 769.

Szklarczyk, R., and Heringa, J. 2004. Tracking repeats using significance and transitivity. Bioinformatics 20 (Suppl. 1):I311–I317.

Taylor, W.R., Heringa, J., Baud, F., and Flores, T.P. 2002. A Fourier analysis of symmetry in protein structure. Protein Eng. 15:79–89.

Teodorescu, O., Galor, T., Pillardy, J., and Elber, R. 2004. Enriching the sequence substitution matrix by structural information. Proteins 54:41–48.

Tomii, K., and Akiyama, Y. 2004. FORTE: A profile–profile comparison tool for protein fold recognition. Bioinformatics 20:594–595.

Unger, R., and Moult, J. 1993. Finding the lowest free energy conformation of a protein is an NP-hard problem: Proof and implications. Bull. Math. Biol. 55:1183–1198.

Uversky, V.N., Gillespie, J.R., and Fink, A.L. 2000. Why are “natively unfolded” proteins unstructured under physiologic conditions? Proteins 41:415–427.

7. Local Structure Prediction of Proteins

253

van Belkum, A., Scherer, S., van Alphen, L., and Verbrugh, H. 1998. Short-sequence DNA repeats in prokaryotic genomes. Microbiol. Mol. Biol. Rev. 62:275–293.

Vapnik, V.N. 1995. The Nature of Statistical Learning Theory. New York, Springer. Vapnik, V.N. 1998. Statistical Learning Theory. New York, Wiley.

Vihinen, M., Torkkila, E., and Riikonen, P. 1994. Accuracy of protein flexibility predictions. Proteins 19:141–149.

von Ohsen, N., Sommer, I., and Zimmer, R. 2003. Profile–profile alignment: A powerful tool for protein structure prediction. Pac. Symp. Biocomput. 252–263. von Ohsen, N., Sommer, I., Zimmer, R., and Lengauer, T. 2004. Arby: Automatic protein structure prediction using profile–profile alignment and confidence mea-

sures. Bioinformatics 20:2228–2235.

Vucetic, S., Brown, C.J., Dunker, A.K., and Obradovic, Z. 2003. Flavors of protein disorder. Proteins 52:573–584.

Wang, G., and Dunbrack, R.L., Jr. 2004. Scoring profile-to-profile sequence alignments. Protein Sci. 13:1612–1626.

Ward, J.J., McGuffin, L.J., Bryson, K., Buxton, B.F., and Jones, D.T. 2004. The DISOPRED server for the prediction of protein disorder. Bioinformatics 20:2138– 2139.

Ward, J.J., McGuffin, L.J., Buxton, B.F., and Jones, D.T. 2003. Secondary structure prediction with support vector machines. Bioinformatics 19:1650–1655.

Waterman, M.S., and Eggert, M. 1987. A new algorithm for best subsequence alignments with application to tRNA–rRNA comparisons. J. Mol. Biol. 197: 723–728.

Wetlaufer, D.B. 1973. Nucleation, rapid folding, and globular intrachain regions in proteins. Proc. Natl. Acad. Sci. USA 70:697–701.

White, J.V., Stultz, C.M., and Smith, T.F. 1994. Protein classification by stochastic modeling and optimal filtering of amino-acid sequences. Math. Biosci. 119:35– 75.

Wootton, J.C., and Federhen, S. 1996. Analysis of compositionally biased regions in sequence databases. Methods Enzymol. 266:554–571.

Wright, P.E., and Dyson, H.J. 1999. Intrinsically unstructured proteins: Re-assessing the protein structure–function paradigm. J. Mol. Biol. 293:321–331.

Xie, Q., Arnold, G.E., Romero, P., Obradovic, Z., Garner, E., and Dunker, A.K. 1998. The sequence attribute method for determining relationships between sequence and protein disorder. Genome Inform. Ser. Workshop Genome Inform.

9:193–200.

Yona, G., and Levitt, M. 2002. Within the twilight zone: A sensitive profile– profile comparison tool based on information theory. J. Mol. Biol. 315:1257– 1275.

Young, M., Kirshenbaum, K., Dill, K.A., and Highsmith, S. 1999. Predicting conformational switches in proteins. Protein Sci. 8:1752–1764.

Yu, L., White, J.V., and Smith, T.F. 1998. A homology identification method that combines protein sequence and structure information. Protein Sci. 7:2499– 2510.

254

V.A. Simossis and J. Heringa

Zemla, A., Venclovas, C., Fidelis, K., and Rost, B. 1999. A modified definition of Sov, a segment-based measure for protein secondary structure prediction assessment. Proteins 34:220–223.

Zvelebil, M.J., Barton, G.J., Taylor, W.R., and Sternberg, M.J. 1987. Prediction of protein secondary structure and active sites using the alignment of homologous sequences. J. Mol. Biol. 195:957–961.

Further Reading

Jones, N.C., and Pevzner P.A. 2004. An Introduction to Bioinformatics Algorithms.

Cambridge, MA, MIT Press.

Konopka, A.K., and Crabbe, M.J.C. (Eds.). 2004. Compact Handbook of Computational Biology. New York, Dekker.

8 Protein Contact Map Prediction

Xin Yuan and Christopher Bystroff

8.1 Introduction

Proteins are linear chains that fold into characteristic shapes and features. To understand proteins and protein folding, we try to represent the protein molecule in such a way that its features are easy to see and manipulate. A simple representation facilitates algorithm design for structure prediction. The simplicity of the threestate character string representation of secondary structure is part of the reason for secondary structure prediction receiving so much attention early in the era of computational biology. One-dimensional strings are easily understood, parsed, mined, and manipulated. But secondary structure alone does not tell us enough about the overall shapes and features of a protein. We need a simple way to represent the overall tertiary structure of a protein.

Here we explore a two-dimensional Boolean matrix representation of protein structure, where each dimension is the residue number and each value is true if the residues are spatial neighbors and false otherwise—called a “contact map.” A contact map is the simplest representation of a protein that can be faithfully projected back into three dimensions. As such it has received increased attention in recent years from bioinformaticists, who see this as a data structure that is readily amenable to data mining and machine learning.

The first goal of this chapter is to introduce the contact map data structure, how it is calculated from the three-dimensional structure and how it is transformed from two dimensions into three. Then we will explore a series of computational methods that have attempted to predict contact maps directly from the primary sequence, with or without the help of template structures from the protein database. Next, we will discuss the various ways that contact map predictions may be evaluated for accuracy. Finally, we will present some of the other ways contact maps have proven useful.

8.2 Definition of Interresidue Contacts and Contact Maps

Interresidue contacts have been defined in various ways. Routinely, a contact is said to exist when a certain distance is below a threshold. The distance may be that between C atoms (Vendruscolo et al., 1997), between the C atoms (Lund et al., 1997; Thomas et al., 1996; Olmea and Valencia, 1997; Fariselli et al., 2001a,b), or it may be the minimum distance between any pair of atoms belonging to the side

255

256

Xin Yuan and Christopher Bystroff

chain or to the backbone of the two residues (Fariselli and Casadio, 1999; Mirny and Domany, 1996). The following definitions of the interresidue distance Di j have appeared at some time in the literature, each associated with a threshold distance:

1.The distance between alpha carbons (CA)

2.The distance between beta carbons (CB), using alpha carbon for glycine

3.The minimum distance between the van der Waals (vdW) spheres of heavy atoms (HS)

4.The minimum distance between the vdW spheres of backbone heavy atoms (BB)

5.The minimum distance between the vdW spheres of side-chain heavy atoms or alpha carbons (SC)

By one informed account, the best definition for interresidue distance is SC, the minimum distance between side-chain or alpha-carbon atoms (Berrera et al., 2003). Using

˚

a cutoff distance of around 1.0 A between vdW spheres, SC-based contact maps efficiently recognized homologue sequences in a “threading” experiment, where a query sequence is assigned an energy score for every possible alignment of the sequence to a set of template structures. In a threading experiment, the definition of a contact determines which residues in the sequence are used to sum the energy. Using the minimum distance between residues and using a short distance cutoff makes the definition of a contact more energetically realistic, and this makes the sum of amino acid contact potentials a better approximation of the true energy. Contact potentials are energy functions that measure the pairwise side-chain-dependent free energy of residue–residue contacts, irrespective of the side-chain conformations. Contact potentials may be derived from contact maps statistics, as described in a later section.

Simpler, backbone atom-based definitions (CA or CB) with longer distance cutoffs are more readily projected into three dimensions, since the atomic positions used to calculate the distance depend only on the backbone angles. CB is slightly more meaningful than CA in an energetic sense, since side chains that point toward each other, and therefore have a shorter CB distance than CA distance, are more likely to make an actual physical contact. Using the minimum distance measures (HS, BB, or SC) can make projection into three dimensions more difficult because we have not saved the precise identity of the contacting atoms in the Boolean matrix.

˚

CB distances with a cutoff of 8 A were chosen for use in the Critical Assessment of Structure Prediction (CASP) experiments (Moult et al., 2003), and this is the definition that we will discuss here, unless otherwise specified.

Having defined what we mean by the distance Di j , the definition of a contact map is a straightforward distance threshold. For a protein of N amino acids the contact map is an N × N matrix C whose elements are given, for all i, j = 1, . . . , N , by

1

if Dij < Dcutoff

 

Ci j = 0

otherwise

(8.1)

The set of all Di j is commonly referred to as the “distance matrix.” Therefore, we can think of Ci j as a thresholded distance matrix. The mean difference between two