Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Becker O.M., MacKerell A.D., Roux B., Watanabe M. (eds.) Computational biochemistry and biophysic.pdf
Скачиваний:
68
Добавлен:
15.08.2013
Размер:
5.59 Mб
Скачать

344

Dunbrack

predictive distribution and can be achieved by making draws from the posterior distribution and from these values, making draws from the likelihood function, i.e.,

p(|y) Θ p()p(θ|y) dθ

Θ p()p(y)p(θ) dθ

 

(56)

Θ p(y)p(θ) dθ

 

 

This distribution resembles the data closely for rotamer (3, 3, 3) but also forms a very reasonable distribution when there are only seven data points (3, 3, 1). A good posterior predictive distribution for any protein structural feature can be used in simulations of protein folding or structure prediction.

V.CONCLUSION

The field of statistics arose in the eighteenth and nineteenth centuries because of the need to develop good public policy based on demographic and economic data. Applications in the natural sciences were immediate, but generally natural scientists have lagged behind in their knowledge of modern statistics compared to social scientists. This is unfortunate, because many algorithms and methodologies have been developed in the last 20 years or so that make feasible sophisticated analysis of very complex data sets. Bayesian statistics has been used fruitfully in molecular and structural biology in recent years but has enjoyed more applications in genetics and clinical research and in the social sciences. Bayesian methods are particularly useful in modeling complex data, where the distribution of information may be uneven or hierarchical. This is true not only of the sequence and structure databases described in this chapter but also of more recently developed experimental methods such as DNA microarrays for analyzing mRNA expression levels over many thousands of genes [100–106]. The computational challenges for this kind of data are immense [107,108]. Particularly now, when the influx of data in biology is overwhelming, Bayesian statistical analysis promises to be an important tool.

ACKNOWLEDGMENTS

I thank Prof. Marc Sobel of Temple University for many useful discussions on Bayesian statistics. This work was funded in part by an appropriation from the Commonwealth of Pennsylvania and NIH Grant CA06927.

REFERENCES

1.RL Dunbrack Jr. Culling the PDB by resolution and sequence identity. 1999. http:// www.fccc.edu/research/labs/dunbrack/culledpdb.html

2.CA Orengo, AD Michie, S Jones, DT Jones, MB Swindells, JM Thornton. CATH—A hierarchic classification of protein domain structures. Structure 5:1093–1108, 1997.

3.L Holm, C Sander. Touring protein fold space with Dali/FSSP. Nucleic Acids Res 26:316– 319, 1998.

Bayesian Statistics

345

4.TJ Hubbard, B Ailey, SE Brenner, AG Murzin, C Chothia. SCOP: A structural classification of proteins database. Nucleic Acids Res 27:254–256, 1999.

5.JS Shoemaker, IS Painter, BS Weir. Bayesian statistics in genetics: A guide for the uninitiated. Trends Genet 15:354–358, 1999.

6.S Greenland. Probability logic and probability induction. Epidemiology 9:322–332, 1998.

7.GM Petersen, G Parmigiani, D Thomas. Missense mutations in disease genes: A Bayesian approach to evaluate causality. Am J Hum Genet. 62:1516–1524, 1998.

8.DA Berry, DK Stangl, eds. Bayesian Biostatistics. New York: Marcel Dekker, 1996.

9.G D’Agostini. Bayesian reasoning in high energy physics: Principles and applications. CERN Lectures, 1998.

10.TJ Loredo. In: PF Fouge`re, ed. From Laplace to Supernova SN 1987A: Bayesian Inference in Astrophysics. Dordrecht, The Netherlands: Kluwer, 1990, pp 81–142.

11.TJ Loredo. In: ED Feigelson, GJ Babu, eds. The Promise of Bayesian Inference for Astrophysics. New York: Springer-Verlag, 1992, pp 275–297.

12.E Parent, P Hubert, B Bobe´e, J Miquel, eds. Statistical and Bayesian Methods in Hydrological Sciences. Paris: UNESCO Press, 1998.

13.CE Buck, WG Cavanaugh, CD Litton. The Bayesian Approach to Interpreting Archaeological Data. New York: Wiley, 1996.

14.A Zellner. An Introduction to Bayesian Inference in Econometrics. New York: Wiley, 1971.

15.J Zhu, JS Liu, CE Lawrence. Bayesian adaptive sequence alignment algorithms. Bioinformatics 14:25–39, 1998.

16.K Sjo¨lander, K Karplus, M Brown, R Hughey, A Krogh, IS Mian, D Haussler. Dirichlet mixtures: A method for improved detection of weak but significant protein sequence homology. Comput Appl Biosci 12:327–345, 1996.

17.K Karplus, K Sjolander, C Barrett, M Cline, D Haussler, R Hughey, L Holm, C Sander. Predicting protein structure using hidden Markov models. Proteins Suppl: 134–139, 1997.

18.RH Lathrop, TF Smith. Global optimum protein threading with gapped alignment and empirical pair score functions. J Mol Biol 255:641–665, 1996.

19.RH Lathrop, JR Rogers Jr, TF Smith, JV White. A Bayes-optimal sequence–structure theory that unifies protein sequence–structure recognition and alignment. Bull Math Biol 60:1039– 1071, 1998.

20.RA Chylla, JL Markley. Improved frequency resolution in multidimensional constant-time experiments by multidimensional Bayesian analysis. J Biomol NMR 3:515–533, 1993.

21.DA d’Avignon, GL Bretthorst, ME Holtzer, A Holtzer. Thermodynamics and kinetics of a folded–folded transition at valine-9 of a GCN4-like leucine zipper. Biophys J 76:2752– 2759, 1999.

22.JA Lukin, AP Gove, SN Talukdar, C Ho. Automated probabilistic method for assigning backbone resonances of (13C,15N)-labeled proteins. J Biomol NMR 9:151–166, 1997.

23.MT McMahon, E Oldfield. Determination of order parameters and correlation times in proteins: A comparison between Bayesian, Monte Carlo and simple graphical methods. J Biomol NMR 13:133–137, 1999.

24.MF Ochs, RS Stoyanova, F Arias-Mendoza, TR Brown. A new method for spectral decomposition using a bilinear Bayesian approach. J Magn Reson 137:161–176, 1999.

25.TO Yeates. The asymmetric regions of rotation functions between Patterson functions of arbitrarily high symmetry. Acta Crystallogr A 49:138–141, 1993.

26.S Doublie, S Xiang, CJ Gilmore, G Bricogne, CW Carter Jr. Overcoming non-isomorphism by phase permutation and likelihood scoring: Solution of the TrpRS crystal structure. Acta Crystallogr A 50:164–182, 1994.

27.CW Carter Jr. Entropy, likelihood and phase determination. Structure 3:147–150, 1995.

28.RL Dunbrack Jr, FE Cohen. Bayesian statistical analysis of protein sidechain rotamer preferences. Protein Sci 6:1661–1681, 1997.

346

Dunbrack

29.P Baldi, S Brunak. Bioinformatics: The Machine Learning Approach. Cambridge, MA: MIT Press, 1998.

30.ET Jaynes. Probability Theory: The Logic of Science. http://bayes.wustl.edu/etj/prob.html. 1999.

31.M Gardner. The Second Scientific American Book of Mathematical Puzzles and Diversions. New York: Simon and Schuster, 1961.

32.T Bayes. An essay towards solving a problem in the doctrine of chances. Phil Trans Roy Soc Lond 53:370, 1763.

33.PS Laplace. Theorie Analytique des Probabilite´s. Paris: Courcier, 1812.

34.TM Porter. The Rise of Statistical Thinking. Princeton, NJ: Princeton Univ Press, 1988.

35.JO Berger, M Delampady. Testing precise hypotheses. Stat Sci 2:317–352, 1987.

36.TS Kuhn. Structure of Scientific Revolutions. Chicago: Univ Chicago Press, 1974.

37.DV Lindley. The 1988 Wald Memorial Lecture: The present position of Bayesian statistics. Stat Sci 5:44–89, 1990.

38.H Jeffreys. Theory of Probability. Oxford: Clarendon Press, 1939.

39.LJ Savage. The Foundations of Statistics. New York: Wiley, 1954.

40.WR Gilks, S Richardson, DJ Spiegelhalter, eds. Markov Chain Monte Carlo in Practice. London: Chapman & Hall, 1996.

41.IJ Good. The Bayes/non-Bayes compromise: A brief review. J Am Stat Assoc 87:597–606, 1992.

42.J Cornfield. In: DL Meyer, RO Collier, eds. The Frequency Theory of Probability, Bayes’ Theorem, and Sequential Clinical Trials. Bloomington, In: Phi Delta Kappa, 1970, pp 1– 28.

43.M Bower, FE Cohen, RL Dunbrack Jr. Prediction of protein sidechain rotamers from a back- bone-dependent rotamer library: A new homology modeling tool. J Mol Biol 267:1268– 1282, 1997.

44.A Gelman, JB Carlin, HS Stern, DB Rubin. Bayesian Data Analysis. London: Chapman & Hall, 1995.

45.N Metropolis, S Ulam. The Monte Carlo method. J Am Stat Assoc 44:335–341, 1949.

46.N Metropolis, AW Rosenbluth, MN Rosenbluth, AH Teller, E Teller. Equation of state calculations by fast computing machines. J Chem Phys 21:1087–1092, 1953.

47.WK Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57:97–109, 1970.

48.CP Robert. In: WR Gilks, S Richardson, DJ Spiegelhalter, eds. Mixtures of Distributions: Inference and estimation. London: Chapman & Hall, 1996, pp 441–464.

49.M Gribskov, AD McLachlan, D Eisenberg. Profile analysis: Detection of distantly related proteins. Proc Natl Acad Sci USA 84:4355–4358, 1987.

50.JU Bowie, ND Clarke, CO Pabo, RT Sauer. Identification of protein folds: Matching hydrophobicity patterns of sequence sets with solvent accessibility patterns of known structures. Proteins Struct Func Genet 7:257–264, 1990.

51.M Brown, R Hughey, A Krogh, IS Mian, K Sjolander, D Haussler. Using Dirichlet mixture priors to derive hidden Markov models for protein families. Intelligent Systems in Molecular Biology 1:47–55, 1993.

52.K Karplus. Evaluating regularizers for estimating distributions of amino acids. Intelligent Systems in Molecular Biology 3:188–196, 1995.

53.RL Tatusov, EV Koonin, DJ Lipman. A genomic perspective on protein families. Science 278:631–637, 1997.

54.TL Bailey, M Gribskov. The megaprior heuristic for discovering protein sequence patterns. Intelligent Systems in Molecular Biology 4:15–24, 1996.

55.S Pietrokovski, JG Henikoff, S Henikoff. The BLOCKS database—A system for protein classification. Nucleic Acids Res 24:197–200, 1996.

56.C Dodge, R Schneider, C Sander. The HSSP database of protein structure–sequence alignments and family profiles. Nucleic Acids Res 26:313–315, 1998.

Bayesian Statistics

347

57.AE Sluder, SW Mathews, D Hough, VP Yin, CV Maina. The nuclear receptor superfamily has undergone extensive proliferation and diversification in nematodes. Genome Res 9:103– 120, 1999.

58.MO Dayhoff, WC Barker, PJ McLaughlin. Inferences from protein and nucleic acid sequences: Early molecular evolution, divergence of kingdoms and rates of change. Orig Life 5:311–330, 1974.

59.MO Dayhoff. The origin and evolution of protein superfamilies. Fed Proc 35:2132–2138, 1976.

60.WC Barker, MO Dayhoff. Evolution of homologous physiological mechanisms based on protein sequence data. Comp Biochem Physiol [B] 62:1–5, 1979.

61.JS Liu, CE Lawrence. Bayesian inference on biopolymer models. Bioinformatics 15:38–52, 1999.

62.TF Smith, MS Waterman. Identification of common molecular subsequences. J Mol Biol 147:195–197, 1981.

63.M Hendlich, P Lackner, S Weitckus, H Flo¨ckner, R Froschauer, K Gottsbacher, G Casari, MJ Sippl. Identification of native protein folds amongst a large number of incorrect models. J Mol Biol 216:167–180, 1990.

64.MAS Saqi, PA Bates, MJE Sternberg. Towards an automatic method of predicting protein structure by homology: An evaluation of suboptimal sequence alignments. Protein Eng 5: 305–311, 1992.

65.DT Jones, WR Taylor, JM Thornton. A new approach to protein fold recognition. Nature 358:86–89, 1992.

66.SH Bryant, CE Lawrence. An empirical energy function for threading protein sequence through the folding motif. Proteins Struct Funct Genet 16:92–112, 1993.

67.R Abagyan, D Frishman, P Argos. Recognition of distantly related proteins through energy calculations. Proteins Struct Funct Genet 19:132–140, 1994.

68.TJ Hubbard, J Park. Fold recognition and ab initio structure predictions using hidden Markov models and β-strand pair potentials. Proteins Struct Funct Genet 23:398–402, 1995.

69.NN Alexandrov. SARFing the PDB. Protein Eng 9:727–732, 1996.

70.D Fischer, D Eisenberg. Protein fold recognition using sequence-derived predictions. Protein Sci 5:947–955, 1996.

71.TR Defay, FE Cohen. Multiple sequence information for threading algorithms. J Mol Biol 262:314–323, 1996.

72.B Rost, R Schneider, C Sander. Protein fold recognition by prediction-based threading. J Mol Biol 270:471–480, 1997.

73.WR Taylor. Multiple sequence threading: An analysis of alignment quality and stability. J Mol Biol 269:902–943, 1997.

74.V DiFrancesco, J Garnier, PJ Munson. Protein topology recognition from secondary structure sequences: Application of the hidden Markov models to the alpha class proteins. J Mol Biol 267:446–463, 1997.

75.S Henikoff, JG Henikoff. Performance evaluation of amino acid substitution matrices. Proteins 17:49–61, 1993.

76.JG Henikoff, S Henikoff. BLOCKS database and its applications. Methods Enzymol 266: 88–105, 1996.

77.S Henikoff, JG Henikoff, S Pietrokovski. BLOCKS : A non-redundant database of protein alignment blocks derived from multiple compilations. Bioinformatics 15:471–479, 1999.

78.M Gerstein, M Levitt. A structural census of the current population of protein sequences. Proc Natl Acad Sci USA 94:11911–11916, 1997.

79.PY Chou, GD Fasman. Prediction of the secondary structure of proteins from their amino acid sequence. Adv Enzymol Relat Areas Mol Biol 47:45–148, 1978.

80.JF Gibrat, J Garnier, B Robson. Further developments of protein secondary structure prediction using information theory. New parameters and consideration of residue pairs. J Mol Biol 198:425–443, 1987.

348

Dunbrack

81.N Qian, TJ Sejnowski. Predicting the secondary structure of globular proteins using neural network models. J Mol Biol 202:865–884, 1988.

82.LH Holley, M Karplus. Protein secondary structure prediction with a neural network. Proc Natl Acad Sci USA 86:152–156, 1989.

83.B Rost, C Sander. Combining evolutionary information and neural networks to predict protein secondary structure. Proteins Struct Funct Genet 19:55–72, 1994.

84.AL Delcher, S Kasif, HR Goldberg, WH Hsu. Protein secondary structure modelling with probabilistic networks. Intelligent Systems in Molecular Biology 1:109–117, 1993.

85.JM Chandonia, M Karplus. Neural networks for secondary structure and structural class predictions. Protein Sci 4:275–285, 1995.

86.JM Chandonia, M Karplus. The importance of larger data sets for protein secondary structure prediction with neural networks. Protein Sci 5:768–774, 1996.

87.GE Arnold, AK Dunker, SJ Johns, RJ Douthart. Use of conditional probabilities for determining relationships between amino acid sequence and protein secondary structure. Proteins 12: 382–399, 1992.

88.P Stolorz, A Lapedes, Y Xia. Predicting protein secondary structure using neural net and statistical methods. J Mol Biol 225:363–377, 1992.

89.MJ Thompson, RA Goldstein. Predicting protein secondary structure with probabilistic schemata of evolutionarily derived information. Protein Sci 6:1963–1975, 1997.

90.MJ Thompson, RA Goldstein. Predicting solvent accessibility: Higher accuracy using Bayesian statistics and optimized residue substitution classes. Proteins Struct Funct Genet 25:38– 47, 1996.

91.J Janin, S Wodak, M Levitt, B Maigret. Conformations of amino acid side-chains in proteins. J Mol Biol 125:357–386, 1978.

92.E Benedetti, G Morelli, G Nemethy, HA Scheraga. Statistical and energetic analysis of sidechain conformations in oligopeptides. Int J Peptide Protein Res 22:1–15, 1983.

93.JW Ponder, FM Richards. Tertiary templates for proteins: Use of packing criteria in the enumeration of allowed sequences for different structural classes. J Mol Biol 193:775–792, 1987.

94.MJ McGregor, SA Islam, MJE Sternberg. Analysis of the relationship between sidechain conformation and secondary structure in globular proteins. J Mol Biol 198:295–310, 1987.

95.RL Dunbrack Jr, M Karplus. Backbone-dependent rotamer library for proteins: Application to sidechain prediction. J Mol Biol 230:543–571, 1993.

96.RL Dunbrack Jr, M Karplus. Conformational analysis of the backbone-dependent rotamer preferences of protein sidechains. Nature Struct Biol 1:334–340, 1994.

97.H Schrauber, F Eisenhaber, P Argos. Rotamers: To be or not to be? An analysis of amino acid sidechain conformations in globular proteins. J Mol Biol 230:592–612, 1993.

98.J Kuszewski, AM Gronenborn, GM Clore. Improving the quality of NMR and crystallographic protein structures by means of a conformational database potential derived from structure databases. Protein Sci 5:1067–1080, 1996.

99.BI Dahiyat, SL Mayo. Protein design automation. Protein Sci 5:895–903, 1996.

100.M Schena, D Shalon, RW Davis, PO Brown. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270:467–470, 1995.

101.M Schena, D Shalon, R Heller, A Chai, PO Brown, RW Davis. Parallel human genome analysis: Microarray-based expression monitoring of 1000 genes. Proc Natl Acad Sci USA 93:10614–10619, 1996.

102.D Shalon, SJ Smith, PO Brown. A DNA microarray system for analyzing complex DNA samples using two-color fluorescent probe hybridization. Genome Res 6:639–645, 1996.

103.MB Eisen, PT Spellman, PO Brown, D Botstein. Cluster analysis and display of genomewide expression patterns. Proc Natl Acad Sci USA 95:14863–14868, 1998.

104.M Wilson, J DeRisi, HH Kristensen, P Imboden, S Rane, PO Brown, GK Schoolnik. Exploring drug-induced alterations in gene expression in Mycobacterium tuberculosis by microarray hybridization. Proc Natl Acad Sci USA 96:12833–12838, 1999.

Bayesian Statistics

349

105.GP Yang, DT Ross, WW Kuang, PO Brown, RJ Weigel. Combining SSH and cDNA microarrays for rapid identification of differentially expressed genes. Nucleic Acids Res 27:1517– 1523, 1999.

106.VR Iyer, MB Eisen, DT Ross, G Schuler, T Moore, JCF Lee, JM Trent, LM Staudt, J Hudson Jr, MS Boguski, D Lashkari, D Shalon, D Botstein, PO Brown. The transcriptional program in the response of human fibroblasts to serum. Science 283:83–87, 1999.

107.MQ Zhang. Large-scale gene expression data analysis: A new challenge to computational biologists. Genome Res 9:681–688, 1999.

108.JM Claverie. Computational methods for the identification of differential and coordinated gene expression. Hum Mol Genet 8:1821–1832, 1999.

109.S-PLUS, Version 3.4. Mathsoft Inc., 1996.