Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Friesner R.A. (ed.) - Advances in chemical physics, computational methods for protein folding (2002)(en)

.pdf
Скачиваний:
12
Добавлен:
15.08.2013
Размер:
6.52 Mб
Скачать

Ubiquitin

1UBQ

17.44

22.95

15.79

31.58

15.79

5.26

7.89

2.63

21.05

23.68

11.84

28.95

59.21

19.22

31.34

50.35

68.92

84.21

 

1UBQ-

17.44

22.95

15.79

31.58

15.79

5.26

7.89

2.63

21.05

23.68

13.16

28.95

57.89

22.34

29.45

49.14

70.87

85.53

 

V26A

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Unique

1COA

16.07

25.11

17.19

21.88

15.62

4.69

4.69

6.25

29.69

21.88

17.19

29.69

53.12

22.52

32.99

45.66

75.26

76.56

 

1DIV

10.62

18.96

33.93

19.64

16.07

8.93

5.36

3.57

12.50

39.29

28.57

21.43

50.00

33.88

23.60

43.77

70.29

71.43

 

1FKB

30.99

28.96

7.48

38.32

22.43

5.61

2.80

1.87

21.50

10.28

2.80

33.64

63.55

8.61

33.15

58.92

73.23

76.64

 

1IMQ

16.14

18.77

52.33

0.00

11.63

11.63

0.00

0.00

24.42

52.33

40.70

5.81

53.49

38.60

10.73

51.59

79.38

82.56

 

2ABD

17.03

19.80

56.98

0.00

10.47

8.14

3.49

0.00

20.93

60.47

54.65

0.00

45.35

50.22

8.33

42.34

76.71

81.40

 

2AIT

27.62

37.32

0.00

40.54

8.11

16.22

0.00

2.70

32.43

0.00

12.16

36.49

51.35

15.25

36.12

49.67

77.62

67.57

 

2PTL

20.53

33.11

19.35

38.71

9.68

14.52

0.00

1.61

16.13

19.35

24.19

41.94

33.87

24.46

38.45

38.30

75.24

85.48

 

2PDD

6.45

15.00

44.19

0.00

16.28

9.30

0.00

0.00

30.23

44.19

34.88

16.28

48.84

32.75

23.04

45.88

71.79

74.42

 

2VIK

25.87

20.53

19.05

23.81

15.08

16.67

2.38

0.00

23.02

21.43

24.60

23.02

52.38

24.16

27.98

48.43

71.23

77.78

aThe secondary structure contents were obtained with the program (DSSP) [69], and the secondary structure predictions and propensities were obtained with the program PRED2ARY [70] (these descriptors are expressed as percentages of the total numbers of residues). Each of the mutations involved the substitution of an alanine into a helix; because such a change is likely to increase the propensity for forming a helix in that region, the contact orders and secondary structure content were taken to be the same as those of the wild types, and the secondary structure propensities and predictions were calculated with the modified sequences. Likewise, the structural data for the two forms of horse cytochrome c (1HRC) were taken to be the same. A contact was defined as two heavy atoms that are within

˚

 

P

 

by at least two residues (i.e.,

i; i þ 1 and i; i þ 2 contacts are ignored). The (unnormalized) contact order is

4 A

of

each other and separated

c ¼

1

 

i>j

ði; jÞjsi sjj, where nc

is the total number of contacts, si

is the sequence position of the residue containing atom i, and ði; jÞ selects the atoms

nc

 

 

 

 

 

 

(i and j) that are in contact (as defined above). The normalized contact order (c=n) is multiplied by 100 for consistency with Refs. 12 and 13.

15

16

aaron r. dinner et al.

correlated with others (Table IV), consideration of all of them is useful because exhaustive enumeration or a genetic algorithm (GA) is employed to determine which to include for optimal fitting and prediction.

The database consists of 33 proteins. Twenty-four of these fall into six structurally related groups, and nine are structurally unique. The former are SH3 domains [1NYF (82 to 148), 1PKS, 1SHG, and 1SRL], Ig-like b-sandwiches [1FNF (1326 to 1415), 1FNF (1416 to 1509), 1HNG, 1TEN (802 to 891), 1TIT, and 1WIT], members of the acylphosphatase family (1APS, 1HDN, 1PBA, 1URN, and 2HQI), cytochromes (1HRC, 1HRC-oxidized, 1YCC), cold shock proteins [1CSP and 1MJC (2 to 70)], l-repressor variants (1LMB wild type and G46A/G48A), and ubiquitin variants (1UBQ wild type and V26A). The remainder of the proteins are 1COA (20 to 83), 1DIV (1 to 56), 1FKB, 1IMQ, 2ABD, 2AIT, 2PDD, 2PTL (94 to 155), and 2VIK. Numbers in parentheses indicate the residue numbers of the domain or fragment studied.

To cross-validate the results, each group of structurally related proteins is left out of the training set in turn and used to test the network. Such a partitioning scheme (in contrast to a jackknife one, for example) minimizes the likelihood of biasing the results in favor of structural descriptors (see Section II). Its use yields true predictions (denoted ‘‘cv’’) in contrast to fits of the data, in which all the proteins are included during the training (denoted ‘‘trn’’). The latter tend to yield inflated accuracy statistics, but we describe them here as well for comparison with earlier studies [12,13,20,47], which failed to cross-validate their results [however, it should be noted that the relationship in Ref. 12 has been used successfully for blind predictions (K. W. Plaxco and D. Baker, personal communication)].

C.Single-Descriptor Models

We begin by examining the relationship between log kf and each individual descriptor.

1.Linear Correlations

The first column of statistics given in Table I contains the Pearson linear correlation coefficients between the descriptor values (x) and log kf ðrx;log kf Þ. This is the statistical measure used by Plaxco et al. in their analysis of a subset of the descriptors considered here [12,14]. Consistent with their results, the two coefficients with the largest magnitudes are associated with the contact order (c and c=n). Several descriptors not examined by Plaxco et al. [12,14] exhibit jrx;log kf j > 0:5 as well: the a-helix content and propensity (h and ph), total helix content (a), and b-sheet content (e). Additional linear statistics are provided in Table V. Physical interpretations of the results are given in Section IV.E.

2.Neural Network Predictions

The second and third columns of statistics in Table I measure the ability of a single-input neural network to predict the folding rate. They contain Pearson

TABLE IV

Descriptor–Descriptor Pearson Linear Correlation Coefficientsa

 

 

G

G=n

m

m=n

n

nc

c

c=n

h

e

t

s

g

b

o

a

Ph

Pe

Po

ph

pe

po

qe

qa

0

G

 

0.94

0.50

0.38

0.50

0.37

0.02

0.29

0.24

0.41

0.26

0.05

0.34

0.02

0.28

0.20

0.01

0.17

0.16

0.06

0.18

0.10

0.04

0.17

1

G=n

0.93

 

0.43

0.40

0.20

0.16

0.16

0.25

0.20

0.39

0.19

0.03

0.25

0.22

0.30

0.16

0.01

0.07

0.09

0.05

0.08

0.00

0.02

0.02

2

m

0.28

0.27

 

0.95

0.39

0.31

0.12

0.09

0.02

0.06

0.05

0.01 0.27 0.08

0.22

0.01

0.11

0.04

0.14

0.13

0.08

0.14

0.20

0.06

3

m=n

0.21

0.26

0.98

 

0.12

0.13

0.01

0.04

0.03

0.02

0.12

0.06

0.16

0.06

0.20

0.05

0.12

0.13

0.06

0.13

0.16

0.04

0.21

0.21

4

n

0.21

0.14 0.10 0.27

 

0.73

0.48

0.10

0.17

0.15

0.27

0.17

0.40

0.47

0.07

0.12

0.02

0.26

0.23

0.02

0.27

0.28

0.12

0.41

5

nc

0.19

0.38 0.20 0.28

0.60

 

0.17

0.28

0.50

0.44

0.14

0.15 0.34 0.42

0.12

0.46

0.34

0.52 0.03

0.34

0.55

0.02

0.00

0.19

6

c

0.11

0.00

0.34

0.43

0.34 0.02

 

0.81

0.63

0.66

0.03

0.29

0.28

0.22

0.18

0.67

0.49

0.51

0.28

0.56

0.50

0.44

0.26

0.46

7

c=n

0.08

0.18

0.23

0.12

0.68 0.54

0.43

0.39

0.84

0.83

0.22

0.47

0.10

0.08

0.24

0.86

0.61

0.77

0.20

0.68

0.77

0.35

0.18

0.26

8

h

0.01

0.06 0.10 0.08

0.08

0.39

0.39

0.75

0.89

0.01

0.22 0.11 0.34

0.41

0.99

0.83

0.82 0.51

0.88

0.88 0.60

0.23

0.24

9

e

0.18

0.12

0.13

0.12

0.15 0.39

0.44

0.45

0.17

0.00

0.07

0.02

0.12

0.00

0.89

0.64

0.77

0.26

0.71

0.80

0.38

0.11

0.18

10

t

0.15

0.26 0.35 0.36

0.29

0.34

0.02

0.27

0.08

0.52

0.70

0.20

0.32

0.27

0.01

0.17

0.01

0.26

0.11

0.00

0.21

0.23

0.14

11

s

0.30

0.34

0.01

0.01

0.04

0.02

0.05

0.04

0.20

0.35

0.55

0.38

0.25

0.45

0.27

0.20

0.32

0.00

0.23

0.31

0.06

0.01

0.02

12

g

0.05

0.15 0.14 0.16

0.17

0.19

0.11

0.28

0.01

0.31

0.36

 

0.42

0.09

0.02

0.07

0.09 0.03

0.09

0.06 0.10

0.29

0.15

13

b

0.04

0.08

0.05

0.02

0.26

0.00

0.03

0.30

0.47

0.09

0.15

0.34

0.49

 

0.26

0.29

0.32

0.31

0.20

0.27

0.32

0.13

0.22

0.03

14

o

0.26

0.33

0.29

0.28

0.13 0.42

0.08

0.23

0.73

0.24

0.41

0.23

0.08

0.43

0.70

0.42

0.50

0.22

0.59

0.50

0.31

0.59

0.10

0.23

15

a

0.02

0.10 0.13 0.12

0.11

0.43

0.41

0.45

0.97

0.81

0.16

0.07

0.22

0.34

 

0.84

0.84 0.51

0.89

0.89 0.62

0.19

0.27

16

Ph

0.23

0.19

0.20

0.19

0.04

0.13

0.45

0.38

0.68

0.66

0.05

0.11

0.33

0.19

0.41

0.74

0.69

0.80 0.80

0.99

0.86 0.83

0.27

0.23

17

Pe

0.22

0.32

0.31

0.33

0.17 0.08

0.31

0.43

0.33

0.33

0.13

0.25

0.62

0.16

0.26

0.46

 

0.29

0.83

0.98

0.38

0.18

0.03

18

Po

0.51

0.53

0.01

0.01

0.08

0.12

0.35

0.16

0.66

0.63

0.03

0.36

0.07

0.13

0.35

0.62

0.79

0.10

 

0.76

0.41

0.96

0.25

0.34

19

ph

0.21

0.16

0.29

0.28

0.06

0.15

0.36

0.33

0.74

0.67

0.07

0.10

0.32

0.24

0.51

0.79

0.98

0.68 0.77

0.83

0.89 0.82

0.21

0.22

20

pe

0.12

0.22

0.42

0.43

0.16 0.19

0.24

0.37

0.55

0.48

0.22

0.25

0.60

0.19

0.47

0.67

0.80

0.93

0.31

 

0.48

0.15

0.06

21

po

0.47

0.48

0.06

0.04

0.05

0.08

0.35

0.19

0.69

0.63

0.09

0.41

0.06

0.21

0.38

0.65

0.84

0.21

0.97

0.84

0.39

0.06

0.21

0.36

22

qe

0.18

0.26

0.53

0.53

0.21 0.34

0.42

0.14

0.08

0.06

0.48

0.14

0.22

0.13

0.12

0.03

0.03

0.02 0.02

0.00

0.06

 

0.37

23

qa

0.20

0.15

0.42

0.39

0.18

0.21

0.55

0.58

0.29

0.20

0.12

0.09

0.30

0.15

0.19

0.21

0.11

0.17

0.30

0.05

0.16

0.24

0.54

 

aData for all 33 proteins are above the diagonal, and data for the 14 proteins with c > 21 are below the diagonal.

17

18

 

aaron r. dinner et al.

 

 

 

 

TABLE V

 

 

 

 

Linear Regression Statistics for log kf

 

 

 

 

 

 

Index

Symbol

rtrn

rcv

q2

 

 

 

 

cv

0

G

0.29

0.02

0.09

1

G=n

0.37

0.13

0.05

2

m

0.04

0.65

0.19

3

m=n

0.04

0.52

0.20

4

n

0.10

0.53

0.27

5

nc

0.08

0.60

0.24

6

c

0.73

0.70

0.48

7

c=n

0.79

0.77

0.59

8

h

0.63

0.55

0.30

9

e

0.67

0.59

0.34

10

t

0.04

0.76

0.23

11

s

0.11

0.52

0.19

12

g

0.01

0.75

0.41

13

b

0.15

0.43

0.26

14

o

0.05

0.74

0.31

15

a

0.63

0.57

0.32

16

Ph

0.47

0.29

0.06

17

Pe

0.48

0.31

0.08

18

Po

0.27

0.27

0.28

19

ph

0.51

0.37

0.13

20

pe

0.47

0.28

0.05

21

po

0.40

0.07

0.09

22

qe

0.21

0.21

0.14

23

qa

0.40

0.12

0.07

linear correlation coefficients (rtrn and rcv) between observed and calculated values of log kf ; thus, only positive values of r are significant. Because there are only 24 different input possibilities, it is feasible to consider each one in turn, so that use of a genetic algorithm is not necessary at this stage. However, the NN weights depend on the random number generator seed through the training procedure. Consequently, for each descriptor, the network was trained independently with ten different seeds. The maximum correlation coefficient for each set of 10 networks corresponding to a particular descriptor is listed in Table I; the average standard deviation for a given descriptor was 0.03 for rtrn and 0.06 for rcv.

As stated above, the coefficients denoted ‘‘trn’’ are for results obtained with networks trained on all 33 proteins; in other words, they are not true predictions since all the data are included in the training set. For descriptors that are linearly related to log kf ; rtrn is expected to be comparable in magnitude to rx;log kf (in fact, for linear regression, rtrn ¼ jrx;log kf j), whereas, for ones that are non-linearly related, it should be higher. Thus, rtrn can be viewed as essentially a nonlinear version of the statistic employed in Ref. 12. Accordingly, most of the descriptors that exhibit high rtrn were included in the analysis of rx;log kf .

statistical analysis of protein folding kinetics

19

The coefficients denoted ‘‘cv’’ are for the predictions obtained with the structurally based cross-validation scheme. Negative values of rcv indicate that the accuracy of the network is lower than that which would be obtained from random guesses. If a network fails in this way when confronted with novel test data, it has derived a spurious relationship by memorizing the information in the training set at the expense of learning more general rules. The highest rcv do correspond to the highest rtrn, but overall the cross-validated coefficients are much lower. The large differences between rtrn and rcv in many cases (Table I) make clear that the former is a relatively indiscriminate statistic for such a small database. If linear regression is used, rtrn and rcv are often closer due to the decreased flexibility of the fitting method (Table V). However, such an approach fails to identify nonlinear relationships and can hide complexities in the results.

In summary, the contact order yields relatively good prediction of log kf but is not alone in doing so. Several measures of the propensity of the sequence for a given structure also exhibit significant relationships with the folding rate. Although rcv values for the various descriptors obtained from the secondary structure prediction program (indices 16 to 21 in Table I) are lower than those for measures of the known native structure (indices 6 to 15), the former correlations may be sufficiently high that the calculated descriptors could be used to identify particularly fast or slow proteins without the need for highresolution structures. The stability, which has been suggested to be of importance based on model studies, exhibits no clear relation to the folding rate. An essential additional point of the single-descriptor analysis is that large differences are observed between most of the values obtained with and without cross-validation. This highlights the need for care in assessing the significance of correlations when working with small numbers of sequences.

D. Multiple-Descriptor Models

We present results for twoand three-descriptor models; addition of a fourth descriptor yielded no significant improvement in predictive accuracy. In the twodescriptor case there are only 276 possible input combinations, so we examine each explicitly, whereas, in the three-descriptor case there are 2024, so we use the genetic algorithm (GA) to optimize the descriptor selection. Use of the GA in the two-descriptor case gives models of comparable quality to the exhaustive search, but this test of the algorithm is not very stringent because the space of input combinations is small. Because both the GA and the NN depend on the random number generator seed, several trials were performed in each case (as detailed in Section IV.D.2).

1.Two Descriptors

The best five two-descriptor models are shown in Table VI, and selected examples to illustrate the types of behavior that are observed are shown in Fig. 2.

20

aaron r. dinner et al.

There is a significant increase in fitting ability (training) and, more importantly, in predictive accuracy (cross-validation) upon adding a second descriptor. In Figure 2, we see that the squares (&) tend to be closer to the ideal line than the circles ( ), particularly for lower log kf (slower-folding proteins). To quantitate the improvement, we calculated Wold’s E statistic from the q2cv values (Table VI). While these figures suggested to us that the additional descriptors significantly improve the accuracies of the cross-validated predictions, general confidence limits are not straightforward to calculate. Consequently, we did the following. We shuffled the values of each secondary descriptor (other than c=n) 10 times and then trained neural networks to predict log kf as for the actual data. Averages and standard deviations of the correlation coefficients are reported in Table VII. We see that, even though the rtrn values are comparable to those in Table VI, the

calculated log kf

calculated log kf

5.0

4.0

3.0

2.0

1.0

0.0

−1.0

−1.0 0.0 1.0 2.0 3.0 4.0 5.0 observed log kf

(a)

5.0

4.0

3.0

2.0

1.0

0.0

−1.0

−1.0 0.0 1.0 2.0 3.0 4.0 5.0 observed log kf

(c)

calculated log kf

calculated log kf

5.0

4.0

3.0

2.0

1.0

0.0

−1.0

−1.0 0.0 1.0 2.0 3.0 4.0 5.0 observed log kf

(b)

5.0

4.0

3.0

2.0

1.0

0.0

−1.0

−1.0 0.0 1.0 2.0 3.0 4.0 5.0 observed log kf

(d)

Figure 2. Comparison of observed and calculated values of log kf for selected models. (a and b) c=nð Þ; c=n and G=n (&); and c=n; G=n and peð4Þ. (c and d) c=nð Þ; c=n and nc (&); and c=n; nc; and Gð4Þ. (a and c) Training set fits. (b and d) Cross-validated predictions.

statistical analysis of protein folding kinetics

21

TABLE VI

 

The Best (as Measured by rcv) Five Two-Descriptor Models Obtained by Examining All Possible Combinations for Ten Different Random Number Generator Seedsa

Descriptors

rtrn

rcv

q2

E

 

 

 

 

cv

 

c=n

G=n

0.89

0.81

0.66

0.74

c=n

Ph

0.87

0.80

0.63

0.81

c=n

nc

0.89

0.79

0.62

0.82

c=n

ph

0.86

0.77

0.57

0.93

c=n

qa

0.84

0.77

0.59

0.89

aFor the calculation of E, q2cv was compared with that for c=n. Statistics for linear regression and additional measures of the predictive accuracy are available in Tables VII and VIII.

rcv values are close to that for c=n by itself (Table I); the NN ignores the randomized descriptor. The fact that the rcv values for the actual data are two to four standard deviations above the average rcv values for the randomized data demonstrates that the improvement is significant and is not due to the increase in the number of fitting parameters.

The best predictions are obtained with G=n paired with c=n ( G with c is the sixth best set of inputs with rcv ¼ 0.77 and E ¼ 0.76) This combination of input descriptors was investigated previously [15], but it is of interest that it ranks first in the exhaustive search performed here. To better understand the physical basis for the correlations, we show the dependence of log kf on c=n andG=n in Fig. 3a. When c=n is small (c=n 19; mainly a-helical proteins), folding is always fast (kf > 400 s 1), whereas when c=n is large (c=n 25; either mixed-a/b or b-sheet proteins), the rate spans over three orders of magnitude. Thus, proteins with lower contact orders fold fast regardless of their stabilities, whereas for those with higher contact orders, the rate increases with G=n. As described in Ref. 15, a single-input neural network can be trained to predict log kf from G for the 14 proteins with c > 21 (Fig. 4); rtrn ¼ 0:81, and rcv ¼ 0:64, which confirms that stability plays a significant role in determining the folding rates of mixed-a/b and b-sheet proteins. For these 14

TABLE VII

Randomization Tests for the Models in Table VIa

Descriptors

rtrn

rcv

q2

 

 

 

 

cv

c=n

G=n

0:83 0:01

0:71 0:03

0:49 0:04

c=n

Ph

0:84 0:03

0:68 0:07

0:43 0:12

c=n

nc

0:87 0:02

0:69 0:04

0:46 0:05

c=n

ph

0:84 0:02

0:68 0:06

0:42 0:10

c=n

qa

0:84 0:00

0:68 0:07

0:44 0:11

aIn each case, the second descriptor listed was shuffled 10 times, and the networks were trained as for the original data. Values shown are averages for the 10 trials; ranges indicate standard deviations.

(rpe;ph

22

 

aaron r. dinner et al.

 

 

 

 

 

TABLE VIII

 

 

 

Linear Regression Statistics for the Models in Table VI

 

 

 

 

 

 

Descriptors

rtrn

rcv

q2

E

 

 

 

 

cv

 

c=n

G=n

0.81

0.72

0.47

1.27

c=n

Ph

0.79

0.75

0.57

1.04

c=n

nc

0.82

0.79

0.62

0.92

c=n

ph

0.79

0.75

0.56

1.05

c=n qa

0.80

0.77

0.60

0.97

 

 

 

proteins, r G;log kf

¼ 0:80 while

rc;log kf ¼ 0:22; E ¼ ð1 qc2; GÞ=ð1 qc2Þ

¼ 0:23.

Two of the other models in Table VI combine the contact order with a measure of the a-helical propensity: c=n with either Ph or ph: These pairings essentially reflect the results of the previous section. The remaining model couples c=n with nc, which reveals a secondary dependence on protein size. Consistent with the sign of rnc;log kf (Table I), the functional dependences of log kf on these descriptors for the models in Table VI indicate that shorter proteins fold faster than longer ones (Fig. 3b).

2.Three Descriptors

As mentioned above, there are 2024 possible combinations of three descriptors, so we use a GA to identify the inputs that are likely to yield the greatest predictive accuracy. Use of the GA requires selection of a particular measure of predictive accuracy to decide which models to keep at each cycle. Because we are interested primarily in cross-validated predictions, rcv is a natural choice. However, the structurally based partitioning scheme is less straightforward to automate than a jackknife one. Consequently, for the GNN, we used the Pearson linear correlation coefficient for the jackknife cross-validated outputs (rjck) and subsequently tested each selected combination of descriptors with the structurally based cross-validation scheme (rcv). We performed five GNN trials, from each of which we saved the best 20 models. Of these 100 models, 46 were unique, and each of these was subjected to 10 trials with the structurally based cross-validation scheme.

In general, the GA combines the descriptors that were identified above by the two-dimensional exhaustive search (c; c=n; G; G=n; and nc) to further refine the predictions (Tables IX to XI and Fig. 2). The propensity for sheet structure ðpeÞ appears in two of the five models; not surprisingly, it is strongly anticorrelated with the propensity for helical structure, which appeared in Table VI

¼ 0:89). In considering these results, it is necessary to keep in mind that the database is small, so that there is a danger of overfitting (but see Table X). Nevertheless, given this disclaimer, we see that simultaneous consideration of multiple descriptors improves prediction of the folding rate and that both the

statistical analysis of protein folding kinetics

23

12

104

102 kf

1

0.15

 

0.10

 

n

 

 

 

18

G

/

 

 

24

 

 

0.05

 

 

c/n

30

 

 

 

36

 

 

(a)

104

102 kf

12

18

24

c/n

1

400

600

800

n

c

 

 

30

1000

 

 

36

 

(b)

Figure 3. Functional dependence of calculated folding rate (kf , in s 1) on the normalized contact order (c=n) and either (a) the normalized stability ( G=n in kcal/mol) or (b) the total number of atomic contacts ðncÞ.

24

aaron r. dinner et al.

 

5

 

 

 

 

 

 

 

 

 

4

 

 

 

 

 

 

 

 

 

3

 

 

 

 

 

 

 

 

f

 

 

 

 

 

 

 

 

 

k

2

 

 

 

 

 

 

 

 

log

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1

 

 

 

 

 

 

 

 

 

0

 

 

 

 

 

Figure 4.

Observed (points)

 

 

 

 

 

 

and calculated (line) log kf

as a

 

 

 

 

 

 

 

 

−1

 

 

 

 

 

function of the stability in kcal/

 

2

4

6

8

10

mol for the

14 proteins in

the

 

0

 

 

 

 

∆G

 

 

database with c > 21:

 

size and the stability play significant secondary roles that could not have been anticipated from the single-descriptor analyses.

E.Physical Bases of the Observed Correlations

Consistent with earlier, single-descriptor linear analyses of protein folding [12,13,50], the primary determinants of the folding rate are measures that characterize the native structure; that is, proteins with more sequentially local interactions tend to fold faster. As discussed below, the equilibrium structure and the kinetics are connected by the fact that the structure of the transition state resembles that of the native state in many small proteins [50]. Thus, the kinetics and the underlying thermodynamics of the reaction are affected in a similar way, in accord with linear free energy relations.

The microscopic origin for the statistical dependence of the folding kinetics on the structure is the stochastic diffusive search that is required to find the

TABLE IX

The Best (as Measured by rcv) Five Unique Three-Descriptor Models Obtained from the GNN Protocol for Ten Different Random Number Generator Seedsa

Descriptors

 

rtrn

rjck

rcv

q2

E

 

 

 

 

 

 

cv

 

c=n

G=n

pe

0.92

0.84

0.86

0.74

0.76

c=n

G

nc

0.93

0.84

0.84

0.70

0.80

c=n

G=n

nc

0.92

0.81

0.83

0.67

0.97

c=n

G

c

0.90

0.83

0.83

0.66

0.81

c=n

G

pe

0.91

0.80

0.83

0.67

0.72

aFor the calculation of E, q2cv was compared with the highest observed q2cv of the six possible twodescriptor models that could be formed from the three selected inputs (corresponding to the unshuffled pair in Table X). Statistics for linear regression and additional measures of the predictive accuracy are available in Table X and XI.