Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Becker O.M., MacKerell A.D., Roux B., Watanabe M. (eds.) Computational biochemistry and biophysic.pdf
Скачиваний:
68
Добавлен:
15.08.2013
Размер:
5.59 Mб
Скачать

Comparative Protein Structure Modeling

289

and the segment-matching method of Levitt [74]. The accuracies of the methods were similar. They were able to predict correctly approximately 50% of χ1 angles and 35% of both χ1 and χ2 angles. In typical comparative modeling applications where the backbone

˚

is closer to the native structures ( 2 A RMSD), these numbers increase by approximately 20% [146].

III. AB INITIO PROTEIN STRUCTURE MODELING METHODS

This section briefly reviews prediction of the native structure of a protein from its sequence of amino acid residues alone. These methods can be contrasted to the threading methods for fold assignment [Section II.A] [39–47,147], which detect remote relationships between sequences and folds of known structure, and to comparative modeling methods discussed in this review, which build a complete all-atom 3D model based on a related known structure. The methods for ab initio prediction include those that focus on the broad physical principles of the folding process [148–152] and the methods that focus on predicting the actual native structures of specific proteins [44,153,154,240]. The former frequently rely on extremely simplified generic models of proteins, generally do not aim to predict native structures of specific proteins, and are not reviewed here.

Although comparative modeling is the most accurate modeling approach, it is limited by its absolute need for a related template structure. For more than half of the proteins and two-thirds of domains, a suitable template structure cannot be detected or is not yet known [9,11]. In those cases where no useful template is available, the ab initio methods are the only alternative. These methods are currently limited to small proteins and at best

˚

result only in coarse models with an RMSD error for the Cα atoms that is greater than 4 A. However, one of the most impressive recent improvements in the field of protein structure modeling has occurred in ab initio prediction [155–157].

Ab initio prediction relies on the thermodynamic hypothesis of protein folding [158]. The thermodynamic hypothesis suggests that the native structure of a protein sequence corresponds to its global free energy minimum state. Accordingly, ab initio prediction methods are generally formulated as optimizations. As such, they can be distinguished by the representation of a protein and its degrees of freedom, the function that defines the energy for each of the allowed conformations, and the optimization method that attempts to find the global minimum on a given energy surface.

Although the folding of short proteins has been simulated at the atomic level of detail [159,160], a simplified protein representation is often applied. Simplifications include using one or a few interaction centers per residue [161] as well as a lattice representation of a protein [162]. Some methods are hierarchical in that they begin with a simplified lattice representation and end up with an atomistic detailed molecular dynamics simulation [163].

The energy functions for folding simulations include atom-based potentials from molecular mechanics packages [164] such as CHARMM [81], AMBER [165], and ECEPP [166], the statistical potentials of mean force derived from many known protein structures [167], and simplified potentials based on chemical intuition [168–171]. Some methods also incorporate non-physical spatial restraints obtained from multiple sequence alignments and other considerations to reduce the size of the conformational space that needs to be explored [172–176].

290

Fiser et al.

Many different optimization methods [177,178]—even enumerations with some lattice models [171]—have been applied to the protein folding problem. These methods include molecular dynamics simulations [179,180], Monte Carlo sampling [173,181,182], the diffusion equation method [183], and genetic algorithm optimization [184–186]. A recent and particularly successful approach assembles the whole protein model from relatively short building blocks [187–189]. Many candidate blocks are obtained from known protein structures by relying on energetic, geometrical, and sequence similarity filters. The model of a whole protein is then assembled from such pieces by a Monte Carlo optimization of a statistical energy function [188].

There is scope for combining the comparative modeling and ab initio methods. The modeling of inserted loops in comparative prediction is based primarily on the sequence information alone. In addition, the alignment errors as well as large distortions of the target relative to the template require that such regions be modeled ab initio without relying on the template structure. It is likely that the ab initio approaches will help reduce some of the limitations of comparative modeling.

IV. ERRORS IN COMPARATIVE MODELS

The errors in comparative models can be divided into five categories [58] (Fig. 7):

1.Errors in side chain packing.

2.Distortions or shifts of a region that is aligned correctly with the template structures.

3.Distortions or shifts of a region that does not have an equivalent segment in any of the template structures.

4.Distortions or shifts of a region that is aligned incorrectly with the template structures.

5.A misfolded structure resulting from using an incorrect template.

Significant methodological improvements are needed to address all of these errors. Errors 3–5 are relatively infrequent when sequences with more than 40% identity

to the templates are modeled. For example, in such a case, approximately 90% of the

˚

main chain atoms are likely to be modeled with an RMS error of about 1 A. In this range of sequence similarity, the alignment is mostly straightforward to construct, there are not many gaps, and structural differences between the proteins are usually limited to loops and side chains. When sequence identity is between 30% and 40%, the structural differences become larger, and the gaps in the alignment are more frequent and longer. As a

˚

result, the main chain RMS error increases to about 1.5 A for about 80% of the residues. The rest of the residues are modeled with large errors because the methods generally fail to model structural distortions and rigid-body shifts and are unable to recover from misalignments. Below 40% sequence identity, misalignments and insertions in the target sequence become the major problems. Insertions longer than about eight residues cannot yet be modeled accurately, but shorter loops can frequently be modeled successfully [92,119,239]. When sequence identity drops below 30%, the main problem becomes the identification of related templates and their alignment with the sequence to be modeled (Fig. 8). In general, it can be expected that about 20% of residues will be misaligned and

˚

consequently incorrectly modeled with an error greater than 3 A at this level of sequence similarity [51]. This is a serious impediment for comparative modeling because it appears

Comparative Protein Structure Modeling

291

Figure 7 Typical errors in comparative modeling. (a) Errors in side chain packing. The Trp 109 residue in the crystal structure of mouse cellular retinoic acid binding protein I (thin line) is compared with its model (thick line) and with the template mouse adipocyte lipid-binding protein (broken line). (b) Distortions and shifts in correctly aligned regions. A region in the crystal structure of mouse cellular retinoic acid binding protein I (thin line) is compared with its model (thick line), and with the template fatty acid binding protein (broken line). (c) Errors in regions without a template. The Cα trace of the 112–117 loop is shown for the X-ray structure of human eosinophil neurotoxin (thin line), its model (thick line), and the template ribonuclease A structure (residues 111–117; broken line). (d) Errors due to misalignments. The N-terminal region in the crystal structure of human eosinophil neurotoxin (thin line) is compared with its model (thick line). The corresponding region of the alignment with the template ribonuclease A is shown. The black lines show

˚

correct equivalences, that is residues whose Cα atoms are within 5 A of each other in the optimal least-squares superposition of the two X-ray structures. The ‘‘a’’ characters in the bottom line indicate helical residues. (e) Errors due to an incorrect template. The X-ray structure of α-trichosanthin (thin line) is compared with its model (thick line), which was calculated using indole-3-glycerophos- phate synthase as the template. (From Ref. 146.)

that at least one-half of all related protein pairs are related at less than 30% sequence identity [9,190].

It has been pointed out that a comparative model is frequently more distant from the actual target structure than the closest template structure used to calculate the model [191]. However, at least for some modeling methods, this is the case only when there are errors in the template–target alignment used for modeling and when the correct structurebased template–target alignment is used for comparing the template with the actual target structure [58]. In contrast, the model is generally closer to the target structure than any of

292

Fiser et al.

Figure 8 Average model accuracy as a function of the percentage identity between the target and template sequences. (a) The models were calculated entirely automatically, based on single template structures. As the sequence identity between the target sequence and the template structure decreases, the average structural similarity between the template and the target also decreases (dashed line, triangles). Structure overlap is defined as the fraction of equivalent Cα atoms. For comparison of the model with the actual structure (continuous line, circles), two Cα atoms were considered equivalent if

˚

they were within 3.5 A of each other and belonged to the same residue. For comparison of the template structure with the actual structure (dashed line, triangles), two Cα atoms were considered

˚

equivalent if they were within 3.5 A of each other after alignment and rigid-body superposition by the ALIGN3D command in MODELLER. (b) Three models (solid line) compared with their corresponding experimental structures (dotted line). The models were calculated with MODELLER in a completely automated fashion before the experimental structures were available [146]. When multiple sequence and structure information is used and the alignments are edited by hand, the models can be significantly more accurate than shown in this plot [58].

the templates if the modeling target–template alignment is used in evaluating the similarity between the actual target structure and the template [58]. As a result, using a model is generally better than using the template structure even when the alignment is incorrect, because the actual target structure, and therefore the correct template–target alignment, are not available in practical modeling applications.

Comparative Protein Structure Modeling

293

To put the errors in comparative models into perspective, we list the differences among structures of the same protein that have been determined experimentally (Fig. 9).

˚

The 1 A accuracy of main chain atom positions corresponds to X-ray structures defined

˚

at a low resolution of about 2.5 A and with an R-factor of about 25% [192], as well as to medium resolution NMR structures determined from 10 interproton distance restraints per residue [193]. Similarly, differences between the highly refined X-ray and NMR struc-

˚

tures of the same protein also tend to be about 1 A [193]. Changes in the environment

Figure 9 Relative accuracy of comparative models. Upper left panel, comparison of homologous structures that share 40% sequence identity. Upper right panel, conformations of ileal lipid-binding protein that satisfy the NMR restraints set equally well. Lower left panel, comparison of two independently determined X-ray structures of interleukin 1β. Lower right panel, comparison of the X-ray and NMR structures of erabutoxin. The figure was prepared using the program MOLSCRIPT [236].

294

Fiser et al.

(e.g., oligomeric state, crystal packing, solvent, ligands) can also have a significant effect on the structure [194]. Overall, comparative modeling based on templates with more than 40% identity is almost as good as medium resolution experimental structures, simply because the proteins at this level of similarity are likely to be as similar to each other as are the structures for the same protein determined by different experimental techniques under different conditions. However, the caveat in comparative protein modeling is that some regions, mainly loops and side chains, may have larger errors.

A particularly informative way to test protein structure modeling methods, including comparative modeling, is provided by the biennial meetings on critical assessment of techniques for protein structure prediction (CASP) [191,195,196]. The most recent meeting was held in December 1998 [241]. Protein modelers are challenged to model sequences with unknown 3D structure and to submit their models to the organizers before the meeting. At the same time, the 3D structures of the prediction targets are being determined by X-ray crystallography or NMR methods. They become available only after the models are calculated and submitted. Thus, a bona fide evaluation of protein structure modeling methods is possible.

V.MODEL EVALUATION

Essential for interpreting 3D protein models is the estimation of their accuracy, both the overall accuracy and the accuracy in the individual regions of a model. The errors in models arise from two main sources, the failure of the conformational search to find the optimal conformation and the failure of the scoring function to identify the optimal conformation. The 3D models are generally evaluated by relying on geometrical preferences of the amino acid residues or atoms that are derived from known protein structures. Empirical relationships between model errors and target–template sequence differences can also be used. It is convenient to approach an evaluation of a given model in a hierarchical manner [9]. It first needs to be assessed if the model at least has the correct fold. The model will have a correct fold if the correct template is picked and if that template is aligned at least approximately correctly with the target sequence. Once the fold of a model is confirmed, a more detailed evaluation of the overall model accuracy can be performed based on the overall sequence similarity on which the model is based (Fig. 8). Finally, a variety of error profiles can be constructed to quantify the likely errors in the different regions of a model. A good strategy is to evaluate the models by using several different methods and identify the consensus between them. In addition, energy functions are in general designed to work at a certain level of detail and are not appropriate to judge the models at a finer or coarser level [197]. There are many model evaluation programs and servers [198,199] (Table 1).

A basic requirement for a model is that it have good stereochemistry. The most useful programs for evaluating stereochemistry are PROCHECK [200], PROCHECKNMR [201], AQUA [201], SQUID [202], and WHATCHECK [203]. The features of a model that are checked by these programs include bond lengths, bond angles, peptide bond and side chain ring planarities, chirality, main chain and side chain torsion angles, and clashes between non-bonded pairs of atoms. In addition to good stereochemistry, a model also has to have low energy according to a molecular mechanics force field, such as that of CHARMM22 [80]. However, low molecular mechanical energy does not ensure