Добавил:

fench Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Сумский государственный университет

Предмет:

Биомеханика

Файл:

Computational Methods for Protein Structure Prediction & Modeling V1 - Xu Xu and Liang

.pdf

Скачиваний:

Добавлен:

10.08.2013

Размер:

10.5 Mб

Скачать

☆

<<< < Предыдущая 7 8 9 10 11 12 13 14 15 16 17 1819 / 4119 20 21 22 23 24 25 26 27 28 29 30 31 > Следующая >>>

5. Protein Structure Comparison and Classiﬁcation

167

As can be seen, there are gray areas in the SCOP database and scientists managing SCOP use their discretion in deciding these classiﬁcations.

CATH (Orengo et al., 1997) is created using the SSAP (Orengo and Taylor, 1996) program, and also tuned manually. Like SCOP, it is hierarchical and has four levels: class, architecture, topology, and homologous topology. Class is similar to SCOP’s class: proteins are categorized based on the types of SSEs they have. The architecture level classiﬁcation of a protein is assigned manually based on the orientation of SSEs in 3D without considering their connectivity. At the topology level, SSE orientation and the connection between them are taken into account. At the lowest level, proteins are grouped into homologous topologies if there is sufﬁcient evidence that they have an evolutionary relationship.

FSSP (Fold classiﬁcation based on Structure-Structure alignment of Proteins) (Holm and Sander, 1996) is constructed using the DALI program (Holm and Sander, 1993). A representative set of all available protein structures is selected and all proteins in this representative set are aligned with each other. The resulting Z -scores are used to build a fold tree using an average linkage clustering algorithm. FSSP is fully automated.

Getz et al. (2002) assign SCOP and CATH classiﬁcation to query proteins using FSSP; this assignment is based on the scores of the proteins in FSSP’s answer set for the query protein. They also found strong correlations between these classiﬁcations. The CATH topology of a protein can be identiﬁed from its SCOP fold with 93% accuracy, and the SCOP fold of a protein can be identiﬁed from its CATH topology with 82% accuracy.

5.5.2 Automated Classiﬁcation

With an exponential growth in the number of newly discovered protein structures, manual databases have become harder to manage. Automated classiﬁcation schemes that can produce classiﬁcations with a similar quality to the manual classiﬁcations are needed. Automated techniques are needed to ﬁnd proteins that have similar structural features in a database that contains thousands of structures, especially with various distance metrics. Additionally, for recently discovered structures, the identiﬁcation of the appropriate categories for their classiﬁcation is crucial. It is not reasonable to expect every researcher to examine thousands of proteins to decide the classiﬁcation of the new protein. Automated techniques are needed in this regard.

A number of schemes toward automated classiﬁcation (Getz et al., 2002; Lindahl and Eloffson, 2000), fold recognition (Lundstrom et al., 2001; Portugaly and Linial, 2000), and structure prediction servers (Fischer, 2003; Kim et al., 2004) from protein sequences have been proposed. But, these schemes consider only the sequence information and ignore other sources of information that are available such as structure. An approach developed in Can et al. (2004) uses sequence and structure information simultaneously for the automated classiﬁcation of proteins. This is discussed in detail next.

168	Orhan C¸ amo˘glu and Ambuj K. Singh

In the protein classiﬁcation problem, there is an existing classiﬁcation scheme such as SCOP, and there is a set of new proteins that has to be classiﬁed based on the implicit rules and conventions of the existing classiﬁcation. A classiﬁcation algorithm simpliﬁes the update of databases such as SCOP that have to accommodate many new proteins periodically. The main questions that need to be answered during classiﬁcation are:

Does the query protein belong to an existing category (family/superfamily/fold), or does it need a new category to be deﬁned?

If the query protein belongs to an existing category, what is its classiﬁcation (label)?

The hierarchical nature of classiﬁcation databases simpliﬁes the classiﬁcation problem. For example, if a protein belongs to a family, then its superfamily and fold are known. In the case of SCOP, the easiest classiﬁcation level is family and the hardest one is fold. A hierarchical classiﬁer can take advantage of this. The ﬁrst attempt is to assign the protein to a family. If this is successful, then the other levels are already known. In case the protein does not belong to an existing family, a superfamily level classiﬁcation is attempted. If the protein does not belong to an existing superfamily, a fold level classiﬁcation is attempted.

5.5.3Building a Classiﬁer from a Comparison Tool

Given a protein comparison tool (sequence or structure), a classiﬁer can be designed by comparing a given query with all proteins of known classiﬁcation. Similarity scores obtained thus provide a measure of proximity of the query protein to the categories deﬁned in the classiﬁcation. After sorting these scores, the ﬁrst classiﬁcation question (does the query belong to an existing category?) can be answered as follows. If none of the categories show a similarity greater than some predeﬁned cutoff, the query protein is classiﬁed as not belonging to an existing category. This cutoff can be determined by investigating the score distribution of each tool.

A 1-Nearest-Neighbor (1NN) classiﬁcation can be used for the second classiﬁcation problem (what is the category of the query protein?). The query protein is assigned to the category of the protein that is found to be most similar, i.e., gets the highest score, using the comparison tool.

In Can et al. (2004), ﬁve component classiﬁers were deﬁned using two sequence and three structure comparison tools. The ﬁrst sequence tool uses the Hidden Markov Model (HMM) library from the SUPERFAMILY database (Gough, 2002). This library is manually curated to classify proteins at the SCOP superfamily level. The models in the SUPERFAMILY database can be searched using HMM-based search tools such as HMMER (Eddy, 1998) and SAM (Hughey and Krogh, 1995). These tools assign a similarity score to a protein sequence according to its match with a model. This classiﬁer is referred to as HMMER.

The second sequence comparison tool is PSI-Blast (Altschul and Koonin, 1998). PSI-Blast is an improved version of Blast that works in iterations. In the

5. Protein Structure Comparison and Classiﬁcation

169

Fig. 5.14 Comparison of VAST and DALI scores for a set of proteins.

ﬁrst iteration, Blast is run and a new scoring scheme is created based on the set of close neighbors. This process of searching and redeﬁnition of score matrix can be repeated.

The structure comparison tools used were CE (Shindyalov and Bourne, 1998), VAST (Madej et al., 1995), and DALI (Holm and Sander, 1993). Each of these tools performs comparisons with a different technique. As a result, they provide a different view of structural relationships between proteins. For example, even though both VAST and DALI compare structures, they assign different scores to the same pair of proteins, as can be seen in Fig. 5.14. [In a similar study, Shindyalov and Bourne (2000) compared CE and DALI scores and showed that there were many proteins that were found similar by CE and dissimilar by DALI, and vice versa.] By exploiting these differences, one can achieve better performance with a combination of the tools.

Each sequenceand structure-comparison method described above assigns a score for a pair of proteins that indicates the statistical signiﬁcance of the similarity between them. In particular, the z-scores reported by CE and DALI, p-values reported by VAST, and e-values (− log(e-value)) reported by HMMER and PSI-Blast are used as the similarity scores.

5.5.3.1 Performance of Component Classiﬁers

The individual performance (accuracy) of the tools when they are used as component classiﬁers was tested with a number of experiments. It was assumed that the classiﬁcations of all proteins in SCOP v1.59 (DS159) are known, and the goal of the component classiﬁers was to classify the new proteins introduced in SCOP v1.61 (QS161) into families, superfamilies, and folds. A hierarchical classiﬁcation scheme is used (Camoglu et al., 2005). At the family level, all new proteins were queried. At the superfamily level, only proteins that do not have family-level similarities are queried. At the fold level, only proteins that do not have family/superfamily-level similarities are queried.

170	Orhan C¸ amo˘glu and Ambuj K. Singh

At the family level, the sequence tools outperform the structure tools by achieving 94.5 and 92.6% accuracy. The highest success rate of the structure tools is only 89% by VAST. For 83% of the query proteins, all ﬁve tools make correct decisions (as to whether the query protein belongs to an existing category), for 4.1% of the queries four tools, for 1.9% of the queries three tools, for 6.9% of the queries two tools, and for 2.7% of the queries only one tool makes the correct decision. An interesting point here is that for 98.2% of the proteins, at least one tool is successful. So, it is theoretically possible to classify up to 98.2% of these proteins correctly by combining the results of the individual tools.

At the superfamily level, the performance of the structure tools improves relative to the sequence tools, as expected. However, the overall performance of the tools drops signiﬁcantly compared to the family level. This is expected since classiﬁcation at the superfamily level is harder. HMMER has one of the best performances with a 79.1% accuracy; this is no surprise considering that HMMER is manually tuned for superfamily classiﬁcations. Among the structure tools, VAST has the best performance with a 78.6% success rate. PSI-Blast performs poorly with a success rate of only 66.1%. Only 44.7% of the queries can be classiﬁed correctly by all ﬁve tools. For 4% of the queries, none of them is successful. This again raises the possibility of achieving better accuracy through a combination of the tools. These results are depicted in Fig. 5.15.

Structure tools outperform sequence tools at the fold level. VAST has the best performance with an 85% success rate. PSI-Blast has the worst performance with

Fig. 5.15 Performance of individual classiﬁers on recognizing the members of existing superfamilies, and assigning categories to them for the new proteins in SCOP v.1.61. The ﬁrst set of bars represents the performance for recognition of existing members and the second set represents the performance for the assignment of categories.

5. Protein Structure Comparison and Classiﬁcation

171

a 60.7% success rate. For only 30.9% of the queries, all ﬁve tools are successful (Camoglu et al., 2005).

Once a tool has marked a new protein as a member of an existing category, the classiﬁcation of the query protein is complete, i.e., the query protein is assigned to the same category as its nearest neighbor. The next question is to judge the accuracy of this assignment, i.e., whether the assigned category is the correct one. The accuracies of the tools are high at the family level. All except DALI have success rates above 90%. HMMER has the best performance with 94.8% accuracy and is followed by PSI-Blast with 92.3% accuracy. For 76.5% of the queries, all ﬁve tools are able to assign the correct family label. For only 2.1% of the queries, none of them is successful. The superfamily results can be seen in Fig. 5.15. At the fold level, all tools seem to perform poorly. At the fold level, PSI-Blast is not able to make even one correct fold assignment, whereas VAST assigns correct folds to 54% of the queries. For 35.1% of the queries, none of the tools is able to assign the correct fold label.

5.5.4 Automated Classiﬁcation Using Ensemble Classiﬁer

As evident from the earlier experiments, an ensemble classiﬁer can potentially obtain higher classiﬁcation accuracy than any single component classiﬁer. There are many studies in the area of machine learning and pattern recognition that address the intelligent design of ensemble classiﬁers (Duda et al., 2001; Meir and Ratsch, 2003; Schapire and Singer, 1999). These include both competitive models (e.g., bagging and binning) and collaborative models (e.g., boosting).

Camoglu et al. (2005) employ a hierarchical decision tree to answer the question whether the query protein belongs to an existing category. To combine different tools, their results need to be normalized to a consistent scale. This is achieved by dividing the scores into bins. A bin is a tool-neutral accuracy extent, e.g., 90–100%, 80–100%, instead of a tool-speciﬁc similarity score. Bins for each tool are manually crafted to achieve maximum performance. All proteins’ 1NN scores are obtained and sorted. Bin boundaries are placed on this sorted list and accuracy is computed for each bin. For example, if bin i is labeled with an accuracy of x%, then the predicted accuracy of all proteins whose scores fall in the bin is x%.

After the bins for each tool are constructed, a decision tree is created. At each level of this tree, a combination of tools is run and depending on their decisions, query proteins are classiﬁed as a member of an existing category or new. The decision tree for the family level is presented in Fig. 5.16. As can be seen, at the ﬁrst level of the decision tree, PSI-Blast and HMMER are run to classify the query protein. Each tool reports a conﬁdence for their classiﬁcation decision and these decisions and conﬁdences are merged (Can et al., 2004). If the conﬁdence of the consensus decision is greater than 95%, the query protein is classiﬁed as a member of an existing family. If the conﬁdence is lower than 60%, it is classiﬁed as a member of a new family. If the conﬁdence is between 60 and 95%, then tools at this level cannot make a conﬁdent decision, and the next level of the tree is used. At the next level, all of the structure tools are run for a consensus decision. There are again two thresholds,

172 Orhan C¸ amo˘glu and Ambuj K. Singh

	Comparison Tool
	HMM+Blast
<60%	else		>95%	existing
new family				existing
new family				family
				family
	CE + Vast +
	Dali
<70%	else	>85%	existing
new family	else		existing
			family
	HMM+Blast
<85%		>85%
new family		existing
new family		family
		family

Fig. 5.16 The decision tree for recognizing if a query protein belongs to an existing family.

85 and 70%. At the last level, there is only one threshold, 80%. If the conﬁdence is greater than this threshold, the query protein is classiﬁed as a member of an existing family, else it is classiﬁed as a member of a new family.

There are two components of a decision tree: (1) the combination of tools at each level and (2) the thresholds at each level. The ﬁrst issue is addressed by using the domain knowledge of the classiﬁcation schemes and the classiﬁcation results for component classiﬁers (e.g., Fig. 5.15). For example, at the family level, it is known that sequence tools have priority, and at the superfamily level and fold levels, structure tools are more important. The problem of ﬁnding the right thresholds requires more care. The thresholds can be automatically determined by examining different choices and ﬁnding values that maximize accuracy for the training data. But, this approach tends to overﬁt. Thus, the distributions are manually analyzed and thresholds set after this analysis. An example of such an analysis is depicted in Fig. 5.17. Although placing the cutoff at point A2 is more suitable for the training data, it is clear that there is a natural separation at point M2. If A2 is used as the threshold, it overﬁts the data and the eventual performance suffers. The determined thresholds for superfamily and fold level classiﬁcations are shown in Table 5.1.

5.5.5Experimental Analysis

To validate that the consensus classiﬁer indeed improves the classiﬁcation performance, the standard validation technique in pattern recognition is applied (Duda et al., 2001). Two data sets are used: a training set and a test set. Ensemble classiﬁer

5. Protein Structure Comparison and Classiﬁcation

173

Fig. 5.17 An example histogram of the conﬁdence levels for the training data. A1 and A2 are the thresholds found by the automated greedy approach. M1 and M2 are the manual thresholds. It is seen that A1 and A2 overﬁt the data, whereas M1 and M2 do not.

Table 5.1 Heuristic decision tree rules for recognition of members of existing superfamilies and folds. At each level, a combination of tools is run and the probability of being a member of an existing category is assigned to each protein. The proteins that have probabilities higher than the indicated range are assigned to the predicted category, the ones within the range are passed to the next step, and those below the range are deemed new. For the last level, only a single threshold exists.

	Level 1	Range	Level 2	Range	Level 3	Threshold

Superfamily	VAST	45%:93%	HMMER	40%:75%	CE+DALI	55%
Fold	VAST	50%:85%	CE	80%:90%	DALI	60%

is trained with proteins introduced in SCOP 1.61 using the classiﬁcations of proteins in SCOP 1.59, and it is validated on proteins introduced in SCOP 1.63 using the classiﬁcations of proteins in SCOP 1.61. Automated ensemble classiﬁer performed well on assigning proteins to their correct classiﬁcations by achieving 98% success for family assignments, 83% success for superfamily assignments, and 61% success for fold assignments. The performance of each tool and the ensemble classiﬁer for superfamily classiﬁcation (training and evaluation phases) is shown in Table 5.2. The beneﬁts of an ensemble classiﬁer are obvious. Complete results appear in Camoglu et al. (2005).

5.6 Concluding Remarks

A number of algorithms for protein comparison were presented in this chapter. Besides a survey of the current state of the art of techniques for problems in structure analysis such as pairwise alignment, multiple alignment, and motif discovery,

174	Orhan C¸ amo˘glu and Ambuj K. Singh

Table 5.2 Performance of individual tools and ensemble classiﬁer on training and test data sets. The performance is measured for two cases: Existing (ability to recognize the proteins from the existing classiﬁcations) and Assignment (ability to assign correct classiﬁcations)

			Existing		Assignment

		Training	Evaluation	Training		Evaluation

Superfamily	HMMER	78.63	79.29	68.9		39.18
	CE	71.77	38.03	81.1		43.81
	VAST	78.63	66.67	81.71		55.67
	DALI	77.42	75.24	80.49		74.74
	PSI-Blast	66.13	31.39	11.59		24.23
	Ensemble	80.65	83.82	93.29		82.99

techniques that scale to large data sets were discussed in depth. In this vein, an approach based on index structures was presented for pairwise protein structure comparison. In this technique, SSE-based features were extracted from database proteins and inserted into an index structure. Given a query protein, its features were also extracted and compared using the index. A graph-based approach was used to extend and evaluate the matches. Embedding this algorithm into well-known pairwise structure analysis algorithms led to signiﬁcant speed-ups. As data sets such as the PDB grow in size, scalability of structure comparison algorithms will be an important factor.

Classiﬁcation of proteins was also presented. A number of existing classiﬁcation techniques based on protein structure were discussed. A technique for automated classiﬁcation of proteins was investigated in depth. The approach used a number of sequence and comparison tools and combined them in a decision tree to increase the classiﬁcation accuracy. The use of multiple data sources is an increasingly useful criterion in biological analysis. Some aspects of such an approach were presented in the use of multiple sequence and structure comparison tools for protein classiﬁcation.

Comparing protein structures quantitatively poses a number of challenges: deﬁning the appropriate notion of similarity, developing new algorithms based on these notions of similarity that can scale to an exponentially increasing number of structures, and understanding the signiﬁcance of a score. The substantial amount of research in this area is a reﬂection of the current challenges and activity in this area. Other opportunities for future research include: ﬂexible models for alignment, simultaneous use of sequence and structure in pairwise and multiple alignment, and evolutionary characterization of structures and protein–protein interactions.

5.7 References and Resources

5.7.1Deﬁnitions

Protein structure alignment: one-to-one mapping between the residues of two protein structures.

5. Protein Structure Comparison and Classiﬁcation

175

Root-mean-square distance: the root-mean-square distance (RMSD) between two proteins A and B under a correspondence R of size k and a transformation f is deﬁned as

RMSD(A,B, R, f )		i=1 dist (ki		i
	=	k	2 a	, f (R(a )))
	=

Distance matrix: a two-dimensional matrix where each entry Mi j stores the Euclidean distance between the ith and jth residues of a protein.

Structural motif: a substructure that is common in the structures of a set of proteins.

Multiple structure alignment: the alignment of a set of related proteins that results in a consensus structure which has the minimum RMSD sum to the protein structures in the set.

Feature vector: a vector composed of numerical values that summarizes the properties of an object.

Index structure: a data structure that organizes a set of objects and supports efﬁcient retrieval.

5.7.2	Resources

Name	Description	Link

CATH	Classiﬁcation	http://cathwww.biochem.ucl.ac.uk/latest/
CE	Pairwise & multiple	http://cl.sdsc.edu/
	alignment
DALI	Pairwise alignment	http://www.ebi.ac.uk/dali/
FSSP	Classiﬁcation	http://ekhidna.biocenter.helsinki.ﬁ/dali/start
MASS	Multiple alignment	http://bioinfo3d.cs.tau.ac.il MASS/
MultiProt	Multiple alignment	http://bioinfo3d.cs.tau.ac.il/MultiProt/
PDB	Structure repository	http://www.rcsb.org/pdb/Welcome.do
ProtDex	Database search	http://xena1.ddns.comp.nus.edu.sg/genesis/PD2˜	.htm
PSI	Database search	http://bioserver.cs.ucsb.edu/proteinstructuresimilarity.php
SCOP	Classiﬁcation	http://scop.mrc-lmb.cam.ac.uk/scop/
SSAP	Pairwise alignment	http://www.cathdb.info/cgi-bin/cath/GetSsapRasmol.pl
Trilogy	Motif ﬁnding	http://theory.lcs.mit.edu/trilogy/
URMS	Pairwise alignment	http://cbsusrv01.tc.cornell.edu/urms/
VAST	Pairwise alignment	http://www.ncbi.nih.gov/Structure/VAST/vastsearch.html

5.8 Further Reading

For readers interested in pairwise protein structure comparison, we recommend “Protein structure comparison by alignment of distance matrices” by Holm and Sander (1993), “Approximate protein structural alignment in polynomial time” by

176	Orhan C¸ amo˘glu and Ambuj K. Singh

Kolodny and Linial (2004), and “Sensitivity and selectivity in protein structure comparison” by Sierk and Pearson (2004). For more information on motif detection and multiple structure alignment, we recommend “Discovery of sequence–structure patterns across diverse proteins” by Bradley et al. (2002) and “MASS: multiple structural alignment by secondary structures” by Dror et al. (2003). More information about database searches can be found in “Index-based similarity search for protein structure databases” by Camoglu et al. (2004) and “Rapid 3d protein structure database searching using information retrieval techniques” by Aung and Tan (2004). For further discussion about the comparison of classiﬁcation databases and automated classiﬁcation techniques, we recommend “Automated assignment of SCOP and CATH protein structure classiﬁcations from FSSP scores” by Getz et al. (2002) and “Decision tree based information integration for automated protein classiﬁcation” by Camoglu et al. (2005).

Acknowledgment

This work was supported in part by NSF grants DBI-0213903 and EF-0331697.

References

Alexandrov, N., and D. Fischer. 1996. Analysis of topological and nontopological structural similarities in the PDB: New examples from old structures. Proteins 25:354–365.

Altschul, S. F., and E. V. Koonin. 1998. Iterated proﬁle searches with PSI-BLAST—a tool for discovery in protein databases. Trends Biochem Sci. 23:444–447.

Arun, K., T. Huang, and S. Blostein. 1987. Least-squares ﬁtting of two 3-D point sets. IEEE Trans. Pattern Anal. Mach. Intell. 9:698–700.

Aung, Z., and K.-L. Tan. 2004. Rapid 3d protein structure database searching using information retrieval techniques. Bioinformatics 20:1045–1052.

Beckmann, N., H.-P. Kriegel, R. Schneider, and B. Seeger. 1990. The R*-tree: An efﬁcient and robust access method for points and rectangles. In SIGMOD, pp. 322–331, Atlantic City, NJ.

Berman, H. M., J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov, and P. E. Bourne. 2000. The Protein Data Bank. Nucleic Acids Res. 28:235–242.

Binkowski, T. A., B. DasGupta, and J. Liang. 2004. Order independent structural alignment of circularly permuted proteins. In IEEE EMBS, July.

Bradley, P., P. S. Kim, and B. Berger. 2002. TRILOGY: Discovery of sequence– structure patterns across diverse proteins. Proc. Natl. Acad. Sci. USA 99:8500– 8503.

Brown, N., C. Orengo, and W. Taylor. 1996. A protein structure comparison methodology. Comput. Chem. 20:359–380.

<<< < Предыдущая 7 8 9 10 11 12 13 14 15 16 17 1819 / 4119 20 21 22 23 24 25 26 27 28 29 30 31 > Следующая >>>

Соседние файлы в предмете Биомеханика

#
10.08.201314.05 Mб77Biomolecular Sensing Processing and Analysis - Rashid Bashir and Steve Wereley.pdf
#
10.08.201320.53 Mб94Bioreaction Engineering Principles - Jens Nielsen.pdf
#
10.08.201326.55 Mб123Bioregenerative Engineering Principles and Applications - Shu Q. Liu..pdf
#
10.08.20134.43 Mб426Biosignal and Biomedical Image Processing MATLAB based Applications - John L. Semmlow.pdf
#
10.08.20133.76 Mб70Biotechnology for Biomedical Engineers - Martin L. Yarmush et al.pdf
#
10.08.201310.5 Mб69Computational Methods for Protein Structure Prediction & Modeling V1 - Xu Xu and Liang.pdf
#
10.08.201330.78 Mб45CRC Press - Biomedical Photonics Handbook.pdf
#
10.08.20134.33 Mб65Cytoskeletal Mechanics - Mofrad and Kamm.pdf
#
10.08.20133.42 Mб70E coli in Motion - Howard C. Berg.pdf
#
10.08.201316.8 Mб64Engineering and Manufacturing for Biotechnology - Marcel Hofman & Philippe Thonart.pdf
#
10.08.20137.9 Mб239Environmental Biotechnology - Jordening and Winter.pdf