1Foundation of Mathematical Biology / The Elements of Statistical Learning
.pdfPeptide Binding: Background
Milik M, Sauer D, Brunmark AP et al.,
Nature Biotechnology, 16:753-6, 1998.
Predict the amino acid sequences of peptides that bind to the particular MHC class I molecule, Kb.
The peptides of interest are 8-mers which may result from proteolysis of invading viral particles.
Some bind to class I MHC molecules.
These complexes are presented on the infected cell surface where recognized by cytotoxic T lymphocytes which destroy the infected cell.
Hence, MHC binding is an essential prerequisite for any peptide to induce an immune response
) the task of identifying peptides that bind to
MHC molecules is immunologically important.
Peptide Binding: Problem
Studies shown that binding peptides typically have
specific amino acids at specific anchor positions.
Rules for predicting binding based solely on anchor position preferences, motifs, are inadequate.
Binding is also known to be influenced by
(i)presence of secondary anchor positions, and
(ii)between-position amino acid interactions.
It is the search for this more complex structure that constitutes the problem of interest.
Complex structure ./ Artificial Neural Networks.
Position 1 |
Position 2 |
0.8 |
|
0.4 0.5 |
0.6 |
|
0.3 |
0.4 |
Non-Binders |
0.2 |
Binders |
||
0.2 |
|
0.1 |
0.0 |
|
0.0 |
|
A C D E F G H I K L N P Q R S T V Y |
A C D E F G H I K L M N P Q R S T V W Y |
Position 3 |
Position 4 |
0.2 0.4 0.6
0.0 A C D E F G H I K L M N P Q R S T V W Y
0.1 0.2 0.3
0.0 A C D E F G H I K L M N P Q R S T V W Y
Position 5 |
Position 6 |
0.4 0.3 0.2 0.1 0.0 A C D E F G H I K L M N P Q R S T V W Y
0.10 0.20
0.0 A C D E F G H I K L M N P Q R S T V W Y
Position 7 |
Position 8 |
0.2 0.4
0.0 A D E F G H I K L M N P Q R S T V W Y
0.2 0.4 0.6
0.0 A C D E F G H I K L M N P Q R S T V W Y
Peptide Binding: Data Structure, Issues
Binary outcome: Binding (yes/no).
8 unordered categorical covariates:
the amino acids at the respective positions.
Highly polymorphic data: respectively
18, 20, 20, 20, 20, 20, 19, 20 distinct amino acids.
Key concerns: large number of corresponding indicator variables, between position interactions.
To avert related difficulties Milik et al., use select biophysical and biochemical properties of amino acids: adequacy? ) potential information loss.
This structure is representative of a vast class of problems: Genotype 7!Phenotype.
Peptide Binding: Regression Difficulties
Problems occur irrespective of outcome type.
Regression modelling of binding:
Default starting model includes each position. This entails estimating 149 coefficients;
just assimilating the output will be difficult.
This for a simple model in a small (8-mer) setting.
Adjacent and/or second nearest neighbor amino acids impact ability to bind to MHC:
this suggests including third-order interactions.
But, problems even for second-order interactions: SAS, S-Plus break – lack of dynamic memory. Not remedied by expansion or forward selection.
Full Tree // Training data
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
92/223 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
pos8:A,C,D,E,G,H,K,N,P,Q,R,S,T,V,W |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
pos8:F,I,L,M,Y |
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||
|
|
|
|
0 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
|
|
|
|
|
|
|
|
|
|||||
|
|
|
|
17/101 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
8/122 |
|
|
|
|
|
|
|
||||||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||
pos1:A,C,D,E,F,G,H,I,K,L,N,P,R,V |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
pos5:E,P,S,T,V |
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||
|
|
|
|
|
|
|
|
pos1:Q,S,T,Y |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
pos5:A,F,I,L,M,N,Y |
|
|
|||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
0 |
|
|
|
|
|
|
0 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0 |
|
|
|
|
1 |
|
|
|
|
|
|||||||||||
|
|
|
|
|
|
|
|
|
|
17/41 |
|
|
|
|
|
|
|
|
|
|
|
|
3/10 |
|
|
1/112 |
|
|
|
||||||||||||||||
|
|
0/60 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||
|
|
|
pos5:A,C,D,G,I,L,N,P,Q,R,S,T,V |
|
|
|
|
|
|
|
|
pos2:F,L,M |
|
|
|
|
|
|
pos6:D,E,L,V |
|
|
|
|
|
|||||||||||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
pos5:F,H,M,Y |
|
|
|
|
|
|
|
|
pos2:A,D,H,T |
|
|
|
|
|
pos6:A,G,H,I,N,P,Q,R,S,T,Y |
||||||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||
|
|
|
|
0 |
|
|
|
|
|
|
|
|
|
|
|
1 |
|
|
|
|
1 |
|
|
|
0 |
|
|
|
1 |
|
|
|
1 |
|
|||||||||||
|
|
|
|
4/27 |
|
|
|
|
|
|
|
|
|
1/14 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2/5 |
|
0/5 |
|
1/5 |
|
0/107 |
||||||||||||||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||
|
pos6:D,E,H,L,M,P,Q,R,T,V |
|
|
|
|
|
pos2:A,N,P |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||
|
|
|
|
|
|
pos6:S,Y |
|
|
|
|
|
|
|
|
pos2:G,S,T |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||
|
|
0 |
|
1 |
|
|
|
1 |
|
|
1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||
|
0/22 |
1/5 |
|
1/5 |
0/9 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Tree Deviance versus Tree Size // Test data
120
110
deviance |
100 |
|
90 |
80
2 |
4 |
6 |
8 |
size
Predictions: test data
1
37/87
pos8:A,C,D,E,G,H,K,N,P,Q,R,S,T,V,W
pos8:F,I,L,M,Y
0 |
1 |
7/37 |
7/50 |
pos1:A,C,D,E,F,G,H,I,K,L,N,P,R,V pos5:E,P,S,T,V pos1:Q,S,T,Y pos5:A,F,I,L,M,N,Y
0 |
|
0 |
|
0 |
|
1 |
|
|
|
|
|
|
|
1/23 |
6/14 |
0/1 |
2/44 |
Peptide Binding: Tree Attributes
Salient feature of trees re unordered categorical covariates (amino acids) is flexible (exhaustive) and automated handling of groups of levels: avoid computing/examining individual coefficients; covariate integrity preserved.
Interactions are readily accommodated.
Easy interpretation/prediction via tree schematic.
Oft-cited deficiency of tree methods is piecewise constant response surfaces provide poor/inefficient approximations to smooth response surfaces: motivated MARS (HTF, Secn 9.4) modifications.
Here such concerns are moot. Notion of a smooth response surface requires ordered covariates – otherwise nothing to be smooth with respect to.