Brereton Chemometrics
.pdf
PATTERN RECOGNITION |
221 |
|
|
Similarly, procrustes analysis in chemistry involves comparing two diagrams, such as two PC scores plots. One such plot is the reference and a second plot is manipulated to resemble the reference plot as closely as possible. This manipulation is done mathematically involving up to three main transformations.
1.Reflection. This transformation is a consequence of the inability to control the sign of a principal component.
2.Rotation.
3.Scaling (or stretching). This transformation is used because the scales of the two types of measurements may be very different.
4.Translation.
If two datasets are already standardised, transformation 3 may not be necessary, and the fourth transformation is not often used.
The aim is to reduce the root mean square difference between the scores of the reference dataset and the transformed dataset:
|
= |
i |
1 a |
1 |
− |
|
r |
|
|
|
trans tia )2/I |
||
|
I |
A |
(ref tia |
|
||
|
|
|
|
|
|
|
|
|
= |
= |
|
|
|
For case study 2, it might be of interest to compare performances using different mobile phases (solvents). The original data were obtained using methanol: are similar separations achievable using acetonitrile or THF? The experimental measurements are presented in Tables 4.14 and 4.15. PCA is performed on the standardised data (transposing the matrices as appropriate). Figure 4.21 illustrates the two scores plots using
PC2
|
|
|
6 |
|
|
|
|
|
|
|
|
5 |
|
|
|
|
|
|
|
Kromasil C-18 |
|
|
|
|
|
|
|
|
|
4 |
|
|
|
|
|
|
|
|
3 |
Symmetry C18 |
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
2 |
|
|
|
|
|
|
Kromasil C8 |
|
1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Inertsil ODS-3 |
|
|
|
|
|
|
|
|
0 |
|
|
|
|
|
−6 |
−4 |
−2 |
0 |
2 |
4 |
6 |
8 |
10 |
|
|
|
−1 |
|
|
|
Purospher |
|
|
|
|
|
|
|
|
|
|
|
Inertsil ODS |
−2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Inertsil ODS-2 |
−3 |
|
Supelco ABZ+ |
|
|
|
|
|
|
|
|
|
|
|||
|
|
|
|
|
|
|
||
|
|
|
−4 |
PC1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
Figure 4.21
Comparison of scores plots for methanol and acetonitrile (case study 2)
222 |
|
|
|
|
|
|
|
CHEMOMETRICS |
|
|
|
||||||||
|
Table 4.14 Chromatographic parameters corresponding to case study 2, obtained using ace- |
||||||||
|
tonitrile as mobile phase. |
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
Parameter |
Inertsil |
Inertsil |
Inertsil |
Kromasil |
Kromasil |
Symmetry |
Supelco |
Purospher |
|
|
ODS |
ODS-2 |
ODS-3 |
C18 |
C8 |
C18 |
ABZ+ |
|
|
Pk |
0.13 |
0.07 |
0.11 |
0.15 |
0.13 |
0.37 |
0 |
0 |
|
PN |
7 340 |
10 900 |
13 500 |
7 450 |
9 190 |
9 370 |
18 100 |
8 990 |
|
PN(df) |
5 060 |
6 650 |
6 700 |
928 |
1 190 |
3 400 |
7 530 |
2 440 |
|
PAs |
1.55 |
1.31 |
1.7 |
4.39 |
4.36 |
1.92 |
2.16 |
2.77 |
|
Nk |
0.19 |
0.08 |
0.16 |
0.16 |
0.15 |
0.39 |
0 |
0 |
|
NN |
15 300 |
11 800 |
10 400 |
13 300 |
16 800 |
5 880 |
16 100 |
10 700 |
|
NN(df) |
7 230 |
6 020 |
5 470 |
3 980 |
7 860 |
648 |
6 780 |
3 930 |
|
NAs |
1.81 |
1.91 |
1.81 |
2.33 |
1.83 |
5.5 |
2.03 |
2.2 |
|
Ak |
2.54 |
1.56 |
2.5 |
2.44 |
2.48 |
2.32 |
0.62 |
0.2 |
|
AN |
15 500 |
16 300 |
14 900 |
11 600 |
16 300 |
13 500 |
13 800 |
9 700 |
|
AN(df) |
9 100 |
10 400 |
9 480 |
3 680 |
8 650 |
7 240 |
7 060 |
4 600 |
|
AAs |
1.51 |
1.62 |
1.67 |
2.6 |
1.85 |
1.72 |
1.85 |
1.8 |
|
Ck |
1.56 |
0.85 |
1.61 |
1.39 |
1.32 |
1.43 |
0.34 |
0.11 |
|
CN |
14 600 |
14 900 |
13 500 |
13 200 |
18 100 |
13 100 |
18 000 |
9 100 |
|
CN(df) |
13 100 |
12 500 |
12 200 |
10 900 |
15 500 |
10 500 |
11 700 |
5 810 |
|
CAs |
1.01 |
1.27 |
1.17 |
1.2 |
1.17 |
1.23 |
1.67 |
1.49 |
|
Qk |
7.34 |
3.62 |
7.04 |
5.6 |
5.48 |
5.17 |
1.4 |
0.92 |
|
QN |
14 200 |
16 700 |
13 800 |
14 200 |
16 300 |
11 100 |
10 500 |
4 200 |
|
QN(df) |
12 800 |
13 800 |
11 400 |
10 300 |
12 600 |
5 130 |
7 780 |
2 220 |
|
QAs |
1.03 |
1.34 |
1.37 |
1.44 |
1.41 |
2.26 |
1.35 |
2.01 |
|
Bk |
0.67 |
0.41 |
0.65 |
0.64 |
0.65 |
0.77 |
0.12 |
0 |
|
BN |
15 900 |
12 000 |
12 800 |
14 100 |
19 100 |
12 900 |
13 600 |
5 370 |
|
BN(df) |
8 100 |
8 680 |
6 210 |
5 370 |
8 820 |
5 290 |
6 700 |
2 470 |
|
BAs |
1.63 |
1.5 |
1.92 |
2.11 |
1.9 |
1.97 |
1.82 |
1.42 |
|
Dk |
5.73 |
4.18 |
6.08 |
6.23 |
6.26 |
5.5 |
1.27 |
0.75 |
|
DN |
14 400 |
20 200 |
17 700 |
11 800 |
18 500 |
15 600 |
14 600 |
11 800 |
|
DN(df) |
10 500 |
15 100 |
13 200 |
3 870 |
12 600 |
10 900 |
10 400 |
8 950 |
|
DAs |
1.39 |
1.51 |
1.54 |
2.98 |
1.65 |
1.53 |
1.49 |
1.3 |
|
Rk |
14.62 |
10.8 |
15.5 |
15.81 |
14.57 |
13.81 |
3.41 |
2.22 |
|
RN |
12 100 |
19 400 |
17 500 |
10 800 |
16 600 |
15 700 |
14 000 |
10 200 |
|
RN(df) |
9 890 |
13 600 |
12 900 |
3 430 |
12 400 |
11 600 |
10 400 |
7 830 |
|
RAs |
1.3 |
1.66 |
1.62 |
3.09 |
1.52 |
1.54 |
1.49 |
1.32 |
|
|
|
|
|
|
|
|
|
|
methanol and acetonitrile, as procrustes rotation has been performed to ensure that they agree as closely as possible, whereas Figure 4.22 is for methanol and THF. It appears that acetonitrile has similar properties to methanol as a mobile phase, but THF is very different.
It is not necessary to restrict each measurement technique to two PCs; indeed, in many practical cases four or five PCs are employed. Computer software is available to compare scores plots and provide a numerical indicator of the closeness of the fit, but it is not easy to visualise. Because PCs do not often have a physical meaning, it is important to recognise that in some cases it is necessary to include several PCs for a meaningful result. For example, if two datasets are characterised by four PCs, and each one is of approximately equal size, the first PC for the reference dataset may correlate most closely with the third for the comparison dataset, so including only the
PATTERN RECOGNITION |
|
|
|
|
|
|
|
223 |
|
|
|
||||||||
|
Table 4.15 Chromatographic parameters corresponding to case study 2, obtained using THF |
||||||||
|
as mobile phase. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Parameter |
Inertsil |
Inertsil |
Inertsil |
Kromasil |
Kromasil |
Symmetry |
Supelco |
Purospher |
|
|
ODS |
ODS-2 |
ODS-3 |
C18 |
C8 |
C18 |
ABZ+ |
|
|
Pk |
0.05 |
0.02 |
0.02 |
0.04 |
0.04 |
0.31 |
0 |
0 |
|
PN |
17 300 |
12 200 |
10 200 |
14 900 |
18 400 |
12 400 |
16 600 |
13 700 |
|
PN(df) |
11 300 |
7 080 |
6 680 |
7 560 |
11 400 |
5 470 |
10 100 |
7 600 |
|
PAs |
1.38 |
1.74 |
1.53 |
1.64 |
1.48 |
2.01 |
1.65 |
1.58 |
|
Nk |
0.05 |
0.02 |
0.01 |
0.03 |
0.03 |
0.33 |
0 |
0 |
|
NN |
13 200 |
9 350 |
7 230 |
11 900 |
15 800 |
3 930 |
14 300 |
11 000 |
|
NN(df) |
7 810 |
4 310 |
4 620 |
7 870 |
11 500 |
543 |
7 870 |
6 140 |
|
NAs |
1.57 |
2 |
1.76 |
1.43 |
1.31 |
5.58 |
1.7 |
1.7 |
|
Ak |
2.27 |
1.63 |
1.79 |
1.96 |
2.07 |
2.37 |
0.66 |
0.19 |
|
AN |
14 800 |
15 300 |
13 000 |
13 500 |
18 300 |
12 900 |
14 200 |
11 000 |
|
AN(df) |
10 200 |
11 300 |
10 700 |
9 830 |
14 200 |
9 430 |
9 600 |
7 400 |
|
AAs |
1.36 |
1.46 |
1.31 |
1.36 |
1.32 |
1.37 |
1.51 |
1.44 |
|
Ck |
0.68 |
0.45 |
0.52 |
0.53 |
0.54 |
0.84 |
0.15 |
0 |
|
CN |
14 600 |
11 800 |
9 420 |
11 600 |
16 500 |
11 000 |
12 700 |
8 230 |
|
CN(df) |
12 100 |
9 170 |
7 850 |
9 820 |
13 600 |
8 380 |
9 040 |
5 100 |
|
CAs |
1.12 |
1.34 |
1.22 |
1.14 |
1.15 |
1.28 |
1.42 |
1.48 |
|
Qk |
4.67 |
2.2 |
3.03 |
2.78 |
2.67 |
3.28 |
0.9 |
0.48 |
|
QN |
10 800 |
12 100 |
9 150 |
10 500 |
13 600 |
8 000 |
10 100 |
5 590 |
|
QN(df) |
8 620 |
9 670 |
7 450 |
8 760 |
11 300 |
3 290 |
7 520 |
4 140 |
|
QAs |
1.17 |
1.36 |
1.3 |
1.19 |
1.18 |
2.49 |
1.3 |
1.38 |
|
Bk |
0.53 |
0.39 |
0.42 |
0.46 |
0.51 |
0.77 |
0.11 |
0 |
|
BN |
14 800 |
12 100 |
11 900 |
14 300 |
19 100 |
13 700 |
15 000 |
4 290 |
|
BN(df) |
8 260 |
6 700 |
8 570 |
9 150 |
14 600 |
9 500 |
9 400 |
3 280 |
|
BAs |
1.57 |
1.79 |
1.44 |
1.44 |
1.3 |
1.37 |
1.57 |
1.29 |
|
Dk |
3.11 |
3.08 |
2.9 |
2.89 |
3.96 |
4.9 |
1.36 |
0.66 |
|
DN |
10 600 |
15 000 |
10 600 |
9 710 |
14 400 |
10 900 |
11 900 |
8 440 |
|
DN(df) |
8 860 |
12 800 |
8 910 |
7 800 |
11 300 |
7 430 |
8 900 |
6 320 |
|
DAs |
1.15 |
1.32 |
1.28 |
1.28 |
1.28 |
1.6 |
1.37 |
1.31 |
|
Rk |
12.39 |
12.02 |
12.01 |
11.61 |
15.15 |
19.72 |
5.6 |
3.08 |
|
RN |
9 220 |
17 700 |
13 000 |
10 800 |
13 800 |
12 100 |
12 000 |
9 160 |
|
RN(df) |
8 490 |
13 900 |
11 000 |
8 820 |
12 000 |
8 420 |
9 630 |
7 350 |
|
RAs |
1.07 |
1.51 |
1.32 |
1.31 |
1.27 |
1.64 |
1.32 |
1.25 |
|
|
|
|
|
|
|
|
|
|
first two components in the model could result in very misleading conclusions. It is usually a mistake to compare PCs of equivalent significance to each other, especially when their sizes are fairly similar.
Procrustes analysis can be used to answer fairly sophisticated questions. For example, in sensory research, are the results of a taste panel comparable to chemical measurements? If so, can the rather expensive and time-consuming taste panel be replaced by chromatography? A second use of procrustes analysis is to reduce the number of tests, an example being clinical trials. Sometimes 50 or more bacteriological tests are performed, but can these be reduced to 10 or fewer? A way to check this is by performing PCA on the results of all 50 tests and compare the scores plot when using a subset of 10 tests. If the two scores plots provide comparable information, the 10 selected tests are just as good as the full set of tests. This can be of significant economic benefit.
224 |
CHEMOMETRICS |
|
|
|
|
|
|
10 |
|
|
|
|
|
|
|
|
|
|
8 |
|
|
|
|
|
|
|
|
Symmetry C18 |
6 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
4 |
|
Kromasil C-18 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
PC2 |
|
|
|
2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Inertsil ODS-3 |
|
|
|
|
|
|
|
|
0 |
|
|
|
|
|
|
−8 |
−6 |
−4 |
−2 |
|
0 |
2 |
4 |
6 |
8 |
10 |
|
|
|
|
|
|
|
|
|
Purospher |
|
|
Kromasil C8 |
|
|
−2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Inertsil ODS |
|
|
|
|
|
|
|
|
|
|
|
Inertsil ODS-2 −4 |
|
Supelco ABZ+ |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
−6 |
|
PC1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Figure 4.22
Comparison of scores plots for methanol and THF (case study 2)
4.4 Unsupervised Pattern Recognition: Cluster Analysis
Exploratory data analysis such as PCA is used primarily to determine general relationships between data. Sometimes more complex questions need to be answered, such as, do the samples fall into groups? Cluster analysis is a well established approach that was developed primarily by biologists to determine similarities between organisms. Numerical taxonomy emerged from a desire to determine relationships between different species, for example genera, families and phyla. Many textbooks in biology show how organisms are related using family trees.
The chemist also wishes to relate samples in a similar manner. Can protein sequences from different animals be related and does this tell us about the molecular basis of evolution? Can the chemical fingerprint of wines be related and does this tell us about the origins and taste of a particular wine? Unsupervised pattern recognition employs a number of methods, primarily cluster analysis, to group different samples (or objects) using chemical measurements.
4.4.1 Similarity
The first step is to determine the similarity between objects. Table 4.16 consists of six objects, 1–6, and five measurements, A–E. What are the similarities between the objects? Each object has a relationship to the remaining five objects. How can a numerical value of similarity be defined? A similarity matrix can be obtained, in which the similarity between each pair of objects is calculated using a numerical indicator. Note that it is possible to preprocess data prior to calculation of a number of these measures (see Section 4.3.6).
PATTERN RECOGNITION |
|
|
|
|
|
|
|
|
225 |
||
|
|
|
|
|
|
|
|
||||
|
|
|
Table 4.16 Example of cluster analysis. |
|
|
|
|
||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
A |
B |
C |
D |
E |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
0.9 |
0.5 |
0.2 |
1.6 |
1.5 |
|
|
|
|
|
|
2 |
0.3 |
0.2 |
0.6 |
0.7 |
0.1 |
|
|
|
|
|
|
3 |
0.7 |
0.2 |
0.1 |
0.9 |
0.1 |
|
|
|
|
|
|
4 |
0.1 |
0.4 |
1.1 |
1.3 |
0.2 |
|
|
|
|
|
|
5 |
1.0 |
0.7 |
2.0 |
2.2 |
0.4 |
|
|
|
|
|
|
6 |
0.3 |
0.1 |
0.3 |
0.5 |
0.1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|||
|
Table 4.17 Correlation matrix. |
|
|
|
|
|
|
||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
|
2 |
3 |
|
4 |
5 |
|
6 |
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
1.000 |
|
|
|
|
|
|
|
|
||
2 |
−0.041 |
1.000 |
|
|
|
|
|
|
|
||
3 |
0.503 |
0.490 |
1.000 |
|
|
|
|
|
|
||
4 |
−0.018 |
0.925 |
0.257 |
|
1.000 |
|
|
|
|
||
5 |
−0.078 |
0.999 |
0.452 |
|
0.927 |
1.000 |
|
|
|||
|
6 |
0.264 |
0.900 |
0.799 |
|
0.724 |
0.883 |
1.000 |
|
||
Four of the most popular ways of determining how similar objects are to each other are as follows.
1.Correlation coefficient between samples. A correlation coefficient of 1 implies that samples have identical characteristics, which all objects have with themselves. Some workers use the square or absolute value of a correlation coefficient, and it depends on the precise physical interpretation as to whether negative correlation coefficients imply similarity or dissimilarity. In this text we assume that the more negative is the correlation coefficient, the less similar are the objects. The correlation matrix is presented in Table 4.17. Note that the top right-hand side is not presented as it is the same as the bottom left-hand side. The higher is the correlation coefficient, the more similar are the objects.
2.Euclidean distance. The distance between two samples samples k and l is defined by
dkl |
= |
|
J |
|
(xkj |
− |
xlj )2 |
|
|
|
j |
|
|
1 |
|
|
|||
|
|
|
|
|
|
||||
|
|
|
|
= |
|
|
|
|
|
|
|
|
|
|
|
|
|
||
where there are j measurements and |
|
xij is |
the j th measurement on sample i, |
||||||
for example, x23 is the third measurement on the second sample, equalling 0.6 in Table 4.16. The smaller is this value, the more similar are the samples, so this distance measure works in an opposite manner to the correlation coefficient and strictly is a dissimilarity measure. The results are presented in Table 4.18. Although correlation coefficients vary between −1 and +1, this is not true for the Euclidean distance, which has no limit, although it is always positive. Sometimes the equation is presented in matrix format:
dkl = (xk − xl ).(xk − xl )
226 |
|
|
|
|
|
|
CHEMOMETRICS |
|
|
|
|
|
|
|
|||
|
Table 4.18 Euclidean distance matrix. |
|
|
|
|
|||
|
|
|
|
|
|
|
|
|
|
|
1 |
2 |
3 |
4 |
5 |
6 |
|
|
|
|
|
|
|
|
|
|
1 |
0.000 |
|
|
|
|
|
|
|
2 |
1.838 |
0.000 |
|
|
|
|
|
|
3 |
1.609 |
0.671 |
0.000 |
|
|
|
|
|
4 |
1.800 |
0.837 |
1.253 |
0.000 |
|
|
|
|
5 |
2.205 |
2.245 |
2.394 |
1.600 |
0.000 |
|
|
|
6 |
1.924 |
0.374 |
0.608 |
1.192 |
2.592 |
0.000 |
|
|
|
|
|
|
|
|
|
|
|
Euclidean distance
Manhattan distance
Figure 4.23
Euclidean and Manhattan distances
where the objects are row vectors as in Table 4.16; this method is easy to implement in Excel or Matlab.
3. Manhattan distance. This is defined slightly differently to the Euclidean distance and is given by
J
dkl = |xkj − xlj | j =1
The difference between the Euclidean and Manhattan distances is illustrated in Figure 4.23. The values are given in Table 4.19; note the Manhattan distance will always be greater than (or in exceptional cases equal to) the Euclidean distance.
PATTERN RECOGNITION |
|
|
|
|
|
227 |
||
|
|
|
|
|
||||
|
Table 4.19 Manhattan distance matrix. |
|
|
|
||||
|
|
|
|
|
|
|
|
|
|
|
1 |
2 |
3 |
4 |
5 |
6 |
|
|
|
|
|
|
|
|
|
|
1 |
0 |
|
|
|
|
|
|
|
2 |
3.6 |
0 |
|
|
|
|
|
|
3 |
2.7 |
1.1 |
0 |
|
|
|
|
|
4 |
3.4 |
1.6 |
2.3 |
0 |
|
|
|
|
5 |
3.8 |
4.4 |
4.3 |
3.2 |
0 |
|
|
|
6 |
3.6 |
0.6 |
1.1 |
2.2 |
5 |
0 |
|
|
|
|
|
|
|
|
|
|
|
4.Mahalanobis distance. This method is popular with many chemometricians and, whilst superficially similar to the Euclidean distance, it takes into account that some variables may be correlated and so measure more or less the same properties. The distance between objects k and l is best defined in matrix terms by
dkl = (xk − xl ).C−1.(xk − xl )
where C is the variance–covariance matrix of the variables, a matrix symmetric about the diagonal, whose elements represent the covariance between any two variables, of dimensions J × J . See Appendix A.3.1 for definitions of these parameters; note that one should use the population rather than sample statistics. This measure is very similar to the Euclidean distance except that the inverse of the variance–covariance matrix is inserted as a scaling factor. However, this method cannot easily be applied where the number of measurements (or variables) exceeds the number of objects, because the variance–covariance matrix would not have an inverse. There are some ways around this (e.g. when calculating spectral similarities where the number of wavelengths far exceeds the number of spectra), such as first performing PCA and then retaining the first few PCs for subsequent analysis. For a meaningful measure, the number of objects must be significantly greater than the number of variables, otherwise there are insufficient degrees of freedom for measurement of this parameter. In the case of Table 4.16, the Mahalanobis distance would be an inappropriate measure unless either the number of samples is increased or the number of variables decreased. This distance metric does have its uses in chemometrics, but more commonly in the areas of supervised pattern recognition (Section 4.5) where its properties will be described in more detail. Note in contrast that if the number of variables is small, although the Mahalanobis distance is an appropriate measure, correlation coefficients are less useful.
There are several other related distance measures in the literature, but normally good reasons are required if a very specialist distance measure is to be employed.
4.4.2 Linkage
The next step is to link the objects. The most common approach is called agglomerative clustering whereby single objects are gradually connected to each other in groups. Any similarity measure can be used in the first step, but for simplicity we will illustrate this using only the correlation coefficients of Table 4.17. Similar considerations apply to all the similarity measures introduced in Section 4.4.1, except that in the other cases the lower the distance the more similar the objects.
228 |
CHEMOMETRICS |
|
|
1.From the raw data, find the two objects that are most similar (closest together).
According to Table 4.17, these are objects 2 and 5, as they have the highest correlation coefficient (=0.999) (remember that because only five measurements have been recorded there are only four degrees of freedom for calculation of correlation coefficients, meaning that high values can be obtained fairly easily).
2.Next, form a ‘group’ consisting of the two most similar objects. Four of the original objects (1, 3 and 6) and a group consisting of objects 2 and 5 together remain, leaving a total of five new groups, four consisting of a single original object and one consisting of two ‘clustered’ objects.
3.The tricky bit is to decide how to represent this new grouping. As in the case of distance measures, there are several approaches. The main task is to recalculate the numerical similarity values between the new group and the remaining objects. There are three principal ways of doing this.
(a)Nearest neighbour. The similarity of the new group from all other groups is given by the highest similarity of either of the original objects to each other object. For example, object 6 has a correlation coefficient of 0.900 with object 2, and 0.883 with object 5. Hence the correlation coefficient with the new combined group consisting of objects 2 and 5 is 0.900.
(b)Furthest neighbour. This is the opposite to nearest neighbour, and the lowest similarity is used, 0.883 in our case. Note that the furthest neighbour method of linkage refers only to the calculation of similarity measures after new groups are formed, and the two groups (or objects) with highest similarity are always joined first.
(c)Average linkage. The average similarity is used, 0.892 in our case. There are, in fact, two different ways of doing this, according to the size of each group being joined together. Where they are of equal size (e.g. each consists of one object), both methods are equivalent. The two different ways are as follows.
•Unweighted. If group A consists of NA objects and group B of NB objects, the new similarity measure is given by
sAB = (NAsA + NB sB )/(NA + NB )
• Weighted. The new similarity measure is given by
sAB = (sA + sB )/2
The terminology indicates that for the unweighted method, the new similarity measure takes into consideration the number of objects in a group, the conventional terminology possibly being the opposite to what is expected. For the first link, each method provides identical results.
There are numerous further linkage methods, but it would be rare that a chemist needs to use too many combination of similarity and linkage methods, however, a good rule of thumb is to check the result of using a combination of approaches.
The new data matrix using nearest neighbour clustering is presented in Table 4.20, with the new values shaded. Remember that there are many similarity measures and methods for linking, so this table represents only one possible way of handling the information.
PATTERN RECOGNITION |
229 |
|
|
4.4.3 Next Steps
The next steps consist of continuing to group the data just as above, until all objects have joined one large group. Since there are six original objects, there will be five steps before this is achieved. At each step, the most similar pair of objects or clusters are identified, then they are combined into one new cluster, until all objects have been joined. The calculation is illustrated in Table 4.20, using nearest neighbour linkage, with the most similar objects at each step indicated in bold type, and the new similarity measures shaded. In this particular example, all objects ultimately belong to the same cluster, although arguably object 1 (and possibly 3) does not have a very high similarity to the main group. In some cases, several clusters can be formed, although ultimately one large group is usually formed.
It is normal then to determine at what similarity measure each object joined a larger group, and so which objects resemble each other most.
4.4.4 Dendrograms
Often the result of hierarchical clustering is presented in the form of a dendrogram (sometimes called a ‘tree diagram’). The objects are organised in a row, according
Table 4.20 Nearest neighbour cluster analysis, using correlation coefficients for similarity measures, and data in Table 4.16.
230 |
|
|
|
|
CHEMOMETRICS |
2 |
5 |
4 |
6 |
3 |
1 |
Similarity
Figure 4.24
Dendrogram for cluster analysis example
to their similarities: the vertical axis represents the similarity measure at which each successive object joins a group. Using nearest neighbour linkage and correlation coefficients for similarities, the dendrogram for Table 4.20 is presented in Figure 4.24. It can be seen that object 1 is very different from the others. In this case all the other objects appear to form a single group, but other clustering methods may give slightly different results. A good approach is to perform several different methods of cluster analysis and compare the results. If similar clusters are obtained, no matter which method is employed, we can rely on the results.
4.5 Supervised Pattern Recognition
Classification (often called supervised pattern recognition) is at the heart of chemistry. Mendeleev’s periodic table, grouping of organic compounds by functionality and listing different reaction types all involve classification. Much of traditional chemistry involves grouping chemical behaviour. Most early texts in organic, inorganic and analytical chemistry are systematically divided into chapters according to the behaviour or structure of the underlying compounds or techniques.
So the modern chemist also has a significant need for classification. Can a spectrum be used to determine whether a compound is a ketone or an ester? Can the chromatogram of a tissue sample be used to determine whether a patient is cancerous or not? Can we record the spectrum of an orange juice and decide its origin? Is it possible to monitor a manufacturing process by near-infrared spectroscopy and decide whether the product is acceptable or not? Supervised pattern recognition is used to assign samples to a number of groups (or classes). It differs from cluster analysis where, although the relationship between samples is important, there are no predefined groups.
