Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Brereton Chemometrics

.pdf
Скачиваний:
65
Добавлен:
15.08.2013
Размер:
4.3 Mб
Скачать

PATTERN RECOGNITION

221

 

 

Similarly, procrustes analysis in chemistry involves comparing two diagrams, such as two PC scores plots. One such plot is the reference and a second plot is manipulated to resemble the reference plot as closely as possible. This manipulation is done mathematically involving up to three main transformations.

1.Reflection. This transformation is a consequence of the inability to control the sign of a principal component.

2.Rotation.

3.Scaling (or stretching). This transformation is used because the scales of the two types of measurements may be very different.

4.Translation.

If two datasets are already standardised, transformation 3 may not be necessary, and the fourth transformation is not often used.

The aim is to reduce the root mean square difference between the scores of the reference dataset and the transformed dataset:

 

=

i

1 a

1

 

r

 

 

 

trans tia )2/I

 

I

A

(ref tia

 

 

 

 

 

 

 

 

 

 

=

=

 

 

 

For case study 2, it might be of interest to compare performances using different mobile phases (solvents). The original data were obtained using methanol: are similar separations achievable using acetonitrile or THF? The experimental measurements are presented in Tables 4.14 and 4.15. PCA is performed on the standardised data (transposing the matrices as appropriate). Figure 4.21 illustrates the two scores plots using

PC2

 

 

 

6

 

 

 

 

 

 

 

 

5

 

 

 

 

 

 

 

Kromasil C-18

 

 

 

 

 

 

 

 

4

 

 

 

 

 

 

 

 

3

Symmetry C18

 

 

 

 

 

 

 

 

 

 

 

 

 

2

 

 

 

 

 

 

Kromasil C8

 

1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Inertsil ODS-3

 

 

 

 

 

 

 

0

 

 

 

 

 

−6

−4

−2

0

2

4

6

8

10

 

 

 

−1

 

 

 

Purospher

 

 

 

 

 

 

 

 

 

 

Inertsil ODS

−2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Inertsil ODS-2

−3

 

Supelco ABZ+

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

−4

PC1

 

 

 

 

 

 

 

 

 

 

 

 

Figure 4.21

Comparison of scores plots for methanol and acetonitrile (case study 2)

222

 

 

 

 

 

 

 

CHEMOMETRICS

 

 

 

Table 4.14 Chromatographic parameters corresponding to case study 2, obtained using ace-

 

tonitrile as mobile phase.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Parameter

Inertsil

Inertsil

Inertsil

Kromasil

Kromasil

Symmetry

Supelco

Purospher

 

 

ODS

ODS-2

ODS-3

C18

C8

C18

ABZ+

 

 

Pk

0.13

0.07

0.11

0.15

0.13

0.37

0

0

 

PN

7 340

10 900

13 500

7 450

9 190

9 370

18 100

8 990

 

PN(df)

5 060

6 650

6 700

928

1 190

3 400

7 530

2 440

 

PAs

1.55

1.31

1.7

4.39

4.36

1.92

2.16

2.77

 

Nk

0.19

0.08

0.16

0.16

0.15

0.39

0

0

 

NN

15 300

11 800

10 400

13 300

16 800

5 880

16 100

10 700

 

NN(df)

7 230

6 020

5 470

3 980

7 860

648

6 780

3 930

 

NAs

1.81

1.91

1.81

2.33

1.83

5.5

2.03

2.2

 

Ak

2.54

1.56

2.5

2.44

2.48

2.32

0.62

0.2

 

AN

15 500

16 300

14 900

11 600

16 300

13 500

13 800

9 700

 

AN(df)

9 100

10 400

9 480

3 680

8 650

7 240

7 060

4 600

 

AAs

1.51

1.62

1.67

2.6

1.85

1.72

1.85

1.8

 

Ck

1.56

0.85

1.61

1.39

1.32

1.43

0.34

0.11

 

CN

14 600

14 900

13 500

13 200

18 100

13 100

18 000

9 100

 

CN(df)

13 100

12 500

12 200

10 900

15 500

10 500

11 700

5 810

 

CAs

1.01

1.27

1.17

1.2

1.17

1.23

1.67

1.49

 

Qk

7.34

3.62

7.04

5.6

5.48

5.17

1.4

0.92

 

QN

14 200

16 700

13 800

14 200

16 300

11 100

10 500

4 200

 

QN(df)

12 800

13 800

11 400

10 300

12 600

5 130

7 780

2 220

 

QAs

1.03

1.34

1.37

1.44

1.41

2.26

1.35

2.01

 

Bk

0.67

0.41

0.65

0.64

0.65

0.77

0.12

0

 

BN

15 900

12 000

12 800

14 100

19 100

12 900

13 600

5 370

 

BN(df)

8 100

8 680

6 210

5 370

8 820

5 290

6 700

2 470

 

BAs

1.63

1.5

1.92

2.11

1.9

1.97

1.82

1.42

 

Dk

5.73

4.18

6.08

6.23

6.26

5.5

1.27

0.75

 

DN

14 400

20 200

17 700

11 800

18 500

15 600

14 600

11 800

 

DN(df)

10 500

15 100

13 200

3 870

12 600

10 900

10 400

8 950

 

DAs

1.39

1.51

1.54

2.98

1.65

1.53

1.49

1.3

 

Rk

14.62

10.8

15.5

15.81

14.57

13.81

3.41

2.22

 

RN

12 100

19 400

17 500

10 800

16 600

15 700

14 000

10 200

 

RN(df)

9 890

13 600

12 900

3 430

12 400

11 600

10 400

7 830

 

RAs

1.3

1.66

1.62

3.09

1.52

1.54

1.49

1.32

 

 

 

 

 

 

 

 

 

 

methanol and acetonitrile, as procrustes rotation has been performed to ensure that they agree as closely as possible, whereas Figure 4.22 is for methanol and THF. It appears that acetonitrile has similar properties to methanol as a mobile phase, but THF is very different.

It is not necessary to restrict each measurement technique to two PCs; indeed, in many practical cases four or five PCs are employed. Computer software is available to compare scores plots and provide a numerical indicator of the closeness of the fit, but it is not easy to visualise. Because PCs do not often have a physical meaning, it is important to recognise that in some cases it is necessary to include several PCs for a meaningful result. For example, if two datasets are characterised by four PCs, and each one is of approximately equal size, the first PC for the reference dataset may correlate most closely with the third for the comparison dataset, so including only the

PATTERN RECOGNITION

 

 

 

 

 

 

 

223

 

 

 

Table 4.15 Chromatographic parameters corresponding to case study 2, obtained using THF

 

as mobile phase.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Parameter

Inertsil

Inertsil

Inertsil

Kromasil

Kromasil

Symmetry

Supelco

Purospher

 

 

ODS

ODS-2

ODS-3

C18

C8

C18

ABZ+

 

 

Pk

0.05

0.02

0.02

0.04

0.04

0.31

0

0

 

PN

17 300

12 200

10 200

14 900

18 400

12 400

16 600

13 700

 

PN(df)

11 300

7 080

6 680

7 560

11 400

5 470

10 100

7 600

 

PAs

1.38

1.74

1.53

1.64

1.48

2.01

1.65

1.58

 

Nk

0.05

0.02

0.01

0.03

0.03

0.33

0

0

 

NN

13 200

9 350

7 230

11 900

15 800

3 930

14 300

11 000

 

NN(df)

7 810

4 310

4 620

7 870

11 500

543

7 870

6 140

 

NAs

1.57

2

1.76

1.43

1.31

5.58

1.7

1.7

 

Ak

2.27

1.63

1.79

1.96

2.07

2.37

0.66

0.19

 

AN

14 800

15 300

13 000

13 500

18 300

12 900

14 200

11 000

 

AN(df)

10 200

11 300

10 700

9 830

14 200

9 430

9 600

7 400

 

AAs

1.36

1.46

1.31

1.36

1.32

1.37

1.51

1.44

 

Ck

0.68

0.45

0.52

0.53

0.54

0.84

0.15

0

 

CN

14 600

11 800

9 420

11 600

16 500

11 000

12 700

8 230

 

CN(df)

12 100

9 170

7 850

9 820

13 600

8 380

9 040

5 100

 

CAs

1.12

1.34

1.22

1.14

1.15

1.28

1.42

1.48

 

Qk

4.67

2.2

3.03

2.78

2.67

3.28

0.9

0.48

 

QN

10 800

12 100

9 150

10 500

13 600

8 000

10 100

5 590

 

QN(df)

8 620

9 670

7 450

8 760

11 300

3 290

7 520

4 140

 

QAs

1.17

1.36

1.3

1.19

1.18

2.49

1.3

1.38

 

Bk

0.53

0.39

0.42

0.46

0.51

0.77

0.11

0

 

BN

14 800

12 100

11 900

14 300

19 100

13 700

15 000

4 290

 

BN(df)

8 260

6 700

8 570

9 150

14 600

9 500

9 400

3 280

 

BAs

1.57

1.79

1.44

1.44

1.3

1.37

1.57

1.29

 

Dk

3.11

3.08

2.9

2.89

3.96

4.9

1.36

0.66

 

DN

10 600

15 000

10 600

9 710

14 400

10 900

11 900

8 440

 

DN(df)

8 860

12 800

8 910

7 800

11 300

7 430

8 900

6 320

 

DAs

1.15

1.32

1.28

1.28

1.28

1.6

1.37

1.31

 

Rk

12.39

12.02

12.01

11.61

15.15

19.72

5.6

3.08

 

RN

9 220

17 700

13 000

10 800

13 800

12 100

12 000

9 160

 

RN(df)

8 490

13 900

11 000

8 820

12 000

8 420

9 630

7 350

 

RAs

1.07

1.51

1.32

1.31

1.27

1.64

1.32

1.25

 

 

 

 

 

 

 

 

 

 

first two components in the model could result in very misleading conclusions. It is usually a mistake to compare PCs of equivalent significance to each other, especially when their sizes are fairly similar.

Procrustes analysis can be used to answer fairly sophisticated questions. For example, in sensory research, are the results of a taste panel comparable to chemical measurements? If so, can the rather expensive and time-consuming taste panel be replaced by chromatography? A second use of procrustes analysis is to reduce the number of tests, an example being clinical trials. Sometimes 50 or more bacteriological tests are performed, but can these be reduced to 10 or fewer? A way to check this is by performing PCA on the results of all 50 tests and compare the scores plot when using a subset of 10 tests. If the two scores plots provide comparable information, the 10 selected tests are just as good as the full set of tests. This can be of significant economic benefit.

224

CHEMOMETRICS

 

 

 

 

 

 

10

 

 

 

 

 

 

 

 

 

 

8

 

 

 

 

 

 

 

 

Symmetry C18

6

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

4

 

Kromasil C-18

 

 

 

 

 

 

 

 

 

 

 

 

 

 

PC2

 

 

 

2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Inertsil ODS-3

 

 

 

 

 

 

 

 

0

 

 

 

 

 

 

−8

−6

−4

−2

 

0

2

4

6

8

10

 

 

 

 

 

 

 

 

 

Purospher

 

 

Kromasil C8

 

 

−2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Inertsil ODS

 

 

 

 

 

 

 

 

 

 

Inertsil ODS-2 −4

 

Supelco ABZ+

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

−6

 

PC1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 4.22

Comparison of scores plots for methanol and THF (case study 2)

4.4 Unsupervised Pattern Recognition: Cluster Analysis

Exploratory data analysis such as PCA is used primarily to determine general relationships between data. Sometimes more complex questions need to be answered, such as, do the samples fall into groups? Cluster analysis is a well established approach that was developed primarily by biologists to determine similarities between organisms. Numerical taxonomy emerged from a desire to determine relationships between different species, for example genera, families and phyla. Many textbooks in biology show how organisms are related using family trees.

The chemist also wishes to relate samples in a similar manner. Can protein sequences from different animals be related and does this tell us about the molecular basis of evolution? Can the chemical fingerprint of wines be related and does this tell us about the origins and taste of a particular wine? Unsupervised pattern recognition employs a number of methods, primarily cluster analysis, to group different samples (or objects) using chemical measurements.

4.4.1 Similarity

The first step is to determine the similarity between objects. Table 4.16 consists of six objects, 1–6, and five measurements, A–E. What are the similarities between the objects? Each object has a relationship to the remaining five objects. How can a numerical value of similarity be defined? A similarity matrix can be obtained, in which the similarity between each pair of objects is calculated using a numerical indicator. Note that it is possible to preprocess data prior to calculation of a number of these measures (see Section 4.3.6).

PATTERN RECOGNITION

 

 

 

 

 

 

 

 

225

 

 

 

 

 

 

 

 

 

 

 

Table 4.16 Example of cluster analysis.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A

B

C

D

E

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1

0.9

0.5

0.2

1.6

1.5

 

 

 

 

 

2

0.3

0.2

0.6

0.7

0.1

 

 

 

 

 

3

0.7

0.2

0.1

0.9

0.1

 

 

 

 

 

4

0.1

0.4

1.1

1.3

0.2

 

 

 

 

 

5

1.0

0.7

2.0

2.2

0.4

 

 

 

 

 

6

0.3

0.1

0.3

0.5

0.1

 

 

 

 

 

 

 

 

 

 

 

 

 

Table 4.17 Correlation matrix.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1

 

2

3

 

4

5

 

6

 

 

 

 

 

 

 

 

 

 

 

 

1

1.000

 

 

 

 

 

 

 

 

2

0.041

1.000

 

 

 

 

 

 

 

3

0.503

0.490

1.000

 

 

 

 

 

 

4

0.018

0.925

0.257

 

1.000

 

 

 

 

5

0.078

0.999

0.452

 

0.927

1.000

 

 

 

6

0.264

0.900

0.799

 

0.724

0.883

1.000

 

Four of the most popular ways of determining how similar objects are to each other are as follows.

1.Correlation coefficient between samples. A correlation coefficient of 1 implies that samples have identical characteristics, which all objects have with themselves. Some workers use the square or absolute value of a correlation coefficient, and it depends on the precise physical interpretation as to whether negative correlation coefficients imply similarity or dissimilarity. In this text we assume that the more negative is the correlation coefficient, the less similar are the objects. The correlation matrix is presented in Table 4.17. Note that the top right-hand side is not presented as it is the same as the bottom left-hand side. The higher is the correlation coefficient, the more similar are the objects.

2.Euclidean distance. The distance between two samples samples k and l is defined by

dkl

=

 

J

 

(xkj

xlj )2

 

 

j

 

 

1

 

 

 

 

 

 

 

 

 

 

 

 

=

 

 

 

 

 

 

 

 

 

 

 

 

where there are j measurements and

 

xij is

the j th measurement on sample i,

for example, x23 is the third measurement on the second sample, equalling 0.6 in Table 4.16. The smaller is this value, the more similar are the samples, so this distance measure works in an opposite manner to the correlation coefficient and strictly is a dissimilarity measure. The results are presented in Table 4.18. Although correlation coefficients vary between 1 and +1, this is not true for the Euclidean distance, which has no limit, although it is always positive. Sometimes the equation is presented in matrix format:

dkl = (xk xl ).(xk xl )

226

 

 

 

 

 

 

CHEMOMETRICS

 

 

 

 

 

 

 

Table 4.18 Euclidean distance matrix.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1

2

3

4

5

6

 

 

 

 

 

 

 

 

 

 

1

0.000

 

 

 

 

 

 

2

1.838

0.000

 

 

 

 

 

3

1.609

0.671

0.000

 

 

 

 

4

1.800

0.837

1.253

0.000

 

 

 

5

2.205

2.245

2.394

1.600

0.000

 

 

6

1.924

0.374

0.608

1.192

2.592

0.000

 

 

 

 

 

 

 

 

 

 

Euclidean distance

Manhattan distance

Figure 4.23

Euclidean and Manhattan distances

where the objects are row vectors as in Table 4.16; this method is easy to implement in Excel or Matlab.

3. Manhattan distance. This is defined slightly differently to the Euclidean distance and is given by

J

dkl = |xkj xlj | j =1

The difference between the Euclidean and Manhattan distances is illustrated in Figure 4.23. The values are given in Table 4.19; note the Manhattan distance will always be greater than (or in exceptional cases equal to) the Euclidean distance.

PATTERN RECOGNITION

 

 

 

 

 

227

 

 

 

 

 

 

Table 4.19 Manhattan distance matrix.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1

2

3

4

5

6

 

 

 

 

 

 

 

 

 

 

1

0

 

 

 

 

 

 

2

3.6

0

 

 

 

 

 

3

2.7

1.1

0

 

 

 

 

4

3.4

1.6

2.3

0

 

 

 

5

3.8

4.4

4.3

3.2

0

 

 

6

3.6

0.6

1.1

2.2

5

0

 

 

 

 

 

 

 

 

 

 

4.Mahalanobis distance. This method is popular with many chemometricians and, whilst superficially similar to the Euclidean distance, it takes into account that some variables may be correlated and so measure more or less the same properties. The distance between objects k and l is best defined in matrix terms by

dkl = (xk xl ).C1.(xk xl )

where C is the variance–covariance matrix of the variables, a matrix symmetric about the diagonal, whose elements represent the covariance between any two variables, of dimensions J × J . See Appendix A.3.1 for definitions of these parameters; note that one should use the population rather than sample statistics. This measure is very similar to the Euclidean distance except that the inverse of the variance–covariance matrix is inserted as a scaling factor. However, this method cannot easily be applied where the number of measurements (or variables) exceeds the number of objects, because the variance–covariance matrix would not have an inverse. There are some ways around this (e.g. when calculating spectral similarities where the number of wavelengths far exceeds the number of spectra), such as first performing PCA and then retaining the first few PCs for subsequent analysis. For a meaningful measure, the number of objects must be significantly greater than the number of variables, otherwise there are insufficient degrees of freedom for measurement of this parameter. In the case of Table 4.16, the Mahalanobis distance would be an inappropriate measure unless either the number of samples is increased or the number of variables decreased. This distance metric does have its uses in chemometrics, but more commonly in the areas of supervised pattern recognition (Section 4.5) where its properties will be described in more detail. Note in contrast that if the number of variables is small, although the Mahalanobis distance is an appropriate measure, correlation coefficients are less useful.

There are several other related distance measures in the literature, but normally good reasons are required if a very specialist distance measure is to be employed.

4.4.2 Linkage

The next step is to link the objects. The most common approach is called agglomerative clustering whereby single objects are gradually connected to each other in groups. Any similarity measure can be used in the first step, but for simplicity we will illustrate this using only the correlation coefficients of Table 4.17. Similar considerations apply to all the similarity measures introduced in Section 4.4.1, except that in the other cases the lower the distance the more similar the objects.

228

CHEMOMETRICS

 

 

1.From the raw data, find the two objects that are most similar (closest together).

According to Table 4.17, these are objects 2 and 5, as they have the highest correlation coefficient (=0.999) (remember that because only five measurements have been recorded there are only four degrees of freedom for calculation of correlation coefficients, meaning that high values can be obtained fairly easily).

2.Next, form a ‘group’ consisting of the two most similar objects. Four of the original objects (1, 3 and 6) and a group consisting of objects 2 and 5 together remain, leaving a total of five new groups, four consisting of a single original object and one consisting of two ‘clustered’ objects.

3.The tricky bit is to decide how to represent this new grouping. As in the case of distance measures, there are several approaches. The main task is to recalculate the numerical similarity values between the new group and the remaining objects. There are three principal ways of doing this.

(a)Nearest neighbour. The similarity of the new group from all other groups is given by the highest similarity of either of the original objects to each other object. For example, object 6 has a correlation coefficient of 0.900 with object 2, and 0.883 with object 5. Hence the correlation coefficient with the new combined group consisting of objects 2 and 5 is 0.900.

(b)Furthest neighbour. This is the opposite to nearest neighbour, and the lowest similarity is used, 0.883 in our case. Note that the furthest neighbour method of linkage refers only to the calculation of similarity measures after new groups are formed, and the two groups (or objects) with highest similarity are always joined first.

(c)Average linkage. The average similarity is used, 0.892 in our case. There are, in fact, two different ways of doing this, according to the size of each group being joined together. Where they are of equal size (e.g. each consists of one object), both methods are equivalent. The two different ways are as follows.

Unweighted. If group A consists of NA objects and group B of NB objects, the new similarity measure is given by

sAB = (NAsA + NB sB )/(NA + NB )

Weighted. The new similarity measure is given by

sAB = (sA + sB )/2

The terminology indicates that for the unweighted method, the new similarity measure takes into consideration the number of objects in a group, the conventional terminology possibly being the opposite to what is expected. For the first link, each method provides identical results.

There are numerous further linkage methods, but it would be rare that a chemist needs to use too many combination of similarity and linkage methods, however, a good rule of thumb is to check the result of using a combination of approaches.

The new data matrix using nearest neighbour clustering is presented in Table 4.20, with the new values shaded. Remember that there are many similarity measures and methods for linking, so this table represents only one possible way of handling the information.

PATTERN RECOGNITION

229

 

 

4.4.3 Next Steps

The next steps consist of continuing to group the data just as above, until all objects have joined one large group. Since there are six original objects, there will be five steps before this is achieved. At each step, the most similar pair of objects or clusters are identified, then they are combined into one new cluster, until all objects have been joined. The calculation is illustrated in Table 4.20, using nearest neighbour linkage, with the most similar objects at each step indicated in bold type, and the new similarity measures shaded. In this particular example, all objects ultimately belong to the same cluster, although arguably object 1 (and possibly 3) does not have a very high similarity to the main group. In some cases, several clusters can be formed, although ultimately one large group is usually formed.

It is normal then to determine at what similarity measure each object joined a larger group, and so which objects resemble each other most.

4.4.4 Dendrograms

Often the result of hierarchical clustering is presented in the form of a dendrogram (sometimes called a ‘tree diagram’). The objects are organised in a row, according

Table 4.20 Nearest neighbour cluster analysis, using correlation coefficients for similarity measures, and data in Table 4.16.

230

 

 

 

 

CHEMOMETRICS

2

5

4

6

3

1

Similarity

Figure 4.24

Dendrogram for cluster analysis example

to their similarities: the vertical axis represents the similarity measure at which each successive object joins a group. Using nearest neighbour linkage and correlation coefficients for similarities, the dendrogram for Table 4.20 is presented in Figure 4.24. It can be seen that object 1 is very different from the others. In this case all the other objects appear to form a single group, but other clustering methods may give slightly different results. A good approach is to perform several different methods of cluster analysis and compare the results. If similar clusters are obtained, no matter which method is employed, we can rely on the results.

4.5 Supervised Pattern Recognition

Classification (often called supervised pattern recognition) is at the heart of chemistry. Mendeleev’s periodic table, grouping of organic compounds by functionality and listing different reaction types all involve classification. Much of traditional chemistry involves grouping chemical behaviour. Most early texts in organic, inorganic and analytical chemistry are systematically divided into chapters according to the behaviour or structure of the underlying compounds or techniques.

So the modern chemist also has a significant need for classification. Can a spectrum be used to determine whether a compound is a ketone or an ester? Can the chromatogram of a tissue sample be used to determine whether a patient is cancerous or not? Can we record the spectrum of an orange juice and decide its origin? Is it possible to monitor a manufacturing process by near-infrared spectroscopy and decide whether the product is acceptable or not? Supervised pattern recognition is used to assign samples to a number of groups (or classes). It differs from cluster analysis where, although the relationship between samples is important, there are no predefined groups.

Соседние файлы в предмете Химия