
Brereton Chemometrics
.pdf
EXPERIMENTAL DESIGN |
109 |
|
|
x1 |
x2 |
y |
1 |
1 |
11.1540 |
1 |
0 |
12.4607 |
−1 |
0 |
6.3716 |
0 |
−1 |
6.1280 |
0 |
1 |
2.1698 |
1.By constructing the design matrix and then using the pseudo-inverse, calculate the coefficients for the best fit model given by the equation
y= b0 + b1x1 + b2x2 + b11x12 + b22x22 + b12x1x2
2.From these coefficients, calculate the 12 predicted responses, and so the residual (modelling) error as the sum of squares of the residuals.
3. Calculate the contribution to this error of the replicates simply by calculating the average response over the four replicates, and then subtracting each replicate response, and summing the squares of these residuals.
4.Calculate the sum of square lack-of-fit error by subtracting the value in question 3 from that in question 2.
5.Divide the lack-of-fit and replicate errors by their respective degrees of freedom and comment.
Problem 2.10 The Application of a Plackett–Burman Design to the Screening of Factors Influencing a Chemical Reaction
Section 2.3.3
The yield of a reaction of the form
A + B −−→ C
is to be studied as influenced by 10 possible experimental conditions, as follows:
Factor |
|
Units |
Low |
High |
|
|
|
|
|
x1 |
% NaOH |
% |
40 |
50 |
x2 |
Temperature |
◦C |
80 |
110 |
x3 |
Nature of catalyst |
|
A |
B |
x4 |
Stirring |
|
Without |
With |
x5 |
Reaction time |
min |
90 |
210 |
x6 |
Volume of solvent |
ml |
100 |
200 |
x7 |
Volume of NaOH |
ml |
30 |
60 |
x8 |
Substrate/NaOH ratio |
mol/ml |
0.5 × 10−3 |
1 × 10−3 |
x9 |
Catalyst/substrate ratio |
mol/ml |
4 × 10−3 |
6 × 10−3 |
x10 |
Reagent/substrate ratio |
mol/mol |
1 |
1.25 |

110 |
CHEMOMETRICS |
|
|
The design, including an eleventh dummy factor, is as follows, with the observed yields:
Expt No. |
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 |
Yield (%) |
1 |
− − − − − − − − − − − |
15 |
2 |
+ + − + + + − − − + − |
42 |
3 |
− + + − + + + − − − + |
3 |
4 |
+ − + + − + + + − − − |
57 |
5 |
− + − + + − + + + − − |
38 |
6 |
− − + − + + − + + + − |
37 |
7 |
− − − + − + + − + + + |
74 |
8 |
+ − − − + − + + − + + |
54 |
9 |
+ + − − − + − + + − + |
56 |
10 |
+ + + − − − + − + + − |
64 |
11 |
− + + + − − − + − + + |
65 |
12 |
+ − + + + − − − + − + |
59 |
1.Why is a dummy factor employed? Why is a Plackett–Burman design more desirable than a two level fractional factorial in this case?
2.Verify that all the columns are orthogonal to each other.
3.Set up a design matrix, D, and determine the coefficients b0 to b11.
4.An alternative method for calculating the coefficients for factorial designs such as the Plackett–Burman design is to multiply the yields of each experiment by the levels of the corresponding factor, summing these and dividing by 12. Verify that this provides the same answer for factor 1 as using the inverse matrix.
5.A simple method for reducing the number of experimental conditions for further study is to look at the size of the factors and eliminate those that are less than the dummy factor. How many factors remain and what are they?
Problem 2.11 Use of a Constrained Mixture Design to Investigate the Conductivity of a Molten Salt System
Section 2.5.4 Section 2.5.2.2
A molten salt system consisting of three components is prepared, and the aim is to investigate the conductivity according to the relative proportion of each component. The three components are as follows:
Component |
|
Lower limit |
Upper limit |
|
|
|
|
x1 |
NdCl3 |
0.2 |
0.9 |
x2 |
LiCl |
0.1 |
0.8 |
x3 |
KCl |
0.0 |
0.7 |
The experiment is coded to give pseudo-components so that a value of 1 corresponds to the upper limit, and a value of 0 to the lower limit of each component. The experimental

EXPERIMENTAL DESIGN |
|
|
111 |
||
|
|
|
|
|
|
results are as follows: |
|
|
|
|
|
|
|
|
|
|
|
|
z1 |
z2 |
z3 |
Conductivity ( −1 cm−1) |
|
1 |
0 |
0 |
3.98 |
|
|
0 |
1 |
0 |
2.63 |
|
|
0 |
0 |
1 |
2.21 |
|
|
0.5 |
0.5 |
0 |
5.54 |
|
|
0.5 |
0 |
0.5 |
4.00 |
|
|
0 |
0.5 |
0.5 |
2.33 |
|
|
0.3333 |
0.3333 |
0.3333 |
3.23 |
|
|
|
|
|
|
|
|
1.Represent the constrained mixture space, diagrammatically, in the original mixture space. Explain why the constraints are possible and why the new reduced mixture space remains a triangle.
2.Produce a design matrix consisting of seven columns in the true mixture space as follows. The true composition of a component 1 is given by Z1(U1 − L1) + L1, where U and L are the upper and lower bounds for the component. Convert all three columns of the matrix above using this equation and then set up a design matrix, containing three single factor terms, and all possible two and three factor interaction terms (using a Sheffe´ model).
3.Calculate the model linking the conductivity to the proportions of the three salts.
4.Predict the conductivity when the proportion of the salts is 0.209, 0.146 and 0.645.
Problem 2.12 Use of Experimental Design and Principal Components Analysis for Reduction of Number of Chromatographic Tests
Section 2.4.5 Section 4.3.6.4 Section 4.3 Section 4.4.1
The following table represents the result of a number of tests performed on eight chromatographic columns, involving performing chromatography on eight compounds at pH 3 in methanol mobile phase, and measuring four peakshape parameters. Note that you may have to transpose the matrix in Excel for further work. The aim is to reduce the number of experimental tests necessary using experimental design. Each test is denoted by a mnemonic. The first letter (e.g. P) stands for a compound, the second part of the name, k, N, N(df), or As standing for four peakshape/retention time measurements.
|
Inertsil |
Inertsil |
Inertsil |
Kromasil |
Kromasil |
Symmetry |
Supelco |
Purospher |
|
ODS |
ODS-2 |
ODS-3 |
C18 |
C8 |
C18 |
ABZ+ |
|
Pk |
0.25 |
0.19 |
0.26 |
0.3 |
0.28 |
0.54 |
0.03 |
0.04 |
PN |
10 200 |
6930 |
7420 |
2980 |
2890 |
4160 |
6890 |
6960 |
PN(df) |
2650 |
2820 |
2320 |
293 |
229 |
944 |
3660 |
2780 |
PAs |
2.27 |
2.11 |
2.53 |
5.35 |
6.46 |
3.13 |
1.96 |
2.08 |
Nk |
0.25 |
0.12 |
0.24 |
0.22 |
0.21 |
0.45 |
0 |
0 |
NN |
12 000 |
8370 |
9460 |
13 900 |
16 800 |
4170 |
13 800 |
8260 |
NN(df) |
6160 |
4600 |
4880 |
5330 |
6500 |
490 |
6020 |
3450 |
|
|
|
|
|
|
|
|
|
112 |
|
|
|
|
|
|
CHEMOMETRICS |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Inertsil |
Inertsil |
Inertsil |
Kromasil |
Kromasil |
Symmetry |
Supelco |
Purospher |
|
ODS |
ODS-2 |
ODS-3 |
C18 |
C8 |
C18 |
ABZ+ |
|
NAs |
1.73 |
1.82 |
1.91 |
2.12 |
1.78 |
5.61 |
2.03 |
2.05 |
Ak |
2.6 |
1.69 |
2.82 |
2.76 |
2.57 |
2.38 |
0.67 |
0.29 |
AN |
10 700 |
14 400 |
11 200 |
10 200 |
13 800 |
11 300 |
11 700 |
7160 |
AN(df) |
7790 |
9770 |
7150 |
4380 |
5910 |
6380 |
7000 |
2880 |
AAs |
1.21 |
1.48 |
1.64 |
2.03 |
2.08 |
1.59 |
1.65 |
2.08 |
Ck |
0.89 |
0.47 |
0.95 |
0.82 |
0.71 |
0.87 |
0.19 |
0.07 |
CN |
10 200 |
10 100 |
8500 |
9540 |
12 600 |
9690 |
10 700 |
5300 |
CN(df) |
7830 |
7280 |
6990 |
6840 |
8340 |
6790 |
7250 |
3070 |
CAs |
1.18 |
1.42 |
1.28 |
1.37 |
1.58 |
1.38 |
1.49 |
1.66 |
Qk |
12.3 |
5.22 |
10.57 |
8.08 |
8.43 |
6.6 |
1.83 |
2.17 |
QN |
8800 |
13 300 |
10 400 |
10 300 |
11 900 |
9000 |
7610 |
2540 |
QN(df) |
7820 |
11 200 |
7810 |
7410 |
8630 |
5250 |
5560 |
941 |
QAs |
1.07 |
1.27 |
1.51 |
1.44 |
1.48 |
1.77 |
1.36 |
2.27 |
Bk |
0.79 |
0.46 |
0.8 |
0.77 |
0.74 |
0.87 |
0.18 |
0 |
BN |
15 900 |
12 000 |
10 200 |
11 200 |
14 300 |
10 300 |
11 300 |
4570 |
BN(df) |
7370 |
6550 |
5930 |
4560 |
6000 |
3690 |
5320 |
2060 |
BAs |
1.54 |
1.79 |
1.74 |
2.06 |
2.03 |
2.13 |
1.97 |
1.67 |
Dk |
2.64 |
1.72 |
2.73 |
2.75 |
2.27 |
2.54 |
0.55 |
0.35 |
DN |
9280 |
12 100 |
9810 |
7070 |
13 100 |
10 000 |
10 500 |
6630 |
DN(df) |
5030 |
8960 |
6660 |
2270 |
7800 |
7060 |
7130 |
3990 |
DAs |
1.71 |
1.39 |
1.6 |
2.64 |
1.79 |
1.39 |
1.49 |
1.57 |
Rk |
8.62 |
5.02 |
9.1 |
9.25 |
6.67 |
7.9 |
1.8 |
1.45 |
RN |
9660 |
13 900 |
11 600 |
7710 |
13 500 |
11 000 |
9680 |
5140 |
RN(df) |
8410 |
10 900 |
7770 |
3460 |
9640 |
8530 |
6980 |
3270 |
RAs |
1.16 |
1.39 |
1.65 |
2.17 |
1.5 |
1.28 |
1.41 |
1.56 |
|
|
|
|
|
|
|
|
|
1.Transpose the data so that the 32 tests correspond to columns of a matrix (variables) and the eight chromatographic columns to the rows of a matrix (objects). Standardise each column by subtracting the mean and dividing by the population standard deviation (Chapter 4, Section 4.3.6.4). Why is it important to standardise these data?
2.Perform PCA (principal components analysis) on these data and retain the first three loadings (methods for performing PCA are discussed in Chapter 4, Section 4.3; see also Appendix A.2.1 and relevant sections of Appendices A.4 and A.5 if you are using Excel or Matlab).
3.Take the three loadings vectors and transform to a common scale as follows. For
each loadings vector select the most positive and most negative values, and code these to +1 and −1, respectively. Scale all the intermediate values in a similar fashion, leading to a new scaled loadings matrix of 32 columns and 3 rows. Produce the new scaled loadings vectors.
4.Select a factorial design as follows, with one extra point in the centre, to obtain a range of tests which is a representative subset of the original tests:

EXPERIMENTAL DESIGN |
|
|
113 |
||
|
|
|
|
|
|
|
|
|
|
|
|
|
Design point |
PC1 |
PC2 |
PC3 |
|
|
|
|
|
|
|
1 |
− |
− |
− |
||
2 |
+ |
− |
− |
||
3 |
− |
+ |
− |
||
4 |
+ |
+ |
− |
||
5 |
− |
− |
+ |
|
|
6 |
+ |
− |
+ |
|
|
7 |
− |
+ |
+ |
|
|
8 |
+ |
+ |
+ |
|
|
|
9 |
0 |
0 |
0 |
|
Calculate the Euclidean distance of each of the 32 scaled loadings from each of the nine design points; for example, the first design point calculates the Euclidean distance of the loadings scaled as in question 3 from the point (−1,−1,−1), by the equation
d1 = (p11 + 1)2 + (p12 + 1)2 + (p13 + 1)2
(Chapter 4, Section 4.4.1).
5.Indicate the chromatographic parameters closest to the nine design points. Hence recommend a reduced number of chromatographic tests and comment on the strategy.
Problem 2.13 A Mixture Design with Constraints
Section 2.5.4
It is desired to perform a three factor mixture design with constraints on each factor as follows:
|
x1 |
x2 |
x3 |
Lower |
0.0 |
0.2 |
0.3 |
Upper |
0.4 |
0.6 |
0.7 |
|
|
|
|
1.The mixture design is normally represented as an irregular polygon, with, in this case, six vertices. Calculate the percentage of each factor at the six coordinates.
2.It is desired to perform 13 experiments, namely on the six corners, in the middle of the six edges and in the centre. Produce a table of the 13 mixtures.
3.Represent the experiment diagrammatically.
Problem 2.14 Construction of Five Level Calibration Designs
Section 2.3.4
The aim is to construct a five level partial factorial (or calibration) design involving 25 experiments and up to 14 factors, each at levels −2, −1, 0, 1 and 2. Note that this design is only one of many possible such designs.

114 |
CHEMOMETRICS |
|
|
1.Construct the experimental conditions for the first factor using the following rules.
•The first experiment is at level −2.
•This level is repeated for experiments 2, 8, 14 and 20.
•The levels for experiments 3–7 are given as follows (0, 2, 0, 0, 1).
•A cyclic permuter of the form 0 −−→ −1 −−→ 1 −−→ 2 −−→ 0 is then used. Each block of experiments 9–13, 15–19 and 21–25 are related by this permuter,
each block being one permutation away from the previous block, so experiments 9 and 10 are at levels −1 and 0, for example.
2.Construct the experimental conditions for the other 13 factors as follows.
•Experiment 1 is always at level −2 for all factors.
•The conditions for experiments 2–24 for the other factors are simply the cyclic permutation of the previous factor as explained in Section 2.3.4.
So produce the matrix of experimental conditions.
3.What is the difference vector used in this design?
4.Calculate the correlation coefficients between all pairs of factors 1–14. Plot the two graphs of the levels of factor 1 versus factors 2 and 7. Comment.
Problem 2.15 A Four Component Mixture Design Used for Blending of Olive Oils
Section 2.5.2.2
Fourteen blends of olive oils from four cultivars A–D are mixed together in the design below presented together with a taste panel score for each blend. The higher the score the better the taste of the olive oil.
A |
B |
C |
D |
Score |
|
|
|
|
|
1 |
0 |
0 |
0 |
6.86 |
0 |
1 |
0 |
0 |
6.50 |
0 |
0 |
1 |
0 |
7.29 |
0 |
0 |
0 |
1 |
5.88 |
0.5 |
0.5 |
0 |
0 |
7.31 |
0.5 |
0 |
0.5 |
0 |
6.94 |
0.5 |
0 |
0 |
0.5 |
7.38 |
0 |
0.5 |
0.5 |
0 |
7.00 |
0 |
0.5 |
0 |
0.5 |
7.13 |
0 |
0 |
0.5 |
0.5 |
7.31 |
0.333 33 |
0.333 33 |
0.333 33 |
0 |
7.56 |
0.333 33 |
0.333 33 |
0 |
0.333 33 |
7.25 |
0.333 33 |
0 |
0.333 33 |
0.333 33 |
7.31 |
0 |
0.333 33 |
0.333 33 |
0.333 33 |
7.38 |
|
|
|
|
|
1.It is desired to produce a model containing 14 terms, namely four linear, six two component and four three component terms. What is the equation for this model?
2.Set up the design matrix and calculate the coefficients.
3.A good way to visualise the data is via contours in a mixture triangle, allowing three components to vary and constraining the fourth to be constant. Using a step size of 0.05, calculate the estimated responses from the model in question 2 when

EXPERIMENTAL DESIGN |
115 |
|
|
D is absent and A + B + C = 1. A table of 231 numbers should be produced. Using a contour plot, visualise these data. If you use Excel, the upper right-hand half of the plot may contain meaningless data; to remove these, simply cover up this part of the contour plot with a white triangle. In modern versions of Matlab and some other software packages, triangular contour plots can be obtained straightforwardly Comment on the optimal blend using the contour plot when D is absent.
4.Repeat the contour plot in question 3 for the following: (i) A + B + D = 1, (ii) B + C + D = 1 and (iii) A + C + D = 1, and comment.
5.Why, in this example, is a strategy of visualisation of the mixture contours probably more informative than calculating a single optimum?
Problem 2.16 Central Composite Design Used to Study the Extraction of Olive Seeds in a Soxhlet
Section 2.4 Section 2.2.2
Three factors, namely (1) irradiation power as a percentage, (2) irradiation time in seconds and (3) number of cycles, are used to study the focused microwave assisted Soxhlet extraction of olive oil seeds, the response measuring the percentage recovery, which is to be optimised. A central composite design is set up to perform the experiments.
The results are as follows, using coded values of the variables:
Factor 1 |
Factor 2 |
Factor 3 |
Response |
|
|
|
|
−1 |
−1 |
−1 |
46.64 |
−1 |
−1 |
1 |
47.23 |
−1 |
1 |
−1 |
45.51 |
−1 |
1 |
1 |
48.58 |
1 |
−1 |
−1 |
42.55 |
1 |
−1 |
1 |
44.68 |
1 |
1 |
−1 |
42.01 |
1 |
1 |
1 |
43.03 |
−1 |
0 |
0 |
49.18 |
1 |
0 |
0 |
44.59 |
0 |
−1 |
0 |
49.22 |
0 |
1 |
0 |
47.89 |
0 |
0 |
−1 |
48.93 |
0 |
0 |
1 |
49.93 |
0 |
0 |
0 |
50.51 |
0 |
0 |
0 |
49.33 |
0 |
0 |
0 |
49.01 |
0 |
0 |
0 |
49.93 |
0 |
0 |
0 |
49.63 |
0 |
0 |
0 |
50.54 |
|
|
|
|
1.A 10 parameter model is to be fitted to the data, consisting of the intercept, all single factor linear and quadratic terms and all two factor interaction terms. Set up the design matrix, and by using the pseudo-inverse, calculate the coefficients of the model using coded values.

116 |
CHEMOMETRICS |
|
|
2. The true values of the factors are as follows:
Variable |
−1 |
+1 |
Power (%) |
30 |
60 |
Time (s) |
20 |
30 |
Cycles |
5 |
7 |
|
|
|
Re-express the model in question 1 in terms of the true values of each variable, rather than the coded values.
3.Using the model in question 1 and the coded design matrix, calculate the 20 predicted responses and the total error sum of squares for the 20 experiments.
4.Determine the sum of squares replicate error as follows: (i) calculate the mean response for the six replicates; (ii) calculate the difference between the true and average response, square these and sum the six numbers.
5.Determine the sum of squares lack-of-fit error as follows: (i) replace the six replicate responses by the average response for the replicates; (ii) using the 20 responses (with the replicates averaged) and the corresponding predicted responses, calculate the differences, square them and sum them.
6.Verify that the sums of squares in questions 4 and 5 add up to the total error obtained in question 3.
7.How many degrees of freedom are available for assessment of the replicate and lack-of-fit errors? Using this information, comment on whether the lack-of-fit is significant, and hence whether the model is adequate.
8.The significance each term can be determined by omitting the term from the overall model. Assess the significance of the linear term due to the first factor and the interaction term between the first and third factors in this way. Calculate a new design matrix with nine rather than ten columns, removing the relevant column, and also remove the corresponding coefficients from the equation. Determine the new predicted responses using nine factors, and calculate the increase in sum of square error over that obtained in question 3. Comment on the significance of these two terms.
9.Using coded values, determine the optimum conditions as follows. Discard the two interaction terms that are least significant, resulting in eight remaining terms in the equation. Obtain the partial derivatives with respect to each of the three variables,
and set up three equations equal to zero. Show that the optimum value of the third factor is given by −b3/(2b33), where the coefficients correspond to the linear and quadratic terms in the equations. Hence calculate the optimum coded values for each of the three factors.
10.Determine the optimum true values corresponding to the conditions obtained in question 9. What is the percentage recovery at this optimum? Comment.
Problem 2.17 A Three Component Mixture Design
Section 2.5.2
A three factor mixture simplex centroid mixture design is performed, with the following results:
EXPERIMENTAL DESIGN |
|
|
117 |
||
|
|
|
|
|
|
|
|
|
|
|
|
|
x1 |
x2 |
x3 |
Response |
|
|
|
|
|
|
|
1 |
0 |
0 |
9 |
|
|
0 |
1 |
0 |
12 |
|
|
0 |
0 |
1 |
17 |
|
|
0.5 |
0.5 |
0 |
3 |
|
|
0.5 |
0 |
0.5 |
18 |
|
|
0 |
0.5 |
0.5 |
14 |
|
|
0.3333 |
0.3333 |
0.3333 |
11 |
|
|
|
|
|
|
|
|
1.A seven term model consisting of three linear terms, three two factor interaction terms and one three factor interaction term is fitted to the data. Give the equation for this model, compute the design matrix and calculate the coefficients.
2.Instead of seven terms, it is decided to fit the model only to the three linear terms. Calculate these coefficients using only three terms in the model employing the pseudo-inverse. Determine the root mean square error for the predicted responses, and comment on the difference in the linear terms in question 1 and the significance of the interaction terms.
3.It is possible to convert the model of question 1 to a seven term model in two
independent factors, consisting of two linear terms, two quadratic terms, two linear interaction terms and a quadratic term of the form x1x2(x1 + x2). Show how the models relate algebraically.
4.For the model in question 3, set up the design matrix, calculate the new coefficients and show how these relate to the coefficients calculated in question 1 using the relationship obtained in question 3.
5.The matrices in questions 1, 2 and 4 all have inverses. However, a model that consisted of an intercept term and three linear terms would not, and it is impossible to use regression analysis to fit the data under such circumstances. Explain these observations.
Chemometrics: Data Analysis for the Laboratory and Chemical Plant.
Richard G. Brereton
Copyright ∂ 2003 John Wiley & Sons, Ltd.
ISBNs: 0-471-48977-8 (HB); 0-471-48978-6 (PB)
3 Signal Processing
3.1 Sequential Signals in Chemistry
Sequential signals are surprisingly widespread in chemistry, and require a large number of methods for analysis. Most data are obtained via computerised instruments such as those for NIR, HPLC or NMR, and raw information such as peak integrals, peak shifts and positions is often dependent on how the information from the computer is first processed. An appreciation of this step is essential prior to applying further multivariate methods such as pattern recognition or classification. Spectra and chromatograms are examples of series that are sequential in time or frequency. However, time series also occur very widely in other areas of chemistry, for example in the area of industrial process control and natural processes.
3.1.1 Environmental and Geological Processes
An important source of data involves recording samples regularly with time. Classically such time series occur in environmental chemistry and geochemistry. A river might be sampled for the presence of pollutants such as polyaromatic hydrocarbons or heavy metals at different times of the year. Is there a trend, and can this be related to seasonal factors? Different and fascinating processes occur in rocks, where depth in the sediment relates to burial time. For example, isotope ratios are a function of climate, as relative evaporation rates of different isotopes are temperature dependent: certain specific cyclical changes in the Earth’s rotation have resulted in the Ice Ages and so climate changes, leave a systematic chemical record. A whole series of methods for time series analysis based primarily on the idea of correlograms (Section 3.4) can be applied to explore such types of cyclicity, which are often hard to elucidate. Many of these approaches were first used by economists and geologists who also encounter related problems.
One of the difficulties is that long-term and interesting trends are often buried within short-term random fluctuations. Statisticians distinguish between various types of noise which interfere with the signal as discussed in Section 3.2.3. Interestingly, the statistician Herman Wold, who is known among many chemometricians for the early development of the partial least squares algorithm, is probably more famous for his work on time series, studying this precise problem.
In addition to obtaining correlograms, a large battery of methods are available to smooth time series, many based on so-called ‘windows’, whereby data are smoothed over a number of points in time. A simple method is to take the average reading over five points in time, but sometimes this could miss out important information about cyclicity especially for a process that is sampled slowly compared to the rate of oscillation. A number of linear filters have been developed which are applicable to this time of data (Section 3.3), this procedure often being described as convolution.