
Brereton Chemometrics
.pdf
APPENDICES |
453 |
|
|
Figure A.30
PCR dialog box
The PCR dialog box, illustrated in Figure A.30, is considerably more complicated. It is always necessary to have a training set consisting of an ‘x’ block and a ‘c’ block. The latter may consist of more than one column. For PCR, unlike PLS, all columns are treated independently so there is no analogy to PLS2. You can choose three options.
(1) ‘Training set only’ is primarily for building and validating models. It only uses the training set. You need only to specify an ‘x’ and ‘c’ block training set. The number of objects in both sets must be identical. (2) ‘Prediction of unknowns’ is used to predict concentrations from an unknown series of samples. It is necessary to have an ‘x’ and ‘c’ block training set as well as an ‘x’ block for the unknowns. A model will be built from the training set and applied to the unknowns. There can be any number of unknowns, but the number of variables in the two ‘x’ blocks must be identical. (3) ‘Use test set (predict and compare)’ allows two sets of blocks where concentrations are known, a training set and a test set. The number of objects in the training and test set will normally differ, but the number of variables in both datasets must be identical.
There are three methods for data scaling, as in PCA, but the relevant column means and standard deviations are always obtained from the training set. If there is a test set, then the training set parameters will be used to scale the test set, so that the test set is unlikely to be mean centred or standardised. Similar scaling is performed on both the ‘c’ and ‘x’ block simultaneously. If you want to apply other forms of scaling (such as summing rows to a constant total), this can be performed manually in Excel and PCA

454 CHEMOMETRICS
can be performed without further preprocessing. Cross-validation is performed only on the ‘c’ block or concentration predictions; if you choose cross-validation you can only do this on the training set. If you want to perform cross-validation on the ‘x’ block, use the PCA facility.
There are a number of types of output. Eigenvalues, scores and loadings (of the training set) are the same as in PCA, whereas the coefficients relate the PCs to the concentration estimates, and correspond to the matrix R as described in Chapter 5, Section 5.4.1. This information is available if requested in all cases except for crossvalidation. Separate statistics can be obtained for the ‘c’ block predictions. There are three levels of output. ‘Summary only’ involves just the errors including the training set error (adjusted by the number of degrees of freedom to give 1Ecal as described in Chapter 5, Section 5.6.1), the cross-validated error (divided by the number of objects in the training set) and the test set error, as appropriate to the relevant calculation. If the ‘Predictions’ option is selected, then the predicted concentrations are also displayed, and ‘Predictions and Residuals’ provides the residuals as well (if appropriate for the training and test sets), although these can also be calculated manually. If the ‘Show all models’ option is selected, then predicted ‘c’ values and the relevant errors (according to the information required) for 1, 2, 3, up to the chosen number of PCs is displayed. If this option is not selected, only information for the full model is provided.
The PLS dialog box, illustrated in Figure A.31, is very similar to PCR, except that there is an option to perform PLS1 (‘one c variable at a time’) (see Chapter 5,
Figure A.31
PLS dialog box

APPENDICES |
455 |
|
|
Section 5.5.1) as well as PLS2 (Chapter 5, Section 5.5.2). However, even when performing PLS1, it is possible to use several variables in the ‘c’ block; each variable, however, is modelled independently. Instead of coefficients (in PCR) we have ‘C- loadings’ (Q) for PLS, as well as the ‘X-loadings’ (P ), although there is only one scores matrix. Strictly, there are no eigenvalues for PLS, but the size of each component is given by the magnitude, which is the product of the sum of squares of the scores and X-loadings for each PLS component. Note that the loadings in the method described in this text are neither normalised nor orthogonal. If one selects PLS2, there will be a single set of ‘Scores’ and ‘X-loadings’ matrices, however many columns there are in the ‘c’ block, but ‘C-loadings’ will be in the form of a matrix. If PLS1 is selected and there is more than one column in the ‘c’ block, separate ‘Scores’ and ‘X-loadings’ matrices are generated for each compound, as well as an associated ‘C- loadings’ vector, so the output can become extensive unless one is careful to select the appropriate options.
For both PCR and PLS it is, of course, possible to transpose data, and this can be useful if there are a large number of wavelengths, but both the ‘x’ block and the ‘c’ block must be transposed. These facilities are not restricted to predicting concentrations in spectra of mixtures and can be used for any purpose, such as QSAR or sensory statistics.
The MLR dialog box, illustrated in Figure A.32, is somewhat simpler than the others and is mainly used if two out of X, C and S are known. The type of unknown matrix is chosen and then regions of the spreadsheet of the correct size must be selected. For small datasets MLR can be performed using standard matrix operations in Excel as described in Section A.4.2.2, but for larger matrices it is necessary to have a separate
Figure A.32
MLR dialog box
456 |
CHEMOMETRICS |
|
|
tool, as there is a limitation in the Excel functions. This facility also performs regression using the pseudoinverse, and is mainly provided for completion. Note that it is not necessary to restrict the data to spectra or concentrations.
This Add-in provides a basic functionality for many of the multivariate methods described in Chapters 4–6 and can be used when solving the problems.
A.5 Matlab for Chemometrics
Many chemometricians use Matlab. In order to appreciate the popularity of this approach, it is important to understand the vintage of chemometrics. The first applications of quantum chemistry, another type of computational chemistry, were developed in the 1960s and 1970s when Fortran was the main numerical programming environment. Hence large libraries of routines were established over this period and to this day most quantum chemists still program in Fortran. Were the discipline of quantum chemistry to start over again, probably Fortran would not be the main programming environment of choice, but tens of thousands (or more) man-years would need to be invested to rewrite entire historical databases of programs. If we were developing an operating system that would be used by tens or hundreds of millions of people, that investment might be worthwhile, but the scientific market is much smaller, so once the environment is established, new researchers tend to stick to it as they can then exchange code and access libraries.
Although some early chemometrics code was developed in Fortran (the Arthur package of Kowalski) and Basic (Wold’s early version of SIMCA) and commercial packages are mainly written in C, most public domain code first became available in the 1980s when Matlab was an up and coming new environment. An advantage of Matlab is that it is very much oriented towards matrix operations and most chemometrics algorithms are best expressed in this way. It can be awkward to write matrix based programs in C, Basic or Fortran unless one has access to or develops specialised libraries. Matlab was originally a technical programming environment mainly for engineers and physical scientists, but over the years the user base has expanded strongly and Matlab has kept pace with new technology including extensive graphics, interfaces to Excel, numerous toolboxes for specialist use and the ability to compile software. In this section we will concentrate primarily on the basics required for chemometrics and also to solve the problems in this book; for the more experienced user there are numerous other outstanding texts on Matlab, including the extensive documentation produced by the developer of the software, MathWorks, which maintains an excellent Website. In this book you will be introduced to a number of main features, to help you solve the problems, but as you gain experience you will undoubtedly develop your own personal favourite approaches. Matlab can be used at many levels, and it is now possible to develop sophisticated packages with good graphics in this environment.
There are many versions of Matlab and of Windows and for the more elaborate interfaces between the two packages it is necessary to refer to technical manuals. We will illustrate this section with Matlab version 5.3, although many readers may have access to more up-to-date editions. All are forward compatible. There is a good on-line help facility in Matlab: type help followed by the command, or follow the appropriate menu item. However, it is useful first to have a grasp of the basics which will be described below.

APPENDICES |
457 |
|
|
Figure A.33
Matlab window
A.5.1 Getting Started
To start Matlab it is easiest to simply click the icon which should be available if properly installed, and a blank screen as in Figure A.33 will appear. Each Matlab command is typed on a separate line, terminated by the ENTER key. If the ENTER key is preceded by a semi-colon (;) there is no feedback from Matlab (unless you have made an error) and on the next line you type the next command and so on. Otherwise, you are given a response, for example the result of multiplying matrices together, which can be useful but if the information contains several lines of numbers which fill up a screen and which may not be very interesting, it is best to suppress this.
Matlab is case sensitive (unlike VBA), so the variable x is different to X. Commands are all lower case.
A.5.2 Directories
By default Matlab will be installed in the directory C:\matlabrxx\ on your PC, where xx relates to the edition of the package. You can choose to install elsewhere but at first it is best to stick to the standard directories, which we will assume below. You need some knowledge of DOS directory structure to use the directory commands within Matlab. According to particular combinations of versions of Windows and Matlab there is some flexibility, but keeping to the commands below is safe for the first time user.
A directory c:\matlabrxx\work will be created where the results of your session will be stored unless you specify differently. There are several commands to manage directories. The cd command changes directory so that cd c:\ changes the directory to c:\. If the new directory does not exist you must first create it with the mkdir command. It is best not to include a space in the name of the directory. The following code creates a directory called results on the c drive and makes this the current Matlab directory:
cd c:\
mkdir results cd results
458 |
CHEMOMETRICS |
|
|
To return to the default directory simply key
cd c:\matlabrxx\work
where xx relates to the edition number, if this is where the program is stored.
If you get in a muddle, you can check the working directory by typing pwd and find out its contents using dir.
You can also use the pull down Set Path item on the File menu, but be careful about the compatibility between Matlab and various versions of Windows; it is safest to employ the line commands.
A.5.3 File Types
There are several types of files that one may wish to create and use, but there are three main kinds that are useful for the beginner.
A.5.3.1 mat Files
These files store the ‘workspace’ or variables created during a session. All matrices, vectors and scalars with unique names are saved. Many chemometricians exchange data in this format. The command save places all this information into a file called matlab.mat in the current working directory. Alternatively, you can use the Save Workspace item on the File menu. Normally you wish to save the information as a named file, in which case you enter the filename after the save command. The following code saves the results of a session as a file called mydata in the directory c:\results, the first line being dependent on the current working directory and requires you to have created this first:
cd c:\results save mydata
If you want a space in the filename, enclose in single quotes, e.g. ‘Tuesday file’. In order to access these data in Matlab from an existing file, simply use the load command, remembering what directory you are in, or else the Load Workspace item on the File menu. This can be done several times to bring in different variables, but if two or more variables have the same names, the most recent overwrite the old ones.
A.5.3.2 m Files
Often it is useful to create programs which can be run again. This is done via m files. The same rules about directories apply as discussed above.
These files are simple text files and may be created in a variety of ways. One way is via the normal Notepad text editor. Simply type in a series of statements, and store them as a file with extension .m. There are five ways in which this file can be run from the Matlab command window.
1.Open the m file, cut and paste the text, place into the Matlab command window and press the return key. The program should run provided that there are no errors.

APPENDICES |
459 |
|
|
Figure A.34
The m file window
2.Start Matlab, type open together with the name of the .m file, e.g. open myprog.m, and a separate window should open; see Figure A.34. In the Tools menu select Run and then return to the main Matlab screen, where the results of the program should be displayed.
3.Similarly to method 2, you can use the Open option in the File menu.
4.Provided that you are in the correct directory, you can simply type the name of the
m file, and it will run; for example, if a file called prog.m exists in the current directory, just type prog (followed by the ENTER key).
5.Finally, the program can be run via the Run Script facility in the File menu.
Another way of creating an. m file is in the Matlab command window. In the File menu, select New and then M-file. You should be presented with a new Matlab Editor/Debugger window (see Figure A.34) where you can type commands. When you have finished, save the file, best done using the Save As command. Then you can either return to the Matlab window (an icon should be displayed) and run the file as in option 4 above, or run it in the editing window as in option 2 above, but the results will be displayed in the Matlab window. Note that if you make changes you must save this file to run it. If there are mistakes in the program an error message will be displayed in the Matlab window and you need to edit the commands until the program is correct.
A.5.3.3 Diary Files
These files keep a record of a session. The simplest approach is not to use diary files but just to copy and paste the text of a Matlab session, but diary files can be useful because one can selectively save just certain commands. In order to start a diary file type diary (a default file called diary will be created in the current directory) or diary filename where filename is the name of the file. This automatically opens a file into which all subsequent commands used in a session, together with their results, are stored. To stop recording simply type diary off and to start again (in the same file) type diary on.
The file can be viewed as a text file, in the Text Editor. Note that you must close the diary session before the information is saved.
460 |
CHEMOMETRICS |
|
|
A.5.4 Matrices
The key to Matlab is matrices. Understanding how Matlab copes with matrices is essential for the user of this environment.
A.5.4.1 Scalars, Vectors and Matrices
It is possible to handle scalars, vectors and matrices in Matlab. The package automatically determines the nature of a variable when first introduced. A scalar is simply a number, so
P = 2
sets up a scalar P equal to 2. Notice that there is a distinction between upper and lower case, and it is entirely possible that another scalar p (lower case) co-exists:
p = 7
It is not necessary to restrict a name to a single letter, but all matrix names must start with an alphabetic rather than numeric character and not contain spaces.
For oneand two-dimensional arrays, it is important to enclose the information within square brackets. A row vector can be defined by
Y = [2 8 7]
resulting in a 1 × 3 row vector. A column vector is treated rather differently as a matrix of three rows and one column. If a matrix or vector is typed on a single line, each new row starts a semicolon, so a 3 × 1 column vector may be defined by
Z = [1; 4; 7]
Alternatively, it is possible to place each row on a separate line, so
Z = [1 4 7]
has the same effect. Another trick is to enter as a row vector and then take the transpose (see Section A.5.4.3).
Matrices can be similarly defined, e.g.
W = [2 7 8; 0 1 6]
and
W = [2 7 8 0 1 6]
are alternative ways, in the Matlab window, of setting up a 2 × 3 matrix.
One can specifically obtain the value of any element of a matrix, for example W(2,1) gives the element on the second row and first column of W which equals

APPENDICES |
461 |
|
|
Figure A.35
Obtaining vectors from matrices
0 in this case. For vectors, only one dimension is needed, so Z(2) equals 4 and Y(3) equals 7.
It is also possible to extract single rows or columns from a matrix, by using a colon operator. The second row of matrix X is denoted by X(2,:). This is exemplified in Figure A.35. It is possible to define any rectangular region of a matrix, using the colon operator. For example, if S is a matrix having dimensions 12 × 8 we may want a sub-matrix between rows 7 to 9 and columns 5 to 12, and it is simply necessary to define S(7: 9, 5: 12).
If you want to find out how many matrices are in memory, use the function who, which lists all current matrices available to the program, or whos, which contains details about their size. This is sometimes useful if you have had a long Matlab session or have imported a number of datasets; see Figure A.36.
There is a special notation for the identity matrix. The command eye(3) sets up a 3 × 3 identity matrix, the number enclosed in the brackets referring to the dimensions.
A.5.4.2 Basic Arithmetic Matrix Operations
The basic matrix operations +, − and correspond to the normal matrix addition, subtraction and multiplication (using the dot product); for scalars these are also defined in the usual way. For the first two operations the two matrices should generally have the same dimensions, and for multiplication the number of columns of the first matrix should equal the number of rows of the second matrix. It is possible to place the results in a target or else simply display them on the screen as a default variable called ans.

462 |
CHEMOMETRICS |
|
|
Figure A.36
Use of whos command to determine how many matrices are available
Figure A.37
Simple matrix operations in Matlab
Figure A.37 exemplifies setting up three matrices, a 3 × 2 matrix X, a 2 × 3 matrix Y and a 3 × 3 matrix Z, and calculating X .Y + Z .
There are a number of elaborations based on these basic operations, but the first time user is recommended to keep things simple. However, it is worth noting that it is possible to add scalars to matrices. An example involves adding the number 2 to each element of W as defined above: either type W + 2 or first define a scalar, e.g. P = 2, and then add this using the command W + P. Similarly, one can multiply, subtract or divide all elements of a matrix by a scalar. Note that it is not possible to add a vector to a matrix even if the vector has one dimension identical with that of the matrix.
A.5.4.3 Matrix Functions
A significant advantage of Matlab is that there are several further very useful matrix operations. Most are in the form of functions; the arguments are enclosed in brackets. Three that are important in chemometrics are as follows:
•transpose is denoted by , e.g. W is the transpose of W;
•inverse is a function inv so that inv(Q) is the inverse of a square matrix Q;