Добавил:

fench Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Казанский национальный исследовательский технологический университет

Предмет:

Химия

Файл:

Brereton Chemometrics

.pdf

Скачиваний:

Добавлен:

15.08.2013

Размер:

4.3 Mб

Скачать

☆

<<< < Предыдущая 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 4647 / 5047 48 49 50 > Следующая >>>

APPENDICES	453

Figure A.30

PCR dialog box

The PCR dialog box, illustrated in Figure A.30, is considerably more complicated. It is always necessary to have a training set consisting of an ‘x’ block and a ‘c’ block. The latter may consist of more than one column. For PCR, unlike PLS, all columns are treated independently so there is no analogy to PLS2. You can choose three options.

(1) ‘Training set only’ is primarily for building and validating models. It only uses the training set. You need only to specify an ‘x’ and ‘c’ block training set. The number of objects in both sets must be identical. (2) ‘Prediction of unknowns’ is used to predict concentrations from an unknown series of samples. It is necessary to have an ‘x’ and ‘c’ block training set as well as an ‘x’ block for the unknowns. A model will be built from the training set and applied to the unknowns. There can be any number of unknowns, but the number of variables in the two ‘x’ blocks must be identical. (3) ‘Use test set (predict and compare)’ allows two sets of blocks where concentrations are known, a training set and a test set. The number of objects in the training and test set will normally differ, but the number of variables in both datasets must be identical.

There are three methods for data scaling, as in PCA, but the relevant column means and standard deviations are always obtained from the training set. If there is a test set, then the training set parameters will be used to scale the test set, so that the test set is unlikely to be mean centred or standardised. Similar scaling is performed on both the ‘c’ and ‘x’ block simultaneously. If you want to apply other forms of scaling (such as summing rows to a constant total), this can be performed manually in Excel and PCA

454 CHEMOMETRICS

can be performed without further preprocessing. Cross-validation is performed only on the ‘c’ block or concentration predictions; if you choose cross-validation you can only do this on the training set. If you want to perform cross-validation on the ‘x’ block, use the PCA facility.

There are a number of types of output. Eigenvalues, scores and loadings (of the training set) are the same as in PCA, whereas the coefﬁcients relate the PCs to the concentration estimates, and correspond to the matrix R as described in Chapter 5, Section 5.4.1. This information is available if requested in all cases except for crossvalidation. Separate statistics can be obtained for the ‘c’ block predictions. There are three levels of output. ‘Summary only’ involves just the errors including the training set error (adjusted by the number of degrees of freedom to give 1Ecal as described in Chapter 5, Section 5.6.1), the cross-validated error (divided by the number of objects in the training set) and the test set error, as appropriate to the relevant calculation. If the ‘Predictions’ option is selected, then the predicted concentrations are also displayed, and ‘Predictions and Residuals’ provides the residuals as well (if appropriate for the training and test sets), although these can also be calculated manually. If the ‘Show all models’ option is selected, then predicted ‘c’ values and the relevant errors (according to the information required) for 1, 2, 3, up to the chosen number of PCs is displayed. If this option is not selected, only information for the full model is provided.

The PLS dialog box, illustrated in Figure A.31, is very similar to PCR, except that there is an option to perform PLS1 (‘one c variable at a time’) (see Chapter 5,

Figure A.31

PLS dialog box

APPENDICES	455

Section 5.5.1) as well as PLS2 (Chapter 5, Section 5.5.2). However, even when performing PLS1, it is possible to use several variables in the ‘c’ block; each variable, however, is modelled independently. Instead of coefﬁcients (in PCR) we have ‘C- loadings’ (Q) for PLS, as well as the ‘X-loadings’ (P ), although there is only one scores matrix. Strictly, there are no eigenvalues for PLS, but the size of each component is given by the magnitude, which is the product of the sum of squares of the scores and X-loadings for each PLS component. Note that the loadings in the method described in this text are neither normalised nor orthogonal. If one selects PLS2, there will be a single set of ‘Scores’ and ‘X-loadings’ matrices, however many columns there are in the ‘c’ block, but ‘C-loadings’ will be in the form of a matrix. If PLS1 is selected and there is more than one column in the ‘c’ block, separate ‘Scores’ and ‘X-loadings’ matrices are generated for each compound, as well as an associated ‘C- loadings’ vector, so the output can become extensive unless one is careful to select the appropriate options.

For both PCR and PLS it is, of course, possible to transpose data, and this can be useful if there are a large number of wavelengths, but both the ‘x’ block and the ‘c’ block must be transposed. These facilities are not restricted to predicting concentrations in spectra of mixtures and can be used for any purpose, such as QSAR or sensory statistics.

The MLR dialog box, illustrated in Figure A.32, is somewhat simpler than the others and is mainly used if two out of X, C and S are known. The type of unknown matrix is chosen and then regions of the spreadsheet of the correct size must be selected. For small datasets MLR can be performed using standard matrix operations in Excel as described in Section A.4.2.2, but for larger matrices it is necessary to have a separate

Figure A.32

MLR dialog box

456	CHEMOMETRICS

tool, as there is a limitation in the Excel functions. This facility also performs regression using the pseudoinverse, and is mainly provided for completion. Note that it is not necessary to restrict the data to spectra or concentrations.

This Add-in provides a basic functionality for many of the multivariate methods described in Chapters 4–6 and can be used when solving the problems.

A.5 Matlab for Chemometrics

Many chemometricians use Matlab. In order to appreciate the popularity of this approach, it is important to understand the vintage of chemometrics. The ﬁrst applications of quantum chemistry, another type of computational chemistry, were developed in the 1960s and 1970s when Fortran was the main numerical programming environment. Hence large libraries of routines were established over this period and to this day most quantum chemists still program in Fortran. Were the discipline of quantum chemistry to start over again, probably Fortran would not be the main programming environment of choice, but tens of thousands (or more) man-years would need to be invested to rewrite entire historical databases of programs. If we were developing an operating system that would be used by tens or hundreds of millions of people, that investment might be worthwhile, but the scientiﬁc market is much smaller, so once the environment is established, new researchers tend to stick to it as they can then exchange code and access libraries.

Although some early chemometrics code was developed in Fortran (the Arthur package of Kowalski) and Basic (Wold’s early version of SIMCA) and commercial packages are mainly written in C, most public domain code ﬁrst became available in the 1980s when Matlab was an up and coming new environment. An advantage of Matlab is that it is very much oriented towards matrix operations and most chemometrics algorithms are best expressed in this way. It can be awkward to write matrix based programs in C, Basic or Fortran unless one has access to or develops specialised libraries. Matlab was originally a technical programming environment mainly for engineers and physical scientists, but over the years the user base has expanded strongly and Matlab has kept pace with new technology including extensive graphics, interfaces to Excel, numerous toolboxes for specialist use and the ability to compile software. In this section we will concentrate primarily on the basics required for chemometrics and also to solve the problems in this book; for the more experienced user there are numerous other outstanding texts on Matlab, including the extensive documentation produced by the developer of the software, MathWorks, which maintains an excellent Website. In this book you will be introduced to a number of main features, to help you solve the problems, but as you gain experience you will undoubtedly develop your own personal favourite approaches. Matlab can be used at many levels, and it is now possible to develop sophisticated packages with good graphics in this environment.

There are many versions of Matlab and of Windows and for the more elaborate interfaces between the two packages it is necessary to refer to technical manuals. We will illustrate this section with Matlab version 5.3, although many readers may have access to more up-to-date editions. All are forward compatible. There is a good on-line help facility in Matlab: type help followed by the command, or follow the appropriate menu item. However, it is useful ﬁrst to have a grasp of the basics which will be described below.

APPENDICES	457

Figure A.33

Matlab window

A.5.1 Getting Started

To start Matlab it is easiest to simply click the icon which should be available if properly installed, and a blank screen as in Figure A.33 will appear. Each Matlab command is typed on a separate line, terminated by the ENTER key. If the ENTER key is preceded by a semi-colon (;) there is no feedback from Matlab (unless you have made an error) and on the next line you type the next command and so on. Otherwise, you are given a response, for example the result of multiplying matrices together, which can be useful but if the information contains several lines of numbers which ﬁll up a screen and which may not be very interesting, it is best to suppress this.

Matlab is case sensitive (unlike VBA), so the variable x is different to X. Commands are all lower case.

A.5.2 Directories

By default Matlab will be installed in the directory C:\matlabrxx\ on your PC, where xx relates to the edition of the package. You can choose to install elsewhere but at ﬁrst it is best to stick to the standard directories, which we will assume below. You need some knowledge of DOS directory structure to use the directory commands within Matlab. According to particular combinations of versions of Windows and Matlab there is some ﬂexibility, but keeping to the commands below is safe for the ﬁrst time user.

A directory c:\matlabrxx\work will be created where the results of your session will be stored unless you specify differently. There are several commands to manage directories. The cd command changes directory so that cd c:\ changes the directory to c:\. If the new directory does not exist you must ﬁrst create it with the mkdir command. It is best not to include a space in the name of the directory. The following code creates a directory called results on the c drive and makes this the current Matlab directory:

cd c:\

mkdir results cd results

458	CHEMOMETRICS

To return to the default directory simply key

cd c:\matlabrxx\work

where xx relates to the edition number, if this is where the program is stored.

If you get in a muddle, you can check the working directory by typing pwd and ﬁnd out its contents using dir.

You can also use the pull down Set Path item on the File menu, but be careful about the compatibility between Matlab and various versions of Windows; it is safest to employ the line commands.

A.5.3 File Types

There are several types of ﬁles that one may wish to create and use, but there are three main kinds that are useful for the beginner.

A.5.3.1 mat Files

These ﬁles store the ‘workspace’ or variables created during a session. All matrices, vectors and scalars with unique names are saved. Many chemometricians exchange data in this format. The command save places all this information into a ﬁle called matlab.mat in the current working directory. Alternatively, you can use the Save Workspace item on the File menu. Normally you wish to save the information as a named ﬁle, in which case you enter the ﬁlename after the save command. The following code saves the results of a session as a ﬁle called mydata in the directory c:\results, the ﬁrst line being dependent on the current working directory and requires you to have created this ﬁrst:

cd c:\results save mydata

If you want a space in the ﬁlename, enclose in single quotes, e.g. ‘Tuesday file’. In order to access these data in Matlab from an existing ﬁle, simply use the load command, remembering what directory you are in, or else the Load Workspace item on the File menu. This can be done several times to bring in different variables, but if two or more variables have the same names, the most recent overwrite the old ones.

A.5.3.2 m Files

Often it is useful to create programs which can be run again. This is done via m ﬁles. The same rules about directories apply as discussed above.

These ﬁles are simple text ﬁles and may be created in a variety of ways. One way is via the normal Notepad text editor. Simply type in a series of statements, and store them as a ﬁle with extension .m. There are ﬁve ways in which this ﬁle can be run from the Matlab command window.

1.Open the m ﬁle, cut and paste the text, place into the Matlab command window and press the return key. The program should run provided that there are no errors.

APPENDICES	459

Figure A.34

The m ﬁle window

2.Start Matlab, type open together with the name of the .m ﬁle, e.g. open myprog.m, and a separate window should open; see Figure A.34. In the Tools menu select Run and then return to the main Matlab screen, where the results of the program should be displayed.

3.Similarly to method 2, you can use the Open option in the File menu.

4.Provided that you are in the correct directory, you can simply type the name of the

m ﬁle, and it will run; for example, if a ﬁle called prog.m exists in the current directory, just type prog (followed by the ENTER key).

5.Finally, the program can be run via the Run Script facility in the File menu.

Another way of creating an. m ﬁle is in the Matlab command window. In the File menu, select New and then M-file. You should be presented with a new Matlab Editor/Debugger window (see Figure A.34) where you can type commands. When you have ﬁnished, save the ﬁle, best done using the Save As command. Then you can either return to the Matlab window (an icon should be displayed) and run the ﬁle as in option 4 above, or run it in the editing window as in option 2 above, but the results will be displayed in the Matlab window. Note that if you make changes you must save this ﬁle to run it. If there are mistakes in the program an error message will be displayed in the Matlab window and you need to edit the commands until the program is correct.

A.5.3.3 Diary Files

These ﬁles keep a record of a session. The simplest approach is not to use diary ﬁles but just to copy and paste the text of a Matlab session, but diary ﬁles can be useful because one can selectively save just certain commands. In order to start a diary ﬁle type diary (a default ﬁle called diary will be created in the current directory) or diary filename where filename is the name of the ﬁle. This automatically opens a ﬁle into which all subsequent commands used in a session, together with their results, are stored. To stop recording simply type diary off and to start again (in the same ﬁle) type diary on.

The ﬁle can be viewed as a text ﬁle, in the Text Editor. Note that you must close the diary session before the information is saved.

460	CHEMOMETRICS

A.5.4 Matrices

The key to Matlab is matrices. Understanding how Matlab copes with matrices is essential for the user of this environment.

A.5.4.1 Scalars, Vectors and Matrices

It is possible to handle scalars, vectors and matrices in Matlab. The package automatically determines the nature of a variable when ﬁrst introduced. A scalar is simply a number, so

P = 2

sets up a scalar P equal to 2. Notice that there is a distinction between upper and lower case, and it is entirely possible that another scalar p (lower case) co-exists:

p = 7

It is not necessary to restrict a name to a single letter, but all matrix names must start with an alphabetic rather than numeric character and not contain spaces.

For oneand two-dimensional arrays, it is important to enclose the information within square brackets. A row vector can be deﬁned by

Y = [2 8 7]

resulting in a 1 × 3 row vector. A column vector is treated rather differently as a matrix of three rows and one column. If a matrix or vector is typed on a single line, each new row starts a semicolon, so a 3 × 1 column vector may be deﬁned by

Z = [1; 4; 7]

Alternatively, it is possible to place each row on a separate line, so

Z = [1 4 7]

has the same effect. Another trick is to enter as a row vector and then take the transpose (see Section A.5.4.3).

Matrices can be similarly deﬁned, e.g.

W = [2 7 8; 0 1 6]

and

W = [2 7 8 0 1 6]

are alternative ways, in the Matlab window, of setting up a 2 × 3 matrix.

One can speciﬁcally obtain the value of any element of a matrix, for example W(2,1) gives the element on the second row and ﬁrst column of W which equals

APPENDICES	461

Figure A.35

Obtaining vectors from matrices

0 in this case. For vectors, only one dimension is needed, so Z(2) equals 4 and Y(3) equals 7.

It is also possible to extract single rows or columns from a matrix, by using a colon operator. The second row of matrix X is denoted by X(2,:). This is exempliﬁed in Figure A.35. It is possible to deﬁne any rectangular region of a matrix, using the colon operator. For example, if S is a matrix having dimensions 12 × 8 we may want a sub-matrix between rows 7 to 9 and columns 5 to 12, and it is simply necessary to deﬁne S(7: 9, 5: 12).

If you want to ﬁnd out how many matrices are in memory, use the function who, which lists all current matrices available to the program, or whos, which contains details about their size. This is sometimes useful if you have had a long Matlab session or have imported a number of datasets; see Figure A.36.

There is a special notation for the identity matrix. The command eye(3) sets up a 3 × 3 identity matrix, the number enclosed in the brackets referring to the dimensions.

A.5.4.2 Basic Arithmetic Matrix Operations

The basic matrix operations +, − and correspond to the normal matrix addition, subtraction and multiplication (using the dot product); for scalars these are also deﬁned in the usual way. For the ﬁrst two operations the two matrices should generally have the same dimensions, and for multiplication the number of columns of the ﬁrst matrix should equal the number of rows of the second matrix. It is possible to place the results in a target or else simply display them on the screen as a default variable called ans.

462	CHEMOMETRICS

Figure A.36

Use of whos command to determine how many matrices are available

Figure A.37

Simple matrix operations in Matlab

Figure A.37 exempliﬁes setting up three matrices, a 3 × 2 matrix X, a 2 × 3 matrix Y and a 3 × 3 matrix Z, and calculating X .Y + Z .

There are a number of elaborations based on these basic operations, but the ﬁrst time user is recommended to keep things simple. However, it is worth noting that it is possible to add scalars to matrices. An example involves adding the number 2 to each element of W as deﬁned above: either type W + 2 or ﬁrst deﬁne a scalar, e.g. P = 2, and then add this using the command W + P. Similarly, one can multiply, subtract or divide all elements of a matrix by a scalar. Note that it is not possible to add a vector to a matrix even if the vector has one dimension identical with that of the matrix.

A.5.4.3 Matrix Functions

A signiﬁcant advantage of Matlab is that there are several further very useful matrix operations. Most are in the form of functions; the arguments are enclosed in brackets. Three that are important in chemometrics are as follows:

•transpose is denoted by , e.g. W is the transpose of W;

•inverse is a function inv so that inv(Q) is the inverse of a square matrix Q;

<<< < Предыдущая 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 4647 / 5047 48 49 50 > Следующая >>>

Соседние файлы в предмете Химия

#
15.08.20134.29 Mб19Baer M., Billing G.D. (eds.) - The role of degenerate states in chemistry (Adv.Chem.Phys. special issue, Wiley, 2002).pdf
#
15.08.20137.08 Mб59Basov N.I. i dr. Raschet i konstruirovanie formiruyushchego instrumenta dlya izgotovleniya izdelij (1991.pdf
#
15.08.20135.59 Mб72Becker O.M., MacKerell A.D., Roux B., Watanabe M. (eds.) Computational biochemistry and biophysic.pdf
#
15.08.2013324.82 Кб34benzyne-cyclization.pdf
#
15.08.201314.48 Mб21Borowko M. 2000 Computational methods in surface and colloid science.djvu
#
15.08.20134.3 Mб62Brereton Chemometrics.pdf
#
15.08.20131.07 Mб33Burshtejn K.Ya., Shorygin P.P. Kvantovohimicheskie raschety v organicheskoj himii i molekulyarnoj.pdf
#
15.08.201321.36 Mб49Carey F.A. - Organic Chemistry (2004)(en).djvu
#
15.08.201321.36 Mб42Carey F.A. Advanced organic chemistry 5ed., MGH, 2004.djvu
#
15.08.201311.62 Mб29Carey F.A. Advanced organic chemistry. Part A structure and mechanisms 1938.djvu
#
15.08.20138.77 Mб20Carey F.A. Advanced organic chemistry. Part B reaction and synthesis 1938.djvu