- •List of Tables
- •List of Figures
- •Table of Notation
- •Preface
- •Boolean retrieval
- •An example information retrieval problem
- •Processing Boolean queries
- •The extended Boolean model versus ranked retrieval
- •References and further reading
- •The term vocabulary and postings lists
- •Document delineation and character sequence decoding
- •Obtaining the character sequence in a document
- •Choosing a document unit
- •Determining the vocabulary of terms
- •Tokenization
- •Dropping common terms: stop words
- •Normalization (equivalence classing of terms)
- •Stemming and lemmatization
- •Faster postings list intersection via skip pointers
- •Positional postings and phrase queries
- •Biword indexes
- •Positional indexes
- •Combination schemes
- •References and further reading
- •Dictionaries and tolerant retrieval
- •Search structures for dictionaries
- •Wildcard queries
- •General wildcard queries
- •Spelling correction
- •Implementing spelling correction
- •Forms of spelling correction
- •Edit distance
- •Context sensitive spelling correction
- •Phonetic correction
- •References and further reading
- •Index construction
- •Hardware basics
- •Blocked sort-based indexing
- •Single-pass in-memory indexing
- •Distributed indexing
- •Dynamic indexing
- •Other types of indexes
- •References and further reading
- •Index compression
- •Statistical properties of terms in information retrieval
- •Dictionary compression
- •Dictionary as a string
- •Blocked storage
- •Variable byte codes
- •References and further reading
- •Scoring, term weighting and the vector space model
- •Parametric and zone indexes
- •Weighted zone scoring
- •Learning weights
- •The optimal weight g
- •Term frequency and weighting
- •Inverse document frequency
- •The vector space model for scoring
- •Dot products
- •Queries as vectors
- •Computing vector scores
- •Sublinear tf scaling
- •Maximum tf normalization
- •Document and query weighting schemes
- •Pivoted normalized document length
- •References and further reading
- •Computing scores in a complete search system
- •Index elimination
- •Champion lists
- •Static quality scores and ordering
- •Impact ordering
- •Cluster pruning
- •Components of an information retrieval system
- •Tiered indexes
- •Designing parsing and scoring functions
- •Putting it all together
- •Vector space scoring and query operator interaction
- •References and further reading
- •Evaluation in information retrieval
- •Information retrieval system evaluation
- •Standard test collections
- •Evaluation of unranked retrieval sets
- •Evaluation of ranked retrieval results
- •Assessing relevance
- •A broader perspective: System quality and user utility
- •System issues
- •User utility
- •Results snippets
- •References and further reading
- •Relevance feedback and query expansion
- •Relevance feedback and pseudo relevance feedback
- •The Rocchio algorithm for relevance feedback
- •Probabilistic relevance feedback
- •When does relevance feedback work?
- •Relevance feedback on the web
- •Evaluation of relevance feedback strategies
- •Pseudo relevance feedback
- •Indirect relevance feedback
- •Summary
- •Global methods for query reformulation
- •Vocabulary tools for query reformulation
- •Query expansion
- •Automatic thesaurus generation
- •References and further reading
- •XML retrieval
- •Basic XML concepts
- •Challenges in XML retrieval
- •A vector space model for XML retrieval
- •Evaluation of XML retrieval
- •References and further reading
- •Exercises
- •Probabilistic information retrieval
- •Review of basic probability theory
- •The Probability Ranking Principle
- •The 1/0 loss case
- •The PRP with retrieval costs
- •The Binary Independence Model
- •Deriving a ranking function for query terms
- •Probability estimates in theory
- •Probability estimates in practice
- •Probabilistic approaches to relevance feedback
- •An appraisal and some extensions
- •An appraisal of probabilistic models
- •Bayesian network approaches to IR
- •References and further reading
- •Language models for information retrieval
- •Language models
- •Finite automata and language models
- •Types of language models
- •Multinomial distributions over words
- •The query likelihood model
- •Using query likelihood language models in IR
- •Estimating the query generation probability
- •Language modeling versus other approaches in IR
- •Extended language modeling approaches
- •References and further reading
- •Relation to multinomial unigram language model
- •The Bernoulli model
- •Properties of Naive Bayes
- •A variant of the multinomial model
- •Feature selection
- •Mutual information
- •Comparison of feature selection methods
- •References and further reading
- •Document representations and measures of relatedness in vector spaces
- •k nearest neighbor
- •Time complexity and optimality of kNN
- •The bias-variance tradeoff
- •References and further reading
- •Exercises
- •Support vector machines and machine learning on documents
- •Support vector machines: The linearly separable case
- •Extensions to the SVM model
- •Multiclass SVMs
- •Nonlinear SVMs
- •Experimental results
- •Machine learning methods in ad hoc information retrieval
- •Result ranking by machine learning
- •References and further reading
- •Flat clustering
- •Clustering in information retrieval
- •Problem statement
- •Evaluation of clustering
- •Cluster cardinality in K-means
- •Model-based clustering
- •References and further reading
- •Exercises
- •Hierarchical clustering
- •Hierarchical agglomerative clustering
- •Time complexity of HAC
- •Group-average agglomerative clustering
- •Centroid clustering
- •Optimality of HAC
- •Divisive clustering
- •Cluster labeling
- •Implementation notes
- •References and further reading
- •Exercises
- •Matrix decompositions and latent semantic indexing
- •Linear algebra review
- •Matrix decompositions
- •Term-document matrices and singular value decompositions
- •Low-rank approximations
- •Latent semantic indexing
- •References and further reading
- •Web search basics
- •Background and history
- •Web characteristics
- •The web graph
- •Spam
- •Advertising as the economic model
- •The search user experience
- •User query needs
- •Index size and estimation
- •Near-duplicates and shingling
- •References and further reading
- •Web crawling and indexes
- •Overview
- •Crawling
- •Crawler architecture
- •DNS resolution
- •The URL frontier
- •Distributing indexes
- •Connectivity servers
- •References and further reading
- •Link analysis
- •The Web as a graph
- •Anchor text and the web graph
- •PageRank
- •Markov chains
- •The PageRank computation
- •Hubs and Authorities
- •Choosing the subset of the Web
- •References and further reading
- •Bibliography
- •Author Index
DRAFT! © April 1, 2009 Cambridge University Press. Feedback welcome. |
319 |
15 Support vector machines and machine learning on documents
Improving classifier effectiveness has been an area of intensive machinelearning research over the last two decades, and this work has led to a new generation of state-of-the-art classifiers, such as support vector machines, boosted decision trees, regularized logistic regression, neural networks, and random forests. Many of these methods, including support vector machines (SVMs), the main topic of this chapter, have been applied with success to information retrieval problems, particularly text classification. An SVM is a kind of large-margin classifier: it is a vector space based machine learning method where the goal is to find a decision boundary between two classes that is maximally far from any point in the training data (possibly discounting some points as outliers or noise).
We will initially motivate and develop SVMs for the case of two-class data sets that are separable by a linear classifier (Section 15.1), and then extend the model in Section 15.2 to non-separable data, multi-class problems, and nonlinear models, and also present some additional discussion of SVM performance. The chapter then moves to consider the practical deployment of text classifiers in Section 15.3: what sorts of classifiers are appropriate when, and how can you exploit domain-specific text features in classification? Finally, we will consider how the machine learning technology that we have been building for text classification can be applied back to the problem of learning how to rank documents in ad hoc retrieval (Section 15.4). While several machine learning methods have been applied to this task, use of SVMs has been prominent. Support vector machines are not necessarily better than other machine learning methods (except perhaps in situations with little training data), but they perform at the state-of-the-art level and have much current theoretical and empirical appeal.
Online edition (c) 2009 Cambridge UP
320 |
15 |
Support vector machines and machine learning on documents |
|
|
Maximum |
Support vectors |
|
|
margin |
|
|
|
decision |
|
tu |
|
hyperplane |
tu |
|
|
|
tu tu |
|
|
|
tu |
tu |
|
|
|
|
|
b |
b |
|
|
|
tu |
|
|
|
|
|
b b
b
b
b b |
b |
Margin is maximized
Figure 15.1 The support vectors are the 5 points right up against the margin of the classifier.
15.1Support vector machines: The linearly separable case
MARGIN
SUPPORT VECTOR
For two-class, separable training data sets, such as the one in Figure 14.8 (page 301), there are lots of possible linear separators. Intuitively, a decision boundary drawn in the middle of the void between data items of the two classes seems better than one which approaches very close to examples of one or both classes. While some learning methods such as the perceptron algorithm (see references in Section 14.7, page 314) find just any linear separator, others, like Naive Bayes, search for the best linear separator according to some criterion. The SVM in particular defines the criterion to be looking for a decision surface that is maximally far away from any data point. This distance from the decision surface to the closest data point determines the margin of the classifier. This method of construction necessarily means that the decision function for an SVM is fully specified by a (usually small) subset of the data which defines the position of the separator. These points are referred to as the support vectors (in a vector space, a point can be thought of as a vector between the origin and that point). Figure 15.1 shows the margin and support vectors for a sample problem. Other data points play no part in determining the decision surface that is chosen.
Online edition (c) 2009 Cambridge UP
15.1 Support vector machines: The linearly separable case |
321 |
Figure 15.2 An intuition for large-margin classification. Insisting on a large margin reduces the capacity of the model: the range of angles at which the fat decision surface can be placed is smaller than for a decision hyperplane (cf. Figure 14.8, page 301).
Maximizing the margin seems good because points near the decision surface represent very uncertain classification decisions: there is almost a 50% chance of the classifier deciding either way. A classifier with a large margin makes no low certainty classification decisions. This gives you a classification safety margin: a slight error in measurement or a slight document variation will not cause a misclassification. Another intuition motivating SVMs is shown in Figure 15.2. By construction, an SVM classifier insists on a large margin around the decision boundary. Compared to a decision hyperplane, if you have to place a fat separator between classes, you have fewer choices of where it can be put. As a result of this, the memory capacity of the model has been decreased, and hence we expect that its ability to correctly generalize to test data is increased (cf. the discussion of the bias-variance tradeoff in Chapter 14, page 312).
Let us formalize an SVM with algebra. A decision hyperplane (page 302) can be defined by an intercept term b and a decision hyperplane normal vector w~ which is perpendicular to the hyperplane. This vector is commonly
Online edition (c) 2009 Cambridge UP
322 |
15 Support vector machines and machine learning on documents |
WEIGHT VECTOR referred to in the machine learning literature as the weight vector. To choose among all the hyperplanes that are perpendicular to the normal vector, we specify the intercept term b. Because the hyperplane is perpendicular to the normal vector, all points ~x on the hyperplane satisfy w~ T~x = −b. Now suppose that we have a set of training data points D = {(~xi, yi)}, where each member is a pair of a point ~xi and a class label yi corresponding to it.1 For SVMs, the two data classes are always named +1 and −1 (rather than 1 and 0), and the intercept term is always explicitly represented as b (rather than being folded into the weight vector w~ by adding an extra always-on feature). The math works out much more cleanly if you do things this way, as we will see almost immediately in the definition of functional margin. The linear classifier is then:
(15.1) |
|
f (~x) = sign(w~ T~x + b) |
|
|
|
||
|
A value of −1 indicates one class, and a value of +1 the other class. |
|
|
|
|||
|
We are confident in the classification of a point if it is far away from the |
||||||
|
decision boundary. For a given data set and decision hyperplane, we define |
||||||
FUNCTIONAL MARGIN |
the |
functional margin of the ith example ~x |
i |
with respect to a hyperplane |
h |
w~ , b |
i |
|
T |
|
|
||||
|
as the quantity yi(w~ ~xi + b). The functional margin of a data set with re- |
||||||
|
spect to a decision surface is then twice the functional margin of any of the |
||||||
|
points in the data set with minimal functional margin (the factor of 2 comes |
||||||
|
from measuring across the whole width of the margin, as in Figure 15.3). |
||||||
|
However, there is a problem with using this definition as is: the value is un- |
||||||
|
derconstrained, because we can always make the functional margin as big |
||||||
|
as we wish by simply scaling up w~ and b. For example, if we replace w~ by |
||||||
|
5w~ and b by 5b then the functional margin yi(5w~ T~xi + 5b) is five times as |
||||||
|
large. This suggests that we need to place some constraint on the size of the |
||||||
|
w~ vector. To get a sense of how to do that, let us look at the actual geometry. |
||||||
|
What is the Euclidean distance from a point ~x to the decision boundary? In |
||||||
|
Figure 15.3, we denote by r this distance. We know that the shortest distance |
||||||
|
between a point and a hyperplane is perpendicular to the plane, and hence, |
||||||
|
parallel to w~ . A unit vector in this direction is w~ /|w~ |. The dotted line in the |
||||||
|
diagram is then a translation of the vector rw~ /|w~ |. Let us label the point on |
||||||
|
the hyperplane closest to ~x as ~x′. Then: |
|
|
|
|
|
|
(15.2) |
~x′ = ~x − yr |
|
w~ |
|
|
|
|
||
| |
w~ |
| |
||
|
|
|
where multiplying by y just changes the sign for the two cases of ~x being on either side of the decision surface. Moreover, ~x′ lies on the decision boundary
1. As discussed in Section 14.1 (page 291), we present the general case of points in a vector space, but if the points are length normalized document vectors, then all the action is taking place on the surface of a unit sphere, and the decision surface intersects the sphere’s surface.
Online edition (c) 2009 Cambridge UP
15.1 Support vector machines: The linearly separable case |
323 |
7 |
|
|
|
|
|
|
~x |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
ut |
ut |
|
|
|
|
|
|
|
|
|
|
|
6 |
|
|
|
|
|
|
ut |
ut |
|
|
|
|
|
|
|
|
|
|
|
5 |
|
|
|
|
~x′+ |
r |
ut |
ut |
|
|
|
|
|
|
|
|
|||
|
|
|
|
|
|
|
|
||
4 |
|
|
b |
b |
|
|
|
|
|
|
|
|
|
|
|
ut |
|
||
|
|
|
|
|
|
|
|
||
3 |
|
|
|
b |
b |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
b |
ρ |
|
|
|
|
|
|
|
|
|
|
|
|
2 |
|
|
b |
|
|
|
|
|
|
|
|
|
b |
|
b |
|
|
|
|
|
|
w~ |
b |
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
1 |
|
|
|
|
|
|
|
|
|
0 |
|
|
|
|
|
|
|
|
|
0 |
1 |
|
2 |
3 |
4 |
5 |
6 |
7 |
8 |
Figure 15.3 The geometric margin of a point (r) and a decision boundary (ρ). |
|||||||||
and so satisfies w~ T~x′ + b = 0. Hence:
|
|
|
w~ |
||
(15.3) |
w~ T ~x − yr |
|
+ b = 0 |
||
|w~ | |
|||||
|
Solving for r gives:2 |
|
|
|
|
(15.4) |
r = y |
w~ T~x + b |
|
||
|
|||||
|
|
|
|w~ | |
||
Again, the points closest to the separating hyperplane are support vectors. GEOMETRIC MARGIN The geometric margin of the classifier is the maximum width of the band that can be drawn separating the support vectors of the two classes. That is, it is twice the minimum value over data points for r given in Equation (15.4), or, equivalently, the maximal width of one of the fat separators shown in Figure 15.2. The geometric margin is clearly invariant to scaling of parameters: if we replace w~ by 5w~ and b by 5b, then the geometric margin is the same, because it is inherently normalized by the length of w~ . This means that we can impose any scaling constraint we wish on w~ without affecting the geometric margin. Among other choices, we could use unit vectors, as in Chapter 6, by
2. Recall that |w~ | = √w~ Tw~ .
Online edition (c) 2009 Cambridge UP
324 |
15 Support vector machines and machine learning on documents |
requiring that |w~ | = 1. This would have the effect of making the geometric margin the same as the functional margin.
Since we can scale the functional margin as we please, for convenience in solving large SVMs, let us choose to require that the functional margin of all data points is at least 1 and that it is equal to 1 for at least one data vector. That is, for all items in the data:
(15.5) yi(w~ T~xi + b) ≥ 1
and there exist support vectors for which the inequality is an equality. Since each example’s distance from the hyperplane is ri = yi(w~ T~xi + b)/|w~ |, the geometric margin is ρ = 2/|w~ |. Our desire is still to maximize this geometric margin. That is, we want to find w~ and b such that:
• ρ = 2/|w~ | is maximized
• For all (~xi, yi) D, yi(w~ T~xi + b) ≥ 1
(15.6)
QUADRATIC
PROGRAMMING
Maximizing 2/|w~ | is the same as minimizing |w~ |/2. This gives the final standard formulation of an SVM as a minimization problem:
Find w~ and b such that:
•12 w~ Tw~ is minimized, and
•for all {(~xi, yi)}, yi(w~ T~xi + b) ≥ 1
We are now optimizing a quadratic function subject to linear constraints. Quadratic optimization problems are a standard, well-known class of mathematical optimization problems, and many algorithms exist for solving them. We could in principle build our SVM using standard quadratic programming (QP) libraries, but there has been much recent research in this area aiming to exploit the structure of the kind of QP that emerges from an SVM. As a result, there are more intricate but much faster and more scalable libraries available especially for building SVMs, which almost everyone uses to build models. We will not present the details of such algorithms here.
However, it will be helpful to what follows to understand the shape of the solution of such an optimization problem. The solution involves constructing a dual problem where a Lagrange multiplier αi is associated with each constraint yi(w~ T~xi + b) ≥ 1 in the primal problem:
(15.7) Find α1, . . . αN such that ∑ αi − 12 ∑i ∑j αiαjyiyj~xiT~xj is maximized, and
•∑i αiyi = 0
•αi ≥ 0 for all 1 ≤ i ≤ N
The solution is then of the form:
Online edition (c) 2009 Cambridge UP
15.1 Support vector machines: The linearly separable case |
325 |
3t
2
1b
0 |
|
|
|
|
|
|
b |
|
|
|
0 |
1 |
2 |
3 |
|||||||
Figure 15.4 A tiny 3 data point training set for an SVM.
(15.8) w~ = ∑ αiyi~xi
b = yk − w~ T~xk for any ~xk such that αk 6= 0
In the solution, most of the αi are zero. Each non-zero αi indicates that the corresponding ~xi is a support vector. The classification function is then:
(15.9) |
f (~x) = sign(∑ |
i |
αiyi~xiT~x + b) |
|
|
|
Both the term to be maximized in the dual problem and the classifying function involve a dot product between pairs of points (~x and ~xi or ~xi and ~xj), and that is the only way the data are used – we will return to the significance of this later.
To recap, we start with a training data set. The data set uniquely defines the best separating hyperplane, and we feed the data through a quadratic optimization procedure to find this plane. Given a new point ~x to classify, the classification function f (~x) in either Equation (15.1) or Equation (15.9) is computing the projection of the point onto the hyperplane normal. The sign of this function determines the class to assign to the point. If the point is within the margin of the classifier (or another confidence threshold t that we might have determined to minimize classification mistakes) then the classifier can return “don’t know” rather than one of the two classes. The value of f (~x) may also be transformed into a probability of classification; fitting a sigmoid to transform the values is standard (Platt 2000). Also, since the margin is constant, if the model includes dimensions from various sources, careful rescaling of some dimensions may be required. However, this is not a problem if our documents (points) are on the unit hypersphere.
Example 15.1: Consider building an SVM over the (very little) data set shown in Figure 15.4. Working geometrically, for an example like this, the maximum margin weight vector will be parallel to the shortest line connecting points of the two classes, that is, the line between (1, 1) and (2, 3), giving a weight vector of (1, 2). The optimal decision surface is orthogonal to that line and intersects it at the halfway point.
Online edition (c) 2009 Cambridge UP
326 |
15 Support vector machines and machine learning on documents |
Therefore, it passes through (1.5, 2). So, the SVM decision boundary is:
y = x1 + 2x2 − 5.5
Working algebraically, with the standard constraint that sign(yi(w~ T~xi + b)) ≥ 1, we seek to minimize |w~ |. This happens when this constraint is satisfied with equality by the two support vectors. Further we know that the solution is w~ = (a, 2a) for some
a. So we have that: |
|
|
a + 2a + b |
= |
−1 |
2a + 6a + b |
= |
1 |
Therefore, a = 2/5 and b = −11/5. So the optimal hyperplane is given by w~ = (2/5, 4/5) and b = −11/5. √ √ √
The margin ρ is 2/|w~ | = 2/ 4/25 + 16/25 = 2/(2 5/5) = 5. This answer can be confirmed geometrically by examining Figure 15.4.
? |
Exercise 15.1 |
[ ] |
|
||
|
What is the minimum number of support vectors that there can be for a data set |
|
|
(which contains instances of each class)? |
|
|
Exercise 15.2 |
[ ] |
|
The basis of being able to use kernels in SVMs (see Section 15.2.3) is that the classifica- |
|
|
tion function can be written in the form of Equation (15.9) (where, for large problems, |
|
|
most αi are 0). Show explicitly how the classification function could be written in this |
|
|
form for the data set from Example 15.1. That is, write f |
as a function where the data |
|
points appear and the only variable is ~x. |
|
|
Exercise 15.3 |
[ ] |
Install an SVM package such as SVMlight (http://svmlight.joachims.org/), and build an SVM for the data set discussed in Example 15.1. Confirm that the program gives the same solution as the text. For SVMlight, or another package that accepts the same training data format, the training file would be:
+1 1:2 2:3 −1 1:2 2:0 −1 1:1 2:1
The training command for SVMlight is then:
svm_learn -c 1 -a alphas.dat train.dat model.dat
The -c 1 option is needed to turn off use of the slack variables that we discuss in Section 15.2.1. Check that the norm of the weight vector agrees with what we found in Example 15.1. Examine the file alphas.dat which contains the αi values, and check that they agree with your answers in Exercise 15.2.
Online edition (c) 2009 Cambridge UP
