- •List of Tables
- •List of Figures
- •Table of Notation
- •Preface
- •Boolean retrieval
- •An example information retrieval problem
- •Processing Boolean queries
- •The extended Boolean model versus ranked retrieval
- •References and further reading
- •The term vocabulary and postings lists
- •Document delineation and character sequence decoding
- •Obtaining the character sequence in a document
- •Choosing a document unit
- •Determining the vocabulary of terms
- •Tokenization
- •Dropping common terms: stop words
- •Normalization (equivalence classing of terms)
- •Stemming and lemmatization
- •Faster postings list intersection via skip pointers
- •Positional postings and phrase queries
- •Biword indexes
- •Positional indexes
- •Combination schemes
- •References and further reading
- •Dictionaries and tolerant retrieval
- •Search structures for dictionaries
- •Wildcard queries
- •General wildcard queries
- •Spelling correction
- •Implementing spelling correction
- •Forms of spelling correction
- •Edit distance
- •Context sensitive spelling correction
- •Phonetic correction
- •References and further reading
- •Index construction
- •Hardware basics
- •Blocked sort-based indexing
- •Single-pass in-memory indexing
- •Distributed indexing
- •Dynamic indexing
- •Other types of indexes
- •References and further reading
- •Index compression
- •Statistical properties of terms in information retrieval
- •Dictionary compression
- •Dictionary as a string
- •Blocked storage
- •Variable byte codes
- •References and further reading
- •Scoring, term weighting and the vector space model
- •Parametric and zone indexes
- •Weighted zone scoring
- •Learning weights
- •The optimal weight g
- •Term frequency and weighting
- •Inverse document frequency
- •The vector space model for scoring
- •Dot products
- •Queries as vectors
- •Computing vector scores
- •Sublinear tf scaling
- •Maximum tf normalization
- •Document and query weighting schemes
- •Pivoted normalized document length
- •References and further reading
- •Computing scores in a complete search system
- •Index elimination
- •Champion lists
- •Static quality scores and ordering
- •Impact ordering
- •Cluster pruning
- •Components of an information retrieval system
- •Tiered indexes
- •Designing parsing and scoring functions
- •Putting it all together
- •Vector space scoring and query operator interaction
- •References and further reading
- •Evaluation in information retrieval
- •Information retrieval system evaluation
- •Standard test collections
- •Evaluation of unranked retrieval sets
- •Evaluation of ranked retrieval results
- •Assessing relevance
- •A broader perspective: System quality and user utility
- •System issues
- •User utility
- •Results snippets
- •References and further reading
- •Relevance feedback and query expansion
- •Relevance feedback and pseudo relevance feedback
- •The Rocchio algorithm for relevance feedback
- •Probabilistic relevance feedback
- •When does relevance feedback work?
- •Relevance feedback on the web
- •Evaluation of relevance feedback strategies
- •Pseudo relevance feedback
- •Indirect relevance feedback
- •Summary
- •Global methods for query reformulation
- •Vocabulary tools for query reformulation
- •Query expansion
- •Automatic thesaurus generation
- •References and further reading
- •XML retrieval
- •Basic XML concepts
- •Challenges in XML retrieval
- •A vector space model for XML retrieval
- •Evaluation of XML retrieval
- •References and further reading
- •Exercises
- •Probabilistic information retrieval
- •Review of basic probability theory
- •The Probability Ranking Principle
- •The 1/0 loss case
- •The PRP with retrieval costs
- •The Binary Independence Model
- •Deriving a ranking function for query terms
- •Probability estimates in theory
- •Probability estimates in practice
- •Probabilistic approaches to relevance feedback
- •An appraisal and some extensions
- •An appraisal of probabilistic models
- •Bayesian network approaches to IR
- •References and further reading
- •Language models for information retrieval
- •Language models
- •Finite automata and language models
- •Types of language models
- •Multinomial distributions over words
- •The query likelihood model
- •Using query likelihood language models in IR
- •Estimating the query generation probability
- •Language modeling versus other approaches in IR
- •Extended language modeling approaches
- •References and further reading
- •Relation to multinomial unigram language model
- •The Bernoulli model
- •Properties of Naive Bayes
- •A variant of the multinomial model
- •Feature selection
- •Mutual information
- •Comparison of feature selection methods
- •References and further reading
- •Document representations and measures of relatedness in vector spaces
- •k nearest neighbor
- •Time complexity and optimality of kNN
- •The bias-variance tradeoff
- •References and further reading
- •Exercises
- •Support vector machines and machine learning on documents
- •Support vector machines: The linearly separable case
- •Extensions to the SVM model
- •Multiclass SVMs
- •Nonlinear SVMs
- •Experimental results
- •Machine learning methods in ad hoc information retrieval
- •Result ranking by machine learning
- •References and further reading
- •Flat clustering
- •Clustering in information retrieval
- •Problem statement
- •Evaluation of clustering
- •Cluster cardinality in K-means
- •Model-based clustering
- •References and further reading
- •Exercises
- •Hierarchical clustering
- •Hierarchical agglomerative clustering
- •Time complexity of HAC
- •Group-average agglomerative clustering
- •Centroid clustering
- •Optimality of HAC
- •Divisive clustering
- •Cluster labeling
- •Implementation notes
- •References and further reading
- •Exercises
- •Matrix decompositions and latent semantic indexing
- •Linear algebra review
- •Matrix decompositions
- •Term-document matrices and singular value decompositions
- •Low-rank approximations
- •Latent semantic indexing
- •References and further reading
- •Web search basics
- •Background and history
- •Web characteristics
- •The web graph
- •Spam
- •Advertising as the economic model
- •The search user experience
- •User query needs
- •Index size and estimation
- •Near-duplicates and shingling
- •References and further reading
- •Web crawling and indexes
- •Overview
- •Crawling
- •Crawler architecture
- •DNS resolution
- •The URL frontier
- •Distributing indexes
- •Connectivity servers
- •References and further reading
- •Link analysis
- •The Web as a graph
- •Anchor text and the web graph
- •PageRank
- •Markov chains
- •The PageRank computation
- •Hubs and Authorities
- •Choosing the subset of the Web
- •References and further reading
- •Bibliography
- •Author Index
18.2 Term-document matrices and singular value decompositions |
407 |
Thus, we have SU = UΛ, or S = UΛU−1.
We next state a closely related decomposition of a symmetric square matrix into the product of matrices derived from its eigenvectors. This will pave the way for the development of our main tool for text analysis, the singular value decomposition (Section 18.2).
Theorem 18.2. (Symmetric diagonalization theorem) Let S be a square, sym-
metric real-valued M × M matrix with M linearly independent eigenvectors. Then
SYMMETRIC DIAGONAL there exists a symmetric diagonal decomposition
DECOMPOSITION
(18.8)
?
S = QΛQT,
where the columns of Q are the orthogonal and normalized (unit length, real) eigenvectors of S, and Λ is the diagonal matrix whose entries are the eigenvalues of S. Further, all entries of Q are real and we have Q−1 = QT.
We will build on this symmetric diagonal decomposition to build low-rank approximations to term-document matrices.
Exercise 18.1
What is the rank of the 3 × 3 diagonal matrix below?
|
1 |
1 |
0 |
|
0 |
1 |
1 |
||
|
1 |
2 |
1 |
|
Exercise 18.2 |
|
|
|
Show that λ = 2 is an eigenvalue of |
|
−0 |
. |
C = |
4 |
||
|
6 |
2 |
|
Find the corresponding eigenvector.
Exercise 18.3
Compute the unique eigen decomposition of the 2 × 2 matrix in (18.4).
18.2Term-document matrices and singular value decompositions
SINGULAR VALUE DECOMPOSITION
The decompositions we have been studying thus far apply to square matrices. However, the matrix we are interested in is the M × N term-document matrix C where (barring a rare coincidence) M 6= N; furthermore, C is very unlikely to be symmetric. To this end we first describe an extension of the symmetric diagonal decomposition known as the singular value decomposition. We then show in Section 18.3 how this can be used to construct an approximate version of C. It is beyond the scope of this book to develop a full
Online edition (c) 2009 Cambridge UP
408 |
18 Matrix decompositions and latent semantic indexing |
treatment of the mathematics underlying singular value decompositions; following the statement of Theorem 18.3 we relate the singular value decompo- SYMMETRIC DIAGONAL sition to the symmetric diagonal decompositions from Section 18.1.1. Given
DECOMPOSITION C, let U be the M × M matrix whose columns are the orthogonal eigenvectors of CCT, and V be the N × N matrix whose columns are the orthogonal eigenvectors of CT C. Denote by CT the transpose of a matrix C.
Theorem 18.3. Let r be the rank of the M × N matrix C. Then, there is a singular- SVD value decomposition (SVD for short) of C of the form
(18.9) |
C = UΣVT, |
where
1. The eigenvalues λ1, . . . , λr of CCT are the same as the eigenvalues of CT C;
√
2.For 1 ≤ i ≤ r, let σi = λi, with λi ≥ λi+1. Then the M × N matrix Σ is composed by setting Σii = σi for 1 ≤ i ≤ r, and zero otherwise.
The values σi are referred to as the singular values of C. It is instructive to examine the relationship of Theorem 18.3 to Theorem 18.2; we do this rather than derive the general proof of Theorem 18.3, which is beyond the scope of this book.
By multiplying Equation (18.9) by its transposed version, we have
(18.10) |
CCT = UΣVT VΣUT = UΣ2UT. |
Note now that in Equation (18.10), the left-hand side is a square symmetric matrix real-valued matrix, and the right-hand side represents its symmetric diagonal decomposition as in Theorem 18.2. What does the left-hand side CCT represent? It is a square matrix with a row and a column corresponding to each of the M terms. The entry (i, j) in the matrix is a measure of the overlap between the ith and jth terms, based on their co-occurrence in documents. The precise mathematical meaning depends on the manner in which C is constructed based on term weighting. Consider the case where C is the term-document incidence matrix of page 3, illustrated in Figure 1.1. Then the entry (i, j) in CCT is the number of documents in which both term i and term j occur.
When writing down the numerical values of the SVD, it is conventional to represent Σ as an r × r matrix with the singular values on the diagonals, since all its entries outside this sub-matrix are zeros. Accordingly, it is conventional to omit the rightmost M − r columns of U corresponding to these omitted rows of Σ; likewise the rightmost N − r columns of V are omitted since they correspond in VT to the rows that will be multiplied by the N − r columns of zeros in Σ. This written form of the SVD is sometimes known
Online edition (c) 2009 Cambridge UP
18.2 Term-document matrices and singular value decompositions |
409 |
REDUCED SVD TRUNCATED SVD
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
r |
r |
r |
r |
r |
r |
r |
r |
r |
|
|
|
|
||
r |
r |
r |
|
r |
r |
r |
r |
r |
|
r |
|
r r |
r |
|
r |
r |
r |
|
r |
r |
r |
r |
r |
|
r |
|
r |
r |
r |
r |
r |
r |
|
r |
r |
r |
r |
r |
|
|
|
r |
r |
r |
r |
r |
r |
|
r |
r |
r |
r |
r |
|
|
|
|
|
|
|
|
C |
|
|
= |
|
U |
|
|
Σ |
|
|
|
VT |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
r |
r |
r |
r |
r |
r r |
r r |
r |
|
r r |
r |
|
r |
|
r |
r |
r |
r |
r |
|||
r |
r r |
r r |
|
r r |
r |
|
r |
|
r |
r |
r |
r |
r |
|||
r r |
r r |
r |
|
r |
r r |
|
r |
|
r |
r |
r |
r |
r |
|||
|
|
|
|
|
|
|
|
|
|
|
|
r |
r |
r |
r |
r |
Figure 18.1 Illustration of the singular-value decomposition. In this schematic illustration of (18.9), we see two cases illustrated. In the top half of the figure, we have a matrix C for which M > N. The lower half illustrates the case M < N.
as the reduced SVD or truncated SVD and we will encounter it again in Exercise 18.9. Henceforth, our numerical examples and exercises will use this reduced form.
Example 18.3: We now illustrate the singular-value decomposition of a 4 × 2 matrix of rank 2; the singular values are Σ11 = 2.236 and Σ22 = 1.
(18.11) |
|
|
1 |
−1 |
|
|
|
−0.632 |
0.000 |
|
|
2.236 |
0.000 |
− |
− |
|
C = |
|
1 |
1 |
= |
−0.632 |
−0.000 |
|
|||||||||
|
0 |
1 |
|
|
0.316 |
0.707 |
|
|
0.707 |
0.707 |
||||||
|
|
|
|
|
|
|
|
|
|
−0.707 |
|
|
|
|
−0.707 |
|
|
|
− |
|
0 |
|
|
|
0.316 |
|
0.000 |
1.000 |
0.707 |
||||
|
|
|
1 |
|
|
|
|
.
?
(18.12)
As with the matrix decompositions defined in Section 18.1.1, the singular value decomposition of a matrix can be computed by a variety of algorithms, many of which have been publicly available software implementations; pointers to these are given in Section 18.5.
Exercise 18.4 |
|
|
|
|
Let |
|
0 |
1 |
|
C = |
||||
|
|
1 |
1 |
|
|
1 |
0 |
be the term-document incidence matrix for a collection. Compute the co-occurrence
matrix CCT. What is the interpretation of the diagonal entries of CCT when C is a term-document incidence matrix?
Online edition (c) 2009 Cambridge UP
410
(18.13)
(18.14)
|
|
|
|
18 Matrix decompositions and latent semantic indexing |
|
||||||
Exercise 18.5 |
|
|
|
|
|
|
|
|
|
|
|
Verify that the SVD of the matrix in Equation (18.12) is |
|
|
|
|
|||||||
|
−0.408 |
−0.707 |
|
|
|
|
|
|
|
||
|
0.816 |
0.000 |
|
|
1.732 |
0.000 |
|
|
0.707 |
0.707 |
|
− |
0.707 |
|
|
1.000 |
|
and VT = |
−0.707 |
− |
, |
||
U = |
−0.408 |
, Σ = 0.000 |
|
−0.707 |
by verifying all of the properties in the statement of Theorem 18.3.
Exercise 18.6
Suppose that C is a binary term-document incidence matrix. What do the entries of CTC represent?
Exercise 18.7 |
|
|
|
|
|
Let |
|
0 |
3 |
0 |
|
C = |
|||||
|
|
0 |
2 |
1 |
|
|
2 |
1 |
0 |
be a term-document matrix whose entries are term frequencies; thus term 1 occurs 2
times in document 2 and once in document 3. Compute CCT; observe that its entries are largest where two terms have their most frequent occurrences together in the same document.
18.3 Low-rank approximations
We next state a matrix approximation problem that at first seems to have little to do with information retrieval. We describe a solution to this matrix problem using singular-value decompositions, then develop its application to information retrieval.
Given an M × N matrix C and a positive integer k, we wish to find an FROBENIUS NORM M × N matrix Ck of rank at most k, so as to minimize the Frobenius norm of
|
the matrix difference X = C − Ck, defined to be |
||||
|
k k |
u |
|
|
|
|
i=1 j=1 |
||||
|
|
u |
M |
N |
|
(18.15) |
X |
t |
|
∑ Xij2 . |
|
F = v ∑ |
|||||
|
Thus, the Frobenius norm of X measures the discrepancy between Ck and C; |
||||
|
our goal is to find a matrix Ck that minimizes this discrepancy, while con- |
||||
|
straining Ck to have rank at most k. If r is the rank of C, clearly Cr = C |
||||
|
and the Frobenius norm of the discrepancy is zero in this case. When k is far |
||||
LOW-RANK |
smaller than r, we refer to Ck as a low-rank approximation. |
||||
APPROXIMATION |
The singular value decomposition can be used to solve the low-rank ma- |
trix approximation problem. We then derive from it an application to approximating term-document matrices. We invoke the following three-step procedure to this end:
Online edition (c) 2009 Cambridge UP
18.3 Low-rank approximations |
411 |
|
|
Ck |
|
= |
|
U |
|
|
Σk |
|
|
|
|
|
|
|
|
|
|
|
|
r |
r |
r |
r |
r |
r |
r |
r |
r |
||
r |
r |
r |
r |
r |
|
r |
r |
r |
|
r |
r |
r |
r |
r |
r |
|
r |
r |
r |
|
r |
|
|
VT |
|
|
|
|
|
|
|
r |
r |
r |
r |
r |
r |
r |
r |
r |
r |
r |
r |
r |
r |
r |
r |
r |
r |
r |
r |
r |
r |
r |
r |
r |
Figure 18.2 Illustration of low rank approximation using the singular-value decomposition. The dashed boxes indicate the matrix entries affected by “zeroing out” the smallest singular values.
1.Given C, construct its SVD in the form shown in (18.9); thus, C = UΣVT.
2.Derive from Σ the matrix Σk formed by replacing by zeros the r − k smallest singular values on the diagonal of Σ.
3.Compute and output Ck = UΣkVT as the rank-k approximation to C.
The rank of Ck is at most k: this follows from the fact that Σk has at most k non-zero values. Next, we recall the intuition of Example 18.1: the effect of small eigenvalues on matrix products is small. Thus, it seems plausible that replacing these small eigenvalues by zero will not substantially alter the product, leaving it “close” to C. The following theorem due to Eckart and Young tells us that, in fact, this procedure yields the matrix of rank k with the lowest possible Frobenius error.
|
Theorem 18.4. |
|
|
|
(18.16) |
|
min |
kC − ZkF = kC − CkkF = σk+1. |
|
|
Z| rank(Z)=k |
|
||
|
Recalling that the singular values are in decreasing order σ1 ≥ σ2 ≥ · · ·, |
|||
|
we learn from Theorem 18.4 that Ck is the best rank-k approximation to C, |
|||
|
incurring an error (measured by the Frobenius norm of C − Ck) equal to σk+1. |
|||
|
Thus the larger k is, the smaller this error (and in particular, for k = r, the |
|||
|
error is zero since Σr = Σ; provided r < M, N, then σr+1 = 0 and thus |
|||
|
Cr = C). |
|
|
|
|
To derive further insight into why the process of truncating the smallest |
|||
|
r − k singular values in Σ helps generate a rank-k approximation of low error, |
|||
|
we examine the form of Ck: |
|
||
(18.17) |
C |
= |
UΣ |
VT |
|
k |
|
k |
|
Online edition (c) 2009 Cambridge UP