- •List of Tables
- •List of Figures
- •Table of Notation
- •Preface
- •Boolean retrieval
- •An example information retrieval problem
- •Processing Boolean queries
- •The extended Boolean model versus ranked retrieval
- •References and further reading
- •The term vocabulary and postings lists
- •Document delineation and character sequence decoding
- •Obtaining the character sequence in a document
- •Choosing a document unit
- •Determining the vocabulary of terms
- •Tokenization
- •Dropping common terms: stop words
- •Normalization (equivalence classing of terms)
- •Stemming and lemmatization
- •Faster postings list intersection via skip pointers
- •Positional postings and phrase queries
- •Biword indexes
- •Positional indexes
- •Combination schemes
- •References and further reading
- •Dictionaries and tolerant retrieval
- •Search structures for dictionaries
- •Wildcard queries
- •General wildcard queries
- •Spelling correction
- •Implementing spelling correction
- •Forms of spelling correction
- •Edit distance
- •Context sensitive spelling correction
- •Phonetic correction
- •References and further reading
- •Index construction
- •Hardware basics
- •Blocked sort-based indexing
- •Single-pass in-memory indexing
- •Distributed indexing
- •Dynamic indexing
- •Other types of indexes
- •References and further reading
- •Index compression
- •Statistical properties of terms in information retrieval
- •Dictionary compression
- •Dictionary as a string
- •Blocked storage
- •Variable byte codes
- •References and further reading
- •Scoring, term weighting and the vector space model
- •Parametric and zone indexes
- •Weighted zone scoring
- •Learning weights
- •The optimal weight g
- •Term frequency and weighting
- •Inverse document frequency
- •The vector space model for scoring
- •Dot products
- •Queries as vectors
- •Computing vector scores
- •Sublinear tf scaling
- •Maximum tf normalization
- •Document and query weighting schemes
- •Pivoted normalized document length
- •References and further reading
- •Computing scores in a complete search system
- •Index elimination
- •Champion lists
- •Static quality scores and ordering
- •Impact ordering
- •Cluster pruning
- •Components of an information retrieval system
- •Tiered indexes
- •Designing parsing and scoring functions
- •Putting it all together
- •Vector space scoring and query operator interaction
- •References and further reading
- •Evaluation in information retrieval
- •Information retrieval system evaluation
- •Standard test collections
- •Evaluation of unranked retrieval sets
- •Evaluation of ranked retrieval results
- •Assessing relevance
- •A broader perspective: System quality and user utility
- •System issues
- •User utility
- •Results snippets
- •References and further reading
- •Relevance feedback and query expansion
- •Relevance feedback and pseudo relevance feedback
- •The Rocchio algorithm for relevance feedback
- •Probabilistic relevance feedback
- •When does relevance feedback work?
- •Relevance feedback on the web
- •Evaluation of relevance feedback strategies
- •Pseudo relevance feedback
- •Indirect relevance feedback
- •Summary
- •Global methods for query reformulation
- •Vocabulary tools for query reformulation
- •Query expansion
- •Automatic thesaurus generation
- •References and further reading
- •XML retrieval
- •Basic XML concepts
- •Challenges in XML retrieval
- •A vector space model for XML retrieval
- •Evaluation of XML retrieval
- •References and further reading
- •Exercises
- •Probabilistic information retrieval
- •Review of basic probability theory
- •The Probability Ranking Principle
- •The 1/0 loss case
- •The PRP with retrieval costs
- •The Binary Independence Model
- •Deriving a ranking function for query terms
- •Probability estimates in theory
- •Probability estimates in practice
- •Probabilistic approaches to relevance feedback
- •An appraisal and some extensions
- •An appraisal of probabilistic models
- •Bayesian network approaches to IR
- •References and further reading
- •Language models for information retrieval
- •Language models
- •Finite automata and language models
- •Types of language models
- •Multinomial distributions over words
- •The query likelihood model
- •Using query likelihood language models in IR
- •Estimating the query generation probability
- •Language modeling versus other approaches in IR
- •Extended language modeling approaches
- •References and further reading
- •Relation to multinomial unigram language model
- •The Bernoulli model
- •Properties of Naive Bayes
- •A variant of the multinomial model
- •Feature selection
- •Mutual information
- •Comparison of feature selection methods
- •References and further reading
- •Document representations and measures of relatedness in vector spaces
- •k nearest neighbor
- •Time complexity and optimality of kNN
- •The bias-variance tradeoff
- •References and further reading
- •Exercises
- •Support vector machines and machine learning on documents
- •Support vector machines: The linearly separable case
- •Extensions to the SVM model
- •Multiclass SVMs
- •Nonlinear SVMs
- •Experimental results
- •Machine learning methods in ad hoc information retrieval
- •Result ranking by machine learning
- •References and further reading
- •Flat clustering
- •Clustering in information retrieval
- •Problem statement
- •Evaluation of clustering
- •Cluster cardinality in K-means
- •Model-based clustering
- •References and further reading
- •Exercises
- •Hierarchical clustering
- •Hierarchical agglomerative clustering
- •Time complexity of HAC
- •Group-average agglomerative clustering
- •Centroid clustering
- •Optimality of HAC
- •Divisive clustering
- •Cluster labeling
- •Implementation notes
- •References and further reading
- •Exercises
- •Matrix decompositions and latent semantic indexing
- •Linear algebra review
- •Matrix decompositions
- •Term-document matrices and singular value decompositions
- •Low-rank approximations
- •Latent semantic indexing
- •References and further reading
- •Web search basics
- •Background and history
- •Web characteristics
- •The web graph
- •Spam
- •Advertising as the economic model
- •The search user experience
- •User query needs
- •Index size and estimation
- •Near-duplicates and shingling
- •References and further reading
- •Web crawling and indexes
- •Overview
- •Crawling
- •Crawler architecture
- •DNS resolution
- •The URL frontier
- •Distributing indexes
- •Connectivity servers
- •References and further reading
- •Link analysis
- •The Web as a graph
- •Anchor text and the web graph
- •PageRank
- •Markov chains
- •The PageRank computation
- •Hubs and Authorities
- •Choosing the subset of the Web
- •References and further reading
- •Bibliography
- •Author Index
124 |
6 Scoring, term weighting and the vector space model |
sions in practice will be far larger than three: it will equal the vocabulary size
M.
To summarize, by viewing a query as a “bag of words”, we are able to treat it as a very short document. As a consequence, we can use the cosine similarity between the query vector and a document vector as a measure of the score of the document for that query. The resulting scores can then be used to select the top-scoring documents for a query. Thus we have
|
|
~ |
~ |
|
(6.12) |
score(q, d) = |
V(q) · V(d) |
. |
|
|
|
~ |
~ |
|
|
|
|V(q)||V(d)| |
A document may have a high cosine score for a query even if it does not contain all query terms. Note that the preceding discussion does not hinge on any specific weighting of terms in the document vector, although for the present we may think of them as either tf or tf-idf weights. In fact, a number of weighting schemes are possible for query as well as document vectors, as illustrated in Example 6.4 and developed further in Section 6.4.
Computing the cosine similarities between the query vector and each document vector in the collection, sorting the resulting scores and selecting the top K documents can be expensive — a single similarity computation can entail a dot product in tens of thousands of dimensions, demanding tens of thousands of arithmetic operations. In Section 7.1 we study how to use an inverted index for this purpose, followed by a series of heuristics for improving on this.
Example 6.4: We now consider the query best car insurance on a fictitious collection with N = 1,000,000 documents where the document frequencies of auto, best, car and
are respectively 5000, 50000, 10000 and 1000.
term |
|
query |
|
|
document |
product |
||
|
tf |
df idf wt,q |
tf |
wf |
wt,d |
|
||
auto |
0 |
5000 |
2.3 |
0 |
1 |
1 |
0.41 |
0 |
best |
1 |
50000 |
1.3 |
1.3 |
0 |
0 |
0 |
0 |
car |
1 |
10000 |
2.0 |
2.0 |
1 |
1 |
0.41 |
0.82 |
insurance |
1 |
1000 |
3.0 |
3.0 |
2 |
2 |
0.82 |
2.46 |
In this example the weight of a term in the query is simply the idf (and zero for a term not in the query, such as auto); this is reflected in the column header wt,q (the entry for auto is zero because the query does not contain the termauto). For documents, we use tf weighting with no use of idf but with Euclidean normalization. The former is shown under the column headed wf, while the latter is shown under the column headed wt,d. Invoking (6.9) now gives a net score of 0 + 0 + 0.82 + 2.46 = 3.28.
6.3.3Computing vector scores
In a typical setting we have a collection of documents each represented by a vector, a free text query represented by a vector, and a positive integer K. We
Online edition (c) 2009 Cambridge UP
6.3 The vector space model for scoring |
125 |
COSINESCORE(q)
1 float Scores[N] = 0
2 Initialize Length[N]
3for each query term t
4 do calculate wt,q and fetch postings list for t 5 for each pair(d, tft,d) in postings list
6 do Scores[d] += wft,d × wt,q
7 Read the array Length[d]
8for each d
9 do Scores[d] = Scores[d]/Length[d]
10return Top K components of Scores[]
Figure 6.14 The basic algorithm for computing vector space scores.
seek the K documents of the collection with the highest vector space scores on the given query. We now initiate the study of determining the K documents with the highest vector space scores for a query. Typically, we seek these K top documents in ordered by decreasing score; for instance many search engines use K = 10 to retrieve and rank-order the first page of the ten best results. Here we give the basic algorithm for this computation; we develop a fuller treatment of efficient techniques and approximations in Chapter 7.
Figure 6.14 gives the basic algorithm for computing vector space scores. The array Length holds the lengths (normalization factors) for each of the N documents, whereas the array Scores holds the scores for each of the documents. When the scores are finally computed in Step 9, all that remains in Step 10 is to pick off the K documents with the highest scores.
The outermost loop beginning Step 3 repeats the updating of Scores, iterating over each query term t in turn. In Step 5 we calculate the weight in the query vector for term t. Steps 6-8 update the score of each document by adding in the contribution from term t. This process of adding in contribu-
TERM-AT-A-TIME tions one query term at a time is sometimes known as term-at-a-time scoring or accumulation, and the N elements of the array Scores are therefore known ACCUMULATOR as accumulators. For this purpose, it would appear necessary to store, with each postings entry, the weight wft,d of term t in document d (we have thus far used either tf or tf-idf for this weight, but leave open the possibility of other functions to be developed in Section 6.4). In fact this is wasteful, since storing this weight may require a floating point number. Two ideas help alleviate this space problem. First, if we are using inverse document frequency, we need not precompute idft; it suffices to store N/dft at the head of the postings for t. Second, we store the term frequency tft,d for each postings entry. Finally, Step 12 extracts the top K scores – this requires a priority queue
Online edition (c) 2009 Cambridge UP
126 |
6 Scoring, term weighting and the vector space model |
data structure, often implemented using a heap. Such a heap takes no more than 2N comparisons to construct, following which each of the K top scores can be extracted from the heap at a cost of O(log N) comparisons.
Note that the general algorithm of Figure 6.14 does not prescribe a specific implementation of how we traverse the postings lists of the various query terms; we may traverse them one term at a time as in the loop beginning at Step 3, or we could in fact traverse them concurrently as in Figure 1.6. In such a concurrent postings traversal we compute the scores of one document
DOCUMENT-AT-A-TIME at a time, so that it is sometimes called document-at-a-time scoring. We will say more about this in Section 7.1.5.
?Exercise 6.14
If we were to stem jealous and jealousy to a common stem before setting up the vector space, detail how the definitions of tf and idf should be modified.
Exercise 6.15
Recall the tf-idf weights computed in Exercise 6.10. Compute the Euclidean normalized document vectors for each of the documents, where each vector has four components, one for each of the four terms.
Exercise 6.16
Verify that the sum of the squares of the components of each of the document vectors in Exercise 6.15 is 1 (to within rounding error). Why is this the case?
Exercise 6.17
With term weights as computed in Exercise 6.15, rank the three documents by computed score for the query car insurance, for each of the following cases of term weighting in the query:
1.The weight of a term is 1 if present in the query, 0 otherwise.
2.Euclidean normalized idf.
6.4Variant tf-idf functions
For assigning a weight for each term in each document, a number of alternatives to tf and tf-idf have been considered. We discuss some of the principal ones here; a more complete development is deferred to Chapter 11. We will summarize these alternatives in Section 6.4.3 (page 128).
6.4.1Sublinear tf scaling
It seems unlikely that twenty occurrences of a term in a document truly carry twenty times the significance of a single occurrence. Accordingly, there has been considerable research into variants of term frequency that go beyond counting the number of occurrences of a term. A common modification is
Online edition (c) 2009 Cambridge UP
6.4 |
Variant tf-idf functions |
|
|
127 |
to use instead the logarithm of the term frequency, which assigns a weight |
||||
given by |
|
|
|
|
|
1 + log tf |
|
if tft d > 0 |
|
(6.13) |
wft,d = 0 |
t,d |
otherwise, |
. |
In this form, we may replace tf by some other function wf as in (6.13), to |
||||
obtain: |
|
|
|
|
(6.14) |
wf-idft,d = wft,d × idft. |
|
Equation (6.9) can then be modified by replacing tf-idf by wf-idf as defined in (6.14).
6.4.2Maximum tf normalization
One well-studied technique is to normalize the tf weights of all terms occurring in a document by the maximum tf in that document. For each document
d, let tfmax(d) = maxτ d tfτ,d, where τ ranges over all terms in d. Then, we compute a normalized term frequency for each term t in document d by
(6.15)
SMOOTHING
ntft,d = a + (1 |
− a) |
tft,d |
, |
tfmax(d) |
where a is a value between 0 and 1 and is generally set to 0.4, although some early work used the value 0.5. The term a in (6.15) is a smoothing term whose role is to damp the contribution of the second term – which may be viewed as a scaling down of tf by the largest tf value in d. We will encounter smoothing further in Chapter 13 when discussing classification; the basic idea is to avoid a large swing in ntft,d from modest changes in tft,d (say from 1 to 2). The main idea of maximum tf normalization is to mitigate the following anomaly: we observe higher term frequencies in longer documents, merely because longer documents tend to repeat the same words over and over again. To appreciate this, consider the following extreme example: supposed we were to take a document d and create a new document d′ by simply appending a copy of d to itself. While d′ should be no more relevant to any query than d is, the use of (6.9) would assign it twice as high a score as d. Replacing tf-idft,d in (6.9) by ntf-idft,d eliminates the anomaly in this example. Maximum tf normalization does suffer from the following issues:
1.The method is unstable in the following sense: a change in the stop word list can dramatically alter term weightings (and therefore ranking). Thus, it is hard to tune.
2.A document may contain an outlier term with an unusually large number of occurrences of that term, not representative of the content of that document.
Online edition (c) 2009 Cambridge UP