- •List of Tables
- •List of Figures
- •Table of Notation
- •Preface
- •Boolean retrieval
- •An example information retrieval problem
- •Processing Boolean queries
- •The extended Boolean model versus ranked retrieval
- •References and further reading
- •The term vocabulary and postings lists
- •Document delineation and character sequence decoding
- •Obtaining the character sequence in a document
- •Choosing a document unit
- •Determining the vocabulary of terms
- •Tokenization
- •Dropping common terms: stop words
- •Normalization (equivalence classing of terms)
- •Stemming and lemmatization
- •Faster postings list intersection via skip pointers
- •Positional postings and phrase queries
- •Biword indexes
- •Positional indexes
- •Combination schemes
- •References and further reading
- •Dictionaries and tolerant retrieval
- •Search structures for dictionaries
- •Wildcard queries
- •General wildcard queries
- •Spelling correction
- •Implementing spelling correction
- •Forms of spelling correction
- •Edit distance
- •Context sensitive spelling correction
- •Phonetic correction
- •References and further reading
- •Index construction
- •Hardware basics
- •Blocked sort-based indexing
- •Single-pass in-memory indexing
- •Distributed indexing
- •Dynamic indexing
- •Other types of indexes
- •References and further reading
- •Index compression
- •Statistical properties of terms in information retrieval
- •Dictionary compression
- •Dictionary as a string
- •Blocked storage
- •Variable byte codes
- •References and further reading
- •Scoring, term weighting and the vector space model
- •Parametric and zone indexes
- •Weighted zone scoring
- •Learning weights
- •The optimal weight g
- •Term frequency and weighting
- •Inverse document frequency
- •The vector space model for scoring
- •Dot products
- •Queries as vectors
- •Computing vector scores
- •Sublinear tf scaling
- •Maximum tf normalization
- •Document and query weighting schemes
- •Pivoted normalized document length
- •References and further reading
- •Computing scores in a complete search system
- •Index elimination
- •Champion lists
- •Static quality scores and ordering
- •Impact ordering
- •Cluster pruning
- •Components of an information retrieval system
- •Tiered indexes
- •Designing parsing and scoring functions
- •Putting it all together
- •Vector space scoring and query operator interaction
- •References and further reading
- •Evaluation in information retrieval
- •Information retrieval system evaluation
- •Standard test collections
- •Evaluation of unranked retrieval sets
- •Evaluation of ranked retrieval results
- •Assessing relevance
- •A broader perspective: System quality and user utility
- •System issues
- •User utility
- •Results snippets
- •References and further reading
- •Relevance feedback and query expansion
- •Relevance feedback and pseudo relevance feedback
- •The Rocchio algorithm for relevance feedback
- •Probabilistic relevance feedback
- •When does relevance feedback work?
- •Relevance feedback on the web
- •Evaluation of relevance feedback strategies
- •Pseudo relevance feedback
- •Indirect relevance feedback
- •Summary
- •Global methods for query reformulation
- •Vocabulary tools for query reformulation
- •Query expansion
- •Automatic thesaurus generation
- •References and further reading
- •XML retrieval
- •Basic XML concepts
- •Challenges in XML retrieval
- •A vector space model for XML retrieval
- •Evaluation of XML retrieval
- •References and further reading
- •Exercises
- •Probabilistic information retrieval
- •Review of basic probability theory
- •The Probability Ranking Principle
- •The 1/0 loss case
- •The PRP with retrieval costs
- •The Binary Independence Model
- •Deriving a ranking function for query terms
- •Probability estimates in theory
- •Probability estimates in practice
- •Probabilistic approaches to relevance feedback
- •An appraisal and some extensions
- •An appraisal of probabilistic models
- •Bayesian network approaches to IR
- •References and further reading
- •Language models for information retrieval
- •Language models
- •Finite automata and language models
- •Types of language models
- •Multinomial distributions over words
- •The query likelihood model
- •Using query likelihood language models in IR
- •Estimating the query generation probability
- •Language modeling versus other approaches in IR
- •Extended language modeling approaches
- •References and further reading
- •Relation to multinomial unigram language model
- •The Bernoulli model
- •Properties of Naive Bayes
- •A variant of the multinomial model
- •Feature selection
- •Mutual information
- •Comparison of feature selection methods
- •References and further reading
- •Document representations and measures of relatedness in vector spaces
- •k nearest neighbor
- •Time complexity and optimality of kNN
- •The bias-variance tradeoff
- •References and further reading
- •Exercises
- •Support vector machines and machine learning on documents
- •Support vector machines: The linearly separable case
- •Extensions to the SVM model
- •Multiclass SVMs
- •Nonlinear SVMs
- •Experimental results
- •Machine learning methods in ad hoc information retrieval
- •Result ranking by machine learning
- •References and further reading
- •Flat clustering
- •Clustering in information retrieval
- •Problem statement
- •Evaluation of clustering
- •Cluster cardinality in K-means
- •Model-based clustering
- •References and further reading
- •Exercises
- •Hierarchical clustering
- •Hierarchical agglomerative clustering
- •Time complexity of HAC
- •Group-average agglomerative clustering
- •Centroid clustering
- •Optimality of HAC
- •Divisive clustering
- •Cluster labeling
- •Implementation notes
- •References and further reading
- •Exercises
- •Matrix decompositions and latent semantic indexing
- •Linear algebra review
- •Matrix decompositions
- •Term-document matrices and singular value decompositions
- •Low-rank approximations
- •Latent semantic indexing
- •References and further reading
- •Web search basics
- •Background and history
- •Web characteristics
- •The web graph
- •Spam
- •Advertising as the economic model
- •The search user experience
- •User query needs
- •Index size and estimation
- •Near-duplicates and shingling
- •References and further reading
- •Web crawling and indexes
- •Overview
- •Crawling
- •Crawler architecture
- •DNS resolution
- •The URL frontier
- •Distributing indexes
- •Connectivity servers
- •References and further reading
- •Link analysis
- •The Web as a graph
- •Anchor text and the web graph
- •PageRank
- •Markov chains
- •The PageRank computation
- •Hubs and Authorities
- •Choosing the subset of the Web
- •References and further reading
- •Bibliography
- •Author Index
222 |
11 Probabilistic information retrieval |
|
Optimal Decision Rule, the decision which minimizes the risk of loss, is to |
|
simply return documents that are more likely relevant than nonrelevant: |
(11.6) |
d is relevant iff P(R = 1|d, q) > P(R = 0|d, q) |
Theorem 11.1. The PRP is optimal, in the sense that it minimizes the expected loss BAYES RISK (also known as the Bayes risk) under 1/0 loss.
The proof can be found in Ripley (1996). However, it requires that all probabilities are known correctly. This is never the case in practice. Nevertheless, the PRP still provides a very useful foundation for developing models of IR.
11.2.2The PRP with retrieval costs
Suppose, instead, that we assume a model of retrieval costs. Let C1 be the cost of not retrieving a relevant document and C0 the cost of retrieval of a nonrelevant document. Then the Probability Ranking Principle says that if for a specific document d and for all documents d′ not yet retrieved
(11.7) C0 · P(R = 0|d) − C1 · P(R = 1|d) ≤ C0 · P(R = 0|d′) − C1 · P(R = 1|d′)
then d is the next document to be retrieved. Such a model gives a formal framework where we can model differential costs of false positives and false negatives and even system performance issues at the modeling stage, rather than simply at the evaluation stage, as we did in Section 8.6 (page 168). However, we will not further consider loss/utility models in this chapter.
11.3The Binary Independence Model
BINARY
INDEPENDENCE
MODEL
The Binary Independence Model (BIM) we present in this section is the model that has traditionally been used with the PRP. It introduces some simple assumptions, which make estimating the probability function P(R|d, q) practical. Here, “binary” is equivalent to Boolean: documents and queries are both represented as binary term incidence vectors. That is, a document d is represented by the vector ~x = (x1, . . . , xM) where xt = 1 if term t is present in document d and xt = 0 if t is not present in d. With this representation, many possible documents have the same vector representation. Similarly, we represent q by the incidence vector ~q (the distinction between q and ~q is less central since commonly q is in the form of a set of words). “Independence” means that terms are modeled as occurring in documents independently. The model recognizes no association between terms. This assumption is far from correct, but it nevertheless often gives satisfactory results in practice; it is the “naive” assumption of Naive Bayes models, discussed further in Section 13.4 (page 265). Indeed, the Binary Independence
Online edition (c) 2009 Cambridge UP
(11.8)
(11.9)
11.3 The Binary Independence Model |
223 |
Model is exactly the same as the multivariate Bernoulli Naive Bayes model presented in Section 13.3 (page 263). In a sense this assumption is equivalent to an assumption of the vector space model, where each term is a dimension that is orthogonal to all other terms.
We will first present a model which assumes that the user has a single step information need. As discussed in Chapter 9, seeing a range of results might let the user refine their information need. Fortunately, as mentioned there, it is straightforward to extend the Binary Independence Model so as to provide a framework for relevance feedback, and we present this model in Section 11.3.4.
To make a probabilistic retrieval strategy precise, we need to estimate how terms in documents contribute to relevance, specifically, we wish to know how term frequency, document frequency, document length, and other statistics that we can compute influence judgments about document relevance, and how they can be reasonably combined to estimate the probability of document relevance. We then order documents by decreasing estimated probability of relevance.
We assume here that the relevance of each document is independent of the relevance of other documents. As we noted in Section 8.5.1 (page 166), this is incorrect: the assumption is especially harmful in practice if it allows a system to return duplicate or near duplicate documents. Under the BIM, we model the probability P(R|d, q) that a document is relevant via the probability in terms of term incidence vectors P(R|~x,~q). Then, using Bayes rule, we have:
P(R = 1 ~x,~q) = |
P(~x|R = 1,~q)P(R = 1|~q) |
| |
P(~x ~q) |
|
| |
P(R = 0 ~x,~q) = |
P(~x|R = 0,~q)P(R = 0|~q) |
| |
P(~x ~q) |
|
| |
Here, P(~x|R = 1,~q) and P(~x|R = 0,~q) are the probability that if a relevant or nonrelevant, respectively, document is retrieved, then that document’s representation is ~x. You should think of this quantity as defined with respect to a space of possible documents in a domain. How do we compute all these probabilities? We never know the exact probabilities, and so we have to use estimates: Statistics about the actual document collection are used to estimate these probabilities. P(R = 1|~q) and P(R = 0|~q) indicate the prior probability of retrieving a relevant or nonrelevant document respectively for a query ~q. Again, if we knew the percentage of relevant documents in the collection, then we could use this number to estimate P(R = 1|~q) and P(R = 0|~q). Since a document is either relevant or nonrelevant to a query, we must have that:
P(R = 1|~x,~q) + P(R = 0|~x,~q) = 1
Online edition (c) 2009 Cambridge UP
224 |
11 Probabilistic information retrieval |
11.3.1Deriving a ranking function for query terms
Given a query q, we wish to order returned documents by descending P(R = 1|d, q). Under the BIM, this is modeled as ordering by P(R = 1|~x,~q). Rather than estimating this probability directly, because we are interested only in the ranking of documents, we work with some other quantities which are easier to compute and which give the same ordering of documents. In particular, we can rank documents by their odds of relevance (as the odds of relevance is monotonic with the probability of relevance). This makes things easier, because we can ignore the common denominator in (11.8), giving:
(11.10)
NAIVE BAYES
ASSUMPTION
(11.11)
(11.12)
(11.13)
(11.14)
|
P(R = 1|~x,~q) |
|
|
P(R=1|~q)P(~x|R=1,~q) |
|
|
P(R = 1|~q) |
|
P(~x|R = 1,~q) |
|
O(R ~x,~q) = |
= |
|
P(~x|~q) |
|
= |
· |
||||
|
P(R=0 ~q)P(~x R=0,~q) |
|
||||||||
| |
P(R = 0 ~x,~q) |
|
|
|
|
P(R = 0 ~q) |
P(~x |
R = 0,~q) |
||
|
| |
|
|
| | |
|
|
| |
|
| |
|
P(~x|~q)
The left term in the rightmost expression of Equation (11.10) is a constant for a given query. Since we are only ranking documents, there is thus no need for us to estimate it. The right-hand term does, however, require estimation, and this initially appears to be difficult: How can we accurately estimate the probability of an entire term incidence vector occurring? It is at this point that we make the Naive Bayes conditional independence assumption that the presence or absence of a word in a document is independent of the presence or absence of any other word (given the query):
|
P(~x|R = 1,~q) |
|
M |
|
|
= 1,~q) |
||
|
= ∏ |
P(xt|R |
||||||
|
|
|||||||
|
P(~x|R = 0,~q) |
t=1 P(xt|R = 0,~q) |
||||||
So: |
|
M |
P(xt|R = 1,~q) |
|||||
O(R ~x,~q) = O(R ~q) |
||||||||
|
|
|||||||
| |
| |
· ∏ P(x |
t| |
R = 0,~q) |
||||
|
|
|
t=1 |
|
|
|||
Since each xt is either 0 or 1, we can separate the terms to give:
O(R ~x,~q) = O(R ~q) |
· |
∏ |
P(xt |
= 1|R = 1,~q) |
· |
∏ |
P(xt = 0|R = 1,~q) |
||||||||
P(x |
|
||||||||||||||
| |
| |
t |
= 1 |
| |
R = 0,~q) |
P(x |
t |
= 0 |
| |
R = 0,~q) |
|||||
|
|
|
t:xt=1 |
|
|
|
|
t:xt=0 |
|
|
|
||||
Henceforth, let pt = P(xt = 1|R = 1,~q) be the probability of a term appearing in a document relevant to the query, and ut = P(xt = 1|R = 0,~q) be the probability of a term appearing in a nonrelevant document. These quantities can be visualized in the following contingency table where the columns add to 1:
|
document |
relevant (R = 1) |
nonrelevant (R = 0) |
Term present |
xt = 1 |
pt |
ut |
Term absent |
xt = 0 |
1 − pt |
1 − ut |
Online edition (c) 2009 Cambridge UP
11.3 The Binary Independence Model |
225 |
Let us make an additional simplifying assumption that terms not occurring in the query are equally likely to occur in relevant and nonrelevant documents: that is, if qt = 0 then pt = ut. (This assumption can be changed, as when doing relevance feedback in Section 11.3.4.) Then we need only consider terms in the products that appear in the query, and so,
(11.15) |
O(R ~q,~x) = O(R ~q) |
· |
∏ |
pt |
· |
∏ |
1 |
− pt |
||||
u |
|
1 |
||||||||||
|
| |
| |
t |
− |
u |
t |
||||||
|
|
|
|
t:xt=qt=1 |
|
|
t:xt=0,qt=1 |
|
|
|||
The left product is over query terms found in the document and the right product is over query terms not found in the document.
We can manipulate this expression by including the query terms found in the document into the right product, but simultaneously dividing through by them in the left product, so the value is unchanged. Then we have:
(11.16) |
O(R ~q,~x) = O(R ~q) |
· |
∏ |
pt(1 − ut) |
· ∏ |
1 − pt |
|
|||||||||||
|
| |
| |
|
u |
(1 |
− |
p |
) |
1 |
− |
u |
t |
||||||
|
|
|
|
|
t:xt=qt=1 |
t |
|
t |
|
t:qt=1 |
|
|
||||||
|
The left product is still over query terms found in the document, but the right |
|||||||||||||||||
|
product is now over all query terms. That means that this right product is a |
|||||||||||||||||
|
constant for a particular query, just like the odds O(R|~q). So the only quantity |
|||||||||||||||||
|
that needs to be estimated to rank documents for relevance to a query is the |
|||||||||||||||||
|
left product. We can equally rank documents by the logarithm of this term, |
|||||||||||||||||
|
since log is a monotonic function. The resulting quantity used for ranking is |
|||||||||||||||||
RETRIEVAL STATUS called the Retrieval Status Value (RSV) in this model: |
|
|
|
|
|
|
|
|||||||||||
VALUE |
|
|
pt(1 − ut) |
|
|
|
|
|
|
|
pt(1 − ut) |
|
||||||
(11.17) |
RSVd = log |
∏ |
|
= |
∑ |
|
log |
|
||||||||||
|
|
t:xt=qt=1 |
ut(1 − pt) |
|
t:xt=qt=1 |
|
ut(1 − pt) |
|||||||||||
So everything comes down to computing the RSV. Define ct:
(11.18) |
ct = log |
pt(1 − ut) |
= log |
pt |
+ log |
1 − ut |
|
(1 − pt) |
|||||||
|
|
ut(1 − pt) |
|
|
ut |
The ct terms are log odds ratios for the terms in the query. We have the odds of the term appearing if the document is relevant (pt/(1 − pt)) and the odds of the term appearing if the document is nonrelevant (ut/(1 − ut)). The
ODDS RATIO odds ratio is the ratio of two such odds, and then we finally take the log of that quantity. The value will be 0 if a term has equal odds of appearing in relevant and nonrelevant documents, and positive if it is more likely to appear in relevant documents. The ct quantities function as term weights in the model, and the document score for a query is RSVd = ∑xt=qt=1 ct. Operationally, we sum them in accumulators for query terms appearing in documents, just as for the vector space model calculations discussed in Section 7.1 (page 135). We now turn to how we estimate these ct quantities for a particular collection and query.
Online edition (c) 2009 Cambridge UP
