- •List of Tables
- •List of Figures
- •Table of Notation
- •Preface
- •Boolean retrieval
- •An example information retrieval problem
- •Processing Boolean queries
- •The extended Boolean model versus ranked retrieval
- •References and further reading
- •The term vocabulary and postings lists
- •Document delineation and character sequence decoding
- •Obtaining the character sequence in a document
- •Choosing a document unit
- •Determining the vocabulary of terms
- •Tokenization
- •Dropping common terms: stop words
- •Normalization (equivalence classing of terms)
- •Stemming and lemmatization
- •Faster postings list intersection via skip pointers
- •Positional postings and phrase queries
- •Biword indexes
- •Positional indexes
- •Combination schemes
- •References and further reading
- •Dictionaries and tolerant retrieval
- •Search structures for dictionaries
- •Wildcard queries
- •General wildcard queries
- •Spelling correction
- •Implementing spelling correction
- •Forms of spelling correction
- •Edit distance
- •Context sensitive spelling correction
- •Phonetic correction
- •References and further reading
- •Index construction
- •Hardware basics
- •Blocked sort-based indexing
- •Single-pass in-memory indexing
- •Distributed indexing
- •Dynamic indexing
- •Other types of indexes
- •References and further reading
- •Index compression
- •Statistical properties of terms in information retrieval
- •Dictionary compression
- •Dictionary as a string
- •Blocked storage
- •Variable byte codes
- •References and further reading
- •Scoring, term weighting and the vector space model
- •Parametric and zone indexes
- •Weighted zone scoring
- •Learning weights
- •The optimal weight g
- •Term frequency and weighting
- •Inverse document frequency
- •The vector space model for scoring
- •Dot products
- •Queries as vectors
- •Computing vector scores
- •Sublinear tf scaling
- •Maximum tf normalization
- •Document and query weighting schemes
- •Pivoted normalized document length
- •References and further reading
- •Computing scores in a complete search system
- •Index elimination
- •Champion lists
- •Static quality scores and ordering
- •Impact ordering
- •Cluster pruning
- •Components of an information retrieval system
- •Tiered indexes
- •Designing parsing and scoring functions
- •Putting it all together
- •Vector space scoring and query operator interaction
- •References and further reading
- •Evaluation in information retrieval
- •Information retrieval system evaluation
- •Standard test collections
- •Evaluation of unranked retrieval sets
- •Evaluation of ranked retrieval results
- •Assessing relevance
- •A broader perspective: System quality and user utility
- •System issues
- •User utility
- •Results snippets
- •References and further reading
- •Relevance feedback and query expansion
- •Relevance feedback and pseudo relevance feedback
- •The Rocchio algorithm for relevance feedback
- •Probabilistic relevance feedback
- •When does relevance feedback work?
- •Relevance feedback on the web
- •Evaluation of relevance feedback strategies
- •Pseudo relevance feedback
- •Indirect relevance feedback
- •Summary
- •Global methods for query reformulation
- •Vocabulary tools for query reformulation
- •Query expansion
- •Automatic thesaurus generation
- •References and further reading
- •XML retrieval
- •Basic XML concepts
- •Challenges in XML retrieval
- •A vector space model for XML retrieval
- •Evaluation of XML retrieval
- •References and further reading
- •Exercises
- •Probabilistic information retrieval
- •Review of basic probability theory
- •The Probability Ranking Principle
- •The 1/0 loss case
- •The PRP with retrieval costs
- •The Binary Independence Model
- •Deriving a ranking function for query terms
- •Probability estimates in theory
- •Probability estimates in practice
- •Probabilistic approaches to relevance feedback
- •An appraisal and some extensions
- •An appraisal of probabilistic models
- •Bayesian network approaches to IR
- •References and further reading
- •Language models for information retrieval
- •Language models
- •Finite automata and language models
- •Types of language models
- •Multinomial distributions over words
- •The query likelihood model
- •Using query likelihood language models in IR
- •Estimating the query generation probability
- •Language modeling versus other approaches in IR
- •Extended language modeling approaches
- •References and further reading
- •Relation to multinomial unigram language model
- •The Bernoulli model
- •Properties of Naive Bayes
- •A variant of the multinomial model
- •Feature selection
- •Mutual information
- •Comparison of feature selection methods
- •References and further reading
- •Document representations and measures of relatedness in vector spaces
- •k nearest neighbor
- •Time complexity and optimality of kNN
- •The bias-variance tradeoff
- •References and further reading
- •Exercises
- •Support vector machines and machine learning on documents
- •Support vector machines: The linearly separable case
- •Extensions to the SVM model
- •Multiclass SVMs
- •Nonlinear SVMs
- •Experimental results
- •Machine learning methods in ad hoc information retrieval
- •Result ranking by machine learning
- •References and further reading
- •Flat clustering
- •Clustering in information retrieval
- •Problem statement
- •Evaluation of clustering
- •Cluster cardinality in K-means
- •Model-based clustering
- •References and further reading
- •Exercises
- •Hierarchical clustering
- •Hierarchical agglomerative clustering
- •Time complexity of HAC
- •Group-average agglomerative clustering
- •Centroid clustering
- •Optimality of HAC
- •Divisive clustering
- •Cluster labeling
- •Implementation notes
- •References and further reading
- •Exercises
- •Matrix decompositions and latent semantic indexing
- •Linear algebra review
- •Matrix decompositions
- •Term-document matrices and singular value decompositions
- •Low-rank approximations
- •Latent semantic indexing
- •References and further reading
- •Web search basics
- •Background and history
- •Web characteristics
- •The web graph
- •Spam
- •Advertising as the economic model
- •The search user experience
- •User query needs
- •Index size and estimation
- •Near-duplicates and shingling
- •References and further reading
- •Web crawling and indexes
- •Overview
- •Crawling
- •Crawler architecture
- •DNS resolution
- •The URL frontier
- •Distributing indexes
- •Connectivity servers
- •References and further reading
- •Link analysis
- •The Web as a graph
- •Anchor text and the web graph
- •PageRank
- •Markov chains
- •The PageRank computation
- •Hubs and Authorities
- •Choosing the subset of the Web
- •References and further reading
- •Bibliography
- •Author Index
DRAFT! © April 1, 2009 Cambridge University Press. Feedback welcome. |
219 |
11 Probabilistic information retrieval
During the discussion of relevance feedback in Section 9.1.2, we observed that if we have some known relevant and nonrelevant documents, then we can straightforwardly start to estimate the probability of a term t appearing in a relevant document P(t|R = 1), and that this could be the basis of a classifier that decides whether documents are relevant or not. In this chapter, we more systematically introduce this probabilistic approach to IR, which provides a different formal basis for a retrieval model and results in different techniques for setting term weights.
Users start with information needs, which they translate into query representations. Similarly, there are documents, which are converted into document representations (the latter differing at least by how text is tokenized, but perhaps containing fundamentally less information, as when a non-positional index is used). Based on these two representations, a system tries to determine how well documents satisfy information needs. In the Boolean or vector space models of IR, matching is done in a formally defined but semantically imprecise calculus of index terms. Given only a query, an IR system has an uncertain understanding of the information need. Given the query and document representations, a system has an uncertain guess of whether a document has content relevant to the information need. Probability theory provides a principled foundation for such reasoning under uncertainty. This chapter provides one answer as to how to exploit this foundation to estimate how likely it is that a document is relevant to an information need.
There is more than one possible retrieval model which has a probabilistic basis. Here, we will introduce probability theory and the Probability Ranking Principle (Sections 11.1–11.2), and then concentrate on the Binary Independence Model (Section 11.3), which is the original and still most influential probabilistic retrieval model. Finally, we will introduce related but extended methods which use term counts, including the empirically successful Okapi BM25 weighting scheme, and Bayesian Network models for IR (Section 11.4). In Chapter 12, we then present the alternative probabilistic language model-
Online edition (c) 2009 Cambridge UP
220 |
11 Probabilistic information retrieval |
|
ing approach to IR, which has been developed with considerable success in |
|
recent years. |
11.1 Review of basic probability theory
|
We hope that the reader has seen a little basic probability theory previously. |
|||||||||||||||||
|
We will give a very quick review; some references for further reading appear |
|||||||||||||||||
|
at the end of the chapter. A variable A represents an event (a subset of the |
|||||||||||||||||
|
space of possible outcomes). Equivalently, we can represent the subset via a |
|||||||||||||||||
RANDOM VARIABLE |
random variable, which is a function from outcomes to real numbers; the sub- |
|||||||||||||||||
|
set is the domain over which the random variable A has a particular value. |
|||||||||||||||||
|
Often we will not know with certainty whether an event is true in the world. |
|||||||||||||||||
|
We can ask the probability of the event 0 ≤ P(A) ≤ 1. For two events A and |
|||||||||||||||||
|
B, the joint event of both events occurring is described by the joint probabil- |
|||||||||||||||||
|
ity P(A, B). The conditional probability P(A|B) expresses the probability of |
|||||||||||||||||
|
event A given that event B occurred. The fundamental relationship between |
|||||||||||||||||
CHAIN RULE |
joint and conditional probabilities is given by the chain rule: |
|||||||||||||||||
(11.1) |
P(A, B) = P(A ∩ B) = P(A|B)P(B) = P(B|A)P(A) |
|||||||||||||||||
|
Without making any assumptions, the probability of a joint event equals the |
|||||||||||||||||
|
probability of one of the events multiplied by the probability of the other |
|||||||||||||||||
|
event conditioned on knowing the first event happened. |
|||||||||||||||||
|
Writing P(A) for the complement of an event, we similarly have: |
|||||||||||||||||
(11.2) |
|
|
|
P( |
|
, B) = P(B| |
|
)P( |
|
|
) |
|
||||||
|
|
|
A |
A |
A |
|||||||||||||
PARTITION RULE |
Probability theory also has a partition rule, which says that if an event B can |
|||||||||||||||||
|
be divided into an exhaustive set of disjoint subcases, then the probability of |
|||||||||||||||||
|
B is the sum of the probabilities of the subcases. A special case of this rule |
|||||||||||||||||
|
gives that: |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(11.3) |
|
|
|
P(B) = P(A, B) + P( |
|
|
, B) |
|||||||||||
|
|
|
|
A |
||||||||||||||
BAYES’ RULE |
From these we can derive Bayes’ Rule for inverting conditional probabili- |
|||||||||||||||||
|
ties: |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
P |
( |
A)P(A) |
|
|
|
|
P(B A) |
|||||||||
(11.4) |
P(A|B) = |
|
BP| (B) |
|
= " |
|
|
} P(|B|X)P(X) |
# P(A) |
|||||||||
|
|
∑X{A, |
|
|||||||||||||||
|
|
|
|
A |
||||||||||||||
PRIOR PROBABILITY
POSTERIOR
PROBABILITY
This equation can also be thought of as a way of updating probabilities. We start off with an initial estimate of how likely the event A is when we do not have any other information; this is the prior probability P(A). Bayes’ rule lets us derive a posterior probability P(A|B) after having seen the evidence B,
Online edition (c) 2009 Cambridge UP
