- •List of Tables
- •List of Figures
- •Table of Notation
- •Preface
- •Boolean retrieval
- •An example information retrieval problem
- •Processing Boolean queries
- •The extended Boolean model versus ranked retrieval
- •References and further reading
- •The term vocabulary and postings lists
- •Document delineation and character sequence decoding
- •Obtaining the character sequence in a document
- •Choosing a document unit
- •Determining the vocabulary of terms
- •Tokenization
- •Dropping common terms: stop words
- •Normalization (equivalence classing of terms)
- •Stemming and lemmatization
- •Faster postings list intersection via skip pointers
- •Positional postings and phrase queries
- •Biword indexes
- •Positional indexes
- •Combination schemes
- •References and further reading
- •Dictionaries and tolerant retrieval
- •Search structures for dictionaries
- •Wildcard queries
- •General wildcard queries
- •Spelling correction
- •Implementing spelling correction
- •Forms of spelling correction
- •Edit distance
- •Context sensitive spelling correction
- •Phonetic correction
- •References and further reading
- •Index construction
- •Hardware basics
- •Blocked sort-based indexing
- •Single-pass in-memory indexing
- •Distributed indexing
- •Dynamic indexing
- •Other types of indexes
- •References and further reading
- •Index compression
- •Statistical properties of terms in information retrieval
- •Dictionary compression
- •Dictionary as a string
- •Blocked storage
- •Variable byte codes
- •References and further reading
- •Scoring, term weighting and the vector space model
- •Parametric and zone indexes
- •Weighted zone scoring
- •Learning weights
- •The optimal weight g
- •Term frequency and weighting
- •Inverse document frequency
- •The vector space model for scoring
- •Dot products
- •Queries as vectors
- •Computing vector scores
- •Sublinear tf scaling
- •Maximum tf normalization
- •Document and query weighting schemes
- •Pivoted normalized document length
- •References and further reading
- •Computing scores in a complete search system
- •Index elimination
- •Champion lists
- •Static quality scores and ordering
- •Impact ordering
- •Cluster pruning
- •Components of an information retrieval system
- •Tiered indexes
- •Designing parsing and scoring functions
- •Putting it all together
- •Vector space scoring and query operator interaction
- •References and further reading
- •Evaluation in information retrieval
- •Information retrieval system evaluation
- •Standard test collections
- •Evaluation of unranked retrieval sets
- •Evaluation of ranked retrieval results
- •Assessing relevance
- •A broader perspective: System quality and user utility
- •System issues
- •User utility
- •Results snippets
- •References and further reading
- •Relevance feedback and query expansion
- •Relevance feedback and pseudo relevance feedback
- •The Rocchio algorithm for relevance feedback
- •Probabilistic relevance feedback
- •When does relevance feedback work?
- •Relevance feedback on the web
- •Evaluation of relevance feedback strategies
- •Pseudo relevance feedback
- •Indirect relevance feedback
- •Summary
- •Global methods for query reformulation
- •Vocabulary tools for query reformulation
- •Query expansion
- •Automatic thesaurus generation
- •References and further reading
- •XML retrieval
- •Basic XML concepts
- •Challenges in XML retrieval
- •A vector space model for XML retrieval
- •Evaluation of XML retrieval
- •References and further reading
- •Exercises
- •Probabilistic information retrieval
- •Review of basic probability theory
- •The Probability Ranking Principle
- •The 1/0 loss case
- •The PRP with retrieval costs
- •The Binary Independence Model
- •Deriving a ranking function for query terms
- •Probability estimates in theory
- •Probability estimates in practice
- •Probabilistic approaches to relevance feedback
- •An appraisal and some extensions
- •An appraisal of probabilistic models
- •Bayesian network approaches to IR
- •References and further reading
- •Language models for information retrieval
- •Language models
- •Finite automata and language models
- •Types of language models
- •Multinomial distributions over words
- •The query likelihood model
- •Using query likelihood language models in IR
- •Estimating the query generation probability
- •Language modeling versus other approaches in IR
- •Extended language modeling approaches
- •References and further reading
- •Relation to multinomial unigram language model
- •The Bernoulli model
- •Properties of Naive Bayes
- •A variant of the multinomial model
- •Feature selection
- •Mutual information
- •Comparison of feature selection methods
- •References and further reading
- •Document representations and measures of relatedness in vector spaces
- •k nearest neighbor
- •Time complexity and optimality of kNN
- •The bias-variance tradeoff
- •References and further reading
- •Exercises
- •Support vector machines and machine learning on documents
- •Support vector machines: The linearly separable case
- •Extensions to the SVM model
- •Multiclass SVMs
- •Nonlinear SVMs
- •Experimental results
- •Machine learning methods in ad hoc information retrieval
- •Result ranking by machine learning
- •References and further reading
- •Flat clustering
- •Clustering in information retrieval
- •Problem statement
- •Evaluation of clustering
- •Cluster cardinality in K-means
- •Model-based clustering
- •References and further reading
- •Exercises
- •Hierarchical clustering
- •Hierarchical agglomerative clustering
- •Time complexity of HAC
- •Group-average agglomerative clustering
- •Centroid clustering
- •Optimality of HAC
- •Divisive clustering
- •Cluster labeling
- •Implementation notes
- •References and further reading
- •Exercises
- •Matrix decompositions and latent semantic indexing
- •Linear algebra review
- •Matrix decompositions
- •Term-document matrices and singular value decompositions
- •Low-rank approximations
- •Latent semantic indexing
- •References and further reading
- •Web search basics
- •Background and history
- •Web characteristics
- •The web graph
- •Spam
- •Advertising as the economic model
- •The search user experience
- •User query needs
- •Index size and estimation
- •Near-duplicates and shingling
- •References and further reading
- •Web crawling and indexes
- •Overview
- •Crawling
- •Crawler architecture
- •DNS resolution
- •The URL frontier
- •Distributing indexes
- •Connectivity servers
- •References and further reading
- •Link analysis
- •The Web as a graph
- •Anchor text and the web graph
- •PageRank
- •Markov chains
- •The PageRank computation
- •Hubs and Authorities
- •Choosing the subset of the Web
- •References and further reading
- •Bibliography
- •Author Index
36 |
2 The term vocabulary and postings lists |
Figure 2.9 Postings lists with skip pointers. The postings intersection can use a skip pointer when the end point is still less than the item on the other list.
2.3Faster postings list intersection via skip pointers
In the remainder of this chapter, we will discuss extensions to postings list data structures and ways to increase the efficiency of using postings lists. Recall the basic postings list intersection operation from Section 1.3 (page 10): we walk through the two postings lists simultaneously, in time linear in the total number of postings entries. If the list lengths are m and n, the intersection takes O(m + n) operations. Can we do better than this? That is, empirically, can we usually process postings list intersection in sublinear time? We can, if the index isn’t changing too fast.
SKIP LIST One way to do this is to use a skip list by augmenting postings lists with skip pointers (at indexing time), as shown in Figure 2.9. Skip pointers are effectively shortcuts that allow us to avoid processing parts of the postings list that will not figure in the search results. The two questions are then where to place skip pointers and how to do efficient merging using skip pointers.
Consider first efficient merging, with Figure 2.9 as an example. Suppose
we’ve stepped through the lists in the figure until we have matched 8 on each list and moved it to the results list. We advance both pointers, giving us
16 on the upper list and 41 on the lower list. The smallest item is then the
element 16 on the top list. Rather than simply advancing the upper pointer, we first check the skip list pointer and note that 28 is also less than 41. Hence we can follow the skip list pointer, and then we advance the upper pointer
to 28 . We thus avoid stepping to 19 and 23 on the upper list. A number of variant versions of postings list intersection with skip pointers is possible depending on when exactly you check the skip pointer. One version is shown
Online edition (c) 2009 Cambridge UP
2.3 Faster postings list intersection via skip pointers |
37 |
INTERSECTWITHSKIPS( p1, p2)
1answer ← h i
2 while p1 6= NIL and p2 6= NIL
3do if docID( p1) = docID( p2)
4then ADD(answer, docID( p1))
5p1 ← next( p1)
6p2 ← next( p2)
7else if docID( p1) < docID( p2)
8 |
then if hasSkip( p1) and (docID(skip( p1)) ≤ docID( p2)) |
9 |
then while hasSkip( p1) and (docID(skip( p1)) ≤ docID( p2)) |
10 |
do p1 ← skip( p1) |
11 |
else p1 ← next( p1) |
12 |
else if hasSkip( p2) and (docID(skip( p2)) ≤ docID( p1)) |
13 |
then while hasSkip( p2) and (docID(skip( p2)) ≤ docID( p1)) |
14 |
do p2 ← skip( p2) |
15 |
else p2 ← next( p2) |
16return answer
Figure 2.10 Postings lists intersection with skip pointers.
in Figure 2.10. Skip pointers will only be available for the original postings lists. For an intermediate result in a complex query, the call hasSkip( p) will always return false. Finally, note that the presence of skip pointers only helps for AND queries, not for OR queries.
Where do we place skips? There is a tradeoff. More skips means shorter skip spans, and that we are more likely to skip. But it also means lots of comparisons to skip pointers, and lots of space storing skip pointers. Fewer skips means few pointer comparisons, but then long skip spans which means that there will be fewer opportunities to skip. A simple heuristic for placing skips, which has been found to work well in practice, is that for a postings
list of length P, use √P evenly-spaced skip pointers. This heuristic can be improved upon; it ignores any details of the distribution of query terms.
Building effective skip pointers is easy if an index is relatively static; it is harder if a postings list keeps changing because of updates. A malicious deletion strategy can render skip lists ineffective.
Choosing the optimal encoding for an inverted index is an ever-changing game for the system builder, because it is strongly dependent on underlying computer technologies and their relative speeds and sizes. Traditionally, CPUs were slow, and so highly compressed techniques were not optimal. Now CPUs are fast and disk is slow, so reducing disk postings list size dominates. However, if you’re running a search engine with everything in mem-
Online edition (c) 2009 Cambridge UP
38 |
2 The term vocabulary and postings lists |
ory then the equation changes again. We discuss the impact of hardware parameters on index construction time in Section 4.1 (page 68) and the impact of index size on system speed in Chapter 5.
? |
Exercise 2.5 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
[ ] |
|
Why are skip pointers not useful for queries of the form x OR y? |
|
|
|
|
||||||||||||||
|
Exercise 2.6 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
[ ] |
|
|
We have a two-word query. For one term the postings list consists of the following 16 |
|||||||||||||||||
|
entries: |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
[4,6,10,12,14,16,18,20,22,32,47,81,120,122,157,180] |
|
|
|
|
|
|
|||||||||||
|
and for the other it is the one entry postings list: |
|
|
|
|
|
|
|
||||||||||
|
[47]. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Work out how many comparisons would be done to intersect the two postings lists |
|||||||||||||||||
|
with the following two strategies. Briefly justify your answers: |
|
|
|
|
|||||||||||||
|
a. Using standard postings lists |
|
|
|
|
|
|
|
|
|
|
|
||||||
|
b. Using postings lists stored with skip pointers, with a skip length of √ |
|
, as sug- |
|||||||||||||||
|
P |
|||||||||||||||||
|
gested in Section 2.3. |
|
|
|
|
|
|
|
|
|
|
|
|
|
||||
|
Exercise 2.7 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
[ ] |
|
|
Consider a postings intersection between this postings list, with skip pointers: |
|||||||||||||||||
|
3 |
5 |
9 |
15 |
24 |
39 |
60 |
68 |
75 |
81 |
84 |
89 |
92 |
96 |
97 |
100 115 |
|
and the following intermediate result postings list (which hence has no skip pointers):
3 5 89 95 97 99 100 101
Trace through the postings intersection algorithm in Figure 2.10 (page 37).
a.How often is a skip pointer followed (i.e., p1 is advanced to skip(p1))?
b.How many postings comparisons will be made by this algorithm while intersecting the two lists?
c.How many postings comparisons would be made if the postings lists are intersected without the use of skip pointers?
Online edition (c) 2009 Cambridge UP