- •List of Tables
- •List of Figures
- •Table of Notation
- •Preface
- •Boolean retrieval
- •An example information retrieval problem
- •Processing Boolean queries
- •The extended Boolean model versus ranked retrieval
- •References and further reading
- •The term vocabulary and postings lists
- •Document delineation and character sequence decoding
- •Obtaining the character sequence in a document
- •Choosing a document unit
- •Determining the vocabulary of terms
- •Tokenization
- •Dropping common terms: stop words
- •Normalization (equivalence classing of terms)
- •Stemming and lemmatization
- •Faster postings list intersection via skip pointers
- •Positional postings and phrase queries
- •Biword indexes
- •Positional indexes
- •Combination schemes
- •References and further reading
- •Dictionaries and tolerant retrieval
- •Search structures for dictionaries
- •Wildcard queries
- •General wildcard queries
- •Spelling correction
- •Implementing spelling correction
- •Forms of spelling correction
- •Edit distance
- •Context sensitive spelling correction
- •Phonetic correction
- •References and further reading
- •Index construction
- •Hardware basics
- •Blocked sort-based indexing
- •Single-pass in-memory indexing
- •Distributed indexing
- •Dynamic indexing
- •Other types of indexes
- •References and further reading
- •Index compression
- •Statistical properties of terms in information retrieval
- •Dictionary compression
- •Dictionary as a string
- •Blocked storage
- •Variable byte codes
- •References and further reading
- •Scoring, term weighting and the vector space model
- •Parametric and zone indexes
- •Weighted zone scoring
- •Learning weights
- •The optimal weight g
- •Term frequency and weighting
- •Inverse document frequency
- •The vector space model for scoring
- •Dot products
- •Queries as vectors
- •Computing vector scores
- •Sublinear tf scaling
- •Maximum tf normalization
- •Document and query weighting schemes
- •Pivoted normalized document length
- •References and further reading
- •Computing scores in a complete search system
- •Index elimination
- •Champion lists
- •Static quality scores and ordering
- •Impact ordering
- •Cluster pruning
- •Components of an information retrieval system
- •Tiered indexes
- •Designing parsing and scoring functions
- •Putting it all together
- •Vector space scoring and query operator interaction
- •References and further reading
- •Evaluation in information retrieval
- •Information retrieval system evaluation
- •Standard test collections
- •Evaluation of unranked retrieval sets
- •Evaluation of ranked retrieval results
- •Assessing relevance
- •A broader perspective: System quality and user utility
- •System issues
- •User utility
- •Results snippets
- •References and further reading
- •Relevance feedback and query expansion
- •Relevance feedback and pseudo relevance feedback
- •The Rocchio algorithm for relevance feedback
- •Probabilistic relevance feedback
- •When does relevance feedback work?
- •Relevance feedback on the web
- •Evaluation of relevance feedback strategies
- •Pseudo relevance feedback
- •Indirect relevance feedback
- •Summary
- •Global methods for query reformulation
- •Vocabulary tools for query reformulation
- •Query expansion
- •Automatic thesaurus generation
- •References and further reading
- •XML retrieval
- •Basic XML concepts
- •Challenges in XML retrieval
- •A vector space model for XML retrieval
- •Evaluation of XML retrieval
- •References and further reading
- •Exercises
- •Probabilistic information retrieval
- •Review of basic probability theory
- •The Probability Ranking Principle
- •The 1/0 loss case
- •The PRP with retrieval costs
- •The Binary Independence Model
- •Deriving a ranking function for query terms
- •Probability estimates in theory
- •Probability estimates in practice
- •Probabilistic approaches to relevance feedback
- •An appraisal and some extensions
- •An appraisal of probabilistic models
- •Bayesian network approaches to IR
- •References and further reading
- •Language models for information retrieval
- •Language models
- •Finite automata and language models
- •Types of language models
- •Multinomial distributions over words
- •The query likelihood model
- •Using query likelihood language models in IR
- •Estimating the query generation probability
- •Language modeling versus other approaches in IR
- •Extended language modeling approaches
- •References and further reading
- •Relation to multinomial unigram language model
- •The Bernoulli model
- •Properties of Naive Bayes
- •A variant of the multinomial model
- •Feature selection
- •Mutual information
- •Comparison of feature selection methods
- •References and further reading
- •Document representations and measures of relatedness in vector spaces
- •k nearest neighbor
- •Time complexity and optimality of kNN
- •The bias-variance tradeoff
- •References and further reading
- •Exercises
- •Support vector machines and machine learning on documents
- •Support vector machines: The linearly separable case
- •Extensions to the SVM model
- •Multiclass SVMs
- •Nonlinear SVMs
- •Experimental results
- •Machine learning methods in ad hoc information retrieval
- •Result ranking by machine learning
- •References and further reading
- •Flat clustering
- •Clustering in information retrieval
- •Problem statement
- •Evaluation of clustering
- •Cluster cardinality in K-means
- •Model-based clustering
- •References and further reading
- •Exercises
- •Hierarchical clustering
- •Hierarchical agglomerative clustering
- •Time complexity of HAC
- •Group-average agglomerative clustering
- •Centroid clustering
- •Optimality of HAC
- •Divisive clustering
- •Cluster labeling
- •Implementation notes
- •References and further reading
- •Exercises
- •Matrix decompositions and latent semantic indexing
- •Linear algebra review
- •Matrix decompositions
- •Term-document matrices and singular value decompositions
- •Low-rank approximations
- •Latent semantic indexing
- •References and further reading
- •Web search basics
- •Background and history
- •Web characteristics
- •The web graph
- •Spam
- •Advertising as the economic model
- •The search user experience
- •User query needs
- •Index size and estimation
- •Near-duplicates and shingling
- •References and further reading
- •Web crawling and indexes
- •Overview
- •Crawling
- •Crawler architecture
- •DNS resolution
- •The URL frontier
- •Distributing indexes
- •Connectivity servers
- •References and further reading
- •Link analysis
- •The Web as a graph
- •Anchor text and the web graph
- •PageRank
- •Markov chains
- •The PageRank computation
- •Hubs and Authorities
- •Choosing the subset of the Web
- •References and further reading
- •Bibliography
- •Author Index
86 |
5 Index compression |
of compression algorithms is of no concern. But for improved cache utilization and faster disk-to-memory transfer, decompression speeds must be high. The compression algorithms we discuss in this chapter are highly efficient and can therefore serve all three purposes of index compression.
POSTING In this chapter, we define a posting as a docID in a postings list. For example, the postings list (6; 20, 45, 100), where 6 is the termID of the list’s term, contains three postings. As discussed in Section 2.4.2 (page 41), postings in most search systems also contain frequency and position information; but we will only consider simple docID postings here. See Section 5.4 for references on compressing frequencies and positions.
This chapter first gives a statistical characterization of the distribution of the entities we want to compress – terms and postings in large collections (Section 5.1). We then look at compression of the dictionary, using the dictionary- as-a-string method and blocked storage (Section 5.2). Section 5.3 describes two techniques for compressing the postings file, variable byte encoding and
γ encoding.
5.1Statistical properties of terms in information retrieval
As in the last chapter, we use Reuters-RCV1 as our model collection (see Table 4.2, page 70). We give some term and postings statistics for the collection in Table 5.1. “ %” indicates the reduction in size from the previous line. “T%” is the cumulative reduction from unfiltered.
The table shows the number of terms for different levels of preprocessing (column 2). The number of terms is the main factor in determining the size of the dictionary. The number of nonpositional postings (column 3) is an indicator of the expected size of the nonpositional index of the collection. The expected size of a positional index is related to the number of positions it must encode (column 4).
In general, the statistics in Table 5.1 show that preprocessing affects the size of the dictionary and the number of nonpositional postings greatly. Stemming and case folding reduce the number of (distinct) terms by 17% each and the number of nonpositional postings by 4% and 3%, respectively. The
RULE OF 30 treatment of the most frequent words is also important. The rule of 30 states that the 30 most common words account for 30% of the tokens in written text (31% in the table). Eliminating the 150 most common words from indexing (as stop words; cf. Section 2.2.2, page 27) cuts 25% to 30% of the nonpositional postings. But, although a stop list of 150 words reduces the number of postings by a quarter or more, this size reduction does not carry over to the size of the compressed index. As we will see later in this chapter, the postings lists of frequent words require only a few bits per posting after compression.
The deltas in the table are in a range typical of large collections. Note,
Online edition (c) 2009 Cambridge UP
5.1 Statistical properties of terms in information retrieval |
87 |
Table 5.1 The effect of preprocessing on the number of terms, nonpositional postings, and tokens for Reuters-RCV1. “ %” indicates the reduction in size from the previous line, except that “30 stop words” and “150 stop words” both use “case folding” as their reference line. “T%” is the cumulative (“total”) reduction from unfiltered. We performed stemming with the Porter stemmer (Chapter 2, page 33).
|
|
|
|
|
|
|
|
|
|
tokens (= number of positi |
||
|
|
(distinct) terms |
|
nonpositional postings |
|
entries in postings) |
||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
number |
% |
T% |
|
number |
% |
T% |
|
number |
% |
T% |
unfiltered |
484,494 |
−2 |
−2 |
109,971,179 |
−8 |
−8 |
197,879,290 |
−9 |
−9 |
|||
no numbers |
473,723 |
100,680,242 |
179,158,204 |
|||||||||
case folding |
391,523 |
−17 |
−19 |
96,969,056 |
−3 |
−12 |
179,158,204 |
−0 |
−9 |
|||
30 stop words |
391,493 |
−0 |
−19 |
83,390,443 |
−14 |
−24 |
121,857,825 |
−31 |
−38 |
|||
150 stop words |
391,373 |
−0 |
−19 |
67,001,847 |
−30 |
−39 |
94,516,599 |
−47 |
−52 |
|||
stemming |
322,383 |
−17 |
−33 |
63,812,300 |
−4 |
−42 |
94,516,599 |
−0 |
−52 |
|||
|
|
|
|
|
|
|
|
|
|
|
|
|
LOSSLESS
LOSSY COMPRESSION
however, that the percentage reductions can be very different for some text collections. For example, for a collection of web pages with a high proportion of French text, a lemmatizer for French reduces vocabulary size much more than the Porter stemmer does for an English-only collection because French is a morphologically richer language than English.
The compression techniques we describe in the remainder of this chapter are lossless, that is, all information is preserved. Better compression ratios can be achieved with lossy compression, which discards some information. Case folding, stemming, and stop word elimination are forms of lossy compression. Similarly, the vector space model (Chapter 6) and dimensionality reduction techniques like latent semantic indexing (Chapter 18) create compact representations from which we cannot fully restore the original collection. Lossy compression makes sense when the “lost” information is unlikely ever to be used by the search system. For example, web search is characterized by a large number of documents, short queries, and users who only look at the first few pages of results. As a consequence, we can discard postings of documents that would only be used for hits far down the list. Thus, there are retrieval scenarios where lossy methods can be used for compression without any reduction in effectiveness.
Before introducing techniques for compressing the dictionary, we want to estimate the number of distinct terms M in a collection. It is sometimes said that languages have a vocabulary of a certain size. The second edition of the Oxford English Dictionary (OED) defines more than 600,000 words. But the vocabulary of most large collections is much larger than the OED. The OED does not include most names of people, locations, products, or scientific
Online edition (c) 2009 Cambridge UP
88 |
|
|
|
5 |
Index compression |
|
6 |
|
|
|
|
|
5 |
|
|
|
|
|
4 |
|
|
|
|
log10 M |
3 |
|
|
|
|
|
2 |
|
|
|
|
|
1 |
|
|
|
|
|
0 |
|
|
|
|
|
0 |
2 |
4 |
6 |
8 |
|
|
|
log10 T |
|
|
Figure 5.1 Heaps’ law. Vocabulary size M as a function of collection size T (number of tokens) for Reuters-RCV1. For these data, the dashed line log10 M = 0.49 log10 T + 1.64 is the best least-squares fit. Thus, k = 101.64 ≈ 44 and b = 0.49.
entities like genes. These names need to be included in the inverted index, so our users can search for them.
5.1.1Heaps’ law: Estimating the number of terms
HEAPS’ LAW A better way of getting a handle on M is Heaps’ law, which estimates vocabulary size as a function of collection size:
(5.1) |
M = kTb |
where T is the number of tokens in the collection. Typical values for the |
|
parameters k and b are: |
30 ≤ k ≤ 100 and b ≈ 0.5. The motivation for |
Heaps’ law is that the simplest possible relationship between collection size and vocabulary size is linear in log–log space and the assumption of linearity is usually born out in practice as shown in Figure 5.1 for Reuters-RCV1. In this case, the fit is excellent for T > 105 = 100,000, for the parameter values b = 0.49 and k = 44. For example, for the first 1,000,020 tokens Heaps’ law
Online edition (c) 2009 Cambridge UP
5.1 Statistical properties of terms in information retrieval |
89 |
predicts 38,323 terms:
44 × 1,000,0200.49 ≈ 38,323.
The actual number is 38,365 terms, very close to the prediction.
The parameter k is quite variable because vocabulary growth depends a lot on the nature of the collection and how it is processed. Case-folding and stemming reduce the growth rate of the vocabulary, whereas including numbers and spelling errors increase it. Regardless of the values of the parameters for a particular collection, Heaps’ law suggests that (i) the dictionary size continues to increase with more documents in the collection, rather than a maximum vocabulary size being reached, and (ii) the size of the dictionary is quite large for large collections. These two hypotheses have been empirically shown to be true of large text collections (Section 5.4). So dictionary compression is important for an effective information retrieval system.
5.1.2Zipf’s law: Modeling the distribution of terms
We also want to understand how terms are distributed across documents.
ZIPF’S LAW
(5.2)
POWER LAW
This helps us to characterize the properties of the algorithms for compressing postings lists in Section 5.3.
A commonly used model of the distribution of terms in a collection is Zipf’s law. It states that, if t1 is the most common term in the collection, t2 is the next most common, and so on, then the collection frequency cfi of the ith most common term is proportional to 1/i:
1 cfi i .
So if the most frequent term occurs cf1 times, then the second most frequent term has half as many occurrences, the third most frequent term a third as many occurrences, and so on. The intuition is that frequency decreases very rapidly with rank. Equation (5.2) is one of the simplest ways of formalizing such a rapid decrease and it has been found to be a reasonably good model.
Equivalently, we can write Zipf’s law as cfi = cik or as log cfi = log c + k log i where k = −1 and c is a constant to be defined in Section 5.3.2. It is therefore a power law with exponent k = −1. See Chapter 19, page 426, for another power law, a law characterizing the distribution of links on web pages.
The log–log graph in Figure 5.2 plots the collection frequency of a term as a function of its rank for Reuters-RCV1. A line with slope –1, corresponding to the Zipf function log cfi = log c − log i, is also shown. The fit of the data to the law is not particularly good, but good enough to serve as a model for term distributions in our calculations in Section 5.3.
Online edition (c) 2009 Cambridge UP
