- •List of Tables
- •List of Figures
- •Table of Notation
- •Preface
- •Boolean retrieval
- •An example information retrieval problem
- •Processing Boolean queries
- •The extended Boolean model versus ranked retrieval
- •References and further reading
- •The term vocabulary and postings lists
- •Document delineation and character sequence decoding
- •Obtaining the character sequence in a document
- •Choosing a document unit
- •Determining the vocabulary of terms
- •Tokenization
- •Dropping common terms: stop words
- •Normalization (equivalence classing of terms)
- •Stemming and lemmatization
- •Faster postings list intersection via skip pointers
- •Positional postings and phrase queries
- •Biword indexes
- •Positional indexes
- •Combination schemes
- •References and further reading
- •Dictionaries and tolerant retrieval
- •Search structures for dictionaries
- •Wildcard queries
- •General wildcard queries
- •Spelling correction
- •Implementing spelling correction
- •Forms of spelling correction
- •Edit distance
- •Context sensitive spelling correction
- •Phonetic correction
- •References and further reading
- •Index construction
- •Hardware basics
- •Blocked sort-based indexing
- •Single-pass in-memory indexing
- •Distributed indexing
- •Dynamic indexing
- •Other types of indexes
- •References and further reading
- •Index compression
- •Statistical properties of terms in information retrieval
- •Dictionary compression
- •Dictionary as a string
- •Blocked storage
- •Variable byte codes
- •References and further reading
- •Scoring, term weighting and the vector space model
- •Parametric and zone indexes
- •Weighted zone scoring
- •Learning weights
- •The optimal weight g
- •Term frequency and weighting
- •Inverse document frequency
- •The vector space model for scoring
- •Dot products
- •Queries as vectors
- •Computing vector scores
- •Sublinear tf scaling
- •Maximum tf normalization
- •Document and query weighting schemes
- •Pivoted normalized document length
- •References and further reading
- •Computing scores in a complete search system
- •Index elimination
- •Champion lists
- •Static quality scores and ordering
- •Impact ordering
- •Cluster pruning
- •Components of an information retrieval system
- •Tiered indexes
- •Designing parsing and scoring functions
- •Putting it all together
- •Vector space scoring and query operator interaction
- •References and further reading
- •Evaluation in information retrieval
- •Information retrieval system evaluation
- •Standard test collections
- •Evaluation of unranked retrieval sets
- •Evaluation of ranked retrieval results
- •Assessing relevance
- •A broader perspective: System quality and user utility
- •System issues
- •User utility
- •Results snippets
- •References and further reading
- •Relevance feedback and query expansion
- •Relevance feedback and pseudo relevance feedback
- •The Rocchio algorithm for relevance feedback
- •Probabilistic relevance feedback
- •When does relevance feedback work?
- •Relevance feedback on the web
- •Evaluation of relevance feedback strategies
- •Pseudo relevance feedback
- •Indirect relevance feedback
- •Summary
- •Global methods for query reformulation
- •Vocabulary tools for query reformulation
- •Query expansion
- •Automatic thesaurus generation
- •References and further reading
- •XML retrieval
- •Basic XML concepts
- •Challenges in XML retrieval
- •A vector space model for XML retrieval
- •Evaluation of XML retrieval
- •References and further reading
- •Exercises
- •Probabilistic information retrieval
- •Review of basic probability theory
- •The Probability Ranking Principle
- •The 1/0 loss case
- •The PRP with retrieval costs
- •The Binary Independence Model
- •Deriving a ranking function for query terms
- •Probability estimates in theory
- •Probability estimates in practice
- •Probabilistic approaches to relevance feedback
- •An appraisal and some extensions
- •An appraisal of probabilistic models
- •Bayesian network approaches to IR
- •References and further reading
- •Language models for information retrieval
- •Language models
- •Finite automata and language models
- •Types of language models
- •Multinomial distributions over words
- •The query likelihood model
- •Using query likelihood language models in IR
- •Estimating the query generation probability
- •Language modeling versus other approaches in IR
- •Extended language modeling approaches
- •References and further reading
- •Relation to multinomial unigram language model
- •The Bernoulli model
- •Properties of Naive Bayes
- •A variant of the multinomial model
- •Feature selection
- •Mutual information
- •Comparison of feature selection methods
- •References and further reading
- •Document representations and measures of relatedness in vector spaces
- •k nearest neighbor
- •Time complexity and optimality of kNN
- •The bias-variance tradeoff
- •References and further reading
- •Exercises
- •Support vector machines and machine learning on documents
- •Support vector machines: The linearly separable case
- •Extensions to the SVM model
- •Multiclass SVMs
- •Nonlinear SVMs
- •Experimental results
- •Machine learning methods in ad hoc information retrieval
- •Result ranking by machine learning
- •References and further reading
- •Flat clustering
- •Clustering in information retrieval
- •Problem statement
- •Evaluation of clustering
- •Cluster cardinality in K-means
- •Model-based clustering
- •References and further reading
- •Exercises
- •Hierarchical clustering
- •Hierarchical agglomerative clustering
- •Time complexity of HAC
- •Group-average agglomerative clustering
- •Centroid clustering
- •Optimality of HAC
- •Divisive clustering
- •Cluster labeling
- •Implementation notes
- •References and further reading
- •Exercises
- •Matrix decompositions and latent semantic indexing
- •Linear algebra review
- •Matrix decompositions
- •Term-document matrices and singular value decompositions
- •Low-rank approximations
- •Latent semantic indexing
- •References and further reading
- •Web search basics
- •Background and history
- •Web characteristics
- •The web graph
- •Spam
- •Advertising as the economic model
- •The search user experience
- •User query needs
- •Index size and estimation
- •Near-duplicates and shingling
- •References and further reading
- •Web crawling and indexes
- •Overview
- •Crawling
- •Crawler architecture
- •DNS resolution
- •The URL frontier
- •Distributing indexes
- •Connectivity servers
- •References and further reading
- •Link analysis
- •The Web as a graph
- •Anchor text and the web graph
- •PageRank
- •Markov chains
- •The PageRank computation
- •Hubs and Authorities
- •Choosing the subset of the Web
- •References and further reading
- •Bibliography
- •Author Index
78 |
4 Index construction |
sufficient size.
?Exercise 4.3
For n = 15 splits, r = 10 segments, and j = 3 term partitions, how long would distributed index creation take for Reuters-RCV1 in a MapReduce architecture? Base your assumptions about cluster machines on Table 4.1.
4.5Dynamic indexing
Thus far, we have assumed that the document collection is static. This is fine for collections that change infrequently or never (e.g., the Bible or Shakespeare). But most collections are modified frequently with documents being added, deleted, and updated. This means that new terms need to be added to the dictionary, and postings lists need to be updated for existing terms.
The simplest way to achieve this is to periodically reconstruct the index from scratch. This is a good solution if the number of changes over time is small and a delay in making new documents searchable is acceptable – and if enough resources are available to construct a new index while the old one is still available for querying.
If there is a requirement that new documents be included quickly, one solu- AUXILIARY INDEX tion is to maintain two indexes: a large main index and a small auxiliary index that stores new documents. The auxiliary index is kept in memory. Searches are run across both indexes and results merged. Deletions are stored in an invalidation bit vector. We can then filter out deleted documents before returning the search result. Documents are updated by deleting and reinserting
them.
Each time the auxiliary index becomes too large, we merge it into the main index. The cost of this merging operation depends on how we store the index in the file system. If we store each postings list as a separate file, then the merge simply consists of extending each postings list of the main index by the corresponding postings list of the auxiliary index. In this scheme, the reason for keeping the auxiliary index is to reduce the number of disk seeks required over time. Updating each document separately requires up to Mave disk seeks, where Mave is the average size of the vocabulary of documents in the collection. With an auxiliary index, we only put additional load on the disk when we merge auxiliary and main indexes.
Unfortunately, the one-file-per-postings-list scheme is infeasible because most file systems cannot efficiently handle very large numbers of files. The simplest alternative is to store the index as one large file, that is, as a concatenation of all postings lists. In reality, we often choose a compromise between the two extremes (Section 4.7). To simplify the discussion, we choose the simple option of storing the index as one large file here.
Online edition (c) 2009 Cambridge UP
LOGARITHMIC MERGING
4.5 Dynamic indexing |
79 |
LMERGEADDTOKEN(indexes, Z0, token) 1 Z0 ← MERGE(Z0, {token})
2if |Z0| = n
3then for i ← 0 to ∞
4do if Ii indexes
5 |
then Zi+1 ← MERGE(Ii, Zi) |
6 |
(Zi+1 is a temporary index on disk.) |
7 |
indexes ← indexes − {Ii} |
8 |
else Ii ← Zi (Zi becomes the permanent index Ii.) |
9 |
indexes ← indexes {Ii} |
10 |
BREAK |
11 |
Z0 ← |
LOGARITHMICMERGE()
1Z0 ← (Z0 is the in-memory index.)
2 indexes ←
3while true
4do LMERGEADDTOKEN(indexes, Z0, GETNEXTTOKEN())
Figure 4.7 Logarithmic merging. Each token (termID,docID) is initially added to
in-memory index Z0 by LMERGEADDTOKEN. LOGARITHMICMERGE initializes Z0 and indexes.
In this scheme, we process each posting T/n times because we touch it during each of T/n merges where n is the size of the auxiliary index and T the total number of postings. Thus, the overall time complexity is Θ(T2/n). (We neglect the representation of terms here and consider only the docIDs. For the purpose of time complexity, a postings list is simply a list of docIDs.)
We can do better than Θ(T2/n) by introducing log2(T/n) indexes I0, I1, I2, . . . of size 20 × n, 21 × n, 22 × n . . . . Postings percolate up this sequence of indexes and are processed only once on each level. This scheme is called logarithmic merging (Figure 4.7). As before, up to n postings are accumulated in an in-memory auxiliary index, which we call Z0. When the limit n is reached, the 20 × n postings in Z0 are transferred to a new index I0 that is created on disk. The next time Z0 is full, it is merged with I0 to create an index Z1 of size 21 × n. Then Z1 is either stored as I1 (if there isn’t already an I1) or merged with I1 into Z2 (if I1 exists); and so on. We service search requests by querying in-memory Z0 and all currently valid indexes Ii on disk and merging the results. Readers familiar with the binomial heap data structure2 will recog-
2. See, for example, (Cormen et al. 1990, Chapter 19).
Online edition (c) 2009 Cambridge UP
80 |
4 Index construction |
nize its similarity with the structure of the inverted indexes in logarithmic merging.
Overall index construction time is Θ(T log(T/n)) because each posting is processed only once on each of the log(T/n) levels. We trade this efficiency gain for a slow down of query processing; we now need to merge results from log(T/n) indexes as opposed to just two (the main and auxiliary indexes). As in the auxiliary index scheme, we still need to merge very large indexes occasionally (which slows down the search system during the merge), but this happens less frequently and the indexes involved in a merge on average are smaller.
Having multiple indexes complicates the maintenance of collection-wide statistics. For example, it affects the spelling correction algorithm in Section 3.3 (page 56) that selects the corrected alternative with the most hits. With multiple indexes and an invalidation bit vector, the correct number of hits for a term is no longer a simple lookup. In fact, all aspects of an IR system – index maintenance, query processing, distribution, and so on – are more complex in logarithmic merging.
Because of this complexity of dynamic indexing, some large search engines adopt a reconstruction-from-scratch strategy. They do not construct indexes dynamically. Instead, a new index is built from scratch periodically. Query processing is then switched from the new index and the old index is deleted.
?Exercise 4.4
For n = 2 and 1 ≤ T ≤ 30, perform a step-by-step simulation of the algorithm in Figure 4.7. Create a table that shows, for each point in time at which T = 2 k tokens
have been processed (1 ≤ k ≤ 15), which of the three indexes I0, . . . , I3 are in use. The first three lines of the table are given below.
|
I3 |
I2 |
I1 |
I0 |
2 |
0 |
0 |
0 |
0 |
4 |
0 |
0 |
0 |
1 |
6 |
0 |
0 |
1 |
0 |
4.6Other types of indexes
This chapter only describes construction of nonpositional indexes. Except for the much larger data volume we need to accommodate, the main difference for positional indexes is that (termID, docID, (position1, position2, . . . )) triples, instead of (termID, docID) pairs have to be processed and that tokens and postings contain positional information in addition to docIDs. With this change, the algorithms discussed here can all be applied to positional indexes.
In the indexes we have considered so far, postings lists are ordered with respect to docID. As we see in Chapter 5, this is advantageous for compres-
Online edition (c) 2009 Cambridge UP
4.6 Other types of indexes |
81 |
documents
users |
0/1 |
|
|
0 if user can’t read |
|
|
|||
|
|
|
|
doc, 1 otherwise. |
RANKED RETRIEVAL SYSTEMS
SECURITY
ACCESS CONTROL LISTS
Figure 4.8 A user-document matrix for access control lists. Element (i, j) is 1 if user i has access to document j and 0 otherwise. During query processing, a user’s access postings list is intersected with the results list returned by the text part of the index.
sion – instead of docIDs we can compress smaller gaps between IDs, thus reducing space requirements for the index. However, this structure for the index is not optimal when we build ranked (Chapters 6 and 7) – as opposed to Boolean – retrieval systems. In ranked retrieval, postings are often ordered according to weight or impact, with the highest-weighted postings occurring first. With this organization, scanning of long postings lists during query processing can usually be terminated early when weights have become so small that any further documents can be predicted to be of low similarity to the query (see Chapter 6). In a docID-sorted index, new documents are always inserted at the end of postings lists. In an impact-sorted index (Section 7.1.5, page 140), the insertion can occur anywhere, thus complicating the update of the inverted index.
Securityis an important consideration for retrieval systems in corporations. A low-level employee should not be able to find the salary roster of the corporation, but authorized managers need to be able to search for it. Users’ results lists must not contain documents they are barred from opening; the very existence of a document can be sensitive information.
User authorization is often mediated through access control lists or ACLs. ACLs can be dealt with in an information retrieval system by representing each document as the set of users that can access them (Figure 4.8) and then inverting the resulting user-document matrix. The inverted ACL index has, for each user, a “postings list” of documents they can access – the user’s access list. Search results are then intersected with this list. However, such an index is difficult to maintain when access permissions change – we discussed these difficulties in the context of incremental indexing for regular postings lists in Section 4.5. It also requires the processing of very long postings lists for users with access to large document subsets. User membership is therefore often verified by retrieving access information directly from the file system at query time – even though this slows down retrieval.
Online edition (c) 2009 Cambridge UP
82 |
|
4 |
Index construction |
|
Table 4.3 The five steps in constructing an index for Reuters-RCV1 in blocked |
||||
sort-based indexing. Line numbers refer to Figure 4.2. |
|
|
||
|
|
Step |
Time |
|
|
1 |
reading of collection (line 4) |
|
|
2 |
10 initial sorts of 107 records each (line 5) |
|
|
|
3 |
writing of 10 blocks (line 6) |
|
|
|
4 |
total disk transfer time for merging (line 7) |
|
|
|
5time of actual merging (line 7) total
Table 4.4 Collection statistics for a large collection.
Symbol |
Statistic |
Value |
|
N |
# documents |
1,000,000,000 |
|
Lave |
# tokens per document |
1000 |
|
M |
# distinct terms |
44,000,000 |
|
We discussed indexes for storing and retrieving terms (as opposed to documents) in Chapter 3.
?Exercise 4.5
Can spelling correction compromise document-level security? Consider the case where a spelling correction is based on documents to which the user does not have access.
?Exercise 4.6
Total index construction time in blocked sort-based indexing is broken down in Table 4.3. Fill out the time column of the table for Reuters-RCV1 assuming a system with the parameters given in Table 4.1.
Exercise 4.7
Repeat Exercise 4.6 for the larger collection in Table 4.4. Choose a block size that is realistic for current technology (remember that a block should easily fit into main memory). How many blocks do you need?
Exercise 4.8
Assume that we have a collection of modest size whose index can be constructed with the simple in-memory indexing algorithm in Figure 1.4 (page 8). For this collection, compare memory, disk and time requirements of the simple algorithm in Figure 1.4 and blocked sort-based indexing.
Exercise 4.9
Assume that machines in MapReduce have 100 GB of disk space each. Assume further that the postings list of the term the has a size of 200 GB. Then the MapReduce algorithm as described cannot be run to construct the index. How would you modify MapReduce so that it can handle this case?
Online edition (c) 2009 Cambridge UP
