- •List of Tables
- •List of Figures
- •Table of Notation
- •Preface
- •Boolean retrieval
- •An example information retrieval problem
- •Processing Boolean queries
- •The extended Boolean model versus ranked retrieval
- •References and further reading
- •The term vocabulary and postings lists
- •Document delineation and character sequence decoding
- •Obtaining the character sequence in a document
- •Choosing a document unit
- •Determining the vocabulary of terms
- •Tokenization
- •Dropping common terms: stop words
- •Normalization (equivalence classing of terms)
- •Stemming and lemmatization
- •Faster postings list intersection via skip pointers
- •Positional postings and phrase queries
- •Biword indexes
- •Positional indexes
- •Combination schemes
- •References and further reading
- •Dictionaries and tolerant retrieval
- •Search structures for dictionaries
- •Wildcard queries
- •General wildcard queries
- •Spelling correction
- •Implementing spelling correction
- •Forms of spelling correction
- •Edit distance
- •Context sensitive spelling correction
- •Phonetic correction
- •References and further reading
- •Index construction
- •Hardware basics
- •Blocked sort-based indexing
- •Single-pass in-memory indexing
- •Distributed indexing
- •Dynamic indexing
- •Other types of indexes
- •References and further reading
- •Index compression
- •Statistical properties of terms in information retrieval
- •Dictionary compression
- •Dictionary as a string
- •Blocked storage
- •Variable byte codes
- •References and further reading
- •Scoring, term weighting and the vector space model
- •Parametric and zone indexes
- •Weighted zone scoring
- •Learning weights
- •The optimal weight g
- •Term frequency and weighting
- •Inverse document frequency
- •The vector space model for scoring
- •Dot products
- •Queries as vectors
- •Computing vector scores
- •Sublinear tf scaling
- •Maximum tf normalization
- •Document and query weighting schemes
- •Pivoted normalized document length
- •References and further reading
- •Computing scores in a complete search system
- •Index elimination
- •Champion lists
- •Static quality scores and ordering
- •Impact ordering
- •Cluster pruning
- •Components of an information retrieval system
- •Tiered indexes
- •Designing parsing and scoring functions
- •Putting it all together
- •Vector space scoring and query operator interaction
- •References and further reading
- •Evaluation in information retrieval
- •Information retrieval system evaluation
- •Standard test collections
- •Evaluation of unranked retrieval sets
- •Evaluation of ranked retrieval results
- •Assessing relevance
- •A broader perspective: System quality and user utility
- •System issues
- •User utility
- •Results snippets
- •References and further reading
- •Relevance feedback and query expansion
- •Relevance feedback and pseudo relevance feedback
- •The Rocchio algorithm for relevance feedback
- •Probabilistic relevance feedback
- •When does relevance feedback work?
- •Relevance feedback on the web
- •Evaluation of relevance feedback strategies
- •Pseudo relevance feedback
- •Indirect relevance feedback
- •Summary
- •Global methods for query reformulation
- •Vocabulary tools for query reformulation
- •Query expansion
- •Automatic thesaurus generation
- •References and further reading
- •XML retrieval
- •Basic XML concepts
- •Challenges in XML retrieval
- •A vector space model for XML retrieval
- •Evaluation of XML retrieval
- •References and further reading
- •Exercises
- •Probabilistic information retrieval
- •Review of basic probability theory
- •The Probability Ranking Principle
- •The 1/0 loss case
- •The PRP with retrieval costs
- •The Binary Independence Model
- •Deriving a ranking function for query terms
- •Probability estimates in theory
- •Probability estimates in practice
- •Probabilistic approaches to relevance feedback
- •An appraisal and some extensions
- •An appraisal of probabilistic models
- •Bayesian network approaches to IR
- •References and further reading
- •Language models for information retrieval
- •Language models
- •Finite automata and language models
- •Types of language models
- •Multinomial distributions over words
- •The query likelihood model
- •Using query likelihood language models in IR
- •Estimating the query generation probability
- •Language modeling versus other approaches in IR
- •Extended language modeling approaches
- •References and further reading
- •Relation to multinomial unigram language model
- •The Bernoulli model
- •Properties of Naive Bayes
- •A variant of the multinomial model
- •Feature selection
- •Mutual information
- •Comparison of feature selection methods
- •References and further reading
- •Document representations and measures of relatedness in vector spaces
- •k nearest neighbor
- •Time complexity and optimality of kNN
- •The bias-variance tradeoff
- •References and further reading
- •Exercises
- •Support vector machines and machine learning on documents
- •Support vector machines: The linearly separable case
- •Extensions to the SVM model
- •Multiclass SVMs
- •Nonlinear SVMs
- •Experimental results
- •Machine learning methods in ad hoc information retrieval
- •Result ranking by machine learning
- •References and further reading
- •Flat clustering
- •Clustering in information retrieval
- •Problem statement
- •Evaluation of clustering
- •Cluster cardinality in K-means
- •Model-based clustering
- •References and further reading
- •Exercises
- •Hierarchical clustering
- •Hierarchical agglomerative clustering
- •Time complexity of HAC
- •Group-average agglomerative clustering
- •Centroid clustering
- •Optimality of HAC
- •Divisive clustering
- •Cluster labeling
- •Implementation notes
- •References and further reading
- •Exercises
- •Matrix decompositions and latent semantic indexing
- •Linear algebra review
- •Matrix decompositions
- •Term-document matrices and singular value decompositions
- •Low-rank approximations
- •Latent semantic indexing
- •References and further reading
- •Web search basics
- •Background and history
- •Web characteristics
- •The web graph
- •Spam
- •Advertising as the economic model
- •The search user experience
- •User query needs
- •Index size and estimation
- •Near-duplicates and shingling
- •References and further reading
- •Web crawling and indexes
- •Overview
- •Crawling
- •Crawler architecture
- •DNS resolution
- •The URL frontier
- •Distributing indexes
- •Connectivity servers
- •References and further reading
- •Link analysis
- •The Web as a graph
- •Anchor text and the web graph
- •PageRank
- •Markov chains
- •The PageRank computation
- •Hubs and Authorities
- •Choosing the subset of the Web
- •References and further reading
- •Bibliography
- •Author Index
17.2 Single-link and complete-link clustering |
|
|
|
|
|
|
385 |
|||||||||||||||||
|
|
|
d1 |
|
|
|
|
|
|
d2 |
d3 |
d4 |
d5 |
|||||||||||
|
|
|
|
|
|
|
|
|
||||||||||||||||
1 |
|
|
× |
|
|
|
|
|
|
× |
× |
× |
× |
|
||||||||||
|
|
|
|
|
|
|
|
|
||||||||||||||||
0 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
|
Figure 17.7 Outliers in complete-link clustering. The five documents have the x-coordinates 1 + 2ǫ, 4, 5 + 2ǫ, 6 and 7 − ǫ. Complete-link clustering creates the two clusters shown as ellipses. The most intuitive two-cluster clustering is {{d1}, {d2, d3, d4, d5}}, but in complete-link clustering, the outlier d1 splits {d2, d3, d4, d5} as shown.
distances without regard to the overall shape of the emerging cluster. This CHAINING effect is called chaining.
The chaining effect is also apparent in Figure 17.1. The last eleven merges of the single-link clustering (those above the 0.1 line) add on single documents or pairs of documents, corresponding to a chain. The complete-link clustering in Figure 17.5 avoids this problem. Documents are split into two groups of roughly equal size when we cut the dendrogram at the last merge. In general, this is a more useful organization of the data than a clustering with chains.
However, complete-link clustering suffers from a different problem. It pays too much attention to outliers, points that do not fit well into the global structure of the cluster. In the example in Figure 17.7 the four documents d2, d3, d4, d5 are split because of the outlier d1 at the left edge (Exercise 17.1). Complete-link clustering does not find the most intuitive cluster structure in this example.
17.2.1Time complexity of HAC
The complexity of the naive HAC algorithm in Figure 17.2 is Θ(N3) because we exhaustively scan the N × N matrix C for the largest similarity in each of N − 1 iterations.
For the four HAC methods discussed in this chapter a more efficient algo-
rithm is the priority-queue algorithm shown in Figure 17.8. Its time complexity is Θ(N2 log N). The rows C[k] of the N × N similarity matrix C are sorted in decreasing order of similarity in the priority queues P. P[k].MAX() then
returns the cluster in P[k] that currently has the highest similarity with ωk, where we use ωk to denote the kth cluster as in Chapter 16. After creating the merged cluster of ωk1 and ωk2 , ωk1 is used as its representative. The function computes the similarity function for potential merge pairs: largest simi-
larity for single-link, smallest similarity for complete-link, average similarity for GAAC (Section 17.3), and centroid similarity for centroid clustering (Sec-
Online edition (c) 2009 Cambridge UP
386 |
17 Hierarchical clustering |
|
~ |
~ |
|
EFFICIENTHAC(d1, . . . , dN) |
|
||
1 |
for n ← 1 to N |
|
|
2 |
do for i ← 1 to N |
~ |
~ |
3 |
do C[n][i].sim |
← dn |
· di |
4 |
C[n][i].index ← i |
|
5I[n] ← 1
6P[n] ← priority queue for C[n] sorted on sim
7 P[n].DELETE(C[n][n]) (don’t want self-similarities)
8A ← []
9 for k ← 1 to N − 1
10do k1 ← arg max{k:I[k]=1} P[k].MAX().sim
11k2 ← P[k1].MAX().index
12A.APPEND(hk1, k2i)
13I[k2] ← 0
14P[k1] ← []
15for each i with I[i] = 1 i 6= k1
16do P[i].DELETE(C[i][k1])
17P[i].DELETE(C[i][k2])
18C[i][k1].sim ← SIM(i, k1, k2)
19P[i].INSERT(C[i][k1])
20C[k1][i].sim ← SIM(i, k1, k2)
21P[k1].INSERT(C[k1][i])
22return A
clustering algorithm |
SIM(i, k1, k2) |
|
|
|
|
|
|
|
|
|||||||
single-link |
max(SIM(i, k1), SIM(i, k2)) |
|
|
|
|
|||||||||||
complete-link |
min(SIM(i, k1), SIM(i, k2)) |
|
|
|
|
|||||||||||
centroid |
1 |
|
1 |
|
|
|
|
|
|
|
|
|
||||
( |
|
~vm) · ( |
|
~vi) |
|
|
|
|
|
|
|
|
||||
Nm |
Ni |
|
|
|
|
2 |
|
|
|
|||||||
group-average |
|
|
|
|
1 |
|
|
|
|
[(~vm + ~vi) |
− (Nm + Ni)] |
|||||
|
(Nm+Ni)(Nm+Ni−1) |
|
||||||||||||||
compute C[5] |
|
|
|
|
|
1 |
2 |
3 |
|
4 |
5 |
|
||||
|
|
|
|
|
0.2 |
0.8 |
0.6 |
|
0.4 |
1.0 |
|
|||||
|
|
|
|
|
|
|
|
|||||||||
create P[5] (by sorting) |
|
2 |
3 |
4 |
|
1 |
|
|
||||||||
|
0.8 |
0.6 |
0.4 |
|
0.2 |
|
|
|||||||||
|
|
|
|
|
|
|
|
|
||||||||
merge 2 and 3, update |
|
2 |
4 |
1 |
|
|
|
|
|
|||||||
similarity of 2, delete 3 |
|
0.3 |
0.4 |
0.2 |
|
|
|
|
|
|||||||
delete and reinsert 2 |
|
4 |
2 |
1 |
|
|
|
|
|
|||||||
|
0.4 |
0.3 |
0.2 |
|
|
|
|
|
||||||||
|
|
|
|
|
|
|
|
|
|
|
Figure 17.8 The priority-queue algorithm for HAC. Top: The algorithm. Center: Four different similarity measures. Bottom: An example for processing steps 6 and 16–19. This is a made up example showing P[5] for a 5 × 5 matrix C.
Online edition (c) 2009 Cambridge UP
17.2 Single-link and complete-link clustering |
387 |
SINGLELINKCLUSTERING(d1, . . . , dN)
1for n ← 1 to N
2do for i ← 1 to N
3do C[n][i].sim ← SIM(dn, di)
4 |
C[n][i].index ← i |
5I[n] ← n
6 NBM[n] ← arg maxX {C[n][i]:n6=i} X.sim
7A ← []
8for n ← 1 to N − 1
9do i1 ← arg max{i:I[i]=i} NBM[i].sim
10i2 ← I[NBM[i1].index]
11A.APPEND(hi1, i2i)
12for i ← 1 to N
13do if I[i] = i i 6= i1 i 6= i2
14then C[i1][i].sim ← C[i][i1].sim ← max(C[i1][i].sim, C[i2][i].sim)
15if I[i] = i2
16then I[i] ← i1
17NBM[i1] ← arg maxX {C[i1][i]:I[i]=i i6=i1} X.sim
18return A
Figure 17.9 Single-link clustering algorithm using an NBM array. After merging
two clusters i1 and i2, the first one (i1) represents the merged cluster. If I[i] = i, then i is the representative of its current cluster. If I[i] 6= i, then i has been merged into the cluster represented by I[i] and will therefore be ignored when updating NBM[i1].
tion 17.4). We give an example of how a row of C is processed (Figure 17.8, bottom panel). The loop in lines 1–7 is Θ(N2) and the loop in lines 9–21 is Θ(N2 log N) for an implementation of priority queues that supports deletion and insertion in Θ(log N). The overall complexity of the algorithm is therefore Θ(N2 log N). In the definition of the function SIM, ~vm and ~vi are the vector sums of ωk1 ωk2 and ωi, respectively, and Nm and Ni are the number of documents in ωk1 ωk2 and ωi, respectively.
The argument of EFFICIENTHAC in Figure 17.8 is a set of vectors (as opposed to a set of generic documents) because GAAC and centroid clustering (Sections 17.3 and 17.4) require vectors as input. The complete-link version of EFFICIENTHAC can also be applied to documents that are not represented as vectors.
For single-link, we can introduce a next-best-merge array (NBM) as a further optimization as shown in Figure 17.9. NBM keeps track of what the best merge is for each cluster. Each of the two top level for-loops in Figure 17.9 are Θ(N2), thus the overall complexity of single-link clustering is Θ(N2).
Online edition (c) 2009 Cambridge UP