
- •List of Tables
- •List of Figures
- •Table of Notation
- •Preface
- •Boolean retrieval
- •An example information retrieval problem
- •Processing Boolean queries
- •The extended Boolean model versus ranked retrieval
- •References and further reading
- •The term vocabulary and postings lists
- •Document delineation and character sequence decoding
- •Obtaining the character sequence in a document
- •Choosing a document unit
- •Determining the vocabulary of terms
- •Tokenization
- •Dropping common terms: stop words
- •Normalization (equivalence classing of terms)
- •Stemming and lemmatization
- •Faster postings list intersection via skip pointers
- •Positional postings and phrase queries
- •Biword indexes
- •Positional indexes
- •Combination schemes
- •References and further reading
- •Dictionaries and tolerant retrieval
- •Search structures for dictionaries
- •Wildcard queries
- •General wildcard queries
- •Spelling correction
- •Implementing spelling correction
- •Forms of spelling correction
- •Edit distance
- •Context sensitive spelling correction
- •Phonetic correction
- •References and further reading
- •Index construction
- •Hardware basics
- •Blocked sort-based indexing
- •Single-pass in-memory indexing
- •Distributed indexing
- •Dynamic indexing
- •Other types of indexes
- •References and further reading
- •Index compression
- •Statistical properties of terms in information retrieval
- •Dictionary compression
- •Dictionary as a string
- •Blocked storage
- •Variable byte codes
- •References and further reading
- •Scoring, term weighting and the vector space model
- •Parametric and zone indexes
- •Weighted zone scoring
- •Learning weights
- •The optimal weight g
- •Term frequency and weighting
- •Inverse document frequency
- •The vector space model for scoring
- •Dot products
- •Queries as vectors
- •Computing vector scores
- •Sublinear tf scaling
- •Maximum tf normalization
- •Document and query weighting schemes
- •Pivoted normalized document length
- •References and further reading
- •Computing scores in a complete search system
- •Index elimination
- •Champion lists
- •Static quality scores and ordering
- •Impact ordering
- •Cluster pruning
- •Components of an information retrieval system
- •Tiered indexes
- •Designing parsing and scoring functions
- •Putting it all together
- •Vector space scoring and query operator interaction
- •References and further reading
- •Evaluation in information retrieval
- •Information retrieval system evaluation
- •Standard test collections
- •Evaluation of unranked retrieval sets
- •Evaluation of ranked retrieval results
- •Assessing relevance
- •A broader perspective: System quality and user utility
- •System issues
- •User utility
- •Results snippets
- •References and further reading
- •Relevance feedback and query expansion
- •Relevance feedback and pseudo relevance feedback
- •The Rocchio algorithm for relevance feedback
- •Probabilistic relevance feedback
- •When does relevance feedback work?
- •Relevance feedback on the web
- •Evaluation of relevance feedback strategies
- •Pseudo relevance feedback
- •Indirect relevance feedback
- •Summary
- •Global methods for query reformulation
- •Vocabulary tools for query reformulation
- •Query expansion
- •Automatic thesaurus generation
- •References and further reading
- •XML retrieval
- •Basic XML concepts
- •Challenges in XML retrieval
- •A vector space model for XML retrieval
- •Evaluation of XML retrieval
- •References and further reading
- •Exercises
- •Probabilistic information retrieval
- •Review of basic probability theory
- •The Probability Ranking Principle
- •The 1/0 loss case
- •The PRP with retrieval costs
- •The Binary Independence Model
- •Deriving a ranking function for query terms
- •Probability estimates in theory
- •Probability estimates in practice
- •Probabilistic approaches to relevance feedback
- •An appraisal and some extensions
- •An appraisal of probabilistic models
- •Bayesian network approaches to IR
- •References and further reading
- •Language models for information retrieval
- •Language models
- •Finite automata and language models
- •Types of language models
- •Multinomial distributions over words
- •The query likelihood model
- •Using query likelihood language models in IR
- •Estimating the query generation probability
- •Language modeling versus other approaches in IR
- •Extended language modeling approaches
- •References and further reading
- •Relation to multinomial unigram language model
- •The Bernoulli model
- •Properties of Naive Bayes
- •A variant of the multinomial model
- •Feature selection
- •Mutual information
- •Comparison of feature selection methods
- •References and further reading
- •Document representations and measures of relatedness in vector spaces
- •k nearest neighbor
- •Time complexity and optimality of kNN
- •The bias-variance tradeoff
- •References and further reading
- •Exercises
- •Support vector machines and machine learning on documents
- •Support vector machines: The linearly separable case
- •Extensions to the SVM model
- •Multiclass SVMs
- •Nonlinear SVMs
- •Experimental results
- •Machine learning methods in ad hoc information retrieval
- •Result ranking by machine learning
- •References and further reading
- •Flat clustering
- •Clustering in information retrieval
- •Problem statement
- •Evaluation of clustering
- •Cluster cardinality in K-means
- •Model-based clustering
- •References and further reading
- •Exercises
- •Hierarchical clustering
- •Hierarchical agglomerative clustering
- •Time complexity of HAC
- •Group-average agglomerative clustering
- •Centroid clustering
- •Optimality of HAC
- •Divisive clustering
- •Cluster labeling
- •Implementation notes
- •References and further reading
- •Exercises
- •Matrix decompositions and latent semantic indexing
- •Linear algebra review
- •Matrix decompositions
- •Term-document matrices and singular value decompositions
- •Low-rank approximations
- •Latent semantic indexing
- •References and further reading
- •Web search basics
- •Background and history
- •Web characteristics
- •The web graph
- •Spam
- •Advertising as the economic model
- •The search user experience
- •User query needs
- •Index size and estimation
- •Near-duplicates and shingling
- •References and further reading
- •Web crawling and indexes
- •Overview
- •Crawling
- •Crawler architecture
- •DNS resolution
- •The URL frontier
- •Distributing indexes
- •Connectivity servers
- •References and further reading
- •Link analysis
- •The Web as a graph
- •Anchor text and the web graph
- •PageRank
- •Markov chains
- •The PageRank computation
- •Hubs and Authorities
- •Choosing the subset of the Web
- •References and further reading
- •Bibliography
- •Author Index
368 |
16 Flat clustering |
Germany, Russia and Sports) is a better choice than K = 1. In practice, Equation (16.11) is often more useful than Equation (16.13) – with the caveat that we need to come up with an estimate for λ.
?Exercise 16.4
Why are documents that do not use the same term for the concept car likely to end up in the same cluster in K-means clustering?
Exercise 16.5
Two of the possible termination conditions for K-means were (1) assignment does not change, (2) centroids do not change (page 361). Do these two conditions imply each other?
16.5 Model-based clustering
|
In this section, we describe a generalization of K-means, the EM algorithm. |
|||||||||||
|
It can be applied to a larger variety of document representations and distri- |
|||||||||||
|
butions than K-means. |
|
|
|
|
|
|
|
|
|||
|
In K-means, we attempt to find centroids that are good representatives. We |
|||||||||||
|
can view the set of K centroids as a model that generates the data. Generating |
|||||||||||
|
a document in this model consists of first picking a centroid at random and |
|||||||||||
|
then adding some noise. If the noise is normally distributed, this procedure |
|||||||||||
MODEL-BASED |
will result in clusters of spherical shape. Model-based clustering assumes that |
|||||||||||
CLUSTERING |
the data were generated by a model and tries to recover the original model |
|||||||||||
|
from the data. The model that we recover from the data then defines clusters |
|||||||||||
|
and an assignment of documents to clusters. |
|
|
|
|
|
||||||
|
A commonly used criterion for estimating the model parameters is maxi- |
|||||||||||
|
mum likelihood. In K-means, the quantity exp(−RSS) is proportional to the |
|||||||||||
|
likelihood that a particular model (i.e., a set of centroids) generated the data. |
|||||||||||
|
For K-means, maximum likelihood and minimal RSS are equivalent criteria. |
|||||||||||
|
We denote the model parameters by Θ. In K-means, Θ = {~µ1, . . . ,~µK}. |
|
|
|||||||||
|
More generally, the maximum likelihood criterion is to select the parame- |
|||||||||||
|
ters Θ that maximize the log-likelihood of generating the data D: |
|
|
|||||||||
|
|
max L(D |
Θ) = arg max log |
N |
|
|
Θ) = arg max |
N |
log P(d |
|
|
|
|
Θ = |
∏ |
P(d |
n| |
∑ |
n| |
Θ) |
|||||
|
|
argΘ |
| |
Θ |
|
Θ |
|
|
||||
|
|
|
|
|
n=1 |
|
|
|
n=1 |
|
|
|
L(D|Θ) is the objective function that measures the goodness of the clustering. Given two clusterings with the same number of clusters, we prefer the one with higher L(D|Θ).
This is the same approach we took in Chapter 12 (page 237) for language modeling and in Section 13.1 (page 265) for text classification. In text classification, we chose the class that maximizes the likelihood of generating a particular document. Here, we choose the clustering Θ that maximizes the
Online edition (c) 2009 Cambridge UP

EXPECTATION-
MAXIMIZATION ALGORITHM
(16.14)
(16.15)
16.5 Model-based clustering |
369 |
likelihood of generating a given set of documents. Once we have Θ, we can compute an assignment probability P(d|ωk; Θ) for each document-cluster pair. This set of assignment probabilities defines a soft clustering.
An example of a soft assignment is that a document about Chinese cars may have a fractional membership of 0.5 in each of the two clusters China and automobiles, reflecting the fact that both topics are pertinent. A hard clustering like K-means cannot model this simultaneous relevance to two topics.
Model-based clustering provides a framework for incorporating our knowledge about a domain. K-means and the hierarchical algorithms in Chapter 17 make fairly rigid assumptions about the data. For example, clusters in K-means are assumed to be spheres. Model-based clustering offers more flexibility. The clustering model can be adapted to what we know about the underlying distribution of the data, be it Bernoulli (as in the example in Table 16.3), Gaussian with non-spherical variance (another model that is important in document clustering) or a member of a different family.
A commonly used algorithm for model-based clustering is the ExpectationMaximization algorithm or EM algorithm. EM clustering is an iterative algorithm that maximizes L(D|Θ). EM can be applied to many different types of probabilistic modeling. We will work with a mixture of multivariate Bernoulli distributions here, the distribution we know from Section 11.3 (page 222) and Section 13.3 (page 263):
|
|
|
P(d|ωk; Θ) = |
|
∏ qmk! |
∏ (1 − qmk)! |
|
|
|
|
|
|
||||||||||||
|
|
{ |
|
|
|
K}, |
|
|
tm d |
|
|
|
tm /d |
|
|
|
|
|
|
1| |
|
|
||
where |
Θ = |
|
1, . . . , |
|
|
k |
3 k, |
q |
1k, . . . , |
q |
Mk |
, and |
q |
mk |
= P(U |
m |
= |
k |
) |
|||||
|
|
Θ |
|
Θ |
|
Θ |
|
= (α |
|
|
|
) |
|
|
|
|
ω |
|||||||
are the parameters of the model. |
P(Um = |
1|ωk) is the probability that a |
document from cluster ωk contains term tm. The probability αk is the prior of cluster ωk: the probability that a document d is in ωk if we have no information about d.
The mixture model then is: |
∏ (1 − qmk)! |
P(d|Θ) = ∑ αk ∏ qmk! |
|
K |
|
k=1 tm d |
tm /d |
In this model, we generate a document by first picking a cluster k with probability αk and then generating the terms of the document according to the parameters qmk. Recall that the document representation of the multivariate Bernoulli is a vector of M Boolean values (and not a real-valued vector).
3. Um is the random variable we defined in Section 13.3 (page 266) for the Bernoulli Naive Bayes model. It takes the values 1 (term tm is present in the document) and 0 (term tm is absent in the document).
Online edition (c) 2009 Cambridge UP

370 |
16 Flat clustering |
How do we use EM to infer the parameters of the clustering from the data? That is, how do we choose parameters Θ that maximize L(D|Θ)? EM is simi- EXPECTATION STEP lar to K-means in that it alternates between an expectation step, corresponding
MAXIMIZATION STEP to reassignment, and a maximization step, corresponding to recomputation of the parameters of the model. The parameters of K-means are the centroids, the parameters of the instance of EM in this section are the αk and qmk.
The maximization step recomputes the conditional parameters qmk and the priors αk as follows:
(16.16)
(16.17)
Maximization step: qmk |
= |
∑nN=1 rnk I(tm dn) |
αk = |
∑nN=1 rnk |
|
|
|||||
|
|
∑nN=1 rnk |
|
N |
|
where I(tm dn) = 1 if tm |
dn and 0 otherwise and rnk is the soft as- |
signment of document dn to cluster k as computed in the preceding iteration. (We’ll address the issue of initialization in a moment.) These are the maximum likelihood estimates for the parameters of the multivariate Bernoulli from Table 13.3 (page 268) except that documents are assigned fractionally to clusters here. These maximum likelihood estimates maximize the likelihood of the data given the model.
The expectation step computes the soft assignment of documents to clusters given the current parameters qmk and αk:
Expectation step : rnk = |
αk(∏tm dn qmk)(∏tm /dn (1 − qmk)) |
|
∑kK=1 αk(∏tm dn qmk)(∏tm /dn (1 − qmk)) |
This expectation step applies Equations (16.14) and (16.15) to computing the likelihood that ωk generated document dn. It is the classification procedure for the multivariate Bernoulli in Table 13.3. Thus, the expectation step is nothing else but Bernoulli Naive Bayes classification (including normalization, i.e. dividing by the denominator, to get a probability distribution over clusters).
We clustered a set of 11 documents into two clusters using EM in Table 16.3. After convergence in iteration 25, the first 5 documents are assigned to cluster 1 (ri,1 = 1.00) and the last 6 to cluster 2 (ri,1 = 0.00). Somewhat atypically, the final assignment is a hard assignment here. EM usually converges to a soft assignment. In iteration 25, the prior α1 for cluster 1 is 5/11 ≈ 0.45 because 5 of the 11 documents are in cluster 1. Some terms are quickly associated with one cluster because the initial assignment can “spread” to them unambiguously. For example, membership in cluster 2 spreads from document 7 to document 8 in the first iteration because they share sugar (r8,1 = 0 in iteration 1). For parameters of terms occurring in ambiguous contexts, convergence takes longer. Seed documents 6 and 7
Online edition (c) 2009 Cambridge UP

16.5 Model-based clustering |
371 |
(a) docID |
document text |
docID |
document text |
|
|
1 |
hot chocolate cocoa beans |
7 |
sweet sugar |
2 |
cocoa ghana africa |
8 |
sugar cane brazil |
|
3 |
beans harvest ghana |
9 |
sweet sugar beet |
|
4 |
cocoa butter |
10 |
sweet cake icing |
|
5 |
butter truffles |
11 |
cake black forest |
6sweet chocolate
(b) |
Parameter |
|
|
Iteration of clustering |
|
|
|||
|
|
0 |
1 |
2 |
3 |
4 |
5 |
15 |
25 |
|
α1 |
|
0.50 |
0.45 |
0.53 |
0.57 |
0.58 |
0.54 |
0.45 |
|
r1,1 |
|
1.00 |
1.00 |
1.00 |
1.00 |
1.00 |
1.00 |
1.00 |
|
r2,1 |
|
0.50 |
0.79 |
0.99 |
1.00 |
1.00 |
1.00 |
1.00 |
|
r3,1 |
|
0.50 |
0.84 |
1.00 |
1.00 |
1.00 |
1.00 |
1.00 |
|
r4,1 |
|
0.50 |
0.75 |
0.94 |
1.00 |
1.00 |
1.00 |
1.00 |
|
r5,1 |
|
0.50 |
0.52 |
0.66 |
0.91 |
1.00 |
1.00 |
1.00 |
|
r6,1 |
1.00 |
1.00 |
1.00 |
1.00 |
1.00 |
1.00 |
0.83 |
0.00 |
|
r7,1 |
0.00 |
0.00 |
0.00 |
0.00 |
0.00 |
0.00 |
0.00 |
0.00 |
|
r8,1 |
|
0.00 |
0.00 |
0.00 |
0.00 |
0.00 |
0.00 |
0.00 |
|
r9,1 |
|
0.00 |
0.00 |
0.00 |
0.00 |
0.00 |
0.00 |
0.00 |
|
r10,1 |
|
0.50 |
0.40 |
0.14 |
0.01 |
0.00 |
0.00 |
0.00 |
|
r11,1 |
|
0.50 |
0.57 |
0.58 |
0.41 |
0.07 |
0.00 |
0.00 |
|
qafrica,1 |
|
0.000 |
0.100 |
0.134 |
0.158 |
0.158 |
0.169 |
0.200 |
|
qafrica,2 |
|
0.000 |
0.083 |
0.042 |
0.001 |
0.000 |
0.000 |
0.000 |
|
qbrazil,1 |
|
0.000 |
0.000 |
0.000 |
0.000 |
0.000 |
0.000 |
0.000 |
|
qbrazil,2 |
|
0.000 |
0.167 |
0.195 |
0.213 |
0.214 |
0.196 |
0.167 |
|
qcocoa,1 |
|
0.000 |
0.400 |
0.432 |
0.465 |
0.474 |
0.508 |
0.600 |
|
qcocoa,2 |
|
0.000 |
0.167 |
0.090 |
0.014 |
0.001 |
0.000 |
0.000 |
|
qsugar,1 |
|
0.000 |
0.000 |
0.000 |
0.000 |
0.000 |
0.000 |
0.000 |
|
qsugar,2 |
|
1.000 |
0.500 |
0.585 |
0.640 |
0.642 |
0.589 |
0.500 |
|
qsweet,1 |
|
1.000 |
0.300 |
0.238 |
0.180 |
0.159 |
0.153 |
0.000 |
|
qsweet,2 |
|
1.000 |
0.417 |
0.507 |
0.610 |
0.640 |
0.608 |
0.667 |
Table 16.3 The EM clustering algorithm. The table shows a set of documents
(a) and parameter values for selected iterations during EM clustering (b). Parameters shown are prior α1, soft assignment scores rn,1 (both omitted for cluster 2), and lexical parameters qm,k for a few terms. The authors initially assigned document 6 to cluster 1 and document 7 to cluster 2 (iteration 0). EM converges after 25 iterations. For smoothing, the rnk in Equation (16.16) were replaced with rnk + ǫ where ǫ = 0.0001.
Online edition (c) 2009 Cambridge UP