- •List of Tables
- •List of Figures
- •Table of Notation
- •Preface
- •Boolean retrieval
- •An example information retrieval problem
- •Processing Boolean queries
- •The extended Boolean model versus ranked retrieval
- •References and further reading
- •The term vocabulary and postings lists
- •Document delineation and character sequence decoding
- •Obtaining the character sequence in a document
- •Choosing a document unit
- •Determining the vocabulary of terms
- •Tokenization
- •Dropping common terms: stop words
- •Normalization (equivalence classing of terms)
- •Stemming and lemmatization
- •Faster postings list intersection via skip pointers
- •Positional postings and phrase queries
- •Biword indexes
- •Positional indexes
- •Combination schemes
- •References and further reading
- •Dictionaries and tolerant retrieval
- •Search structures for dictionaries
- •Wildcard queries
- •General wildcard queries
- •Spelling correction
- •Implementing spelling correction
- •Forms of spelling correction
- •Edit distance
- •Context sensitive spelling correction
- •Phonetic correction
- •References and further reading
- •Index construction
- •Hardware basics
- •Blocked sort-based indexing
- •Single-pass in-memory indexing
- •Distributed indexing
- •Dynamic indexing
- •Other types of indexes
- •References and further reading
- •Index compression
- •Statistical properties of terms in information retrieval
- •Dictionary compression
- •Dictionary as a string
- •Blocked storage
- •Variable byte codes
- •References and further reading
- •Scoring, term weighting and the vector space model
- •Parametric and zone indexes
- •Weighted zone scoring
- •Learning weights
- •The optimal weight g
- •Term frequency and weighting
- •Inverse document frequency
- •The vector space model for scoring
- •Dot products
- •Queries as vectors
- •Computing vector scores
- •Sublinear tf scaling
- •Maximum tf normalization
- •Document and query weighting schemes
- •Pivoted normalized document length
- •References and further reading
- •Computing scores in a complete search system
- •Index elimination
- •Champion lists
- •Static quality scores and ordering
- •Impact ordering
- •Cluster pruning
- •Components of an information retrieval system
- •Tiered indexes
- •Designing parsing and scoring functions
- •Putting it all together
- •Vector space scoring and query operator interaction
- •References and further reading
- •Evaluation in information retrieval
- •Information retrieval system evaluation
- •Standard test collections
- •Evaluation of unranked retrieval sets
- •Evaluation of ranked retrieval results
- •Assessing relevance
- •A broader perspective: System quality and user utility
- •System issues
- •User utility
- •Results snippets
- •References and further reading
- •Relevance feedback and query expansion
- •Relevance feedback and pseudo relevance feedback
- •The Rocchio algorithm for relevance feedback
- •Probabilistic relevance feedback
- •When does relevance feedback work?
- •Relevance feedback on the web
- •Evaluation of relevance feedback strategies
- •Pseudo relevance feedback
- •Indirect relevance feedback
- •Summary
- •Global methods for query reformulation
- •Vocabulary tools for query reformulation
- •Query expansion
- •Automatic thesaurus generation
- •References and further reading
- •XML retrieval
- •Basic XML concepts
- •Challenges in XML retrieval
- •A vector space model for XML retrieval
- •Evaluation of XML retrieval
- •References and further reading
- •Exercises
- •Probabilistic information retrieval
- •Review of basic probability theory
- •The Probability Ranking Principle
- •The 1/0 loss case
- •The PRP with retrieval costs
- •The Binary Independence Model
- •Deriving a ranking function for query terms
- •Probability estimates in theory
- •Probability estimates in practice
- •Probabilistic approaches to relevance feedback
- •An appraisal and some extensions
- •An appraisal of probabilistic models
- •Bayesian network approaches to IR
- •References and further reading
- •Language models for information retrieval
- •Language models
- •Finite automata and language models
- •Types of language models
- •Multinomial distributions over words
- •The query likelihood model
- •Using query likelihood language models in IR
- •Estimating the query generation probability
- •Language modeling versus other approaches in IR
- •Extended language modeling approaches
- •References and further reading
- •Relation to multinomial unigram language model
- •The Bernoulli model
- •Properties of Naive Bayes
- •A variant of the multinomial model
- •Feature selection
- •Mutual information
- •Comparison of feature selection methods
- •References and further reading
- •Document representations and measures of relatedness in vector spaces
- •k nearest neighbor
- •Time complexity and optimality of kNN
- •The bias-variance tradeoff
- •References and further reading
- •Exercises
- •Support vector machines and machine learning on documents
- •Support vector machines: The linearly separable case
- •Extensions to the SVM model
- •Multiclass SVMs
- •Nonlinear SVMs
- •Experimental results
- •Machine learning methods in ad hoc information retrieval
- •Result ranking by machine learning
- •References and further reading
- •Flat clustering
- •Clustering in information retrieval
- •Problem statement
- •Evaluation of clustering
- •Cluster cardinality in K-means
- •Model-based clustering
- •References and further reading
- •Exercises
- •Hierarchical clustering
- •Hierarchical agglomerative clustering
- •Time complexity of HAC
- •Group-average agglomerative clustering
- •Centroid clustering
- •Optimality of HAC
- •Divisive clustering
- •Cluster labeling
- •Implementation notes
- •References and further reading
- •Exercises
- •Matrix decompositions and latent semantic indexing
- •Linear algebra review
- •Matrix decompositions
- •Term-document matrices and singular value decompositions
- •Low-rank approximations
- •Latent semantic indexing
- •References and further reading
- •Web search basics
- •Background and history
- •Web characteristics
- •The web graph
- •Spam
- •Advertising as the economic model
- •The search user experience
- •User query needs
- •Index size and estimation
- •Near-duplicates and shingling
- •References and further reading
- •Web crawling and indexes
- •Overview
- •Crawling
- •Crawler architecture
- •DNS resolution
- •The URL frontier
- •Distributing indexes
- •Connectivity servers
- •References and further reading
- •Link analysis
- •The Web as a graph
- •Anchor text and the web graph
- •PageRank
- •Markov chains
- •The PageRank computation
- •Hubs and Authorities
- •Choosing the subset of the Web
- •References and further reading
- •Bibliography
- •Author Index
9.2 Global methods for query reformulation |
189 |
frequency). There is no need to length-normalize vectors. Using Rocchio relevance feedback as in Equation (9.3) what would the revised query vector be after relevance feedback? Assume α = 1, β = 0.75, γ = 0.25.
Exercise 9.6 |
[ ] |
Omar has implemented a relevance feedback web search system, where he is going to do relevance feedback based only on words in the title text returned for a page (for efficiency). The user is going to rank 3 results. The first user, Jinxing, queries for:
banana slug
and the top three titles returned are:
banana slug Ariolimax columbianus Santa Cruz mountains banana slug Santa Cruz Campus Mascot
Jinxing judges the first two documents relevant, and the third nonrelevant. Assume that Omar’s search engine uses term frequency but no length normalization nor IDF. Assume that he is using the Rocchio relevance feedback mechanism, with α = β = γ = 1. Show the final revised query that would be run. (Please list the vector elements in alphabetical order.)
9.2Global methods for query reformulation
In this section we more briefly discuss three global methods for expanding a query: by simply aiding the user in doing so, by using a manual thesaurus, and through building a thesaurus automatically.
9.2.1Vocabulary tools for query reformulation
Various user supports in the search process can help the user see how their searches are or are not working. This includes information about words that were omitted from the query because they were on stop lists, what words were stemmed to, the number of hits on each term or phrase, and whether words were dynamically turned into phrases. The IR system might also suggest search terms by means of a thesaurus or a controlled vocabulary. A user can also be allowed to browse lists of the terms that are in the inverted index, and thus find good terms that appear in the collection.
9.2.2Query expansion
In relevance feedback, users give additional input on documents (by marking documents in the results set as relevant or not), and this input is used QUERY EXPANSION to reweight the terms in the query for documents. In query expansion on the other hand, users give additional input on query words or phrases, possibly suggesting additional query terms. Some search engines (especially on the
Online edition (c) 2009 Cambridge UP
190 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
9 |
|
|
|
|
|
|
|
|
|
|
|
|
|
Relevance feedback and query expansion |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
tA |
W |
r |
|
|
|
e |
|
l |
|
e |
s |
|
o |
|
b |
o |
|
, |
|
|
|
|
t |
M |
I |
|
r |
m |
|
|
y |
|
|
o |
|
a |
|
: |
|
|
r |
g |
|
e |
|
p |
e |
|
|
|
. |
|
s |
|
a |
|
. |
|
|
|
. |
|
l |
|
m |
V |
|
|
|
i |
d |
|
t |
e |
r |
e |
o |
|
e |
|
|
s |
L |
|
|
o |
|
|
|
c |
|
|
p |
a |
|
|
l |
a |
|
|
|
l |
|
S |
m |
|
|
h |
|
|
o |
|
s |
|
|
p |
|
p |
|
p |
|
|
r |
|
i |
|
n |
i |
|
n |
|
|
g |
|
|
g |
|
|
|
s |
m |
|
|
, |
|
|
o |
|
r |
|
e |
|
a |
|
|
|
|
l |
|
m |
|
|
O |
|
|
p |
c |
1 |
t |
i |
o e |
|
|
ª |
|
|
n |
|
n 1 |
s |
|
|
0 |
t |
|
r |
|
o |
o |
|
|
f |
Y |
|
, |
a |
a |
b |
h |
|
|
po |
o |
|
u |
o |
|
a |
|
t |
! |
|
|
l |
5 |
|
M |
m |
|
3 |
y |
4 |
, |
Y |
|
0 |
a |
0 |
h |
0 |
|
o |
, |
0 |
o |
! |
0 |
M |
f |
a |
i |
r |
l |
p |
a |
|
W |
l |
|
m |
|
|
|
|
e |
|
|
|
|
l |
|
( |
c |
|
|
A |
o |
|
|
|
|
b |
m |
|
|
o |
|
|
e |
|
|
u |
|
|
, |
|
|
|
t |
G |
|
|
t |
|
h |
|
u |
|
i |
|
s |
e |
|
|
|
|
s |
p |
|
|
|
t |
a |
|
|
|
g |
[ |
|
|
S |
e |
|
|
|
|
|
i |
) |
g |
|
|
|
n |
|
|
|
|
|
0 |
|
|
I |
|
|
n |
. |
|
|
1 |
] |
|
|
1 |
|
|
H |
|
s |
e |
e |
|
l |
p |
|
. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
, |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
p |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0 |
|
|
o |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
S |
|
|
|
P |
|
|
|
|
O |
|
|
|
|
N |
|
|
|
|
S |
|
|
|
O |
|
|
|
|
R |
|
|
|
|
|
|
|
|
|
ª |
R |
|
|
|
|
|
E |
|
|
|
|
|
S |
|
|
|
U |
|
|
L |
|
T |
|
c |
S |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Aa |
|
|
|
t |
T |
|
|
t |
|
|
. |
& |
|
c |
|
|
T |
o |
|
m |
( |
|
|
|
|
C b |
|
|
|
/ |
|
wi A |
|
n |
|
|
|
i |
g |
T |
r |
e |
|
u |
|
|
|
& |
l |
l |
|
|
e |
a |
|
|
|
T |
|
s |
r |
|
) s |
|
. |
|
|
l |
|
|
G |
|
|
|
o |
|
|
|
m |
|
|
|
o |
|
|
|
b |
|
|
i |
|
|
l |
e |
|
|
|
|
e |
|
|
|
f |
|
|
f o |
|
|
r |
|
t |
|
|
l |
|
e |
|
|
|
s |
|
|
|
|
s |
|
|
|
|
l |
|
|
y |
|
|
|
|
w |
|
|
i |
|
t |
|
h |
|
|
|
t |
|
h |
|
|
|
e |
|
|
|
|
P |
|
|
|
A |
|
|
L |
|
S |
|
|
M |
|
|
P |
|
|
|
|
O |
|
T |
|
N |
|
r |
|
|
e |
S |
|
|
o |
|
O |
|
|
|
R f |
|
|
r |
o |
|
|
R |
|
|
m |
|
|
E |
|
|
|
|
|
S |
|
|
U |
|
|
L |
|
|
T |
|
|
S |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
H |
|
|
|
t |
a |
|
|
a |
|
|
|
|
n |
|
y |
n |
|
|
|
d |
|
|
d |
|
C |
h |
|
|
|
h |
e |
o |
|
|
|
|
|
l |
e |
d |
|
|
|
l |
|
|
d |
P |
|
|
|
s |
|
C |
c |
|
t |
|
|
s |
|
|
|
d |
& |
t |
|
|
|
|
w |
D |
|
|
|
P |
|
|
|
i |
|
eDt |
|
h |
|
|
|
|
l |
A |
|
|
l |
|
|
|
s |
|
. |
|
|
|
|
|
|
|
|||||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
u r a |
|
|
|
o |
|
|
|
|
p |
|
|
|
|
|
a |
|
t |
|
|
|
n |
|
|
|
n |
|
e |
|
e |
|
l |
|
|
™ |
|
|
e a |
|
|
|
|
|
|
|
|
|
f |
|
|
|
f |
|
|
|
|
|
|
i |
|
|
|
|
|
|
l |
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
S |
|
|
|
h |
|
t |
|
|
e |
|
|
|
|
|
|
. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
D |
|
|
|
|
|
|
|
|
|
|
|
|
|
l |
|
|
|
|
|
|
|
|
|
|
|
|
O |
|
|
|
|
|
|
|
|
|
|
|
|
|
i |
|
|
c |
|
|
|
|
|
|
a |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||
wsMPP |
|
|
c |
|
ha w |
a |
|
h |
|
ol |
|
k w l me |
|
t m e |
|
o d . |
. r p |
s u c |
|
, |
|
o a |
lHo |
|
|
& e |
If l |
m |
n |
m |
s a |
|
h |
V |
, c |
|
a n |
. |
l ic c |
. nd |
|
d |
|
o O o |
|
de |
|
|
|
nh |
|
m |
|
ho r |
|
|
t g e |
.ea |
|
|
a |
l l cl |
d |
n |
t C |
s s |
i |
P z |
a, |
e |
D c a |
r |
h |
n A , |
e |
d |
P |
|
d |
|
l o |
|
ae |
t |
|
n v |
|
h |
in |
e |
c |
|
e |
|
re |
|
r |
|
s p |
, |
|
e |
W t |
|
h r |
|
|
s |
a i |
|
|
F o |
t |
|
|
i n |
|
a, |
|
|
|
a |
lM |
|
|
l |
|
l |
o |
|
ua |
w |
|
sn |
|
|
i d m |
|
c |
|
|
b o B |
|
|
u b |
l |
|
u |
s i |
|
l |
e i e |
n |
|
|
t |
|
oeu |
|
so s |
|
s et |
|
h |
r |
|
|
i |
, |
s |
n |
|
|
G |
f |
t |
o |
o |
a |
|
r |
|
|
m mm |
|
|
|
|
|
a e a |
|
|
t |
sn |
|
i |
o , a |
|
|
|
n |
|
|
g |
|
|
. |
e |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
o |
a |
i |
lw |
t |
s |
|
i |
|
w bym es |
|
|
|
u |
s e. |
|
a |
|
|
P CD |
s |
|
ts a |
|
|
i e |
ae |
a n |
|
n |
|
|
|
|
sl |
|
e ls d |
l |
m |
e . |
se |
|
|
c |
|
a s s l |
|
o e |
|
c . |
C d |
cc |
m |
c |
|
|
e |
t o e |
e i |
|
|
|
v |
o |
ms |
|
n |
i n |
|
s |
|
c |
|
|
|
t |
o e |
o |
r |
|
|
r so |
|
f |
|
|
i |
|
|
.e |
|
|
|
|
|
|
s |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
fcUCBw |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Figure 9.6 An example of query expansion in the interface of the Yahoo! web search engine in 2006. The expanded query suggestions appear just below the “Search Results” bar.
web) suggest related queries in response to a query; the users then opt to use one of these alternative query suggestions. Figure 9.6 shows an example of query suggestion options being presented in the Yahoo! web search engine. The central question in this form of query expansion is how to generate alternative or expanded queries for the user. The most common form of query expansion is global analysis, using some form of thesaurus. For each term t in a query, the query can be automatically expanded with synonyms and related words of t from the thesaurus. Use of a thesaurus can be combined with ideas of term weighting: for instance, one might weight added terms less than original query terms.
Methods for building a thesaurus for query expansion include:
•Use of a controlled vocabulary that is maintained by human editors. Here, there is a canonical term for each concept. The subject headings of traditional library subject indexes, such as the Library of Congress Subject Headings, or the Dewey Decimal system are examples of a controlled vocabulary. Use of a controlled vocabulary is quite common for wellresourced domains. A well-known example is the Unified Medical Language System (UMLS) used with MedLine for querying the biomedical research literature. For example, in Figure 9.7, neoplasms was added to a
Online edition (c) 2009 Cambridge UP
9.2 Global methods for query reformulation |
191 |
•User query: cancer
•PubMed query: (“neoplasms”[TIAB] NOT Medline[SB]) OR “neoplasms”[MeSH Terms] OR cancer[Text Word]
•User query: skin itch
•PubMed query: (“skin”[MeSH Terms] OR (“integumentary system”[TIAB] NOT Medline[SB]) OR “integumentary system”[MeSH Terms] OR skin[Text Word]) AND ((“pruritus”[TIAB] NOT Medline[SB]) OR “pruritus”[MeSH Terms] OR itch[Text Word])
Figure 9.7 Examples of query expansion via the PubMed thesaurus. When a user issues a query on the PubMed interface to Medline at http://www.ncbi.nlm.nih.gov/entrez/, their query is mapped on to the Medline vocabulary as shown.
search for cancer. This Medline query expansion also contrasts with the Yahoo! example. The Yahoo! interface is a case of interactive query expansion, whereas PubMed does automatic query expansion. Unless the user chooses to examine the submitted query, they may not even realize that query expansion has occurred.
•A manual thesaurus. Here, human editors have built up sets of synonymous names for concepts, without designating a canonical term. The UMLS metathesaurus is one example of a thesaurus. Statistics Canada maintains a thesaurus of preferred terms, synonyms, broader terms, and narrower terms for matters on which the government collects statistics, such as goods and services. This thesaurus is also bilingual English and French.
•An automatically derived thesaurus. Here, word co-occurrence statistics over a collection of documents in a domain are used to automatically induce a thesaurus; see Section 9.2.3.
•Query reformulations based on query log mining. Here, we exploit the manual query reformulations of other users to make suggestions to a new user. This requires a huge query volume, and is thus particularly appropriate to web search.
Thesaurus-based query expansion has the advantage of not requiring any user input. Use of query expansion generally increases recall and is widely used in many science and engineering fields. As well as such global analysis techniques, it is also possible to do query expansion by local analysis, for instance, by analyzing the documents in the result set. User input is now
Online edition (c) 2009 Cambridge UP
192 |
|
9 Relevance feedback and query expansion |
|
|
|
|
|
|
Word |
Nearest neighbors |
|
|
absolutely |
absurd, whatsoever, totally, exactly, nothing |
|
|
bottomed |
dip, copper, drops, topped, slide, trimmed |
|
|
captivating |
shimmer, stunningly, superbly, plucky, witty |
|
|
doghouse |
dog, porch, crawling, beside, downstairs |
|
|
makeup |
repellent, lotion, glossy, sunscreen, skin, gel |
|
|
mediating |
reconciliation, negotiate, case, conciliation |
|
|
keeping |
hoping, bring, wiping, could, some, would |
|
|
lithographs |
drawings, Picasso, Dali, sculptures, Gauguin |
|
|
pathogens |
toxins, bacteria, organisms, bacterial, parasite |
|
|
senses |
grasp, psyche, truly, clumsy, naive, innate |
|
Figure 9.8 An example of an automatically generated thesaurus. This example is based on the work in Schütze (1998), which employs latent semantic indexing (see Chapter 18).
usually required, but a distinction remains as to whether the user is giving feedback on documents or on query terms.
9.2.3Automatic thesaurus generation
As an alternative to the cost of a manual thesaurus, we could attempt to generate a thesaurus automatically by analyzing a collection of documents. There are two main approaches. One is simply to exploit word cooccurrence. We say that words co-occurring in a document or paragraph are likely to be in some sense similar or related in meaning, and simply count text statistics to find the most similar words. The other approach is to use a shallow grammatical analysis of the text and to exploit grammatical relations or grammatical dependencies. For example, we say that entities that are grown, cooked, eaten, and digested, are more likely to be food items. Simply using word cooccurrence is more robust (it cannot be misled by parser errors), but using grammatical relations is more accurate.
The simplest way to compute a co-occurrence thesaurus is based on termterm similarities. We begin with a term-document matrix A, where each cell
At,d is a weighted count wt,d for term t and document d, with weighting so A has length-normalized rows. If we then calculate C = AAT, then Cu,v is a similarity score between terms u and v, with a larger number being better. Figure 9.8 shows an example of a thesaurus derived in basically this manner, but with an extra step of dimensionality reduction via Latent Semantic Indexing, which we discuss in Chapter 18. While some of the thesaurus terms are good or at least suggestive, others are marginal or bad. The quality of the associations is typically a problem. Term ambiguity easily introduces irrel-
Online edition (c) 2009 Cambridge UP
