Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
An_Introduction_to_Information_Retrieval.pdf
Скачиваний:
419
Добавлен:
26.03.2016
Размер:
6.9 Mб
Скачать

10.2 Challenges in XML retrieval

201

We can represent queries as trees in the same way. This is a query-by- example approach to query language design because users pose queries by creating objects that satisfy the same formal description as documents. In Figure 10.4, q1 is a search for books whose titles score highly for the keywords Julius Caesar. q2 is a search for books whose author elements score highly for Julius Caesar and whose title elements score highly for Gallic war.3

10.2Challenges in XML retrieval

STRUCTURED DOCUMENT RETRIEVAL PRINCIPLE

INDEXING UNIT

In this section, we discuss a number of challenges that make structured retrieval more difficult than unstructured retrieval. Recall from page 195 the basic setting we assume in structured retrieval: the collection consists of structured documents and queries are either structured (as in Figure 10.3) or unstructured (e.g., summer holidays).

The first challenge in structured retrieval is that users want us to return parts of documents (i.e., XML elements), not entire documents as IR systems usually do in unstructured retrieval. If we query Shakespeare’s plays for Macbeth's castle, should we return the scene, the act or the entire play in Figure 10.2? In this case, the user is probably looking for the scene. On the other hand, an otherwise unspecified search for Macbeth should return the play of this name, not a subunit.

One criterion for selecting the most appropriate part of a document is the structured document retrieval principle:

Structured document retrieval principle. A system should always retrieve the most specific part of a document answering the query.

This principle motivates a retrieval strategy that returns the smallest unit that contains the information sought, but does not go below this level. However, it can be hard to implement this principle algorithmically. Consider the query title#"Macbeth" applied to Figure 10.2. The title of the tragedy, Macbeth, and the title of Act I, Scene vii, Macbeth’s castle, are both good hits because they contain the matching term Macbeth. But in this case, the title of the tragedy, the higher node, is preferred. Deciding which level of the tree is right for answering a query is difficult.

Parallel to the issue of which parts of a document to return to the user is the issue of which parts of a document to index. In Section 2.1.2 (page 20), we discussed the need for a document unit or indexing unit in indexing and retrieval. In unstructured retrieval, it is usually clear what the right document

3. To represent the semantics of NEXI queries fully we would also need to designate one node in the tree as a “target node”, for example, the section in the tree in Figure 10.3. Without the designation of a target node, the tree in Figure 10.3 is not a search for sections embedded in articles (as specified by NEXI), but a search for articles that contain sections.

Online edition (c) 2009 Cambridge UP

202

10 XML retrieval

Figure 10.5 Partitioning an XML document into non-overlapping indexing units.

unit is: files on your desktop, email messages, web pages on the web etc. In structured retrieval, there are a number of different approaches to defining the indexing unit.

One approach is to group nodes into non-overlapping pseudodocuments as shown in Figure 10.5. In the example, books, chapters and sections have been designated to be indexing units, but without overlap. For example, the leftmost dashed indexing unit contains only those parts of the tree dominated by book that are not already part of other indexing units. The disadvantage of this approach is that pseudodocuments may not make sense to the user because they are not coherent units. For instance, the leftmost indexing unit in Figure 10.5 merges three disparate elements, the class, author and title elements.

We can also use one of the largest elements as the indexing unit, for example, the book element in a collection of books or the play element for Shakespeare’s works. We can then postprocess search results to find for each book or play the subelement that is the best hit. For example, the query Macbeth's castle may return the play Macbeth, which we can then postprocess to identify act I, scene vii as the best-matching subelement. Unfortunately, this twostage retrieval process fails to return the best subelement for many queries because the relevance of a whole book is often not a good predictor of the relevance of small subelements within it.

Instead of retrieving large units and identifying subelements (top down), we can also search all leaves, select the most relevant ones and then extend them to larger units in postprocessing (bottom up). For the query Macbeth's castle in Figure 10.1, we would retrieve the title Macbeth’s castle in the first pass and then decide in a postprocessing step whether to return the title, the scene, the act or the play. This approach has a similar problem as the last one: The relevance of a leaf element is often not a good predictor of the relevance

Online edition (c) 2009 Cambridge UP

10.2 Challenges in XML retrieval

203

of elements it is contained in.

The least restrictive approach is to index all elements. This is also problematic. Many XML elements are not meaningful search results, e.g., typographical elements like <b>definitely</b> or an ISBN number which cannot be interpreted without context. Also, indexing all elements means that search results will be highly redundant. For the query Macbeth's castle and the document in Figure 10.1, we would return all of the play, act, scene and title elements on the path between the root node and Macbeth's castle. The leaf node would then occur four times in the result set, once directly and three times as part of other elements. We call elements that are contained

NESTED ELEMENTS within each other nested. Returning redundant nested elements in a list of returned hits is not very user-friendly.

Because of the redundancy caused by nested elements it is common to restrict the set of elements that are eligible to be returned. Restriction strategies include:

discard all small elements

discard all element types that users do not look at (this requires a working XML retrieval system that logs this information)

discard all element types that assessors generally do not judge to be relevant (if relevance assessments are available)

only keep element types that a system designer or librarian has deemed to be useful search results

In most of these approaches, result sets will still contain nested elements. Thus, we may want to remove some elements in a postprocessing step to reduce redundancy. Alternatively, we can collapse several nested elements in the results list and use highlighting of query terms to draw the user’s attention to the relevant passages. If query terms are highlighted, then scanning a medium-sized element (e.g., a section) takes little more time than scanning a small subelement (e.g., a paragraph). Thus, if the section and the paragraph both occur in the results list, it is sufficient to show the section. An additional advantage of this approach is that the paragraph is presented together with its context (i.e., the embedding section). This context may be helpful in interpreting the paragraph (e.g., the source of the information reported) even if the paragraph on its own satisfies the query.

If the user knows the schema of the collection and is able to specify the desired type of element, then the problem of redundancy is alleviated as few nested elements have the same type. But as we discussed in the introduction, users often don’t know what the name of an element in the collection is (Is the Vatican a country or a city?) or they may not know how to compose structured queries at all.

Online edition (c) 2009 Cambridge UP

204

 

 

10

XML retrieval

 

 

 

book

 

book

 

book

author

 

author

book

creator

firstname

lastname

Gates

Gates

Gates

Bill

Gates

q3

q4

d2

d3

 

Figure 10.6 Schema heterogeneity: intervening nodes and mismatched names.

A challenge in XML retrieval related to nesting is that we may need to distinguish different contexts of a term when we compute term statistics for ranking, in particular inverse document frequency (idf) statistics as defined in Section 6.2.1 (page 117). For example, the term Gates under the node author is unrelated to an occurrence under a content node like section if used to refer to the plural of gate. It makes little sense to compute a single document frequency for Gates in this example.

One solution is to compute idf for XML-context/term pairs, e.g., to compute different idf weights for author#"Gates" and section#"Gates". Unfortunately, this scheme will run into sparse data problems – that is, many XML-context pairs occur too rarely to reliably estimate df (see Section 13.2, page 260, for a discussion of sparseness). A compromise is only to consider the parent node x of the term and not the rest of the path from the root to x to distinguish contexts. There are still conflations of contexts that are harmful in this scheme. For instance, we do not distinguish names of authors and names of corporations if both have the parent node name. But most important distinctions, like the example contrast author#"Gates" vs. section#"Gates", will be respected.

In many cases, several different XML schemas occur in a collection since the XML documents in an IR application often come from more than one

SCHEMA source. This phenomenon is called schema heterogeneity or schema diversity

HETEROGENEITY and presents yet another challenge. As illustrated in Figure 10.6 comparable elements may have different names: creator in d2 vs. author in d3. In other cases, the structural organization of the schemas may be different: Author

Online edition (c) 2009 Cambridge UP

10.2 Challenges in XML retrieval

205

names are direct descendants of the node author in q3, but there are the intervening nodes firstname and lastname in d3. If we employ strict matching of trees, then q3 will retrieve neither d2 nor d3 although both documents are relevant. Some form of approximate matching of element names in combination with semi-automatic matching of different document structures can help here. Human editing of correspondences of elements in different schemas will usually do better than automatic methods.

Schema heterogeneity is one reason for query-document mismatches like q3/d2 and q3/d3. Another reason is that users often are not familiar with the element names and the structure of the schemas of collections they search as mentioned. This poses a challenge for interface design in XML retrieval. Ideally, the user interface should expose the tree structure of the collection and allow users to specify the elements they are querying. If we take this approach, then designing the query interface in structured retrieval is more complex than a search box for keyword queries in unstructured retrieval.

We can also support the user by interpreting all parent-child relationships in queries as descendant relationships with any number of intervening nodes EXTENDED QUERY allowed. We call such queries extended queries. The tree in Figure 10.3 and q4 in Figure 10.6 are examples of extended queries. We show edges that are interpreted as descendant relationships as dashed arrows. In q4, a dashed arrow connects book and Gates. As a pseudo-XPath notation for q4, we adopt book//#"Gates": a book that somewhere in its structure contains the word Gates where the path from the book node to Gates can be arbitrarily long.

The pseudo-XPath notation for the extended query that in addition specifies

that Gates occurs in a section of the book is book//section//#"Gates". It is convenient for users to be able to issue such extended queries without having to specify the exact structural configuration in which a query term should occur – either because they do not care about the exact configuration or because they do not know enough about the schema of the collection to be able to specify it.

In Figure 10.7, the user is looking for a chapter entitled FFT (q5). Suppose there is no such chapter in the collection, but that there are references to books on FFT (d4). A reference to a book on FFT is not exactly what the user is looking for, but it is better than returning nothing. Extended queries do not help here. The extended query q6 also returns nothing. This is a case where we may want to interpret the structural constraints specified in the query as hints as opposed to as strict conditions. As we will discuss in Section 10.4, users prefer a relaxed interpretation of structural constraints: Elements that do not meet structural constraints perfectly should be ranked lower, but they should not be omitted from search results.

Online edition (c) 2009 Cambridge UP

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]