Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Rivero L.Encyclopedia of database technologies and applications.2006

.pdf
Скачиваний:
14
Добавлен:
23.08.2013
Размер:
23.5 Mб
Скачать

tors using the complex and error-prone EXISTS function (Dadashzadeh, 1992, 1993).

CONCLUSION

One of the most important promises of the relational data model has been that it frees the decision maker, the manager, from the necessity of resorting to an intermediary, the programmer, in retrieving information from the organization’s database in response to unanticipated needs. That promise is founded on the availability of very high-level relational query languages such as SQL, QBE, and relational algebra. Unfortunately, the current implementations of these query languages fail to support users adequately in formulating complex queries involving set comparison that tend to arise in ad hoc decision-making situations. In this article, we have examined these shortcomings and proposed solutions to overcome them with minimal effort.

C. J. Date (2000) has remarked, “A hundred years from now, I’m quite sure, database systems will still be based on Codd’s relational foundation.” If that would even be partly the case, then there is justifiable hope that the continuing standardization efforts of SQL would finally address the undue complexity in formulating set comparison queries by reintroducing the relational calculus features present in the original definition of SEQUEL.

REFERENCES

Blanning, R. W. (1993). Relational division in information management. Decision Support Systems, 9(4), 313324.

Celko, J. (1997). Joe Celko’s SQL puzzles & answers. San Francisco: Morgan Kaufmann.

Chamberlin, D. D., Astrahan, M.M., Eswaran, K.P., Griffiths, P.P., Lorie, R.A., Mehl, J.W., Reisner, P., & Wade, B.W. (1976). SEQUEL2: A unified approach to data definition, manipulation, and control. IBM Journal of Research & Development, 20(6), 560-575.

Codd, E. F. (1971). A data base sublanguage founded on the relational calculus. In Proceedings of the 1971 ACM SIGFIDET Workshop on Data Description, Access and Control (pp. 35-68). New York: Association for Computing Machinery.

Dadashzadeh, M. (1989). An improved division operator for relational algebra. Information Systems, 14(5), 431437.

Set Comparison in Relational Query Languages

Dadashzadeh, M. (1992). A proposed change to the SQL standard. In P. C. Tinnirello (Ed.), Handbook of systems management: Development and support (pp. 465-472). Boston: Auerbach.

Dadashzadeh, M. (1993). A human factor study of set comparison constructs in SQL. TIMS/ORSA Joint National Meeting.

Dadashzadeh, M. (2001). Set comparison queries in SQL. In S. Becker (Ed.), Developing quality complex database systems: Practices, techniques, and technologies (pp. 303-316). Hershey, PA: Idea Group.

Dadashzadeh, M. (2002). Converting Paradox’s QBE set queries into Access 2000 SQL. Review of Business Information Systems, 6(2), 43-54.

Dadashzadeh, M. (2003). A simpler approach to set comparison queries in SQL. Journal of Information Systems Education, 14(4), 345-348.

Date, C. J. (1992). Why quantifier order is important. In C. J. Date & H. Darwen (Eds.), Relational database writings 1989-1991 (pp. 107-114). Reading, MA: Addison-Wesley.

Date, C. J. (2000). The database relational model: A retrospective review and analysis. Reading, MA: Addison-Wesley.

Matos, V. M., & Grasser, R. (2002). A simpler (and better) SQL approach to relational division. Journal of Information Systems Education, 13(2), 85-87.

Ramakrishnan, R., & Gehrke, J. (2003). Database management systems (3rd ed.). New York: McGraw-Hill.

Rao, S. G., Badia, A., & Van Gucht, D. (1996). Providing better support for a class of decision support queries. In

Proceedings of the 1996 SIGMOD International Conference on Management of Data (pp. 217-227). New York: Association for Computing Machinery.

Zloof, M. M. (1975). Query by Example. In Proceedings of the 1975 National Computer Conference (pp. 431-438). Montvale, NJ: American Federation of Information Processing Societies.

KEY TERMS

Generalized Division: A generalized version of the binary relational algebra division operation where the tuples of the left table are grouped by an attribute and only those groups whose set of values for another specified attribute satisfies a desired set comparison (e.g.,

630

TEAM LinG

Set Comparison in Relational Query Languages

equality) with a set of similar values from the right table are passed to the output table. The generalized division operator can be expressed in terms of the five principal relational algebra operations.

QBE: Query by Example. A graphical query and update language for relational databases introduced by IBM and popularized by Paradox RDBMS.

RDBMS: Relational database management system. A software application for managing databases utilizing the relational data model.

Relational Algebra: A collection of unary and binary operators that take one or two tables as input and produce a table as output. The relational algebra operators of Cartesian Product, Selection, Projection, set Difference, and Union are considered to be necessary and sufficient for extracting any desired subset of data from a relational database.

Relational Calculus: A notation founded on predicate calculus dealing with descriptive expressions that

are equivalent to the operations of relational algebra. Two forms of the relational calculus exist: the tuple calculus 5 and the domain calculus.

Relational Completeness: A relational query language is said to be relationally complete if it can express each of the five principal operations of relational algebra.

Relational Data Model: A logical way of organizing a database as a collection of interrelated tables. The logical relationship between tables representing related data is accomplished through shared columns, where the primary key column of one table appears as a foreign key column in another.

Set Comparison Query: A database query in which the desired records must be found based on a comparison of sets of values using set comparison operations such as inclusion, containment, and equality.

SQL: Structured Query Language. The standard language for definition and manipulation of relational databases.

631

TEAM LinG

632

Set Valued Attributes

Karthikeyan Ramasamy

Juniper Networks, USA

Prasad M. Deshpande

IBM Almaden Research Center, USA

INTRODUCTION

About three decades ago, when Codd (1970) invented the relational database model, it took the database world by storm. The enterprises that adapted it early won a large competitive edge. The past two decades have witnessed tremendous growth of relational database systems, and today the relational model is by far the dominant data model and is the foundation for leading DBMS products, including IBM DB2, Informix, Oracle, Sybase, and Microsoft SQL server. Relational databases have become a multibillion-dollar industry.

However, as these databases grew so did the complexity of the data being stored in them with the emergence of a new class of applications. It quickly became apparent that relational databases suffer from various deficiencies and limitations. Relational database systems support a small, fixed collection of data types (e.g., integers, dates, strings) that has been proven to be adequate for traditional applications. With a new class of applications, more complex data needs to be handled. These complex data include hierarchical data of com- puter-aided design and modeling (CAD/CAM), multimedia data, and documents. Support for this kind of data requires the database to incorporate abstract data types and type constructors based on object-oriented concepts. This leads to the development of object database systems along two distinct paths:

Object-Oriented Database Systems: These systems were developed with the goal of adding persistence to object-oriented languages that support complex types.

Object-Relational Database Systems: These systems are an attempt to extend the relational database systems with the functionality needed to support complex types.

Object relational systems are characterized by:

Abstract Data Types: They represent the ability to add a new data type into the system that is seamlessly treated as equivalent to built-in types.

The definition of a new data type describes the data fields and the methods that operate on these fields.

Type Constructors: They are used to construct new types by composing base or abstract data types. The major classes of type constructors are composites (records), collections, and references. The class of collections can be further divided into sets, bags, arrays, and lists.

Inheritance: It allows the creation of new data types by derivation from existing types.

This article focuses on the set type constructor popularly referred to as set-valued attributes. Set-val- ued attributes represent a collection of elements of the same type with uniqueness constraint. To illustrate the usefulness of set-valued attributes, consider the following schema where a product item and the availability of its colors need to be described. In a standard relational schema, we need two tables as follows:

PRODUCT(Id: integer, Name: string, Manufacturer: string)

COLORS(Id: integer, ColorName: string)

This schema is complex and similarly the queries will also be more complex (since it requires a join for the relating product and its colors). An instance of these tables is shown in Figure 1.

In an object-relational database system that supports set-valued attributes, we can describe the same by a single table:

PRODUCT(Id: integer, Name: string, Manufacturer: string, Colors: set (string))

where the construct set indicates that the Colors attribute is a set of strings. This schema is more intuitive and concise from a data modeling and querying perspective. An instance of this table is shown in Figure 2.

This article examines set-valued attributes, their background, and how they are supported in object-rela- tional database systems.

Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.

TEAM LinG

Set Valued Attributes

Figure 1. An instance of relational schema: PRODUCT and COLORS

5

 

PRODUCT

 

 

COLORS

Id

Name

 

Manufacturer

 

Id

ColorName

1001

VacuumCleaner

 

Sears

 

1001

Black

 

 

 

 

 

1001

Red

1002

Food Processor

 

Kitchen Aid

 

1002

White

 

 

 

 

 

1003

Black

1003

Juice Blender

 

Oster

 

1004

Blue

 

 

 

 

 

1004

Grey

 

 

 

 

 

 

 

BACKGROUND

Research in set-valued attributes has been conducted from different perspectives: data modeling, nested relational databases, object-oriented databases, and objectrelational databases.

Data Modeling

Set-valued attributes have long been studied under the context of data modeling, nested relational databases, object-oriented databases, and object-relational databases. Some of the earlier semantic data models (SDM) incorporate sets and collection of entities (Hammer & McLeod, 1981). A further discussion on data modeling using sets can be found in Brodie (1981, 1984). These studies describe how sets can describe the semantic notions of the real world with ease.

Nested Relational Databases

The nested relational model relaxes the assumption that relational attributes are atomic. In order to extend the power of the relational model, Zaniolo (1983) proposes a query language called GEM. GEM adds sets as a data type. Here the sets are viewed as a logical collection, and the set operations of equality and containment are defined as a part of the query language. It also notes that set operations are very expensive to support in standard relational systems. Extension of relational models with set-valued attributes with reference to statistical databases (SDB) has been studied in detail in Ozsoyoglu, Ozsoyoglu, and Matos (1987). The DASDB projects at the Technical University of Darmstadt (Schek & Scholl, 1989) supported the nested relational algebra with nest and unnest operations. The nest operation is used to break up the set instances into individual tuples duplicating other attributes for each set element. The unnest operation combines multiple tuples on a given attribute to form a set when the rest of the attributes values in the tuple match.

OBJECT-ORIENTED DATABASES

From the object-oriented database system perspective, there were two different attempts: One adds persistence to object-oriented languages and the other combines the features of a database system with those of an objectoriented language. Such systems supported many collection types, including sets. Three early projects laid the foundation in this area—Gemstone (Copeland & Maier, 1984) was based on Smalltalk, Vbase (Andrews & Harris, 1987) was based on CLU-like language, and Orion (Bannerjee et al., 1987) was based on Common LISP Object System (CLOS). New SQL-like languages were designed to support powerful querying for these systems. These query languages allowed nested queries and universal and existential quantification queries.

Object-Relational Databases

Object-relational systems typically start from a relational model and its SQL language and build from there. Early systems supported row types and collection types like sets. The best-known research implementations of object-relational database systems are POSTGRES (Stonebraker, 1997; Stonebraker & Kemnitz, 1991) from the University of California, Berkeley and Paradise (Patel et al., 1997) from the University of Wisconsin. POSTGRES supported the dynamic addition of new types, support for complex objects including sets, inheritance, and rules support. Paradise departs from POSTGRES in that it is a parallel object-relational database system. The main contribution is to explore the parallelization of object-relational features in a sharednothing environment.

SET-VALUED ATTRIBUTES

In order to incorporate full support for set-valued attributes into object-relational database systems, the following issues must be addressed:

633

TEAM LinG

Extension of data definition language (DDL) and data manipulation language (DML) to accommodate sets

Storage of tuples/records containing set-valued attributes

Indexing of set-valued attributes

Support for efficient set operations

Extension of DDL and DML

There are many proposals on the extension of data definition language and data manipulation language to support abstract data types and set-valued attributes. GEM (Zaniolo, 1983) extends the relational language QUEL to provide support for the definition and querying of setvalued attributes. It supports the set membership operator IN and various other operators, including = (set equality), != (set does not equal), > (superset), >= (superset equal), < (subset), and <= (subset equal). As an illustrative example using our object-relational schema of PRODUCT, a query that retrieves the product names having a color of either black or blue can be formulated as:

RANGE of P is PRODUCT

RETRIEVE (P.Name)

WHERE “Black” IN P.Colors OR “Blue” IN P.Colors

……………………..(Q1)

O2, an object-oriented database system, provides a data definition language that allows defining classes (whose instances encapsulate behavior) and types (whose instances are values). Each of the attributes can be either atomic or composite (using set, list, and tuple constructors). It defines an SQL-like syntax for filtering on sets and other composite type instances (Bancilhon, 1992). In the query language of O2, query (Q1) can be written as:

SELECT P.Name

FROM P IN PRODUCT, C IN P.Colors

WHERE C = “Black” OR C = “Blue”……………

………………………….…(Q2)

Here the operator IN associates a variable referring to individual tuples of the given table. Other proposals include extension of QUEL for POSTGRES (Stonebraker, 1987).

Storing Sets

In order to efficiently query, the set instances need to be stored in specialized storage representations. Storage representations for sets can be classified based on two orthogonal characteristics: nesting and location. The

Set Valued Attributes

characteristic of nesting describes whether the elements in the set are grouped together or scattered. On the other hand, location specifies whether the set elements are stored along with the rest of the attributes in the relation or vertically decomposed and stored separately. Hence, the four feasible representations are nested internal, nested external, unnested internal, and unnested external. Unnested internal is not very useful since it will replicate the rest of the attributes for each set element, leading to update anomalies. These storage representations have been proposed in Stonebraker (1996) and studied in detail from a performance perspective in Ramasamy (2001). Nested internal representation is referred to as a normalized storage model in Hafez and Ozsoyoglu (1988), and a variant of nested external representation is presented as a decomposed storage model (DSM) in Copeland and Khoshafian (1985). Comparison studies in Ramasamy (2001) show that the choice of representations heavily depends on the nature of the queries posed by the user. Unnested representations suffer from the high cost of fetching the tuple from the buffer pool and predicate processing since each set element is stored as a separate tuple. Nested representations perform better with queries containing set predicates since the predicate evaluation takes advantage of clustering of set elements within the tuple.

Indexing Sets

Lookup queries for sets typically require an indexing structure on the attribute to reduce the response time. The indexing structures for sets can be either nested or unnested. Nested indices treat each set as a single entity, whereas unnested indices treat each set element as an indexable entity. For example, consider the ob- ject-relational instance of PRODUCT. Let us assume that the record identifiers for each of the rows are rid1, rid2, and rid3, where rid1 refers to the first row and so on. An unnested index on the attribute PRODUCT.Colors will consist of the entries: (“Black,” <rid1, rid2>), (“Blue,” <rid3>), (“White,” <rid2>), (“Red,” <rid1>), and (“Grey,” <rid3>). The unnested index can be any of the well-known indexing structures like B+-tree, linear hashing, extensible hashing, etc.

Ishikawa, Kitagawa, and Ohbo (1993) examined the use of signature files for subset predicates. The signature file for a set-valued attribute is created by scanning the entire relation and computing the signature for each set instance. The lookup algorithm converts the set being searched into a signature, matches it against all the signatures, and identifies the potential candidates. The problem with the signature files is that there can be “false drops,” which require the candidate sets

634

TEAM LinG

Set Valued Attributes

be actually examined to determine the result. Signature files are not an indexing method as the entire signature file must be scanned when evaluating a lookup. RD-trees (Hellerstein & Pfeffer, 1994) are a variant of the R-tree (Guttman, 1984) that describe the transitive containment relation and allow narrowing the search to the appropriate data tuples unlike signature files. The RD-tree uses signatures as the bounding box equivalent to guide the search from one level of the tree to the next lower level. Similar to R-tree, multiple branches of the tree must be examined to filter the candidates. Inverted files (Brown, Callan, & Croft, 1994) are an example of an unnested index where each set instance is broken into individual elements and an index is created over them that maps each set element to the tuples they occur. They are a popular technique for single element lookups.

Operations on Sets

Operations on set-valued attributes can be broadly classified as select operations (involves only a single relation) and join operations (involves at least two relations). Select operations retrieve tuples that satisfy a set predicate as illustrated by queries (Q1) and (Q2) in the section on Extension of DDL and DML. The predicate can include equality, subset, superset, and membership. These operations can be executed either by a sequential scan of the entire relation or efficiently by using an index on the set-valued attribute. Join operations involve set-valued attributes on two different relations. Two tuples, one from each relation, are joinable if the join predicate satisfies the set instances from both the tuples. The join predicate can be equality, containment, and intersection. As an example, we can write the query that retrieves the pair of products such that the color availability of one product is the same as the other and more.

SELECT P1.Name, P2.Name

FROM PRODUCT P1, PRODUCT P2

WHERE P1.Colors P2.Colors………

………………………………………(Q3)

Query Q3 involves a self set containment join. Another query that retrieves the pair of products that are available in one or common colors can be written as set intersection as follows:

SELECT P1.Name, P2.Name

FROM PRODUCT P1, PRODUCT P2

WHERE P1.Colors ∩ P2.Colors ≠ ………………

…………………………(Q4)

Set equality algorithms can be evaluated easily using either sort-merge join or hash join. Sort-merge join will 5 require the elements in each set instance be sorted so

that a partial order on sets can be defined. Set containment and intersection joins are much more complex to evaluate. Set containment joins have been addressed in Mamoulis (2003), Melnik and Molina (2003), and Ramasamy (2001). These algorithms are based on a divide-and-conquer approach where the tuples of the first relation are partitioned based on the hash value of an element of the set, and the second relation is replicated for each of the elements in the set. Once the partitioning phase is complete, the corresponding partitions are joined together, and the results of each partition are merged for the final result. The algorithms differ in two aspects: (1) the partitioning phase and (2) whether the entire set elements are projected during the intermediate partitioning step or approximation of sets using signatures is projected.

FUTURE TRENDS

Set-valued attributes and their operations are growing in significance as new applications emerge. These applications can take advantage of them from a performance perspective or a functionality perspective. To quote a few applications, sets can be used to store the customer transaction records into the database so that items that a customer has bought can be clustered as a logical collection to facilitate association rule mining. Self set containment joins have been identified to be useful in the candidate generation step during association rule mining as in Rantzau (2003). In order to store and process XML data, sets have been suggested as an alternative to improve performance (Shanmugasundaram et al., 1999). Still a lot of research effort is needed in the areas of incorporating sets in parallel database systems, more efficient algorithms for join operations in a parallel environment, and estimation of result size of query involving sets.

CONCLUSION

Set-valued attributes are an important addition to ob- ject-relational systems. It is a type constructor that constructs a new type as a collection with a uniqueness constraint. Full support for sets within a database requires extending DDL language to define set type and DML language to pose queries. Storing sets in a database system efficiently requires new types of storage organizations. Query evaluation requires fast lookup of

635

TEAM LinG

set-valued attributes using a new type of indices and new class of algorithms for fast evaluation of complex join operations. We have outlined the major issues concerning set-valued attributes and identified the solutions, trade-offs, and their advantages and disadvantages.

REFERENCES

Andrews, T., & Harris, C. (1987). Combining language and database advances in an object-oriented development environment. Proceedings of the ACM Conference on Object Oriented Programming Systems, Languages, and Applications, October.

Bancilhon, F., Delobel, C., & Kanellakis, P. (1992). Building an object-oriented database system: The story of O2. Morgan Kaufmann.

Bannerjee, J., Chou, H. T., Garza, J., Kim, W., Woelk, D., & Ballou N. (1987). Data model issues for object-oriented applications. ACM Transactions on Office Information Systems, 5(1).

Brodie, M. L. (1981). Association: Database abstraction. In Information Modeling and Analysis. North Holland.

Brodie, M. L. (1984). On the development of data models. In Conceptual Modeling. Springer-Verlag.

Brown, E. W., Callan, J. P, & Croft, W. B. (1994). Fast incremental indexing for full-text information retrieval.

Proceedings of 20th Conference on Very Large Databases, September.

Codd, E. F. (1970). A relational model of data for large shared data banks. Communications of the ACM, 13(6), 377-387.

Copeland, G. P., & Khoshafian, S. (1985). A decomposition storage model. Proceedings of the ACM SIGMOD Conference on Management of Data, May.

Copeland, G. P., & Maier, D. (1984). Making Smalltalk a database system. Proceedings of the ACM SIGMOD Conference on Management of Data.

Guttman, A. (1984). R-trees: A dynamic index structure for spatial searching. Proceedings of ACM SIGMOD International Conference on Management of Data, June.

Hafez, A., & Ozsoyoglu, G. (1988). The partial normalized model of nested relations. Proceedings of International Conference on Very Large Databases, September.

Hammer, M., & McLeod, D. (1981). Database description with SDM: A semantic data model. ACM Transactions on Database Systems, 6(3).

Set Valued Attributes

Hellerstein, J. M, & Pfeffer, A. (1994). “The RD-tree: An index structure for sets (Tech. Rep. No. 1252). Madison: University of Wisconsin–Madison, Computer Sciences Department.

Ishikawa, Y., Kitagawa, H., & Ohbo, N. (1993). Evaluation of signature files as set access facilities in OODBs. Proceedings of ACM SIGMOD International Conference on Management of Data.

Mamoulis, N. (2003). Efficient processing of joins on setvalued attributes. Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data.

Melnik, S., & Molina, H. G. (2003). Adaptive algorithms for set containment joins. ACM Transactions on Database Systems, 28(1).

Ozsoyoglu, G., Ozsoyoglu, Z. M., & Matos, V. (1987). Extending relational algebra and relational calculus with set-valued attributes and aggregate functions. ACM Transactions on Database Systems, 12(4).

Patel, J., Yu, J., Kabra, N., Tufte, K., Nag, B., Burger, J., et al. (1997). Building a scalable geo-spatial DBMS: Technology, implementation and evaluation. Proceedings of ACM SIGMOD Conference on Management of Data.

Ramasamy, K. (2000). Set containment joins: The good, the bad and the ugly. Proceedings of the Conference on Very Large Databases.

Ramasamy, K. (2001). Efficient storage and query processing of set-valued attributes. Unpublished doctoral dissertation, University of Wisconsin–Madison.

Rantzau, R. (2003). Processing frequent itemset discovery queries by division and set containment join operators. Proceedings of the Eighth ACM SIGMOD workshop on Research issues in Data Mining and Knowledge Discovery.

Schek, H. J., & Scholl, M. H. (1989). The two roles of nested relations in the DASDBS project. In S. Abiteboul, P. C. Fisher, & H. J. Schek (Eds.), Lecture Notes in Computer Science. Springer-Verlag.

Shanmugasundaram, J., Tufte, K., He, G., Zhang, C., DeWitt, D., & Naughton, J. (1999). Relational databases for querying XML documents: Limitations and opportunities. Proceedings of the Conference on Very Large Databases.

Stonebraker, M. (1987). The design of POSTGRES storage system. Proceedings of the International Conference on Very Large Databases.

636

TEAM LinG

Set Valued Attributes

Stonebraker, M. (1996). Object-relational DBMS: The next great wave. Morgan Kaufmann.

Stonebraker, M., & Kemnitz, G. (1991). The POSTGRES next generation database management system. Communications of the ACM, 34(10), 78-92.

Zaniolo, C. (1983). The database language GEM. Proceedings of the ACM SIGMOD International Conference on Management of Data.

KEY TERMS

Association Rule Mining: A rule in the form of “if this then that” that associates events in a database; for example, the association between purchased items at a supermarket. The process of examining the data for such rules is called association rule mining.

Data Definition Language (DDL): A language used by a database management system which allows users to define the database, specifying data types, structures, and constraints on the data.

Data Manipulation Language (DML): A language used by a database management system that allows users to manipulate data (querying, inserting, and updating of data).

False Drops: False drops are a property of signature

files. Since signature files use hash to activate bits 5 corresponding to the set elements, possibility exists for

one or more set elements setting the same bits. When a query signature is evaluated using signatures in signature files, there is a probability that the signatures might match but the actual sets might not match. These are called false drops.

Set Containment Joins: A set containment join between relations R(a, {b}) and S(c, {d}) pairs tuples in relation such that {b} is a subset of {d}.

Set Equality Joins: A set equality join between relations R(a, {b}) and S(c, {d}) pairs tuples in relation such that {b} is equal to {d}.

Set Intersection Joins: A set intersection join between relations R(a, {b}) and S(c, {d}) pairs tuples in relation such that {b} {d} .

Signatures: A signature is a fixed length bit vector that is computed by applying a function M iteratively to every element e in the set and setting the bit determined by M(e).

637

TEAM LinG

638

Signature Files and Signature File Construction

YangjunChen

University of Winnipeg, Canada

YongShi

University of Manitoba, Canada

INTRODUCTION

An important question in information retrieval is how to create a database index which can be searched efficiently for the data one seeks. Today, one or more of the following four techniques have been frequently used: full text searching, B-trees, inversion, and the signature file. Full text searching imposes no space overhead but requires long response time. In contrast, B-trees, inversion, and the signature file work quickl, but need a large intermediary representation structure (index), which provides direct links to relevant data. In this paper, we concentrate on the techniques of signature files and discuss different construction approaches of a signature file.

The signature technique cannot only be used in document databases but also in relational and object-oriented databases. In a document database, a set of semistructured (XML) documents is stored and the queries related to keywords are frequently evaluated. To speed up the evaluation of such queries, we can construct signatures for words and superimpose them to establish signatures for document blocks, which can be used to cut off nonrelevant documents as early as possible when evaluating a query. Especially, such a method can be extended to handle the so-called containment queries, for which not only the key words but also the hierarchical structure of a document has to be considered. We can also handle queries issued to a relational or an object-oriented database using the signature technique by establishing signatures for attribute values, tuples, as well as tables and classes.

BACKGROUND

The signature file method was originally introduced as a text indexing methodology (Faloutsos, 1985; Faloutsos, Lee, Plaisant & Shneiderman, 1990). Nowadays, however, it is utilized in a wide range of applications, such as in office filing (Christodoulakis, Theodoridou, Ho, Papa, & Pathria, 1986), hypertext systems (Faloutsos et al.), relational and object-oriented databases (Chang & Schek, 1989; Ishikawa, Kitagawa, & Ohbo, 1993; Lee & Lee, 1992; Sacks-Davis, Kent, Ramamohanarao, Thom, & Zobel,

1995; Yong, Lee, & Kim, 1994), as well as data mining (Andre-Joesson & Badal, 1997). It requires much smaller storage space than inverted files and can handle insertion and update operations in databases easily.

A typical query processing with the signature file is as follows: When a query is given, a query signature is formed from the query value. Then each signature in the signature file is examined over the query signature. If a signature in the file matches the query signature, the corresponding data object becomes a candidate that may satisfy the query. Such an object is called a drop. The next step of the query processing is the false drop resolution. Each drop is accessed and examined whether it actually satisfies the query condition. Drops that fail the test are called false drops while the qualified data objects are called actual drops.

A variety of approaches for constructing signature files have been proposed, such as bit-slice files, S-trees, and signature trees. In the following, we overview all of them and discuss a new application of signatures for tree inclusion problem, which is important for containment query evaluation in document databases.

SIGNATURE FILES AND SIGNATURE FILE ORGANIZATION

Signature Files

Intuitively, a signature file can be considered as a set of bit strings, which are called signatures. Compared to the inverted index, the signature file is more efficient in handling new insertions and queries on parts of words. But the scheme introduces information loss. More specifically, its output usually involves a number of false drops, which may only be identified by means of a full text scanning on every text block short-listed in the output. Also, for each query processed, the entire signature file needs to be searched (Faloutsos, 1985; Faloutsos, 1992). Consequently, the signature file method involves high processing and I/O cost. This problem is mitigated by partitioning the signature file, as well as by exploiting parallel computer architecture (Ciaccia & Zezula, 1996).

Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.

TEAM LinG

Signature Files and Signature File Construction

During the creation of a signature file, each word is processed separately by a hashing function. The scheme sets a constant number (m) of 1s in the [1..F] range. The resulting binary pattern is called the word signature. Each text is seen to consist of fixed-size logical blocks and each block involves a constant number (D) of noncommon, distinct words. The D word signatures of a block are superimposed (bit OR-ed) to produce a single F-bit pattern, which is the block signature stored as an entry in the signature file.

Figure 1 depicts the signature generation and comparison process of a block containing three words (then D = 3), say “SGML,” “database,” and “information.” Each signature is of length F = 12, in which m = 4 bits are set to 1. When a query arrives, the block signatures are scanned and many nonqualifying blocks are discarded. The rest are either checked (so that the “false drops” are discarded; see below) or they are returned to the user as they are. Concretely, a query specifying certain values to be searched for will be transformed into a query signature sq in the same way as for word signatures. The query signature is then compared to every block signature in the signature file. Three possible outcomes of the comparison are exemplified in Figure 1: (1) the block matches the query; that is, for every bit set in sq, the corresponding bit in the block signature s is also set (i.e., s sq = sq) and the block contains really the query word; (2) the block doesn’t match the query (i.e., s sq sq); and (3) the signature comparison indicates a match but the block in fact doesn’t match the search criteria (false drop). In order to eliminate false drops, the block must be examined after the block signature signifies a successful match.

In a signature file, a set of signatures is sequentially stored, which is easy to implement and requires low storage space and low update cost. However, when evaluating a query, a full scan of the signature file has to be performed. Therefore, it is generally slow in retrieval.

Figure 1(b) shows a simple signature file. To determine the length of signatures, we use the following formula (Faloutsos, 1985):

F × ln2 = m × D

(1)

Figure 2. Illustration for bit-slice file

8

b itslic e f ile s:

 

 

OI D fi le:

5

 

 

 

1

0

1

0

1

0

0

1

o

1

 

 

0

1

1

0

0

0

1

1

o

2

 

0

0

1

0

1

1

0

1

o

3

 

1

1

1

0

1

0

0

0

o

4

 

0

0

1

1

1

0

0

1

o

5

 

1

1

1

0

0

0

1

0

o

6

 

0

1

0

1

0

0

1

1

o

7

 

0

1

0

1

0

1

1

0

o

8

 

Bit-Slice Files

A signature file can be stored in a column-wise manner. That is, the signatures in the file are vertically stored in a set of files (Ishikawa et al., 1993), i.e., in F files, in each of which one bit per signature for all the signatures is stored as shown in Figure 2.

With such a data structure, the signatures are checked slice-by-slice (rather than signature-by-signature) to find matching signatures. To demonstrate the retrieval, consider the query signature sq = 10110000. First, we check the first bit-slice file and find that only three positions: first, fourth and sixth positions match the first bit in sq. Then, we check the second bit-slice file. This time, however, only those three positions will be checked. Since the second bit in sq is 0, no positions will be filtered. Next, we check the third bit-slice file against the third bit in sq. Because all the three positions are set to 1 in it, the same positions in the next bit-slice file, i.e., in the fourth bit-slice file will be checked against fourth bit in sq. Since none of the three positions in the fourth bit-slice file matches this bit, the search stops and reports a nil.

From this process, we can see that only part of the F bit-slice files have to be scanned. So the search cost must be lower than that of a sequential file. However, update cost becomes larger. For example, an insertion of a new set signature requires about F disk accesses, one for each bitslice file.

Figure 1. Signature generation, comparison and signature files

block: ... SGML ... databases ... information ...

 

 

signature file:

OID file:

word signature:

 

 

queries:

query signatures:

matching results:

 

 

 

 

 

 

 

1 0 1 0 1 0 0 1

o

SGML

010 000 100 10

 

SGML

010 000 100 10

match with OS

 

0 1 1 0 0 0 1 1

 

o1

 

 

 

 

 

2

 

 

0 0 1 0 1 1 0 1

 

o

 

 

 

 

 

 

 

 

 

database

100 010 010 100

XML

011 000 100 100

no match with OS

 

 

 

o3

 

 

1 1 1 0 1 0 0 0

 

 

 

 

 

 

 

 

 

 

 

4

 

information

010 100 011 000

 

informatik110 100 100 000

false drop

 

0 0 1 1 1 0 0 1

 

o

 

 

 

 

 

5

 

 

 

 

 

 

 

1 1 1 0 0 0 1 0

 

o

 

 

 

 

 

 

 

 

 

 

6

 

 

 

 

 

 

 

 

0 1 0 1 0 0 1 1

 

o

 

object signature

110 110 111 110

 

 

 

 

 

 

 

7

 

 

 

 

 

 

0 1 0 1 0 1 1 0

 

o

 

(OS)

(a)

 

 

 

 

 

 

8

 

 

 

 

 

(b)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

639

TEAM LinG

Соседние файлы в предмете Электротехника