Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Rivero L.Encyclopedia of database technologies and applications.2006

.pdf
Скачиваний:
11
Добавлен:
23.08.2013
Размер:
23.5 Mб
Скачать

Mylopoulos, J., Gal, A., Kontogiannis, K., & Stanley, M. (1996). A generic integration architecture for cooperative information systems. Proceedings of the 1st IFCIS International Conference on Cooperative Information Systems (CoopIS’96) (pp. 208-217). IEEE.

Paton, N.W. (1998). Active rules for databases. SpringerVerlag.

Paton, N.W., Díaz, O., Williams, M.H., Campin, J., Dinn, A., & Jaime, A. (1993). Dimensions of active behaviour. Proceedings of the 1st Workshop on Rules in Databases Systems (RIDS-93).

Schwiderski, S. (1996). Monitoring the behaviour of distributed systems. PhD thesis, University of Cambridge.

Vargas-Solar, G. (2000). Service d’événements flexible pour l’intégration d’applications bases de données réparties. PhD thesis, University Joseph Fourier.

Vargas-Solar, G., & Collet, C. (2002). ADEES, adaptable and extensible event service. Proceedings of the 13th International Conference on Database Expert Systems and Applications (DEXA’02).

KEY TERMS

Active Mechanism: A system responsible for detecting events and reacting automatically to such events according to predefined active rules or ECA rules. Traditionally, active mechanisms are embedded within Active Database Management Systems (ADBMS). Unbundled active mechanisms have been proposed for federated database systems and for component database systems.

Active Rule: It is represented by an ECA structure meaning when an event is produced, if the condition is verified execute an action. The event part represents a situation that triggers the rule, the condition part represents the state of a system execution or of a database, and the action part denotes a set of operations or a program.

Event Instance: An event is an instance of an event type associated to a point in time that belongs to the validity interval of its type.

Active Federated Database Systems

Event Management Model: Defines policies for detecting, producing, and notifying instances of event types.

Event Type: Represents a class of situations produced within an event producer and that is interesting for a consumer. In the context of active systems, an event type represents the class of significant situations that trigger an active rule. Such situations are produced within a validity interval which is the time interval during which instances of the event type can be detected.

Federated Database System: A system that integrates a number of pre-existing autonomous DBMS which can be homogeneous or heterogeneous. They can use different underlying data models, data definition and manipulation facilities, and transaction management and concurrency control mechanisms. DBMS in the federation can be integrated by a mediator providing a unified view of data: a global schema, a global query language, a global catalogue and a global transaction manager. The underlying transaction model considers, in general, a set of transactions synchronized by a global transaction. Synchronization is achieved using protocols such as the Two-Phase Commit protocol.

Rule Execution Model: Defines policies for triggering a rule, evaluating its condition, and executing its action within the execution of an existing system (e.g., DBMS, FDBMS). It also defines policies used for executing a set of rules. Rule execution models have been characterized in taxonomies such as the one proposed by Coupaye, (1996); Paton et al. (1993); and Fraternali and Tanca (1995).

Rule Model: Describes the structure of a rule (i.e., its event, condition, and action parts).

ENDNOTES

1Events are ordered according to an instant of production that belongs to a global time that can be computed using methods as those proposed in Schwiderski (1996) and Lamport and Melliar-Smit (1985).

2The triggering transaction is the transaction within which events are produced; a triggered transaction is the one that executes a triggered rule.

10

TEAM LinG

 

11

 

Advanced Query Optimization

 

 

 

A

 

 

 

 

 

Antonio Badia

University of Louisville, USA

INTRODUCTION

Query optimization has been an active area of research ever since the first relational systems were implemented. In the last few years, research in the area has experienced renewed impulse, thanks to new development like data warehousing. In this article, we overview some of the recent advances made in complex query optimization. This article assumes knowledge of SQL and relational algebra, as well as the basics of query processing; in particular, the user is assumed to understand cost optimization of basic SQL blocks (Select- Project-Join queries). After explaining the basic unnesting approach to provide some background, we overview three complementary techniques: source and algebraic transformations (in particular, moving outerjoins and pushing down aggregates), query rewrite (materialized views), new indexing techniques (bitmap and join indices), and different methods to build the answer (online aggregation and sampling). Throughout this article, we will use subquery to refer to a nested SQL query and outer query to refer to an SQL query that contains a nested query. The TPCH benchmark database (TPC, n.d.) is used as a source of examples. This database includes (in ascending size order) tables Nation, Customer (with a foreign key for Nation), Order (with a foreign key for Customer), and Lineitem (with a foreign key for Order). All attributes from a table are prefixed with the initial of the table’s name (“c_” for Customer, and so on).

BACKGROUND

One of the most powerful features of SQL is its ability to express complex conditions using nested queries, which can be non-correlated or correlated. Traditional execution of such queries by the tuple iteration paradigm is known to be inefficient, especially for correlated subqueries. The seminal idea to improve this evaluation method was developed by Won Kim (1982) who showed that it is possible to transform many queries with nested subqueries in equivalent queries without subqueries. Kim divided subqueries into four different classes: non-correlated, aggregated subqueries (type- A); non-correlated, not aggregated subqueries (type-N); correlated, not aggregated subqueries (type-J); and cor-

related, aggregated subqueries (type-JA). Nothing can be done for type-A queries, at least in a traditional relational framework. Type-N queries correspond to those using IN, NOT IN, EXISTS, NOT EXISTS SQL predicates. Kim’s observation was that queries that use IN can be rewritten transforming the IN predicate into a join (however, what is truly needed is a semijoin). Type-J queries are treated essentially as type-N.

Type-JA is the most interesting type. For one, it is a very common query in SQL. Moreover, other queries can be written as type-JA. For instance, a query with EXISTS can be rewritten as follows: a condition like EXISTS (SELECT attr...) (where attr is an attribute) is transformed into 0 > (SELECT COUNT(*) (the change to “*” is needed to deal with null values). To deal with type-JA queries, we first transform the subquery into a query with aggregation; the result is left in a temporary table, which is then joined with the main query. For instance, the query:

Select c_custkey

From Customer

Where c_acctbal > 10,000 and c_acctbal > (Select sum(o_totalprice) From Order

Where o_custkey = c_custkey))

is executed as: Create Temp as

Select o_custkey, sum(o_totalprice) as sumprice From Order

Group by o_custkey

Select c_custkey From Customer, Temp

Where o_custkey = c_custkey and c_acctbal > 10,000 and c_acctbal > sumprice

However, this approach fails on several counts: first, non-matching customers (customers without Order) are lost in the rewriting (they will fail to qualify in the final join) although they would have made it into the original query (if their account balances were appropriate); this is sometimes called the zero-count bug. Second, the approach is incorrect when the correlation uses a predicate other than equality. To solve both problems at once, several authors suggested a new strategy: first, outerjoin the relations involved in the correlation (the

Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.

TEAM LinG

outerjoin will keep values with no match), then, compute the aggregation. Thus, the example above would be computed as follows:

Create Table Temp(ckey, sumprice) as (Select c_custkey, sum(o_totalprice)

From Customer Left Outer Join Order on o_custkey = c_custkey

Group by c_custkey)

Select c_custkey From Customer, Temp

Where c_custkey = ckey and c_acctbal > 10,000 and c_acctbal > sumprice

This approach still has two major drawbacks in terms of efficiency. First of all, we note that we are only interested in certain customers (those with an account balance over a given amount); however, all customers are considered in the outerjoin. The Magic Sets approach (Seshadri et al., 1996) takes care of this problem by computing first the values that we are interested in. For instance, in the example above, we proceed as follows:

Create table C_mag as

(Select c_custkey, c_acctbal From Customer Where c_acctbal > 10,000);

Create table Magic as

(Select distinct c_custkey as key From C_mag); /* this is the Magic Set */

Create table Correl (custkey, sumprice) as (Select c_custkey, sum(o_totalprice)

From Customer outer join Magic on key =

c_custkey

Group by c_custkey)

Select c_custkey from C_mag, Correl

where c_custkey = custkey and c_acctbal > sumprice

QUERY OPTIMIZATION

Query Transformations

Although the approaches discussed are an improvement over the naïve approach, they introduce another issue that must be taken care of: the introduction of outerjoins in the query plan is problematic, since outerjoins, unlike joins, do not commute among themselves and with other operators like (regular) joins and selections. This is a

Advanced Query Optimization

problem because query optimizers work, in large part, by deciding in which order to carry out operations in a query, using the fact that traditional relational algebra operators can commute, therefore, can be executed in a variety of Order. To deal with this issue, Galindo-Legaria and Rosenthal (1997) give conditions under which an outerjoin can be optimized.This work is expanded in Rao et al. Intuitively, this line of work is based on two ideas: sometimes, outerjoins can be substituted by regular joins (for instance, if a selection is to be applied to the result of the outerjoin and the selection mentions any of the padded attributes, then all padded tuples that the outerjoin adds over a regular join will be eliminated anyway, since they contain null values and null values do not pass any condition, except the IS NULL predicate); and sometimes, outerjoins can be moved around (the generalized outerjoin proposed in the work above keeps some extra attributes so that interactions with joins are neutralized).

Finally, one last note on unnesting to point out that most approaches do not deal properly with operators involving negation or, equivalently, universal quantification, like operators involving the ALL comparison. It has been suggested that operators using ALL could be rewritten as antijoins; a query like:

Select c_custkey

From Customer

Where c_acctbal > ALL (Select o_totalprice From Order where c_custkey = o_custkey)

can be decided by outerjoining Customer and Order on condition (c_custkey = o_custkey AND c_acctbal <= o_totalprice) (since a tuple in Customer would be present in the outerjoin only if the c_acctbal was never less than or equal to a total price, that is, if the c_acctbal was greater than all total prices); unfortunately, this reasoning does not hold when there are nulls present in either attributes. A different approach dealing with these operators is that of Aiken and Bahlen (2003). This work introduces a multidimensional join operator (MD), in which two relations can be joined on several conditions. The MD operator can be annotated with aggregate operations, each one computed in the result of a different join condition. Then, the above query would be computed by an MD-join between Customer and Order, where (a) a grouping by c_custkey and a count(*) are computed over the join via condition c_custkey = o_custkey; and

(b) a grouping by c_custkey and a count(*) are computed over the join via condition c_custkey = o_custkey and c_acctbal > o_totalprice. The tuples where the two counts are the same are then picked up by a selection. Intuitively, we count all tuples for a given value that could possibly fulfill the ALL operator and all tuples that actually fulfill it; if both numbers are the same, then

12

TEAM LinG

Advanced Query Optimization

obviously all tuples have fulfilled the condition, and the value should be made part of the result. Another strategy to deal with this difficult case is outlined in Badia (2003), where a Boolean aggregate is proposed: after outerjoining Customer and Order and grouping by c_custkey, the aggregate simply ANDs comparisons between c_acctbal and all related o_totalprice values; if the final result is true, then the operator is fulfilled.

Another strategy developed recently in query optimization is the pushing down of aggregates. To push down an operator is simply to execute it earlier in the query process; it is well known that pushing down conditions is usually an effective way to reduce computation costs. For instance, in the query:

Select c_name, sum(c_acctbal) From Customer, Nation

Where c_nationkey = n_nationkey and c_mktsegment = “dairy”

Group by c_name

a result can be produced by joining Customer and Nation first and then selecting the appropriate customers, or by first selecting and then joining the result of the selection and the table Nation. In many cases, the latter is more effective, as it gets rid of undesired data sooner. However, the grouping (and the aggregate computation) must wait until all other operations are executed. To understand why it would be desirable to push down a group by. Note that the output of such an operation is usually (much) smaller than the input; like a selection, a group by tends to reduce the volume of data for further computation (there is another motivation, to be discussed later). Gupta, Harinayaran, and Quass (1995) study this issue and show that, in some cases, the group by can be pushed down; therefore, it can be done before joins or selections. Their basic idea, later extended in Goel and Iyer (1996), is that a group by breaks down a relation into groups; anything that treats such groups as a whole (instead of doing different things to different tuples in the group) is compatible with the grouping operation. Thus, in the case of selections, if a selection gets rid of all tuples in a group, or keeps all tuples in a group, the selection and the grouping can be done in any order. For instance, the query

Select c_name, sum(c_acctbal)

From Customer

Where c_name = “Jones”

Group by c_name

is one such case. The reason is that since all tuples in a group share the same name, they are all equally affected by the condition. What happens when this is not the case?

For instance, in our previous example, the condition clearly interferes with the grouping. However, we can A first group by c_name and c_mktsegment; then apply the condition, and finally group by c_name alone. By adding

the attribute in the selection condition to the grouping, we make groups that are not altered by selection. Note that now we have two groupings to do, but the second one is already computed by the first one. Note also that any sums computed by the first grouping can be “rolled up” to achieve the final, correct result.

As for the join, the problem can be attacked by looking at a join as a Cartesian product followed by a selection. We have already dealt with the issues introduced by selection; hence, we need to look at issues introduced by Cartesian products. The obvious issue is one of multiplicity. Assume the query:

Select c_name, sum(c_acctbal)

From Customer, Order

Where c_custkey = o_custkey

Group by c_name

Since all the attributes involved in the grouping and aggregation are from the Customer table, one may be tempted to push down the group to the Customer table before the join with Order. The problem with this approach is that it disturbs the computation of the aggregate; in effect, each tuple in Customer may match several tuples in Order; therefore, computing the sum before or after the join will yield different results. The basic observation here is the following: the Cartesian product may alter a computation by repeating some existing values but never by creating any new values (Chaudhuri & Shim, 1994). Some aggregates, like Min and Max, are duplicate-insensitive, that is, are not affected by the presence of duplicates; thus, there is nothing to worry about. Other aggregates like Sum, Count, and Avg are duplicate-sensitive (Chaudhuri & Shim, 1995). In this case, one needs to know how many duplicates of a value a join introduces in order to recreate the desired result. To do so, an aggregation is introduced on the other relation in the join, Order, with count(*) as the aggregate. This will give the desired information, which can then be used to calculate the appropriate final result. In our example above, we can group Customer by c_name, sum the c_acctbal, group Order by o_custkey and compute count(*) on each such group (that is, compute how many times a give customer key appears in Order); then both grouped relations can be joined and the original sum on the Customer side multiplied by the count obtained in the Order side. Note that the relations are grouped before the join, thereby probably reducing their sizes and making join processing much more efficient.

13

TEAM LinG

Data Warehousing Optimizations

In a DW environment, it is common to have complex queries which involve joins of large tables and grouping of large tables. Therefore, some special indices and processing techniques have been developed to support these kinds of queries over large relations.

For joins, using join indices is a common strategy. A join index is a table containing rowids or tuple ids to tuples in several relations such that the join index represents (precomputes) the join of the relations. Thus, given relations R, S with attributes A, B respectively, the index join for the join of R and S on condition A=B is a binary table with entries of the form (Rtid, Stid), where Rtid is a pointer to a tuple t_R in R, Stid is a pointer to a tuple t_S in S such that t_R.A = t_S.B. Note that the pointer may be physical or logical; when it is logical, there is some kind of translation to a physical pointer so that we can, given a tid, locate a tuple on disk.

The same idea can be extended to more than two relations. In the context of a data warehouse, where most queries involve the join of the fact table and several dimension tables, a star index join can be constructed. A typical star index is one that contains rows of the form (Frowid, Drowid1…,Drowidn), with Frowid a pointer to the Fact table and each Drowidi a pointer to Dimension table i, indicating how the Fact table would join each dimension table row by row.

A problem with join indices, especially with star indices, is that they may grow very large. The star index described above has as many rows as the fact table, which is usually quite large. A possible improvement is the O’Neil-Graefe schema (Graefe & O’Neil, 1995), in which separate join indices are implemented for each dimension table (i.e., each join index reflects the join of the fact table and one of the dimension tables) using bitmaps; when a query requires several such indices to be used, the selected bitmaps are all ANDed together to do the join in one step.

Another index used in data warehousing is the bitmap index. A bitmap is a sequence of bits. A bitmap index is used to indicate the presence or absence of a value in a row. Assume relation R has m rows, and attribute A in R can have one of n values. For each value, we build a bitmap with m bits such that the ith bit is set to 1 if the given value of A is the one in the ith row of R and to 0 otherwise. For instance, attribute Rank with values Assistant, Associate, Full would generate 3 bitmaps; on each one of them, bit 35 would be set to 1 in the first bitmap if the value of Rank in record 35 is Assistant and to 0 otherwise. Note that the n bitmaps will be complementary: each bit will be 1 on only one of them. Therefore, the higher the value of n, the more sparse the bitmaps will be (i.e., the proportion of 0s to the total

Advanced Query Optimization

number of bits will be high). Since this is wasteful, bitmaps are usually utilized when n is small. The value n is said to be the cardinality of the attribute; therefore, bitmaps are more useful with low cardinality attributes. since less, more dense bitmaps can then be used. To make them useful for high cardinality attributes, bitmaps can be compressed, but this in turn presents other problems (see below). Another strategy used is to represent the values not by a single bit but by a collection of bits. For instance, if n = 8, we would need 8 bitmaps in the schema just introduced. Another approach is to devote 3 bits to represent each possible value of the attribute by one of the numbers 0.7 in binary (hence, a logarithmic number of bits is enough). Thus, such bitmaps are made up of small collections of bits.

The attraction of using bitmaps resides in two characteristics. First is the ability of most CPUs to process them in a very efficient manner. The second characteristic is that bitmaps have the advantage of providing access to the base relation in sorted order; therefore, sequential or list prefetch can be used. Note that in order to access data, we have to transform the number i for bits that are set to 1 to a pointer to the ith row of the base table; however, this usually can be achieved with low overhead. Furthermore, bit manipulation can be used for several problems, like range predicates, multiple predicates over one relation, or (in combination with join indices) for complex predicates involving joins and selections.

We can also use bitmap indices in the context of trees. In the leaf of a tree, we have a value and a list of Rowids pointing to the rows that have that value. When the list of Rowids is long, substituting it with a bitmap may reduce the size of the representation. IBM is said to have implemented trees that can decide to use list of Rowids or bitmaps on the fly, depending on the data.

Finally, another optimization that is proper of DW is the idea of using aggregate or summary tables (Gupta & Mumick, 1999). These are materialized views created by aggregating the fact table on some dimension(s). Whenever a query comes in that needs to aggregate the fact table on exactly those aggregates and dimensions (or sometimes, a subset of them, as some aggregates can be computed stagewise), we use the summary table instead of the fact table. Since the summary table is usually much smaller than the fact table due to the grouping, the query can be processed much faster. However, in order to use materialized views in a data warehouse, two questions must be answered:

Many aggregations are possible, which ones to materialize? If a fact has n-dimensions, and each dimension has a hierarchy with m levels, in principle, there are m x n ways to aggregate the data. In

14

TEAM LinG

Advanced Query Optimization

fact, since any subset of values is a possible grouping, there are 2 m x n ways to summarize. Clearly, it is impossible to materialize all the views. Several issues have been researched: which views are more useful in answering queries (that is, which views would be used by more queries)? Which materialized aggregates will be a good trade-off of space (storage constraints) vs. time (query response time)? Given two views, can one be computed from the other so that it is redundant to materialize both of them? Most approaches are syntactical; that is, they do not take into account query workloads but try to determine a solution from the DW schema alone. A greedy algorithm, for example, can assign some utility and cost to each grouping, start with a low cost grouping, then expand on as far as the utility increases. These and other ideas are explored in the papers collected in Gupta and Mumick (1999).

When can a query use a summary table instead of the original table? This is a complex question, since it may not be enough to check when the view is a subpart of the relational expression denoted by the query (so it can be substituted into it). For example, assume relation R and a view V defined by:

Select A, B, sum(D) From R Group By A, B Assume that the query

Select A, B, C, sum(D) From R Group By A, B, C

is issued. This query can use view V since we can refine the grouping, and SUM can be computed stagewise. In order to deduce this, we can compare the query trees for V and the query; a subtree of the query tree can be made identical to the tree for V in part by pushing down the group by operation as described earlier (this is the other motivation for that line of work). But even query:

Select A, B, E, sum(D) From R, S Where R.A = S.Y Group By A, B, E

with S.Y a primary key of S (so R.A is a foreign key) and E an attribute of S, can use V, although realizing so requires sophisticated reasoning. See Cohen, Nutt and Serebrenik (1999) for an approach to this issue.

Approximate Answers

All approaches presented so far compute an exact answer, even if it takes a long time, and then return it. Lately, some work has investigated the idea of computing approximate answers with the goal of returning some useful information to the user in a much shorter

time. We sketch two ideas: sampling and online aggregation. Sampling is simply a development of statistical A notions: if, instead of scanning a whole table, we scan a sample of it, we can produce a result very fast, even though it would not be exact. It is possible, however, to

give a reasonable answer with a sample that is randomly obtained and has a minimum size. The trick in this approach is to obtain the right sample. If, for instance, we want to obtain a sample of the join of two large tables (say, Order and Lineitem), and we obtain a sample of Order and one of Lineitem and join them, the resulting sample may not be large enough because the space that it represents is that of the Cartesian product of Order and Lineitem. However, since this join Order has the primary key and Lineitem the foreign key, we can take a random sample of Lineitem (even a small one) and join the sample with Order (which can be quite efficient with a small sample). The result gives us a very good approximation of the join. A different approach is implemented in the Aqua system (Acharya, Gibbons & Poosala, 1999). Instead of using sampling, this system builds approximations, called synopses, on top of which an aggregate query can be run. For instance, the query:

Select sum(l_quantity) from Order, Lineitem

where l_orderkey = o_orderkey and o_Ordertatus = F

is rewritten to

Select sum(l_quantity) * 100 from js_Order, bs_lineitem

where l_orderkey = o_orderkey and o_Ordertatus = F

where the tables js_Order and bs_lineitem are created by the system to implement a join synopsis, a 1% sample of the join between Order and Lineitem (that is why the resulting sum is scaled by 100).

Online aggregation focuses, like the name indicates, in queries involving aggregates. Because an aggregation produces one result from many data points, it is a primary target for computing approximate answers. The work at Berkeley (Haas & Hellerstein, 1999) is based on computing the aggregate incrementally and showing the results to the user, together with a confidence interval as they are computed. If the users decide they have enough accuracy, or enough information altogether, they can stop the computation. The basic sampling idea explained above is refined to form the basis of the ripple join, an algorithm designed to minimize the time to produce an initial, approximate answer with a reasonable level of precision (Haas & Hellerstein, 1999). This work is extended in Ioannidis and Poosala (1999). While previous work focused on aggregate queries and used

15

TEAM LinG

sampling-based techniques, this article presents a histo- gram-based approach, developing a histogram algebra that allows them to handle general queries. Since answers now can be sets, an error measure to quantify the error in an approximate query answer is also introduced. Finally, a related effort is that of Chaudhuri and Gravano (1999), in which queries requiring only the top k tuples of an answer (according to some raking mechanism) can be efficiently evaluated using statistical information from the DBMS. An excellent, if a bit dated, tutorial on this area of research still ongoing is Haas and Hellerstein (2001).

FUTURE TRENDS

Query optimization continues to be a very active and exciting area of research. Several areas seem to be currently favored by researchers: one is query optimization for XML with new techniques developed for this data model (tree matching, pattern and regular expression matching, tree indexing). Another one is the support for new types of answers: approximate, top-k, ranked answers can benefit from specific techniques. New indexing methods that use information in histograms and approximation techniques (especially for aggregate queries) are another important area. In the context of data warehousing, refinements to support cube operations and reuse of materialized views continue to be fundamental. Finally, support for operations on sets (like set joins, set containment) are also being studied, due to their wide applicability.

CONCLUSION

Perhaps because of the practical interest of the subject, research on query optimization has never really stopped. There has been considerable advances in the area in the last years. Thanks in part to the development of data warehouses, new problems and techniques have been developed. In this article, we have overviewed some of the recent main ideas.

REFERENCES

Acharya, S., Gibbons, P., & Poosala, V. (1999, September 7-10). Aqua: A fast decision support system using approximate query answers. Proceedings of 25th International Conference on Very Large Data Bases, Edinburgh, Scotland.

Advanced Query Optimization

Akinde, M., & Bahlen, M. (2003, March 5-8). Efficient computation of subqueries in complex OLAP. Proceedings of the 19th International Conference on Data Engineering, Bangalore, India.

Badia, A. (2003, September 3-5). Computing SQL subqueries with boolean aggregates. Proceedings of the 5th International Data Warehousing and Knowledge Discovery Conference, Prague, Czech Republic.

Chaudhuri, S., & Gravano, L. (1999, September 7-10). Evaluating top-k selection queries. Proceedings of 25th International Conference on Very Large Data Bases, September 7-10, 1999, Edinburgh, Scotland.

Chaudhuri, S., & Shim, K. (1994, September 12-15).

Including group-by in query optimization. Proceedings of the 20th International Conference on Very Large Data Bases, Santiago de Chile, Chile.

Chaudhuri, S., & Shim, K. (1995). An overview of costbased optimization of queries with aggregates. Data Engineering Bulletin, 18(3), 3-10.

Cohen, S., Nutt, W., & Serebrenik, A. (1999, June 1415). Algorithms for rewriting aggregate queries using views. Proceedings of the International Workshop on Design and Management of Data Warehouses, Heidelberg, Germany.

Galindo-Legaria, C., & Rosenthal, A. (1997). Outerjoin simplification and reordering for query optimization. ACM Transactions on Database Systems, 22(1), 43-74.

Goel, P., & Iyer, B. (1996, June 4-6). SQL query optimization: Reordering for a general class of queries.

Proceedings of the ACM SIGMOD International Conference on Management of Data, Montreal, Quebec, Canada.

Graefe, G., & O’Neil, P. (1995). Multi-table joins through bitmapped join indices. SIGMOD Record, 24(3), 8-11.

Gupta, A., Harinayaran, V., & Quass, D. (1995, September 11-15). Aggregate-query processing in data warehousing environments. Proceedings of the 21st International Conference on Very Large Data Bases, Zurich, Switzerland.

Gupta, A., & Mumick, I.S. (Eds.). (1999). Materialized views: Techniques, implementations and applications.

Cambridge, MA: MIT Press.

Haas, P., & Hellerstein, J. (1999, June 1-3). Ripple joins for online aggregation. Proceedings of the ACM SIGMOD International Conference on Management of Data, Philadelphia, Pennsylvania.

16

TEAM LinG

Advanced Query Optimization

Haas, P., & Hellerstein, J. (2001, May 21-24). Online query processing. Tutorial at the ACM SIGMOD International Conference on Management of Data Conference, Santa Barbara, California. Retrieved January 23, 2005, from http:/ /control.cs.berkeley.edu

Ioannidis, Y., & Poosala, V. (1999, September 7-10). Histogram-based approximation of set-valued queryanswers. Proceedings of 25th International Conference on Very Large Data Bases, Edinburgh, Scotland.

Kim, W. (1982). On optimizing an SQL-like nested query. ACM Transactions on Database Systems, 7(3), 443-469.

Rao, J., Lindsay, B., Lohman, G., Pirahesh, H., & Simmen, D. (2001, April 2-6). Using EELs, a practical approach to outerjoin and antijoin reordering. Proceedings of the 17th International Conference on Data Engineering, Heidelberg, Germany.

Seshadri, P., Hellerstein, J., Pirahesh, H., Leung, H.T., Ramakrishnan, R., Srivastava, D., et al. (1996, June 4-6). Cost-based optimization for Magic:Algebra and implementation. Proceedings of the ACM SIGMOD International Conference on Management of Data, Montreal, Quebec, Canada.

Seshadri, P., Pirahesh, H., & Leung, T.Y. (1996). Complex query decorrelation. Proceedings of the 12th International Conference on Data Engineering, February 26-March 1, New Orleans, Louisiana.

TPC. (n.d.). The TPCH benchmark. Retrieved January 23, 2005, from http://www.tpc.org/tpch

KEY TERMS

Bitmap Index: An index containing a series of bitmaps such that for attribute A on relation R, each bitmap tells us if a given record in R has a certain value for A. Bitmap indices are often used in decision support environments, since used in conjunction with other bitmap or regular indices, they can cut down on disk accesses for selections, thereby improving query response time.

Magic Set: Given SQL query with a correlated, nested subquery, the magic set is the set of values that

are relevant for the computation of the subquery as parameters. This set is obtained by computing all condi- A tions in the outer query except the one involving the subquery.

Materialized View: A view that is computed at definition time and stored on disk as a regular relation. Materialized views offer the advantage that can be used to precompute part of a query and then reused as many times as needed. The main challenge about materialized views is to keep them updated when the relations on which they were defined change; incremental techniques are used to make this efficient.

Outerjoin: Extension of the regular join operator. Given two relations R and S and a condition C involving attributes of R and S, the outerjoin will generate as its output: (a) all the tuples in the Cartesian product of R and S that respect condition C (i.e., a regular join); plus

(b) all the tuples in R that do not have a match in S fulfilling C, padded with null values on the attributes of S; plus (c) all the tuples in S that do not have a match in R fulfilling C, padded with null values on the attributes of R. A left outerjoin will output only (a) and (b), and a right outerjoin will output only (a) and (c).

Query Rewriting: An approach to query optimization that rewrites the original SQL query into another equivalent SQL query (or a sequence of several SQL queries) that offer the possibility of yielding better performing query plans once they are processed by the query optimizer.

Semijoin: Variant of the regular join operator. The (left) semijoin of relations R and S on condition C involving attributes of R and S will generate as its output all the tuples in R that have at least one matching tuple in S that fulfills C (i.e., the tuples in R that would participate in a regular join) without repetitions (the right semijoin is defined similar). Algebraically, the (left) semijoin is defined as the projection on the schema of relation R of the join of R and S.

Unnesting: A particular technique in query rewriting that aims at transforming SQL queries with subqueries in them into equivalent queries without subqueries, also called flattening.

17

TEAM LinG

18

Applying Database Techniques to the Semantic Web

María del Mar Roldán-García

University of Malaga, Spain

Ismael Navas-Delgado

University of Malaga, Spain

José F. Aldana-Montes

University of Malaga, Spain

INTRODUCTION

Information on the Web has grown very quickly. The semantics of this information are becoming explicit and the Semantic Web (Berners-Lee, Hendler, & Lassila, 2001) is emerging. Ontologies provide a formal representation of the real world by defining concepts and relationships between them. In order to provide semantics to Web resources, instances of such concepts and relationships are used to annotate them. These annotations on the resources, which are based on ontologies, are the foundation of the Semantic Web. Because of the Web’s size we have to deal with large amounts of knowledge. All this information must be represented and managed efficiently to guarantee the feasibility of the Semantic Web. Querying and reasoning over instances of ontologies will make the Semantic Web useful.

Knowledge representation is a well-known problem for artificial intelligence researchers. Explicit semantics is defined by means of formal languages. Description logics (Nardi & Brachman, 2003) is a family of logical formalisms for representing and reasoning on complex classes of individuals (called concepts) and their relationships (expressed by binary relations called roles). DL formalism allows the description of concepts, relationships, and individuals (i.e., the knowledge base). All of them, together with complex concept formation and concept retrieval and realization, provide a query/reasoning language for the knowledge base. Research in description logics deals with new ways to query a knowledge base efficiently. On the other hand, knowledge representation research does not deal with large amounts of information, and its results can only be applied to very small knowledge bases (with a small number of instances), which are not the knowledge bases we expect to find on the Semantic Web. As a consequence, reasoning algorithms are not scalable and usually are main-memory-oriented algorithms.

All these problems increase when we deal with the context of the Semantic Web. In such an environment,

reasoning and querying should be scalable, distributable, and efficient. In this context, we will find a fairly large amount of distributed instances. On the Semantic Web, not only reasoning on concepts is necessary, but also reasoning at the instance level and efficient instance retrieval. Therefore, it is necessary to allow (efficient) queries and (efficient) reasoning to be compatible in such a distributed (massive in instances) environment as the Semantic Web. This makes it necessary to develop efficient knowledge storage mechanisms that use secondary memory (in order to guarantee scale-up) and allow efficient reasoning about concepts and relationships defined by an ontology, and about its instances as well. Furthermore, it is fundamental to provide efficient disk-oriented reasoning mechanisms, particularly for the instance retrieval problem. We believe database technology (not necessarily database systems) is a must for this purpose, because it is developed to deal with large amounts of data and massive storage.

BACKGROUND

Description Logics

Description logics (DL) are a logical formalism, related to semantic networks and frame systems, for representing and reasoning on complex classes of individuals (called concepts) and their relationships (expressed by binary relations called roles). Typically, we distinguish between atomic (or primitive) concepts (and roles), and complex concepts defined by using DL constructors. Different DL languages vary in the set of constructors provided.

A DL knowledge base has two components: (1) a terminological part (the Tbox) that contains a set of concept descriptions and represents the general schema modeling the domain of interest and (2) an assertional part (the Abox) that is a partial instantiation of this schema,

Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.

TEAM LinG

Applying Database Techniques to the Semantic Web

Figure 1. Example of a description logic Tbox

Figure 2. Example of an Abox for the Figure 1 Tbox

 

A

 

 

 

 

 

 

 

String ô Thing

 

 

Roldan: Person

 

 

 

 

 

 

 

 

Person ô Thing

 

 

Navas: Person

 

 

Professor ô Person

 

 

Aldana: Person

 

 

Doctor ô Professor

 

 

Roldan: NoDoctor

 

 

NoDoctor ô Professor

 

 

Navas: NoDoctor

 

 

Student ô Person

 

 

Aldana: Doctor

 

 

PhDStudent ô Student

 

 

<Roldan, Aldana>: director

 

 

UnderGradStudent ô Student

 

 

<Navas, Aldana>: director

 

 

Student ô director.Professor

 

 

<Roldan, “Maria del Mar”>: name

 

 

Student ô name.String

 

 

<Navas, “Ismael”>: name

 

 

 

 

 

<Aldana, “Jose F”.>: name

 

 

 

 

 

 

 

 

comprising a set of assertions relating either individuals to classes or individuals to each other. Many of the applications only require reasoning in the Tbox, but in an environment like the Semantic Web, we also need Abox reasoning. Figure 1 shows an example of a description logic Tbox. Figure 2 shows an example of an Abox for this Tbox.

The reasoning tasks in a Tbox are consistency (satisfiability), which checks if knowledge is meaningful; subsumption, which checks whether all the individuals belonging to a concept (the subsumee) also belong to another concept (the subsumer); and equivalence, which checks if two classes denote the same set of instances (Nardi & Brachman, 2003). All of these reasoning mechanisms are reducible to satisfiability as long as we use a concept language closer under negation.

Typically, the basic reasoning tasks in an Abox are instance checking, which verifies whether a given individual is an instance of (belongs to) a specified concept; knowledge base consistency, which amounts to verifying whether every concept in the knowledge base admits at least one individual; and realization, which finds the most specific concept an individual object is an instance of (Nardi & Brachman, 2003).

In recent years significant advances have been made in the design of sound and complete algorithms for DLs. Moreover, systems using these algorithms have also been developed (Haarslev & Möller, 2001; Horrocks, 1998). Most of these approaches only deal with Tbox reasoning, but in an environment like the Semantic Web, we also need Abox reasoning. Although some DL systems provide sound and complete Abox reasoning, they provide very weak Abox query language. A query means retrieving instances that satisfy certain restrictions or qualifications and hence are of interest for a user.

Web Ontology Languages

The recognition of the key role that ontologies play in the future of the Web has led to the extension of Web markup languages in order to provide languages to define Web-based ontologies. Examples of these languages are XML Schema,1 RDF,2 and RDF Schema (Brickley & Guha, 2002).

Even with RDF Schema, RDF has very weak semantics. Still, it provides a good foundation for interchanging data and enabling true Semantic Web languages to be layered

Figure 3. OWL class constructors

Figure 4. OWL axioms

19

TEAM LinG

Соседние файлы в предмете Электротехника