Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Rivero L.Encyclopedia of database technologies and applications.2006

.pdf
Скачиваний:
14
Добавлен:
23.08.2013
Размер:
23.5 Mб
Скачать

530

Relational, Object-Oriented and

Object-Relational Data Models

Antonio Badia

University of Louisville, USA

INTRODUCTION

The relational data model is the dominant paradigm in the commercial database market today, and it has been for several years. However, there have been challenges to the model over the years, and they have influenced its evolution and that of database technology. The objectoriented revolution that got started in programming languages arrived to the database area in the form of a brand new data model. The relational model managed not only to survive the newcomer but to continue becoming a dominant force, transformed into the objectrelational model (also called extended relational, or universal) and relegating object-oriented databases to a niche product. Although this market has many nontechnical aspects, there are certainly important technical differences among the mentioned data models. In this article I describe the basic components of the relational, object-oriented, and object-relational data models. I do not, however, discuss query language, implementation, or system issues. A basic comparison is given and then future trends are discussed.

BACKGROUND

In order to facilitate the comparison among the models, I will use the same example throughout the article. The example involves a universe of people; some are professors and some are students. Each professor works at a department, and some professors chair departments. Students have professors as advisors. The exact situation is depicted as entity-relationship (ER) diagram (Chen, 1976) in Figure 1. The notation given denotes that Chairs is a one-to-one relationship, Faculty is a one-to- many relationship, and Teaches is a many-to-many relationship. Also, Phones is a multi-valued attribute and Student and Professor are subclasses of Person. The inverted triangle symbol is used to denote a class/ subclass relationship; the reader is warned that different authors use different symbols for this purpose.

The relational data model (Date, 2004) is well known; one simply overviews its main concepts here as it serves as the baseline for the comparison. A domain is a

nonempty set; intuitively, it provides a pool of values. Every domain is assumed to come with a name (an infinite number of names, technically). Given a schema or list of domain names R = {A1,…,An}, a relation on R is a subset of the Cartesian product A1 x x An. The elements of the relation are called tuples; each tuple is made up of a list of values a1,…,an, with ai coming from domain Ai. A key K for relation r in schema R is a subset of R (i.e., a set of attributes) such that, for any two tuples in r, they are the same if they have the same value for K. Intuitively, the key represents (stands for) the whole tuple, a fact that is exploited in relational database design.

Usually, the domains allowed in most implementations are data types that the computer can easily handle: different numerical types (integers, reals, etc.), characters, and strings. Most database systems also offer a “date” and a “time” domain, to facilitate expression of temporal information as well as large object types, frequently used to deal with multimedia data. However, no complex types are allowed. “Complex” here means, roughly, offering the ability to store more than one value. Because tuples will be used to represent entities, this means that attributes with multiple values (or many relationships among objects) will force an object to be represented over several tuples–perhaps even over several tables. This characteristic, called the first normal form, will become an issue later on in this article.

A relational-database schema is a set of relation schemas plus a set of integrity constraints. An integrity constraint is a condition specified over the database schema that restricts the data that can be stored in the relations. The most important constraints are the key

Figure 1. Entity-relationship diagram for the example

Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.

TEAM LinG

Relational, Object-Oriented and Object-Relational Data Models

constraint, which specifies that certain attribute(s) form a key in a relation, and the foreign key constraint, which specifies that certain attribute(s) form a foreign key in a relation. The foreign key attributes K1 have all their values drawn from some primary key K2; K1 is said to refer to K2. Foreign key constraints (also called referential integrity constraints) are the glue that holds the relations in a database together by making sure that values in an attribute that need to refer to a certain entity do so. It is therefore necessary to specify, when talking about a relational database, which primary keys and integrity constraints are supposed to hold. Here, as an example, is how our model would be represented in a relational database; for simplicity, I use a stylized syntax, with the relation name first and the attributes as a list in parenthesis. Each foreign key declaration follows the attribute name, and each relation is followed by a primary key declaration:

Person(Ssn, Name, Age); primary key: Ssn. Professor(Ssn, Name, Age, Rank, Salary); primary key: Ssn.

Student(Ssn, Name, Age, GPA); primary key: Ssn. Department(Name, Office); primary key: Name.

Dept-Phone(Name, Phone); primary key: (Name, Phone).

Faculty(Ssn1 foreign key refers to Professor, Name foreign key refers to Department); primary key: Ssn1. Chairs(Ssn1 foreign key refers to Professor, Name foreign key refers to Department, Date), primary key: Ssn1.

Teaches(Ssn1 foreign key refers to Professor, Ssn2 foreign key refers to Student); primary key: (Ssn1, Ssn2).

Note that each entity gives rise to a table, and each relationship does, too. There is another option in the translation, affecting one-to-many and one-to-one relationships. A one-to-many relationship, such as Faculty, could be represented in the table corresponding to the entity in the one side (in this case, Professor); and so could one-to-one relationships; in fact, in one-to-one relationships there is a choice of tables). Many-to-many relationships need their own separate tables. Also, the multivalued attribute “Phones” cannot be added to the Department table for the reasons already cited (first normal form).

Because having several tables repeating all department information, and each with a different phone, would create redundancy, a separate table is created, with the department name representing the whole department (because Name is the primary key of Department). This is the process of normalization, on which relational database design is based. As for the class/

subclass relation, there is no direct facility to capture it

in relational databases; it must be simulated by one of 4 two methods. Both methods have one table for the superclass, and one for each of the subclasses. The first method puts, on the tables corresponding to the subclasses, the attributes proper of the subclass only; the second method combines, on the tables corresponding

to the subclasses, the attributes of the subclass and the superclass. I have chosen this second option on the database. Note that the table corresponding to the superclass is still needed in the second method in case there are elements that are objects of the superclass only and not of any subclasses.

MAIN THRUST: ADVANCED DATA MODELS

The Object-Oriented Data Model

There are many variations of the object-oriented data model. In this article, we use the object data management group (ODMG) model (Catell et al., 2000) and its object database language (ODL), because they set the standard data model for object-oriented databases.

The basic building blocks of the ODMG data model are objects and literals. A literal’s value may be simple or complex. There are three types of literals: atomic, collection, and structured. Atomic literals correspond to basic data types: integers (long, short, unsigned), float (float, double), Boolean, single character, string, and enumeration types. Structured literals have a tuple structure; they include types such as Date, Time,... Also, the user can define structured literals as needed, using a “Struct” construct. Collection literals specify a collection of objects or literals. Types of collections are Set, Bag, List (homogeneous), Array, and Dictionary. Each collection has a group of built-in operators. On the other hand, objects have object identifiers ( OIDs) and a value, unlike literals, which have value but no OID. Objects may have a name and can be of atomic or collection type. Atomic objects are not objects without internal structure; they correspond to atomic or structured literals. For each object, properties (i. e., attributes and relationships) and operations are specified. Values of attributes are typically literals (atomic or complex) but can be OIDs. Values of relationships are always object names or a collection applied to object names.

Object definition language (ODL) is used to implement the ODMG data model. In ODL, classes are declared by giving them an interface, using the keyword interface; an interface declares the structure of an

531

TEAM LinG

Relational, Object-Oriented and Object-Relational Data Models

object and it is noninstantiable (i.e., there cannot be objects instantiating the interface). ODL uses “class” to refer to instantiable declarations. Each class is a collection of attributes that are either primitive (i.e., of a basic data type) or relationship (i.e., their value is an object or set of objects of a given class). A key for a class can be optionally declared. Note that all objects have object identifiers, which are immutable, regardless of whether they have keys (whose values can be changed) or not. Obviously, inheritance among classes is supported directly in the model, and declared using the keyword extends; however, only single inheritance is allowed. In the ODMG model, only binary relationships are explicitly represented through a pair of inverse references (using the inverse keyword). An example of class declarations can be seen in the following section, implementing the model in Figure 1. Because only binary relationships without attributes can be directly represented in ODL, binary relations with attributes and n-ary relationships (n > 2) must be represented as classes themselves, a process called reification.

Class Person { attribute int ssn; attribute string name; attribute int age;

}

Class Professor extends Person { attribute int rank;

attribute real salary;

relationship Department dept inverse faculty; relationship set <Student> teaches inverse taught; relationship Chairs chairing inverse chairp;

}

Class Student extends Person { attribute float GPA;

relationship Department department;

relationship taught set <Professor> inverse teaches;

}

Class Department { attribute string name; attribute string office;

attribute set <string> phones;

relationship Chairs chairperson inverse chairsd; relationship set <Professor> faculty inverse dept;

}

Class Chairs { attribute date Date;

relationship Department chairsd inverse chairperson; relationship Professor chairsp inverse chairing;

}

Note that only binary relationships without attributes can be nicely captured in the model. A binary relationship with attributes (Chairs) must be reified (i.e., converted

to a class) in order to be represented. The same would happen to relationships that involved more than two entities. On the other hand, complex or multivalued attributes are not a problem; they can be easily represented using structures or sets, so that each entity in our ER model corresponds to a class.

The Object-Relational Data Model

The object-relational (also called extended-relational, or universal) data model is exemplified by the latest version of the Structured Query Language or SQL-99 (Melton, 1999). It can be seen as an attempt to capture as many as possible of the object-oriented concepts introduced in the previous subsection and wrap them in a relational shell. A more modest view of it regards the model as extending the base of the relational model (instead of the model itself) by making it easier to add more complex data types to serve as domain definitions. Here the basics of the standard (Eisenberg, Kulkarni, Michaels, Melton, & Zemke, 2004) are described, because each commercial DBMS has its own version of the model, sometimes with different names and different syntax.

One of the basic ideas is to substitute domains by (possibly complex) types, called user defined types (UDTs). The name comes from the fact that the model provides constructors so that users can define their own types as needed by an application; the emphasis is in extendibility. UDTs come in distinct types and structured types. Distinct types are based on a single, builtin data type. The following is an example of an UDT called “age” based on the built-in type integer:

CREATE TYPE age AS INTEGER (CHECK age BETWEEN 0 and 100) FINAL;

Distinct types are not compatible with any other type (including the one they are based on); hence, expressions such as “age + 20” or “age > 7” are illegal operations (however, the CAST operator can be used as a work around: “CAST(age AS INTEGER) + 20” is allowed.). The keyword final refers to whether a type can be extended with subtypes: distinct types cannot, structured ones can. As an option, operations and comparisons can be defined for types. Distinct types are first class: they can be used in a column declaration, SQL variable, and so forth. Here is a table where the above declaration is used for a column:

CREATE TABLE person (

Ssn INTEGER,

Name CHARACTER VARYING(100),

Age age)

532

TEAM LinG

Relational, Object-Oriented and Object-Relational Data Models

Structured types can have internal structure, with parts called attributes. Attributes do not have to be builtin; they may be complex (SQL99 offers “built-in” structured types ARRAY and ROW), but cannot be of the type being defined (i.e., recursion is not allowed). Structured types do not have identity. One cannot specify constraints on the attributes of a structured type. A definition can be destroyed by using the command DROP TYPE; it can also be changed with command ALTER TYPE, but there are restrictions on what can be changed. For instance, it is not allowed to add or delete attributes if the type being changed is a supertype, an element of another type, or a reference (see the following section for an explanation of references).

Structured types can be used as columns of tables and also as tuples of tables. For example, one could have defined a structured type as follows instead of the previous table:

CREATE TYPE person as ( Ssn INTEGER,

Name CHARACTER VARYING(100), Age age)

And then use the type as a row: CREATE TABLE Persons OF person

(REF IS person-id SYSTEM GENERATED)

(I will explain REF shortly). This yields a typed table, a table whose rows are of a structured type. The attributes of the type become attributes of the table.

Structured types can have methods defined on them. An observer and a mutator are methods automatically created for every attribute; they provide encapsulation. A creator method is also defined, which is invoked by NEW. Types may be NOT INSTANTIABLE (i.e., cannot have values of that type; this is used for abstract superclasses) or INSTANTIABLE.

It is possible to define a hierarchy of types by defining a new type UNDER another; (single) inheritance applies:

CREATE TYPE student UNDER person AS ( GPA DECIMAL(1,2),

Department REF(department))

Typed tables also have inheritance hierarchies, corresponding to the hierarchies of the types:

CREATE TABLE Students of Student;

(Department WITH OPTION SCOPE Departments)

Subtables cannot have primary keys, but supertables can. They can also have UNIQUE, NOT NULL, CHECK and integrity constraints. There is also a self-reference

value created automatically, unique for each row of the

table. SQL-99 provides REFERENCE types, which give 4 structured types an ID. Maximal supertables must specify

the self-referencing column (it is inherited by subtables, which cannot specify such a column on their own). References can be generated by system, by using some built-in type, or by using some attributes of the type. The declaration of table Students specifies that the reference (REF) to department on each student must be to a valid reference of departments within the table Departments, which is a table of Department types.

SQL-99 introduces also typed views. Typed views can have a hierarchy, also.

We now show the rest of the declarations needed to complete the modeling of Figure 1 in the object-rela- tional data model:

CREATE TYPE department (

Name CHARACTER VARYING(100),

Office CHARACTER VARYING(100),

Phones CHARACTER(10) ARRAY[5],

Chairperson REF(Faculty))

CREATE TYPE Professor UNDER person AS ( Rank INTEGER,

Salary DECIMAL(5,2), Department REF(department), Chairs REF(department) )

CREATE TYPE department (

Name CHARACTER VARYING(100),

Office CHARACTER VARYING(100),

Phones CHARACTER(10) ARRAY[5],

Chairperson REF(Professor))

CREATE TABLE Faculty of Professor UNDER Persons (Chairs WITH OPTION SCOPE Departments)

CREATE TABLE Departments of DEPARTMENT (REF IS dept-id SYSTEM GENERATED Chairperson WITH OPTIONS SCOPE Faculty)

CREATE TABLE Teaches(

Teacher REF(Professor) SCOPE Faculty,

Pupil REF(Student) SCOPE Students))

A few observations are in order: First, we could have created tables of persons (or faculty, or students, or departments) without first declaring a type, by giving each row in the table the appropriate attributes (as in the first example of table PERSON); this results in tables similar to those in the original relational model. However, that option would not have allowed us to model type/subtype relationships, and to have inheritance. Sec-

533

TEAM LinG

Relational, Object-Oriented and Object-Relational Data Models

ond, binary relations that are one-to-one (or one-to- many) can be represented through references, similar to ODL. In this case, we add the SCOPE clause to limit references to objects appearing in a table. However, there is no concept of inverse here; it is up to the user to make sure that the references in, for instance, Chairs and Chairperson, are consistent. Note that there is no need for both relations; having Chairs in Professor would be enough to establish the relationship—and indeed, that is how it would be truly modeled; we have added Chairperson to Department for illustration only. However, note that to find the chairperson of a given department could be difficult and costly without this link. Finally, note that many-to-many relationships (and relationships that are n-ary, n > 2) must be modeled in a manner similar to the original relational model, the only difference being that references (instead of primary keys and foreign keys) can be used to refer to the original objects.

Comparison

For a meaningful comparison among the models introduced, focus on the fundamental aspect of a data model: modeling power. Ask, How does each model deal with the basic constructs of most conceptual models (entities or classes, and relationships among them)? Is there something that is expressible in one model and not in the others? How easy (or intuitive or natural) is it to model complex situations?

The previous analysis makes it possible to answer the question. In theory, there are situations that can be modeled in the object-oriented data model and that do not have a good representation in the relational data model (albeit they can be modeled with some ingenuity). In particular, set-valued attributes are tricky to handle in the relational model and lead to dissemination of related information in several tables; also, type/ subtype relations and inheritance must be simulated. In practice, however, the limitations of object-oriented models to deal with relationships (especially relationships with attributes and relationships that are many-to- many) mean that neither model can claim a significant advantage. The object-relational model, offering some of the abilities of object-oriented models (inheritance, identifiers) and relational models (clear modeling of relationships) may claim a small advantage over both of them. However, the model as defined by the SQL standard (Melton, 1999) still lacks the ability to handle sets well. It seems that the last iteration of the standard removes this hurdle, in which case the object-relational model could become the winner in terms of modeling power. However, no model is able to handle semistructured

or irregular data very well (Abiteboul, Buneman, & Suciu, 1999; Mani & Badia, 2003).

FUTURE TRENDS

New developments in the area of data models have been directed towards capturing data in semistructured form (XML; Abiteboul, Buneman, & Suciu, 1999). Integration of XML with relational data will continue to attract researchers in coming years, because at present such integration is not well defined. It is interesting to point out that this issue has revived the interest in nested relations, an old data model (Paredaens, De Bra, Gyssens, & Van Gucht, 1989) that continues seeing new uses (Lee, Mani, Chiu, & Chu, 2001) because it is a hierarchical model that is well suited to represent XML data. Another area of interest will be the integration of unstructured information (text), since it is acknowledged that documents (e-mails, memos, technical reports, etc.) are a very important source of information that is currently outside the realm of databases. Current efforts in this direction are based on information retrieval techniques (Baeza-Yates & Ribeiro, 1999), but this approach has limitations that will keep researchers looking for other approaches. Finally, integration of multimedia data (e.g., video, audio, images), which was a big driving force behind the development of extensions of the relational model in the first place, is still an open area of inquiry. Trends like geographical information systems (GIS; Rigaux, Scholl, & Voisard, 2001) that rely heavily on such data will continue to fuel research in this area.

CONCLUSION

I have presented the basic characteristics of the (pure) relational, the object-oriented and object-relational data models. By showing them side by side, with a common example, it is possible to compare the models easily, a comparison that is absent from the literature. Note that the focus is on conceptual aspects of the model, and not query languages, implementation, and performance issues. The comparison of data models shows that is very difficult to establish a clear winner; each model may be better for a particular type of data. For data modeling capabilities, the object-relational model may offer somewhat more flexibility than the other two. This is not unexpected if this model was created in an attempt to integrate the best of the relational and object-ori- ented paradigm.

534

TEAM LinG

Relational, Object-Oriented and Object-Relational Data Models

REFERENCES

Abiteboul, S., Buneman, P., & Suciu, D. (1999) Data on the Web: From relations to semistructured data and XML. San Francisco: Morgan Kaufman.

Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. New York: ACM Press.

Catell, R. G. (1994). Object data management. Reading, MA: Addison-Wesley.

Catell, R. G., Barry, D., Berler, M., Eastman, J., Jordan, D., Rusell, C., Schadow, O., et al. (2000). The object data standard: ODMG 3.0. San Diego, CA: Academic Press.

Chen, P. (1976). The entity-relationship model - towards a unified view of data. ACM Transactions on Database Systems, 1(1).

Date, C.J. (2004). An introduction to database systems.

Reading, MA: Addison-Wesley.

Eisenberg, J., Kulkarni, K., Michaels, J.-E., Melton, J., & Zemke, F. (2004, March). SQL: 2003 has been published.

SIGMOD Record, 33(1), 119-126.

Lee, D., Mani, M., Chiu, F., & Chu, W. (2001). Nestingbased relational-to-XML schema translation. Proceedings of the 4th International Workshop on the Web and Databases, WebDB 2001, Santa Barbara, CA, May 24-25, in conjunction with ACM PODS/SIGMOD 2001, informal proceedings.

Mani, M., & Badia, A. (2003, October). Data modeling using XML. Tutorial in the 22nd International Conference on Conceptual Modeling, Computer Science 2813. New York: Springer.

Melton, J., & Simon, A. R. (2002). SQL 1999: Understanding relational language components. San Francisco: Morgan Kaufmann.

Paredaens, J., De Bra, P., Gyssens, M., & Van Gucht, D. (1989). The structure of the relational database model. New York: Springer.

Rigaux, P., Scholl, M., & Voisard, A. (2001). Spatial databases: With application to GIS. San Francisco: Morgan Kaufmann.

KEY TERMS

4

Inheritance: A special relation between two classes of objects; class A inherits from class B when it is considered that A possesses all the characteristics of B (and possibly some more) simply because it is a subclass of class B. Models that support inheritance explicitly allow economical expression of facts,because declaring A as a subclass of B implies inheritance in such models.

Object Identifier: In the object-oriented data model, each object is given an object identifier (OID). The importance of OIDs is that it makes the model work by reference and not by value (i.e., if an object changes the values of all its attributes, it still is the same object because of its OID). In the relational world, an object that changes the value of one attribute is a different object. Unlike keys, OIDs are immutable.

Relation: From a type point of view, a relation is simply a set of tuples. In most implementations of the relational model, however, multisets (sets with repetitions) are used.

Set: From a type point of view, a set is a collection construct that is homogeneous (i.e., it contains objects of the same type) and unbounded (i.e., it has no maximum or fixed size). Sets can only be handled as relations in the relational model, but the set constructor in the objectoriented data model means that a much higher degree of flexibility is needed in organizing information.

User-Defined Type (UDT): Any type as defined through the use of some basic constructors, such as CREATE TYPE. The object-relational model provides UDT in an attempt to make the system more customizable for applications with particular requirements of the data representation.

535

TEAM LinG

536

Repairing and Querying Databases with Integrity Constraints

Sergio Greco

DEIS Università della Calabria, Italy

EsterZumpano

DEIS Università della Calabria, Italy

INTRODUCTION

Data integration aims at providing a uniform integrated access to multiple heterogeneous information sources, which were designed independently for autonomous applications and whose contents are strictly related.

There are several ways to integrate databases or possibly distributed information sources, but whatever integration architecture we choose, the heterogeneity of the sources to be integrated causes subtle problems. In particular, the database obtained from the integration process may be inconsistent with respect to integrity constraints; that is, one or more integrity constraints are not satisfied. The following example shows a case of inconsistency.

Example 1. Consider the following database schema consisting of the single binary relation Teaches (Course, Professor) where the attribute Course is a key for the relation. Assume there are two different instances for the relations Teaches, D1={(c1,p1),(c2,p2)} and D2={(c1,p1),(c2,p3)}.

The two instances satisfy the constraint that Course is a key, but from their union, we derive a relation which does not satisfy the constraint since there are two distinct tuples with the same value for the attribute

Course.

Obtaining consistent information from inconsistent databases is a primary issue in database management systems. In the integration of two conflicting databases, simple solutions could be based on the definition of preference criteria such as a partial order on the source information or a majority criteria (Lin & Mendelzon, 1996). However, these solutions are not generally satisfactory, and more useful solutions are those based on

(1) the computation of repairs for the database and (2) the computation of consistent answers (Arenas, Bertossi & Chomicki, 1999). The computation of repairs is based on the definition of minimal sets of insertion and deletion operations so that the resulting database satisfies all constraints. The computation of consistent an-

swers is based on the identification of tuples satisfying integrity constraints and on the selection of tuples matching the goal. For instance, for the integrated database of Example 1, we have two alternative repairs consisting in the deletion of one of the tuples (c2,p2) and (c2,p3). The consistent answer to a query over the relation Teaches contains the unique tuple (c1,p1) so that we do not know which professor teaches course c2.

Therefore, it is very important, in the presence of inconsistent data, to compute the set of consistent answers, but also to know which facts are unknown and if there are possible repairs for the database.

BACKGROUND

A (disjunctive Datalog) rule r is a clause of the form:

A1 … Ak B1, ..., Bm, not Bm+1, …,not Bn, k+m+n>0

where A1,…, Ak, B1,…, Bn are atoms of the form p(t1,..., th), p is a predicate symbol of arity h, and the terms t1,..., th are constants or variables. The disjunction A1 ... Ak is the head of r, while the conjunction B1,…,Bm, not

Bm+1,…, not Bn is the body of r. We also assume the existence of the binary built-in predicate symbols (com-

parison operators) which can only be used in the body of rules.

An extended Datalog program extends standard Datalog programs with a different form of negation, known as classical or strong negation, which can also appear in the head of rules. An extended atom is either an atom, say A or its negation A. An extended Datalog program is a set of rules of the form:

A1 … Ak B1, ..., Bm, not Bm+1, …,not Bn, k+m+n>0

where A1,…, Ak, B1,…, Bn are extended atoms.

The semantics of an extended program P is defined by considering each negated predicate symbol, say p,

Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.

TEAM LinG

Repairing and Querying Databases with Integrity Constraints

as a new symbol syntactically different from p and by

adding to the program, for each predicate symbol p with arity n the constraint ← p(X1,...,Xn), ←p(X1,...,Xn).

INTEGRITY CONSTRAINTS

Integrity constraints express semantic information over data, that is, relationships that must hold among data in the theory. Generally, integrity constraints, denoted as IC, represent the interaction among data and define properties which are supposed to be explicitly satisfied by all instances over a given database schema. Therefore, they are mainly used to validate database transactions.

Definition 1. A full (or universal) integrity constraint is a formula of the first order predicate calculus of the form:

( X) [ B1 ... Bn ϕ A1 ... Am ψ1 ... ψk ]

where A1, ..., Am, B1, ..., Bn are base positive literals, ϕ, ψ1, ..., ψk are built-in literals, X denotes the list of all variables appearing in B1,...,Bn and it is supposed that variables appearing in A1,..., Am, ϕ, ψ1, ..., ψ k also appear

in B1,...,Bn.

In the definition above, the conjunction B1 ... Bn

ϕ is called the body and the disjunction A1 ... Am

ψ1 ... ψk the head of the integrity constraint. Moreover, an integrity constraint is said to be positive if no negated literals occur in it.

TECHNIQUES FOR QUERYING AND REPAIRING DATABASES

Recently, there have been several proposals considering the integration of databases as well as the computation of queries over inconsistent databases. Most of the techniques work for restricted form of constraints and only recently have there been proposals to consider more general constraints. In the following, we give an informal description of the main techniques proposed in the literature.

In Agarwal et al. (1995), it is proposed an extension of relational algebra, called flexible algebra, to deal with data having tuples with the same value for the key attributes and conflicting values for the other attributes. The technique only considers constraints defining functional dependencies, and it is sound only for the class of databases having dependencies determined by a primary key consist-

 

ing of a single attribute.

4

In Dung (1996), it is proposed that the Integrated

Relational Calculus, an extension of flexible algebra for other key functional dependencies based on the definition of maximal consistent subsets for a possibly inconsistent database. Dung proposed extending relations by also considering null values denoting the absence of information with the restriction that tuples cannot have null values for the key attributes. The Integrated Relational Calculus overcomes some drawbacks of the flexible relational algebra. Anyhow, as both techniques consider restricted cases, the computation of answers can be done efficiently.

In Lin and Mendelzon (1996), an approach is proposed taking into account the majority view of the knowledge bases in order to obtain a new relation which is consistent with the integrity constraints. The technique proposes a formal semantics to merge first-order theories under a set of constraints.

Example 2. Consider the following three relation instances which collect information regarding author, title, and year of publication of papers:

Bib1={(John,T1,1980),(Mary,T2,1990)},

Bib2={(John,T1,1981),(Mary,T2,1990)},

Bib3={(John,T1,1981), (Frank,T3,1990)}

From the integration of the three databases Bib1, Bib2, and Bib3, we obtain the database B i b = { ( J o h n , T 1 , 1 9 8 0 ) , ( M a r y , T 2 , 1 9 9 0 ) , ( F r a n k , T 3 , 1 9 9 0 ) } .

Thus, the technique, proposed by Lin and Mendelzon, removes the conflict about the year of publication of the paper T1 written by the author John observing that two of the three source databases that have to be integrated store the value 1980; thus, the information that is maintained is the one which is present in the majority of the knowledge bases.

However, the “merging by majority” technique does not resolve conflicts in all cases since information is not always present in the majority of the databases, and therefore, it is not always possible to choose between alternative values. Thus, generally, the technique stores disjunctive information, and this makes the computation of answers more complex (although the computation becomes efficient if the “merging by majority” technique can be applied); moreover, the use of the majority criteria involves discarding inconsistent data, hence, the loss of potentially useful information.

537

TEAM LinG

Repairing and Querying Databases with Integrity Constraints

In Arenas et al. (1999), they introduce a logical characterization of the notion of a consistent answer in a possibly inconsistent database. The technique is based on the computation of an equivalent query

Tw(Q) derived from the source query Q. The definition of Tw(Q) is based on the notion of residue developed in the context of semantic query optimization.

More specifically, for each literal B appearing in some integrity constraint, a residue Res(B) is computed. Intuitively, Res(B) is a universal quantified first order formula which must be true because of the constraints if B is true. Universal constraints can be rewritten as denials, that is, logic rules with empty heads of the form ←

B1 … Bn.

Let A be a literal, r a denial of the form ← B1 … Bn, Bi (for some 1 i n), a literal unifying with A, and q the most general unifier for A and Bi such that variables in A are used to substitute variables in Bi, but they are not substituted by other variables. Then, the residue of A with respect to r and Bi is:

Res(A,r,Bi) = not( (B1 … Bi-1 Bi+1 … Bn) θ )

=not B1θ … not B i-1θ not B i+1θ

... not Bnθ .

The residue of A with respect to r is Res(A,r) =

Bi | A= Bi θ Res(A,r,Bi) consisting of the conjunction of

all the possible residues of A in r whereas the residue of A with respect to a set of integrity constraints IC is

Res(A) = r IC Res(A,r).

Thus, the residue of a literal A is a first order formula which must be true if A is true. The operator Tw(Q) is defined as follows:

T0(Q) = Q ;

Ti(Q) = Ti-1(Q) R where R is a residue of some literal in Ti-1.

The operator Tw represents the fixpoint of T. Example 3is defined by the following first order formula:

Figure 1. Example 3-Consider a database D consisting of the following two relations

Supplier

Department

Item

 

Item

Type

c1

d1

i1

 

i1

t

c2

d2

i2

 

i2

t

(X, Y, Z) [ Supply(X,Y,Z) Class(Z,t) X=c1 ]

stating that only supplier c1 can supply items of type t.

The database D = { Supply(c1, d1, i1), Supply(c2, d2, i2), Class(i1, t), Class(i2, t) } is inconsistent because the integrity constraint is not satisfied (an item of type t is also supplied by supplier c2).

This constraint can be rewritten as:

Supply(X,Y,Z) Class(Z,t) X ≠ c1, where all variables are (implicitly) universally quantified. The residue of the literals appearing in the constraint are:

Res(Supply(X,Y,Z)) = not Class(Z,t) X = c1 Res(Class(Z,t)) = not Supply(X,Y,Z) X = c1

The iteration of the operator T to the query goal Class(Z,t) gives:

T0(Class(Z,t)) = Class(Z,t),

T1(Class(Z,t)) = Class(Z,t) (not Supply(X,Y,Z) X = c1),

T2(Class(Z,t)) = Class(Z,t) (not Supply(X,Y,Z) X = c1),

At Step 2, a fixpoint is reached since the literal Class(Z,t) has been “expanded”, and the literal not Supply(X,Y,Z) does not have a residue associated to it. Thus, to answer the query Q = Class(Z,t) with the above integrity constraint, the query Tw(Q)=Class(Z,t) ( not Supply(X,Y,Z) X = c1) is evaluated. The computation of Tw(Q) over the above database gives the result Z=i1.

The technique, more general than the previous ones, has been shown to be complete for universal binary integrity constraints and universal quantified queries. However, the rewriting of queries is complex since the termination conditions are not easy to detect, and the computation of answers is generally not guaranteed to be polynomial.

In Arenas, Bertossi, and Chomicki (2000), they propose an approach consisting in the use of a Logic Program with Exceptions (LPe) for obtaining consistent query answers. An LPe is a program with the syntax of an extended logic

program (ELP); that is, in it we may find both logical (or strong) negation (←) and procedural negation (not). In this program, rules with a positive literal in the head represent a sort of general default, whereas rules with a logically negated head represent exceptions. The semantic of an LPe is obtained from the semantics for ELPs by adding extra conditions that assign higher prior-

538

TEAM LinG

Repairing and Querying Databases with Integrity Constraints

ity to exceptions. The method, given a set of integrity constraints (ICs) and an inconsistent database instance, consists in the direct specification of database repairs in a logic programming formalism. The resulting program will have both negative and positive exceptions, strong and procedural negations, and disjunctions of literals in the head of some of the clauses; that is, it will be a disjunctive extended logic program with exceptions. As in Arenas et al. (1999), the method considers a set of integrity constraints written in the standard format

n

P

( x

i

) m

(← Q

i

( y

i

) ϕ where ϕ

is a for-

i =1

i

 

i =1

 

 

 

 

mula containing only built-in predicates, and there is an implicit universal quantification in front. This method specifies the repairs of the database D that violate IC by means of a logical program with exceptions, ΠD. In ΠD for each predicate P, a new predicate P’ is introduced, and each occurrence of P is replaced by P’. More specifically, ΠD is obtained by introducing:

Persistence Defaults. For each base predicate P, the method introduces the persistence defaults:

P’(x) P(x), ¬P’(x) not P(x).

The predicate P’ is the repaired version of the predicate P, so it contains the tuples corresponding to P in a repair of the original database.

Stabilizing Exceptions. From each IC and for each negative literal not Qi0 in IC, the negative exception clause is introduced:

¬Q’i0(y i0) i=1..n ¬P’i(xi), ii0 Q’i(yi), ϕ

where ϕ’ is a formula that is logically equivalent to the logical negation of ϕ. Similarly, for each positive literal Pi1 in the constraint, the positive exception clause:

P’i1(xi1) ii1 ¬P’i(xi), i=1..m Q’i(yi), ϕ

is generated. The meaning of the stabilizing exceptions is to make the ICs be satisfied by the new predicates. These exceptions are necessary but not sufficient to ensure that the changes the original subject should be subject to, in order to restore consistency, are propagated to the new predicates.

Triggering Exceptions. From the IC in standard form, the disjunctive exception clause:

i=1..n P’i(xi) i=1..m Q’i(yi) i=1..n

not Pi(xi,), i=1..m

 

4

Qi(yi), ϕ

 

 

 

 

is produced.

The program ΠD constructed as shown above is a “disjunctive extended repair logic program with exceptions for the database instance D”. In ΠD positive defaults are blocked by negative conclusions, and negative defaults by positive conclusions.

Example 4. Consider the database D = {p(a), q(b)} with the inclusion dependency ID: p(X) q(X)

In order to specify the database repairs, the new predicates p’ and q’ are introduced. The resulting repair program has four default rules expressing that p’ and q’ contain exactly what p and q contain, respectively:

p’(x) p(x); q’(x) q(x);

¬p’(x) not p(x); and ¬q’(x) not q(x);

two stabilizing exceptions:

q’(x) p’(x); ¬p’(x) ← ¬q’(x);

and the triggering exception:

¬p’(x) q’(x) p(x), not q(x).

The answer sets are { p(a), q(b), p’(a), q’(b), ¬p’(a) } and { p(a), q(b), p’(a),q’(b), q’(b) } that correspond to the two expected database repairs.

The method can be applied to a set of domainindependent binary integrity constraints IC; that is, the constraint can be checked w.r.t. satisfaction by looking to the active domain, and two literals appear in each IC at most.

In Greco and Zumpano (2000), a general framework for computing repairs and consistent answers over inconsistent databases with universally quantified variables was proposed. The technique is based on the rewriting of constraints into extended disjunctive rules with two different forms of negation (negation as failure and classical negation). The disjunctive program can be used for two different purposes: compute repairs for the database and produce consistent answers, that is, a maximal set of atoms which do not violate the constraints. The technique is sound and complete

539

TEAM LinG

Соседние файлы в предмете Электротехника