Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Rivero L.Encyclopedia of database technologies and applications.2006

.pdf
Скачиваний:
14
Добавлен:
23.08.2013
Размер:
23.5 Mб
Скачать

REFERENCES

Berners-Lee, T. J., Calilliau, & Groff, J. F. (1992). The World Wide Web. Computer Networks & ISDN Systems, 25(4), 454-459.

Campbell, R. L. (2002). Jean Piaget’s genetic epistemology: Appreciation and critique. Retrieved April, 2002, from http://hubcap.clemson.edu/campber/piaget.html

Corcho, O., & Gomez-Perez, A. (2000). Evaluating knowledge representation and reasoning capabilities of ontology specification languages. Proceedings of the ECAI 2000 Workshop on Applications of Ontologies and Prob- lem-Solving Methods, Berlin, Germany. Retrieved June 2004, from http://delicias.dia.fi.upm.es /WORKSHOP/ ECAI00/schedule.html

Gruber, T. R. (1993). A translation approach to portable ontology specifications. Knowledge Acquisition, 5, 199-220.

Hallden, S. (1957). On the logic of “better.” Library of Theoria, No. 2. Uppsala.

Kangassalo, H. (1983). Structuring principles of conceptual schemas and conceptual models. In J. A. Bubenko, Jr. (Ed.), Information modeling (pp. 223-307). Lund, Sweden: Studentlitteratur.

Palomaki, J. (1994). From concepts to concept theory: Discoveries, connections, and results. Acta Universitatis Tamperensis, ser. A, vol. 416. Tampere, Finland: University of Tampere. Finland.

Piaget, J. (1950). Introduction l’pistmologie gntique. 1: La pense mathmatique. Paris: Presses Universitaires de France.

Sowa, J. F. (2000). Knowledge representation: Logical, philosophical, and computational foundations.

Pacific Grove, CA: Brooks Cole.

W3C. (2002, March 7). Requirements for a Web ontology language. W3C working draft. Retrieved August, 2003, from http://www.w3.org/TR/2002/WD-webont- req-20020307/

Wright, G. H. von. (1963a). The logic of preference. Edinburgh, Scotland: Edinburgh University Press.

Wright, G. H. von. (1963b). The varieties of goodness. London: Routledge.

High Quality Conceptual Scheme

KEY TERMS

Conceptual Modeling: Process of forming and collecting conceptual knowledge about the Universe of Discourse (UoD), and documenting the results in the form of a conceptual schema.

Conceptual Schema: A completely or partially timeindependent description of a portion of the (real or postulated) world in the sense that a conceptual schema contains the definition of all concepts, and all relationships between concepts, allowed to be used in the description of that portion of the world.

Entity: Anything that can be conceived as real and can be named.

Explication: The act of making clear or removing obscurity from the meaning of a word, symbol, expression, and such.

Modellens: The constructed model, which may consist of different kinds of representation, such as sentences, figures, or letters.

Modeller: The modeling subject, the designer, a student, a group of agents, a camera, and so forth.

Modellum: The object to be modeled, the entity of interest, the universe of discourse, and so forth.

Ontology: In philosophy, the theory of being. In information systems design, ontology defines the terms used to describe and represent some area of knowledge.

Pragmatics: A subfield of linguistics that studies how people comprehend and produce communicative acts in concrete situations.

Universe of Discourse (UoD): A collection of all those entities that have been, are, or ever might be in a selected portion of the real world or the postulated world.

280

TEAM LinG

 

281

 

The Information Quality of Databases

 

 

 

I

 

 

 

 

 

InduShobha Chengalur-Smith

University at Albany, USA

M.PamelaNeely

Rochester Institute of Technology, USA

ThomasTribunella

Rochester Institute of Technology, USA

INTRODUCTION

A database is only as good as the data in it. Transactionprocessing systems and decision-support systems provide data for strategic planning and operations. Thus, it is important to not only recognize the quality of information in databases, but also to deal with it. Research in information quality has addressed the issues of defining, measuring and improving quality in databases; commercial tools have been created to automate the process of cleaning and correcting data; and new laws have been created that deal with the fallout of poor information quality. These issues will be discussed in the next sections.

BACKGROUND

Data or information quality began to achieve prominence during the total quality management movement in the 1980s. However, it has relevance to individual decision making as well as to organizational data. With the advent of the World Wide Web, the concept of information overload became universal, and the notion of information quality had instant appeal. As is typical in an emerging field, there are several research themes in the information quality literature. In this section, we highlight topics in some of the major research areas.

Although data quality is now widely perceived as critical, there is less agreement on exactly what is meant by high-quality information. Clearly it is a multidimensional construct whose measure is very much context dependent. A substantial amount of work has been devoted to identifying the dimensions of data quality and their interrelationships. Early work on the dimensions of data quality looked at pairs of attributes, such as accuracy and timeliness, and the trade-offs between them (e.g., Ballou & Pazer, 1995).

A more comprehensive study that attempted to capture all the dimensions of information quality was conducted by Wang and Strong (1996). They conducted a

two-stage survey and identified about 180 attributes of data quality, which they combined into 15 distinct dimensions. These dimensions were further organized into four categories: intrinsic quality, contextual quality, representation, and accessibility.

There have been several follow-on studies that focus on subsets of these dimensions. For instance, Lee and Strong (2003) examined five of these dimensions and concluded that when data collectors know why the data is being collected, it leads to better quality data. Another avenue for research has been in developing formulae and metrics for each of these dimensions (e.g., Pipino, Lee, & Wang, 2002).

Yet another branch of research in information quality considers information to be a product of an information manufacturing process (Ballou, Wang, Pazer, & Tayi, 1998). Thus, the output of an information system can be tracked much like a manufacturing product, allowing for the application of quality-control procedures to improve the quality of information products. This concept has been elaborated on by Shankaranarayan, Ziad, and Wang (2003) by creating an information product map that represents the flow and sequence of data elements that form the information product.

MAIN FOCUS

Wang, Lee, Pipino, and Strong (1998) listed the steps an organization must take to successfully manage information as a product, which has led to the development of several software tools and methods for modeling and measuring data quality.

Not long ago, the majority of the software tools that fell into the data quality area depended on comparing data in databases to something else (Neely, 1998). Broadly defined, data quality tools provide three functions. They audit the data at the source; clean and transform the data in the staging area; and monitor the extraction, transformation, and loading process. Some of the

Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.

TEAM LinG

tools are an extension of data validation rules that should be a part of the database design but are frequently ignored in legacy systems. These tools, commonly referred to as auditing tools, involve comparing the data to a known set of parameters, which might be a minimum value, a set of dates, or a list of values. Once the data is run through the tool, results are routinely examined manually, a very timeand labor-consuming process.

Another category of tools, data cleansing tools, originally began as name and address tools. The core functionality of these tools is the ability to parse, standardize, correct, verify, match, and transform data. Ultimately, these tools are designed to produce accurate lists that can be used or sold.

Acquisitions and mergers have consolidated what was once a fragmented set of software tools that addressed specific roles in the data-quality process into integrated tools that can be used throughout the process of arriving at high-quality data. Companies such as Trillium Software, which once provided tools focusing on the audit function, have now expanded their functionality into the cleansing arena. Ascentials Software acquired Vality, primarily a data cleansing tool, and Ardent, primarily an auditing tool, providing it with the ability to monitor the entire process. Some of the earliest techniques in automating the process involved data mining techniques, discovering patterns in the data and reverse-engineering the process by suggesting business rules. WizRule has taken an early lead in using this technology.

Data profiling is being heralded as the new generation of data quality tools. Tools that provide this functionality, such as Data Quality Suite, claim that the inherent problem with auditing data is that it prevents the user from seeing the big picture. The focus is on the data elements, whereas with data profiling, the focus is on what the data should look like, based on aggregates of the data.

The metadata associated with data expands as the data is used for secondary purposes, such as for data warehousing, which in turn supports analytical tools, such as data mining or On-Line Analytical Processing (OLAP) tools. What was once a data dictionary, describing the physical characteristics of the data, such as field type, becomes a complex web of physical definitions, quality attributes, business rules, and organizational data. Some of the newest tools are those that attempt to manage metadata. Commercial metadata tools fall into two categories: tools for working with metadata, and centralized repositories. The metadata tools attempt to interpret technical metadata so that the business users can understand it (Levin, 1998). Central repositories are based on industry standards, in an effort to allow thirdparty warehouse and analysis tools to interface with

The Information Quality of Databases

them. Many systems attempt to combine the technical specifications with business information (Platinum Technology, 2000); however, there is still room for improvement.

Information quality has also been studied as a contributing factor to information systems (IS) success. A widely accepted model of IS success holds that the quality of information obtained from a system impacts satisfaction with, and ultimately the success of, the system (DeLone & McLean, 2003). In an empirical study of the factors affecting the success of data warehouse projects, data quality is shown to have a significant impact (Wixom & Watson, 2001). However organizational, project and technical success did not result in high-quality data in the warehouse. A study by Strong, Lee, and Wang (1997) identified the major reasons for poor-quality data and suggested ways to improve the data collection and information dissemination processes.

The major impact of poor-quality data is on decision making. To make users aware of the quality of the data they are dealing with, the use of data tags has been proposed (Wang & Madnick, 1990). Databases could be tagged at varying levels of granularity and using different dimensions of data quality. For instance, by recording the time a data item was last updated, a user could assess the currency of the data. Creating separate tags for each data item would be very resource intensive. Alternatively, a tag for each record or field could be used to represent the quality of the information stored there. When considering what information the tag should contain, it is important to remember that the data will be used in the context of a specific situation. Thus, the quality of the data may be sufficient in one instance, but not in another. The “fitness-for-use” concept continues to be an issue when applying data tags. Regardless, the data tags form a metadata about the quality of the data.

Clearly, keeping the metadata on information quality up to date could itself be a Herculean task. Before embarking on such an ambitious project, it would be worthwhile to examine the impact of providing this information about the quality of the data. If decision makers were unable to use this information, for any reason, whether it was because it was not in the appropriate format or because it created information overload, then the effort of producing this metadata would be futile.

Some exploratory research (Chengalur-Smith, Ballou, & Pazer, 1999) has shown that the format in which the data quality information is provided does play a role in the use of the information. Fisher, ChengalurSmith, & Ballou, (2003) determined that experienced decision makers were more likely to use the data quality information if they did not have expertise in the data domain and if they did not feel time pressure. Further research will refine these findings.

282

TEAM LinG

The Information Quality of Databases

An important aspect of information quality is trust. If users do not trust the source of the data, they are unlikely to use the data or any resulting information product. There has been a lot of research devoted to interpersonal trust and, in a business setting, trust among partnering organizations. When designing information systems, the credibility of the data has to be established. Tseng and Fogg (1999) discuss the two kinds of errors that could occur when consuming information: the “gullibility” error, in which users believe information they should not, and the “incredulity” error, in which users fail to trust data that is actually credible. Tseng and Fogg emphasize that the responsibility for reducing the incredulity error rests with the designers of database and information systems.

Wathen and Burkell (2002) indicate that users first assess the surface and message credibility of a system before they evaluate its content. Surface credibility can be assessed through the interface design characteristics, and message credibility includes information quality characteristics such as source, currency, and relevance. The threshold credibility required would vary from user to user and would depend on the context, motivation, expertise, and so forth, of the user. A major challenge for system designers is creating systems that have universal access. The Holy Grail for such systems is that is they can be built to adapt to individual users’ preferences (Stephanidis, 2001). Thus, the data in these systems needs to be custom tailored so that the presentation pace and format is matched to users’ knowledge-acquisition styles.

A major concern for databases is security. Often, databases contain confidential information (e.g., financial data). The trade-offs between security and accessibility has been studied by several researchers. One approach has been to establish different levels of access and provide only aggregate statistics to the general public but allow authorized individuals to access the entire dataset for decision making purposes (Adam & Wortmann, 1989). A related approach proposed by Muralidhar, Parsa, and Sarathy (1999) suggests perturbing the data with random noise in such a way that the relationships between the attributes do not change. Their process ensures that querying the database with the modified data will give accurate results but that the original confidential data will not be disclosed.

Information auditing is another aspect of information quality that is gaining in popularity. An information product is the output from an information system that is of value to some user. In 2002, the U.S. Congress enacted the Sarbanes-Oxley Act “to protect investors by improving the accuracy and reliability of corporate disclosures made pursuant to the securities laws.” The act was a direct reaction to the financial malfeasance that

occurred with corporations such as Enron and Global Crossing. It requires management to report “all signifi- I cant deficiencies in the design or operation of internal controls which could adversely affect the issuer’s abil-

ity to record, process, summarize, and report financial data.” This raises the level of importance regarding data quality concerns and will require accountants as well as other professionals to receive more training on dataquality issues. In addition, the act requires auditors to assess internal controls and attest to its adequacy and effectiveness at maintaining accurate and reliable financial data. Analyzing internal control procedures requires auditors to think about systems and focus attention on the business processes that produce or degrade data quality.

According to the American Institute of Certified Accountants, data storage, data flows, and information processes, as well as the related internal control systems, must be regularly documented and tested. Data quality can be measured in terms of a risk assessment, which estimates the probability that data integrity has been maintained by the control environment. A violation of internal control, such as an improper authorization or a breakdown in the segregation of duties, should require an immediate response from management to correct the problem and maintain the organization’s confidence in the quality of the data.

As part of a data audit, companies may be required to assess the quality of the reports or the other information products they generate. In the context of a relational database, an information product is generally obtained by combining tables residing in the database. Although it may be feasible to assess the quality of the individual tables in a database, it is impossible to predict all the ways in which this data may be combined in the future. Furthermore, a given combination of tables may have a very different quality profile from the constituent tables, making it impossible to surmise the quality of information products with any degree of confidence. Ballou et al (2004) have proposed a method for estimating the quality of any information product based on samples drawn from the base tables. They developed a reference table procedure that can be used as a gauge of the quality of the information product, regardless of the combination of relational algebraic operations used to create the product.

FUTURE TRENDS

One of the technological trends that will affect data quality is markup languages. The creation of markup languages such as eXtensible Markup Language (XML) will allow for the rapid communication of data between

283

TEAM LinG

organizations, systems, and networks. XML allows users to define an application-specific markup language and publish an open standard in the form of a taxonomy of tags; XML-based tags that can be used to identify grains of data for financial, scientific, legal and literary applications.

For example, eXtensible Business Reporting Language (XBRL) is an XML-based open standard being developed by XBRL International Inc. The XBRL taxonomy is a royalty-free standard that is supported by a not-for-profit consortium of approximately 200 companies and agencies. Accordingly, the XBRL standard has been adopted by all major producers of accounting information systems (AIS) and enterprise resource planning (ERP) systems. These systems will produce financial information in an XBRL-compliant format so that the information can be published, retrieved, exchanged, and analyzed over the Internet. This common platform, which supports the business reporting process, will improve the ease and increase the quantity of financial data being received by managers and investors.

Currently, XBRL is being used by the Securities Exchange Commission (SEC) to accept financial reports that contain data in XBRL-compliant form. New ways of testing data quality and reliability using markup technology will have to be developed by investors, accountants, regulators, business executives, and financial analysts. Furthermore, XBRL will reduce latency by increasing the speed with which financial information is reported to the market. In the future, XBRLcompliant systems could theoretically change the reporting cycle from periodic to real time. However, realtime on-line financial reporting will require the development of real-time data-assurance methods.

CONCLUSION

This article covered some key research topics in information quality. We began with the attempts to capture the many dimensions of data quality and their inter-relation- ships. We also focused on the technology that has been developed to automate data cleansing. The impact of data quality on the success of and satisfaction with information systems was also discussed. Future developments in databases may include data tags that are metadata about the quality of the data. On an individual level, the effectiveness of this metadata is still being investigated. Some preliminary research has shown that the data tags are most helpful to users who are experienced but not necessarily familiar with the subject domain of the database. Finally, the Sarbanes Oxley Act will have wide reaching consequences for all data quality.

The Information Quality of Databases

REFERENCES

Adam, N. R., & Wortmann, J. C. (1989). Security control methods for statistical databases: A comparative study. ACM Computing Surveys, 21(4), 515-556.

Ballou, D. P., Chengalur-Smith, I. N., & Wang, R. Y. (2004). Quality estimation for information products in relational database environments.

Ballou, D. P., & Pazer, H. L. (1995). Designing information systems to optimize the accuracy timeliness tradeoffs. Information Systems Research, 6(1), 51-72.

Ballou, D. B., Wang, R. Y., Pazer, H. L., & Tayi, G. K. (1998, April). Modeling information manufacturing systems to determine information product quality. Management Science, 44(4), 462-484.

Chengalur-Smith, I. N., Ballou, D. P., & Pazer, H. L. (1999). The impact of data quality information on decision making: An exploratory analysis. IEEE Transactions on Knowledge and Data Engineering, 11(6), 853-864.

DeLone & McLean. (2003). The DeLone and McLean model of information systems success: A ten year update. Journal of Management Information Systems, 19(4), 9-30.

Fisher, C., Chengalur-Smith, I. N., & Ballou, D. P. (2003).The impact of experience and time on the use of data quality information in decision making. Information Systems Research, 14(2), 170-188.

Lee, Y., & Strong, D. (2003). Knowing why about data processes and data quality. Journal of Management Information Systems, 20(3), 13-39.

Levin, R. (1998). Meta matters. Information Week, 18-19.

Muralidhar, K, Parsa, R., & Sarathy, R. (1999). A general additive data perturbation method for database security. Management Science, 45(10), 1399-1415.

Neely, P. (1998). Data quality tools for data warehousing: A small sample survey. Proceedings of the International Conference on Information Quality, Cambridge, MA.

Pipino, L., Lee, Y. W., & Wang, R. Y. (2002). Data quality assessment. Communications of the ACM, 45(4), 211-218.

Platinum Technology. (2000). Putting metadata to work in the warehouse. Computer Associates White Paper.

Shankaranarayan, G., Ziad, M., & Wang, R. Y. (2003). managing data quality in dynamic decision environment: An information product approach. Journal of Database Management, 14(4).

284

TEAM LinG

The Information Quality of Databases

Stephanidis, C. (2001). Adaptive techniques for universal access. User Modeling and User-Adapted Interaction, 11, 159-179.

Strong, D. M., Lee, Y. W., & Wang, R. W. (1997). 10 potholes in the road to information quality. Computer, 30(8), 38-46.

Tseng, H., & Fogg, B. J. (1999). Credibility and computing technology. Communications of the ACM, 42(5), 39-44.

Wang, R. Y., Lee, Y. W., Pipino, L., & Strong, D. M. (1998). Manage your information as a product. Sloan Management Review, 39(4), 95-105.

Wang, R. Y. & Strong, D. (1996). Beyond accuracy: What data quality means to data consumers. Journal of Management Information Systems, 12(4), 5-34.

Wathen, C. N., & Burkell, J. (2002). Believe it or not: Factors influencing credibility on the Web. Journal of the American Society for Information Science and Technology, 53(2), 134-144.

Wixom, B. H., & Watson, H. J. (2001). An empirical investigation of the factors affecting data warehousing success. MIS Quarterly, 25(1), 17-41.

KEY TERMS

Auditing: a systematic process of objectively obtaining and evaluating evidence regarding assertions about data and events to ascertain the degree of compliance with established criteria.

Data Tags: Quality indicators attached to fields, records, or tables in a database to make decision-makers I aware of the level of data quality.

Fitness for Use: Describes the many variables that need to be considered when evaluating the quality of an information product.

Information Product: The output of an information manufacturing system; it implies stages of development, such as information suppliers, manufacturers, consumers, and managers.

Information Quality: tThe degree to which information consistently meets the requirements and expectations of the knowledge workers in performing their jobs.

Internal Control: A system of people, technology, processes, and procedures designed to provide reasonable assurance that an organization achieves its business process goals.

Metadata: Data about data; that is, data concerning data characteristics and relationships.

Risk Assessment: A set of procedures whose purpose is to identify, analyze, and manage the possibility that data quality has been degraded due to events and circumstances such as a breakdown in internal controls, system restructuring, changes in organizational culture, or alterations in the external environment.

285

TEAM LinG

286

Integration of Data Semantics in

Heterogeneous Database Federations

H. Balsters

University of Groningen, The Netherlands

INTRODUCTION

Modern information systems are often distributed in nature; data and services are spread over different component systems wishing to cooperate in an integrated setting. Information integration is a very complex problem and is relevant in several fields, such as data reengineering, data warehousing, Web information systems, e-commerce, scientific databases, and B2B applications. Information systems involving integration of cooperating component systems are called federated information systems; if the component systems are all databases then we speak of a federated database system (Rahm & Bernstein, 2001; Sheth & Larson, 1990). In this article, we will address the situation where the component systems are so-called legacy systems; i.e., systems that are given beforehand and which are to interoperate in an integrated single framework in which the legacy systems are to maintain as much as possible their respective autonomy.

A huge challenge is to build federated databases that respect so-called global transaction safety; i.e., global transactions should preserve constraints on the global level of the federation. In a federated database (or FDB, for short) one has different component databases wishing to cooperate in an integrated setting. The component systems are often legacy systems: They have been developed in isolation before development of the actual federated system (they remain to be treated as autonomous entities). Legacy systems were typically designed to support local requirements; i.e., with local data and constraints, and not taking into account any future cooperation with other systems. Different legacy systems may also harbour different underlying data models, subject to different query transaction processing methods (flat files, network, hierarchical, relational, objectoriented, etc.). Getting a collection of autonomous legacy systems to cooperate in a single federated system is known as the interoperability problem.

The general term mediation (Wiederhold, 1995) was founded to address the problem of interoperability. A federated database (FDB) can be seen as a special kind of mediation, where the mediator acts as a DBMS-like interface to the FDB application.

A mediator is a global service to link local data sources and local application programs. It provides integrated information while letting the component systems of the federation remain intact. Typical mediator tasks include:

accessing and retrieving relevant data from multiple heterogeneous sources,

transforming retrieved data so that they can be integrated,

integrating the homogenized data.

The mediator provides a database-like interface to applications. This interface gives the application the impression of a homogeneous, monolithic database. In reality, however, queries and transactions issued against this interface are translated to queries and transactions against underlying component database systems. Mediation is typically realized by defining a suitable uniform data model on the global level of the federation; such a global uniform data model is targeted at resolving the (ontological) differences in the underlying component data models (Rahm & Bernstein, 2001). Best candidates, on the conceptual level, are semantic data models, e.g., UML/OCL (Warmer & Kleppe, 2003), with the aim to define a data model powerful enough to harbour:

rich data structures,

an expressive language for operations,

a rich constraint language.

This article will discuss some problems related to semantic heterogeneity, as well as offer an overview of some possible directions in which they can be solved.

BACKGROUND

Data integration systems are characterized by an architecture based on a global schema and a set of local schemas. There are generally three situations in which the data integration problem occurs. The first is known as global-as-view (GAV), in which the global schema is defined directly in terms of the source schemas. GAV

Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.

TEAM LinG

Integration of Data Semantics in Heterogeneous Database Federations

systems typically arise in the context where the source schemas are given, and the global schema is to be derived from the local schemas. The second situation is known local-as-view (LAV), in which the relation between the global schema and the sources is established by defining every source as a view over the global schema. LAV systems typically arise in the context where the global schema is given beforehand, and the local schemas are to be derived in terms of the global schema. The third situation is known as data exchange, characterized by the situation that the local source schemas, as well as the global schema, are given beforehand; the data integration problem then exists in establishing a suitable mapping between the given global schema and the given set of local schemas (Miller, Haas,

&Hernandez, 2000). An overview of data integration concentrating on LAV and GAV can be found in Lenzerini (2002); articles by Abiteboul and Douschka (1998), Grahne and Mendelzon (1999), and Halevy (2001) concentrate on LAV, whereas Cali, Calvanese, De Giaconom, and Lenzerini (2002), Türker and Saake (2000), and Vermeer and Apers (1996) concentrate on GAV. Our article focuses on legacy problems in database federations in the context of GAV; in particular, we deal with the situation that a preexisting collection of autonomous component databases is targeted to interoperate on the basis of mediation. The mediator is defined as a virtual database on the global level and is aimed at faithfully (i.e., on the basis of completeness and consistency) integrating the information in the original collection of component databases.

A major problem that we will address in this article is that of so-called semantic heterogeneity (Bouzeghoub

&Lenzerini, 2001; Hull, 1997; Rahm & Bernstein, 2001; Vermeer & Apers, 1996). Semantic heterogeneity refers to disagreement on (and differences in) meaning, interpretation, or intended use of related data. The process of creation of uniform representations of data is known as data extraction, whereas data reconciliation is concerned with resolving data inconsistencies. Examples of articles concentrating on GAV as a means to tackle semantic heterogeneity in database federations are found in Balsters (2003), Cali et al. (2002), Türker and Saake (2000), and Vermeer and Apers. These articles concern the following topics: Cali et al. treats data integration under global integrity constraints; Türker and Saake concerns integration of local integrity constraints; and Vermeer and Apers abstracts from the relational model, as we do in this article, by offering a solution based on an object-oriented data model (Balsters, de By, & Zicari, 1993; Balsters & Spelt, 1998). This article differs from the aforementioned articles in the following aspects. In contrast to Cali et al., we also take local integrity constraints into account;

furthermore, our approach adopts an approach restricted

to so-called sound views instead of exact ones. The I article by Türker and Saake abstracts from problems concerning data extraction by assuming the existence of

a uniform data model (pertaining to all participating local databases) in which all problems regarding semantic heterogeneity have been straightened out beforehand. Our article, in contrast, offers a treatment of data extraction and reconciliation in a combined setting and as an integral part of the mapping from local to global.

OUR FOCUS:

SEMANTIC HETEROGENEITY

The problems we are facing when trying to integrate the data found in legacy component frames are well known and extensively documented (Lenzerini, 2002; Sheth & Larson, 1990). We will focus on one of the large categories of integration problems coined as semantic heterogeneity (Balsters & de Brock, 2003a, 2003b; Bouzeghoub & Lenzerini, 2001; Hull, 1997; Vermeer & Apers, 1996). Semantic heterogeneity refers to disagreement on (and differences in) meaning, interpretation, or intended use of related data. Examples of problems in semantic heterogeneity are data extraction and data reconciliation. The process of creation of uniform representations of data is known as data extraction, whereas data reconciliation is concerned with resolving data inconsistencies. The process of data extraction can give rise to various inconsistencies due to matters pertaining to the ontologies (Rahm & Bernstein, 2001) of the different component databases. Ontology deals with the connection between syntax and semantics, and how to classify and resolve difficulties and classification between syntactical representations on the one hand and semantics providing interpretations on the other hand.

Integration of the source database schemas into one encompassing schema can be a tricky business due to: homonyms and synonyms, data conversion, default values, missing attributes, and subclassing. These five conflict categories describe the most important problems we face in dealing with integration of data semantics. We will now shortly describe in informal terms how these problems can be tackled.

Conflicts due to homonyms are resolved by mapping two same name occurrences (but with different semantics) to different names in the integrated model. Synonyms are treated analogously, by mapping two different names (with the same semantics) to one common name. In the sequel, we will use the abbreviations hom (syn) to indicate that we

287

TEAM LinG

Integration of Data Semantics in Heterogeneous Database Federations

have applied this method to solve a particular case of homonym (synonym) conflicts.

Conflicts due to conversion arise when two attributes have the same meaning, but their domain values are differently represented. For example, an attributes salary (occurring, say, in two classes Person and Employee in two different databases DB1 and DB2, respectively) could indicate the salary of an employee, but in the first case the salary is represented in the currency dollars ($), while in the latter case the currency is given in euros (•). One of the things we can do is to convert the two currencies to a common value (e.g., $, invoking a function convertTo$). Another kind of conversion can occur when the combination of two attributes in one class have the same meaning as one attribute in another class. We will use the abbreviation conv to indicate that we have applied this method to solve a particular case of a representation conflict.

Conflicts due to default values occur when integrating two classes and an attribute in one class is not mentioned in the other (similar) class, but it could be added there by offering some suitable default value for all objects inside the second class. As an example, consider an attribute part-time (occurring, say, in class Person in database DB1) could also be added to some other related class (say, Employee in database DB2) by stipulating that the default value for all objects in the latter class will be 10 (indicating full-time employment). We will use the abbreviation def to indicate that we have applied this method to solve a particular case of a default conflict.

Conflicts due to differentiation occur when the integration of two classes calls for the introduction of some additional attribute in order to discriminate between objects originally coming from these two classes, e.g., in the case of conflicting constraints. Consider as an example the classes Person (in DB1) and Employee (in DB2). Class Person could have as a constraint that salaries are less than 1,500 (in $), while class Employee could have as a constraint that salaries are at least 1,000 (in •). These two constraints seemingly conflict with each other, obstructing integration of the Person and the Employee class to a common class, say, Integrated-Person. However, by adding a discriminating attribute dep to Integrated-Person indicating whether the object comes from the DB1 or from the DB2 database, one can differentiate between two kinds of employees and state the constraint on the integrated level in a suitable way. We will use the abbreviation diff to indicate that we have

applied this method to solve a particular case of a differentiation conflict.

Problems pertaining to semantic heterogeneity in database integration (dealing with applications of hom, syn, conv, def, and diff ) can all be treated in a uniform and systematic manner. We refer the interested reader to Sheth and Larson (1990), Hull (1997), Lenzerini (2002), and Balsters (2003a, 2003b) for exact details on this subject, but we can give an indication of how such a database integration can take place.

The basic idea is to arrive at an encompassing, global database, say, DBINT (from integrated database), in combination with a particular set of constraints. Data extraction will be performed by providing a global schema in which the local data coming from local databases can be uniformly represented, while data reconciliation has been performed by resolving constraint conflicts by suitable use of differentiation on the global level.

The strategy to integrate a collection of legacy databases into an integrated database DBINT is based on the following principle:

An integrated database DBINT is intended to hold (and maintain to hold!) exactly the “union” of the data in the source databases in the original collection of local databases.

In particular, this means that a given arbitrary update offered to the global database should correspond to a unique specific update on a (collection of) base table(s) in the original collection of local databases (and vice versa). We have to keep in mind that any update, say, t, on the global level is an update on a virtual (nonmaterialized) view; hence, it has to correspond to some concrete update, say, , implementing t on the local level of the component frame. This is the existence requirement. There is also a uniqueness requirement that has to be met. Suppose, for example, that some update t on the global level corresponds to two different updates, say, and ’, on the local level. This would lead to an undesirable situation related to the fact that in a database federation, existing legacy components remain to exist after integration and also are assumed to respect existing applications (i.e., those application running on the legacy databases irrespective of the fact that such a legacy database happens to also participate in some global encompassing federation). Such an existing application could involve a query on a legacy database, which could give two different query results, depending on which of the two concrete updates and ’ is performed on the local level. A database view, such as DBINT, satisfying both the

288

TEAM LinG

Integration of Data Semantics in Heterogeneous Database Federations

existence and uniqueness criteria as described above, is called an exact view. The general view update problem is treated extensively in Abiteboul, Hull, and Vianu (1995). In our setting in which we wish to construct DBINT as an exact view over the component databases, the results offered in Abiteboul et al. (1995) entail that the universe of discourse of the original collection of component databases and the universe of discourse of the integrated database DBINT are, in a mathematical sense, isomorphic. Only in this way will we not lose any information when mapping the legacy components to the integrated database (and vice versa). The interested reader is referred to Balsters (2003a, 2003b) for more details on this subject.

FUTURE TRENDS

Data reconciliation and data extraction are now fairly well understood on the theoretical level. A topic acquiring recent attention is integration of database constraints and introduction of so-called federation constraints on the global level of the federation (Balsters, 2003b; Türker & Saake, 2000). Integration of local transactions within database federations (closely related to the problem of constraint integration) is still in demand of more research.

Federated database systems are now also appearing on a more commercial level. As a prominent, and promising, product, we mention IBM DB2 Information Integrator (cf. http://www.ibm.com, date of download May 2003), enabling users to build integrated databases without migrating data from local to global sources.

Many of the techniques referred to in this article are also applied in settings related to data federations, of which we mention data warehousing and enterprise application integration. Both of the latter-mentioned application areas are heavily embedded in data integration problems.

CONCLUSION

A data federation provides for tight coupling of a collection of heterogeneous legacy databases into a global integrated system. Two major problems in constructing database federations concern achieving and maintaining consistency and a uniform representation of the data on the global level of the federation. The process of creation of uniform representations of data is known as data extraction, whereas data reconciliation is concerned with resolving data inconsistencies. A particular example of a data reconciliation problem in database federa-

tions is the integration of integrity constraints occurring within component legacy databases into a single global I schema.

We have described an approach to constructing a global integrated system from a collection of (semantically) heterogeneous component databases based on the concept of exact view. A global database constructed by exact views integrates component schemas without loss of constraint information; i.e., integrity constraints available at the component level remain intact after integration on the global level of the federated database.

REFERENCES

Abiteboul, S., Hull, R., & Vianu, V. (1995). Foundations of databases. Reading, MA: Addison-Wesley.

Abiteboul, S., & Douschka, O. (1998). Complexity of answering queries using materialized views. In Proceedings of the Seventeeth ACM SIGACT-SIGMID-SIGART Symposium on Principles of Database Systems. New York: ACM Press.

Balsters, H. (2003). Modeling database views with derived classes in the UML/OCL-framework. In Lecture Notes in Computer Science: Vol. 2863. «UML» 2003 Sixth International Conference. Berlin, Germany: Springer.

Balsters, H., de By, R. A., & Zicari, R. (1993). Sets and constraints in an object-oriented data model. In Lecture Notes in Computer Science: Vol. 707. Proceedings of the Seventh ECOOP. Berlin, Germany: Springer.

Balsters, H., & de Brock, E. O. (2003a). An object-oriented framework for managing cooperating legacy databases. In Lecture notes in computer science: Vol. 2817. Ninth International Conference On Object-Oriented Information Systems. Berlin, Germany: Springer.

Balsters, H., & de Brock, E. O. (2003b). Integration of integrity constraints in database federations. Sixth IFIP TC-11 WG 11.5 Conference on Integrity and Internal Control in Information Systems. Norwell, MA: Kluwer.

Balsters, H., & Spelt, D. (1998). Automatic verification of transactions on an object-oriented database. In Lecture notes in computer science: Vol. 1369. Sixth International Workshop on Database Programming Languages. Berlin, Germany: Springer.

Bouzeghoub, M., & Lenzerini, M. (2001). Introduction.

Information Systems, 26.

Cali, A., Calvanese, D., De Giacomo, G., & Lenzerini, M. (2002). Data integration under integrity constraints. In

289

TEAM LinG

Соседние файлы в предмете Электротехника