
Rivero L.Encyclopedia of database technologies and applications.2006
.pdfChecking Integrity Constraints in a Distributed Database
simplified forms of the constraints by analyzing both the syntax of the constraints and their appropriate update operations. These methods are based on syntactic criteria. In fact, an improved set of constraints (fragment constraints) can be constructed by applying both the semantic and syntactic methods. Also, the knowledge about the data distribution and the application domain can be used to create algorithms which derive using this knowledge an efficient set of fragment constraints from the original constraints. In order to derive an integrity tests from a given initial constraint, the following steps can be followed (Ibrahim, Gray & Fiddian, 2001):
a.Transform the initial constraint into a set of logically equivalent fragment constraints which reflect the data fragmentation. At this stage, the transformations are restricted to logically equivalent transformations, without considering any reformulation of the original constraints. The transformation is one to many, which means that given an integrity constraint, the result of the transformation is a logically equivalent set of fragment constraints. There are six transformation rules that are applied during this process (Ibrahim, Gray & Fiddian, 2001) which cover the horizontal, vertical and mixed fragmentation.
b.Optimize the derived set of fragment constraints. There are several types of optimization technique that can be applied to constraints at the fragment level, such as techniques for optimizing query processing (Chakravarthy, 1990), reformulation techniques (Qian, 1989) and theorem based techniques (McCarroll, 1995; McCune & Henschen, 1989). Constraint optimization can be performed before (preoptimization) or after (postoptimization), compilation. The constraint optimization process should use both syntactic and semantic information about the database integrity constraints.
c.Distribute the set of fragment constraints to the appropriate site(s). Because the complexity of enforcing constraints is directly related to both the number of constraints in the constraint set and the number of sites involved, the objective of this phase is to reduce the number of constraints allocated to each site for execution at that site. Distributing the whole set of fragment constraints to every site is not cost-effective since not all fragment constraints are affected by an update and so sites may not be affected by particular updates. The decision of the distribution is based on the site allocation of the fragment relations specified in the constraint.
d.Generate the integrity tests. There are a lot of techniques proposed by previous researchers to generate integrity tests (simplified forms) as discussed in the previous section.
These four steps use the assumption that the database is consistent prior to an update operation, the fragmentation strategies used, the allocation of the fragment relations, the specification of the fragment constraints, and the generated update templates.
iii.Localizing Integrity Checking: It assumes efficient integrity tests can be generated. Here, efficiency is measured by analyzing three components, namely, the amount of data that needs to be accessed, the amount of data that needs to be transferred across the network and the number of sites that are involved in order to check a constraint. The intention is to derive local tests which are complete, necessary or sufficient and whose evaluation is local to a site. Constraint simplification by update analysis is preferred to constraint simplification by reformulation, since reformulation will generally require an unbounded search over all possible constraints when generating the integrity tests. Using heuristics to minimize this search may mean some optimal reformulations are not found. The fragment constraints derived by the transformation process can be classified as either local or nonlocal fragment constraints. The evaluation of a local fragment constraint is performed at a single site, thus it is similar to the evaluation of a constraint in a centralized system. Therefore, the constraint simplification techniques proposed by previous researchers can be adopted for constructing integrity tests for local fragment constraints. For nonlocal fragment constraints, a process which derives local tests from these constraints is more attractive than one which derives global tests whose evaluation spans more than one site. This means exploiting the relevant general knowledge to derive local tests. This will involve using, for example, techniques that can infer the information stored at different sites of the network.
iv.Pretest Evaluation: It adopts an evaluation scheme which avoids the need to undo the update in the case of constraint violation, i.e. pre-test evaluation is preferred as it avoids the undoing process.
v.Test Optimization: Integrity test evaluation costs can be further reduced by examining the semantics of both the tests and the relevant update operations. An integrity test for a modify operation can be simplified if the attribute(s) being modified is not the attribute(s) being tested. Also, in some
70
TEAM LinG

Checking Integrity Constraints in a Distributed Database
cases it is more efficient to construct transition tests. These simplified tests further reduce the amount of data needing to be accessed or the number of sites that might be involved, and they are more selective as more constants are being substituted for the variables in these tests, which make them easier and cheaper to evaluate than the generated initial tests.
transferring of data across the network is required (Gupta, 1994), so T = 0. The evaluation of local tests involves a C checking space and an amount of data accessed (assumed
A = al) which are always smaller than the checking space and the amount of data accessed by the initial constraints since these tests are the simplified forms of those constraints so these are minimized as well.
FUTURE TRENDS
From the above sections, it is obvious that a constraint checking mechanism for a distributed database is said to be efficient if it can minimize the number of sites involved, the amount of data accessed and the amount of data transferred across the network during the process of checking and detecting constraint violation. The overall intention of a constraint checking mechanism is shown in Figure 1. Three main components can be identified with respect to a given integrity constraint which affect the efficiency of its evaluation. These components which are shown in Figure 1 are represented by the following: (i) The X-axis represents the number of sites involved, σ, in verifying the truth of the integrity constraint; (ii) the Y-axis represents the amount of data accessed (or the checking space), A; (iii) the Z- axis represents the amount of data transferred across the network, T = Σni=1 dti where dti is the amount of data transferred from site i, and n is the number of remote sites involved in verifying the truth of the integrity constraint.
A constraint checking mechanism is said to be more efficient if it tends towards the origin, (0, 0, 0) in this diagram. The strategy should try to allocate the responsibility of verifying the consistency of a database to a single site, i.e. the site where the update operation is to be performed. Thus the number of sites involved, σ = 1. As the checking operation is carried out at a single site, no
Figure 1. The intention of the constraint checking mechanism
The amount of data |
|
|
accessed, A |
|
|
Y |
|
|
|
Z |
The amount of |
|
|
data transferred |
|
|
across the |
|
|
network, T |
x al |
(1, al, 0) |
|
(0,0,0) |
|
X |
The number of sites involved, |
|
|
σ |
|
CONCLUSION
An important aim of a database system is to guarantee database consistency, which means that the data contained in a database is both accurate and valid. There are many ways which inaccurate data may occur in a database. Several issues have been highlighted with regards to checking integrity constraints. The main issue is how to efficiently check integrity constraints in a distributed environment. Several strategies can be applied to achieve efficient constraint checking in a distributed database such as constraint filtering, constraint optimization, localizing constraint checking, pre-test evaluation and test optimization. Here, efficiency is measured by analyzing three components, namely: the amount of data that needs to be accessed, the amount of data that needs to be transferred across the network and the number of sites that are involved in order to check a constraint.
REFERENCES
Barbara, D., & Garcia-Molina, II (1992). The demarcation protocol: A technique for maintaining linear arithmetic constraints in distributed database systems. Proceedings of the Conference on Extending Database Technology (EDBT’92) (pp. 373-388).
Bernstein, P. A., & Blaustein, B. T. (1981). A simplification algorithm for integrity assertions and concrete views.
Proceedings of the 5th International Computer Software and Applications Conference (COMPSAC’81) (pp. 9099).
Blaustein, B. T. (1981). Enforcing database assertions: Techniques and applications. Doctoral dissertation, Harvard University.
Ceri, S., Fraternali, P., Paraboschi, S., & Tanca, L. (1994). Automatic generation of production rules for integrity maintenance. ACM Transactions on Database Systems, 19(3), 367-422.
71
TEAM LinG
Checking Integrity Constraints in a Distributed Database
Chakravarthy, U. S. (1990). Logic-based approach to semantic query optimization. ACM Transactions on Database Systems, 15(2), 162-207.
Codd, E. F. (1990). The relational model for database management (Version 2). Boston: Addison-Wesley.
Cremers, A. B., & Domann, G. (1983). AIM – An integrity monitor for the database system INGRES. Proceedings of the 9th International Conference on Very Large Data Bases (VLDB 9) (pp. 167-170).
Embury, S. M., Gray, P. M. D., & Bassiliades, N. D. (1993). Constraint maintenance using generated methods in the P/FDM object-oriented database. Proceedings of the 1st
International Workshop on Rules in Database Systems
(pp. 365-381).
Eswaran, K. P., & Chamberlin, D. D. (1975). Functional specifications of a subsystem for data base integrity.
Proceedings of the 1st International Conference on Very Large Data Bases (VLDB 1), 1(1), 48-68.
Grefen, P. W. P. J. (1990). Design considerations for integrity constraint handling in PRISMA/DB1 (Prisma Project Document P508). University of Twente, Enschede.
Grefen, P. W. P. J. (1993). Combining theory and practice in integrity control: A declarative approach to the specification of a transaction modification subsystem. Proceedings of the 19th International Conference on Very Large Data Bases (VLDB 19) (pp. 581-591).
Gupta, A. (1994). Partial information based integrity constraint checking. Doctoral dissertation, Stanford University.
Hammer, M. M., & McLeod, D. J. (1975). Semantic integrity in a relational data base system. Proceedings of the 1st
International Conference on Very Large Data Bases (VLDB 1), 1(1), 25-47.
Henschen, L. J., McCune, W. W., & Naqvi, S. A. (1984). Compiling constraint-checking programs from first-or- der formulas. Advances in Database Theory, 2, 145169.
Hsu, A., & Imielinski, T. (1985). Integrity checking for multiple updates. Proceedings of the 1985 ACM SIGMOD International Conference on the Management of Data (pp. 152-168).
Ibrahim. H. (2002a). Extending transactions with integrity rules for maintaining database integrity. Proceedings of the International Conference on Information and Knowledge Engineering (IKE’02) (pp. 341-347).
Ibrahim, H. (2002b). A strategy for semantic integrity checking in distributed databases. Proceedings of the 9th
International Conference on Parallel and Distributed Systems (ICPADS 2002) (pp. 139-144). IEEE Computer Society.
Ibrahim, H., Gray, W. A., & Fiddian, N. J. (1996). The development of a semantic integrity constraint subsystem for a distributed database (SICSDD). Proceedings of the 14th British National Conference on Databases (BNCOD 14) (pp. 74-91).
Ibrahim,H.,Gray,W.A.,&Fiddian,N.J.(1998).SICSDD— A semantic integrity constraint subsystem for a distributed database. Proceedings of the 1998 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA’98) (pp. 1575-1582).
Ibrahim, H., Gray, W. A., & Fiddian, N. J. (2001). Optimizing fragment constraints—A performance evaluation.
International Journal of Intelligent Systems—Verifica- tion and Validation Issues in Databases, KnowledgeBased Systems, and Ontologies, 16(3), 285-306.
Mazumdar, S. (1993). Optimizing distributed integrity constraints. Proceedings of the 3rd International Symposium on Database Systems for Advanced Applications
(Vol. 4, pp. 327-334).
McCarroll, N. F. (1995). Semantic integrity enforcement in parallel database machines. Doctoral dissertation, University of Sheffield.
McCune, W. W., & Henschen, L. J. (1989). Maintaining state constraints in relational databases: A proof theoretic basis. Journal of the Association for Computing Machinery, 36(1), 46-68.
Nicolas, J. M. (1982). Logic for improving integrity checking in relational data bases. Acta Informatica, 18(3), 227-253.
Plexousakis, D. (1993). Integrity constraint and rule maintenance in temporal deductive knowledge bases.
Proceedings of the 19th International Conference on Very Large Data Bases (VLDB 19) (pp. 146-157).
Qian, X. (1988). An effective method for integrity constraint simplification. Proceedings of the 4th International Conference on Data Engineering (ICDE 88) (pp. 338-345).
Qian, X. (1989). Distribution design of integrity constraints. Proceedings of the 2nd International Conference on Expert Database Systems (pp. 205-226).
Sheard, T., & Stemple, D. (1989). Automatic verification of database transaction safety. ACM Transactions on Database Systems, 14(3), 322-368.
72
TEAM LinG
Checking Integrity Constraints in a Distributed Database
Simon, E., & Valduriez, P. (1987). Design and analysis of a relational integrity subsystem (Tech. Rep. DB-015-87). Austin, TX: MCC.
Stemple,D.,Mazumdar,S.,&Sheard,T.(1987).Onthemodes and measuring of feedback to transaction designers. Proceedings of the 1987 ACM-SIGMOD International Conference on the Management of Data (pp. 374-386).
Wang, X. Y. (1992). The development of a knowledgebased transaction design assistant. Doctoral dissertation, University of Wales College of Cardiff.
KEY TERMS
Complete Test: Verifies that an update operation leads a consistent database state to either a consistent or inconsistent database state.
Data Fragmentation: Refers to the technique used to split up the global database into logical units. These logical units are called fragment relations, or simply fragments.
Database Consistency: Means that the data con- |
|
|
C |
||
tained in the database is both accurate and valid. |
||
Distributed Database: A collection of multiple, logi- |
|
|
|
||
cally interrelated databases distributed over a computer |
|
|
network. |
|
|
Global Test: Verifies that an update operation violates |
|
|
an integrity constraint by accessing data at remote sites. |
|
|
Integrity Control: Deals with the prevention of se- |
|
|
mantic errors made by the users due to their carelessness |
|
|
or lack of knowledge. |
|
|
Local Test: Verifies that an update operation vio- |
|
|
lates an integrity constraint by accessing data at the |
|
|
local site. |
|
|
Necessary Test: Verifies that an update operation |
|
|
leads a consistent database state to an inconsistent |
|
|
database state. |
|
|
Sufficient Test: Verifies that an update operation |
|
|
leads a consistent database state to a new consistent |
|
|
database state. |
|
73
TEAM LinG

74
Collective Knowledge Composition in a P2P
Network
BoanergesAleman-Meza
University of Georgia, USA
Christian Halaschek-Wiener
University of Georgia, USA
I.BudakArpinar
University of Georgia, USA
INTRODUCTION
Today’s data and information management tools enable massive accumulation and storage of knowledge that is produced through scientific advancements, personal and corporate experiences, communications, interactions, and so forth. In addition, the increase in the volume of this data and knowledge continues to accelerate. The willingness and the ability to share and use this information are key factors for realizing the full potential of this knowledge scattered over many distributed computing devices and human beings. By correlating these isolated islands of knowledge, individuals can gain new insights, discover new relations (Sheth, Arpinar & Kashyap, 2003), and produce more knowledge. Despite the abundance of information, knowledge starvation still exists because most of the information cannot be used effectively for decisionmaking and problem-solving purposes. This is in part due to the lack of easy to use knowledge sharing and collective discovery mechanisms. Thus, there is an emerging need for knowledge tools that will enable users to collectively create, share, browse, and query their knowledge.
For example, many complex scientific problems increasingly require collaboration between teams of scientists who are distributed across space and time and who belong to diverse disciplines (Loser et al., 2003; Pike et al., 2003). Effective collaboration remains dependent, however, on how individual scientists (i.e., peers) can represent their meaningful knowledge, how they can query and browse each others’ knowledge space (knowledge map), and, most importantly, how they can compose their local knowledge pieces together collectively to discover new insights that are not evident to each peer locally.
A common metaphor for knowledge is that it consists of separate little factoids and that these knowledge “atoms” can be collected, stored, and passed along (Lakoff & Johnson, 1983). Views like this are what underlie the notion that an important part of knowledge management is getting access to the “right knowledge.”
While the state of the art is not at the point where we can duplicate the accomplishments of a Shakespeare or Einstein on demand, research developments allow us to craft technological and methodological support to increase the creation of new knowledge, both by individuals and by groups (Thomas, Kellogg & Erickson, 2001).
A Peer-to-Peer (P2P) network can facilitate scalable composition of knowledge compared to a centralized architecture where local knowledge maps are extracted and collected in a server periodically to find possible compositions. This kind of vision can be realized by exploiting advances in various fields. Background and enabling technologies include semantic metadata extraction and annotation, and knowledge discovery and composition. Figure 1 shows these components for an ontol- ogy-based P2P query subsystem.
Figure 1. Part of an ontology-based P2P query subsystem
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Collective Knowledge Composition in a P2P Network
BACKGROUND
Semantic Metadata Extraction
and Annotation
A peer’s local knowledge can be in various formats such as Web pages (unstructured), text documents (unstructured), XML (semi-structured), RDF or OWL, and so forth. In the context of efficient collective knowledge composition, this data must be in a machine processable format, such as RDF or OWL. Thus, all data that is not in this format must be processed and converted (metadata extraction). Once this is completed, the knowledge will be suitable to be shared with other peers. The Semantic Web envisions making content machine processable, not just readable or consumable by human beings (Berners-Lee, Hendler & Lassila, 2001). This is accomplished by the use of ontologies which involve agreed terms and their relationships in different domains. Different peers can agree to use a common ontology to annotate their content and/ or resolve their differences using ontology mapping techniques. Furthermore, peers’ local knowledge will be represented in a machine processable format, with the goal of enabling the automatic composition of knowledge.
Ontology-driven extraction of domain-specific semantic metadata has been a highly researched area. Both semiautomatic (Handschuh, Staab & Studer, 2003) and automatic (Hammond, Sheth & Kochut, 2002) techniques and tools have been developed, and significant work continues in this area (Vargas-Vera et al., 2002).
Knowledge Discovery and
Composition
One of the approaches for knowledge discovery is to consider relations in the Semantic Web that are expressed semantically in languages like RDF(S). Anyanwu and Sheth (2003) have formally defined particular kinds of relations in the Semantic Web, namely, Semantic Associations. Discovery and ranking of these kinds of relations have been addressed in a centralized system (AlemanMeza et al., 2005; Sheth et al., 2005). However, a P2P approach can be exploited to make the discovery of knowledge more dynamic, flexible, and scalable. Since different peers may have knowledge of related entities and relationships, they can be interconnected in order to provide a solution for a scientific problem and/or to discover new knowledge by means of composing knowledge of the otherwise isolated peers.
In order to exploit peers’ knowledge, it is necessary to make use of knowledge query languages. A vast amount of research has been aimed at the development of query languages and mechanisms for a variety of knowledge
representation models. However, there are additional |
|
|
C |
||
special considerations to be addressed in distributed |
||
dynamic systems such as P2P. |
|
|
|
||
PEER-TO-PEER NETWORKS |
|
|
Recently, there has been a substantial amount of research |
|
|
in P2P networks. For example, P2P network topology has |
|
|
been an area of much interest. Basic peer networks include |
|
|
random coupling of peers over a transport network such |
|
|
as Gnutella (http://www.gnutella.com, discussed by |
|
|
Ripeanu, 2001) and centralized server networks such as |
|
|
that of Napster (http://www.napster.com) architecture. |
|
|
These networks suffer from drawbacks such as scalability, |
|
|
lack of search guarantees, and bottlenecks. Yang and |
|
|
Garcia-Molina (2003) discussed super-peer networks that |
|
|
introduce hierarchy into the network in which super-peers |
|
|
have additional capabilities and duties in the network that |
|
|
may include indexing the content of other peers. Queries |
|
|
are broadcasted among super-peers, and these queries |
|
|
are then forwarded to leaf peers. Schlosser et al. (2003) |
|
|
proposed HyperCup, a network in which a deterministic |
|
|
topology is maintained and known of by all nodes in the |
|
|
network. Therefore, nodes at least have an idea of what |
|
|
the network beyond their scope looks like. They can use |
|
|
this globally available information to reach locally optimal |
|
|
decisions while routing and broadcasting search mes- |
|
|
sages. Content addressable networks (CAN) (Ratnasamy |
|
|
et al., 2001) have provided significant improvements for |
|
|
keyword search. If meta-information on a peer’s content |
|
|
is available, this information can be used to organize the |
|
|
network in order to route queries more accurately and for |
|
|
more efficient searching. Similarly, ontologies can be |
|
|
used to bootstrap the P2P network organization: peers |
|
|
and the content that they provide can be classified by |
|
|
relating their content to concepts in an ontology or |
|
|
concept hierarchy. The classification determines, to a |
|
|
certain extent, a peer’s location in the network. Peers can |
|
|
use their knowledge of this scheme to route and broadcast |
|
|
queries efficiently. |
|
|
Peer network layouts have also combined multiple |
|
|
ideas briefly mentioned here. In addition, Nejdl et al. (2003) |
|
|
proposed a super-peer based layout for RDF-based P2P |
|
|
networks. Similar to content addressable networks, super- |
|
|
peers index the metadata context that the leaf peers have. |
|
|
Efficient searching in P2P networks is very important |
|
|
as well. Typically, a P2P node broadcasts a search request |
|
|
to its neighboring peers who propagate the request to |
|
|
their peers and so on. However, this can be dramatically |
|
|
improved. For example, Yang and Garcia-Molina (2003) |
|
|
have described techniques to increase search effective- |
|
|
ness. These include iterative deepening, directed Breadth |
|
|
First Search, and local indices over the data contained |
|
75
TEAM LinG
Collective Knowledge Composition in a P2P Network
within r-hops from itself. Ramanathan, Kalogeraki, and Pruyne (2001) proposed a mechanism in which peers monitor which other peers frequently respond successfully to their requests for information. When a peer is known to frequently provide good results, other peers attempt to move closer to it in the network by creating a new connection with that peer. This leads to clusters of peers with similar interests that allow limiting the depth of searches required to find good results. Nejdl et al. (2003) proposed using the semantic indices contained in super-peers to forward queries more efficiently. Yu and Singh (2003) proposed a vector-reputation scheme for query forwarding and reorganization of the network. Tang, Xu, and Dwarkadas (2003) made use of data semantics in the pSearch project. In order to achieve efficient search, they rely on a distributed hash table to extend LSI and VSM algorithms for their use in P2P networks.
FUTURE TRENDS
Knowledge composition applications are fundamentally based on advances in research areas such as information retrieval, knowledge representation, and databases. As the growth of the Web continues, knowledge composition will likely exploit pieces of knowledge from the multitude of heterogeneous sources of Web content available. The field of peer-to-peer networks is, as of now, an active research area with applicability as a framework for knowledge composition. Given our experiences, we believe that future research outcomes in peer-to-peer knowledge composition will make use of a variety of knowledge sources. Knowledge will be composed from structured data (such as relational databases), semi-structured data (such as XML feeds), semantically annotated data (using the RDF model or OWL), and necessary conversions will be done using knowledge extractors. Thus, knowledge will be composed from databases, XML, ontologies, and extracted data. However, the more valuable insights will probably be possible by combining knowledge sources with unstructured Web content. Large scale analysis and composition of knowledge exploiting massive amounts of Web content remain challenging and interesting topics.
CONCLUSION
The problem of collectively composing knowledge can greatly benefit from research in the organization and discovery of information in P2P networks. Additionally, several capabilities in creating knowledge bases from heterogeneous sources provide the means for exploiting semantics in data and knowledge for knowledge composition purposes. In this respect, we have discussed the evolution
of peer-to-peer systems from a knowledge composition perspective. Although challenging research problems remain, there is great potential for moving from centralized knowledge discovery systems towards a distributed environment. Thus, research in databases, information retrieval, semantic analytics, and P2P networks provides the basis of a framework in which applications for knowledge composition can be built.
REFERENCES
Aleman-Meza, B., Halaschek-Wiener, C., Arpinar, I.B., & Sheth, A. (2005). Ranking complex relationships on the Semantic Web.
Anyanwu, K., & Sheth, A. (2003). r-Queries: Enabling querying for semantic associations on the Semantic Web. Proceedings of the 12th International World Wide Web Conference, Budapest, Hungary.
Berners-Lee, T., Hendler, J., & Lassila, O. (2001). The Semantic Web: A new form of Web content that is meaningful to computers will unleash a revolution of new possibilities. Scientific American, 284(5), 34.
Hammond, B., Sheth, A., & Kochut, K. (2002). Semantic enhancement engine: A modular document enhancement platform for semantic applications over heterogeneous content. In V. Kashyap & L. Shklar (Eds.), Real world Semantic Web applications (pp. 29-49). Ios Pr.
Handschuh, S., Staab, S., & Studer, R. (2003). Leveraging metadata creation for the Semantic Web with CREAM. KI 2003: Advances in Artificial Intelligence, 2821, 19-33.
Lakoff, G., & Johnson, M. (1983). Metaphors we live by. Chicago: University of Chicago Press.
Loser, A., Wolpers, M., Siberski, W., & Nejdl, W. (2003). Efficient data store and discovery in a scientific P2P network. Proceedings of the ISWC 2003 Workshop on Semantic Web Technologies for Searching and Retrieving Scientific Data, Sanibel Island, Florida.
Nejdl, W., Wolpers, M., Siberski, W., Schmitz, C., Schlosser, M., Brunkhorst, I., & Löser, A. (2003). Super- peer-based routing and clustering strategies for RDF-based peer-to-peer networks. Proceedings of the 12th International World Wide Web Conference, Budapest, Hungary.
Pike, W., Ahlqvist, O., Gahegan, M., & Oswal, S. (2003). Supporting collaborative science through a knowledge and data management portal. Proceedings of the ISWC 2003 Workshop on Semantic Web Technologies for Searching and Retrieving Scientific Data, Sanibel Island, Florida.
76
TEAM LinG

Collective Knowledge Composition in a P2P Network
Ramanathan, M.K., Kalogeraki, V., & Pruyne, J. (2001).
Finding good peers in peer-to-peer networks (Tech. Rep. No. HPL-2001-271). HP Labs.
Ratnasamy, S., Francis, P., Handley, M., Karp, R., & Shenker, S. (2001). A scalable content addressable network. Proceedings of the ACM SIGCOMM. New York: ACM Press.
Ripeanu, M. (2001). Peer-to-peer architecture case study: Gnutella Network. Proceedings of the International Conference on Peer-to-Peer Computing, Linkoping, Sweden.
Schlosser, M., Sintek, M., Decker, S., & Nejdl, W. (2003). HyperCuP–Hypercubes, ontologies and efficient search on P2P networks. Lecture Notes in Artificial Intelligence, 2530, 112-124.
Sheth, A., Aleman-Meza, B., Arpinar, I.B., Halaschek, C., Ramakrishnan, C., Bertram, C., et al. (2005). Semantic association identification and knowledge discovery for national security applications. In L. Zhou & W. Kim (Eds.),
Special Issue of Journal of Database Management on Database Technology for Enhancing National Security, 16(1), 33-53. Hershey, PA: Idea Group Publishing.
Sheth, A., Arpinar, I.B., & Kashyap, V. (2003). Relationships at the heart of Semantic Web: Modeling, discovering, and exploiting complex semantic relationships. In M. Nikravesh, B. Azvin, R. Yager, & L.A. Zadeh (Eds.),
Enhancing the power of the Internet studies in fuzziness and soft computing (pp. 63-94). Springer-Verlag.
Tang, C., Xu, Z., & Dwarkadas, S. (2003). Peer-to-peer information retrieval using self-organizing semantic overlay networks. Proceedings of the ACM SIGCOMM 2003, Karlsruhe, Germany.
Thomas, J.C., Kellogg, W.A., & Erickson, T. (2001). The knowledge management puzzle: Human and social factors in knowledgemanagement.IBMSystemsJournal,40(4),863-884.
Vargas-Vera, M., Motta, E., Domingue, J., Lanzoni, M., Stutt, A., & Ciravegna, F. (2002). MnM: Ontology driven semi-automatic and automatic support for semantic markup. Proceedings of the 13th International Conference on Knowledge Engineering and Management, Sigüenza, Spain.
Yang, B., & Garcia-Molina, H. (2003). Designing a superpeer network. Proceedings of the 19th International Conference on Data Engineering, Bangalore, India.
Yu, B., & Singh, M.P. (2003). Searching social networks.
Proceedings of the 2nd International Joint Conference on Autonomous Agents and Multiagent Systems, Melbourne, Australia.
KEY TERMS
C
Knowledge Composition: Knowledge composition involves assembling knowledge atoms (such as triples in RDF and OWL) to build more complex knowledge maps.
Metadata: In general terms, metadata are data about data. Examples are size of a file, topic of a news article, etcand so forth.
Ontology: From a practical perspective, ontologies define a vocabulary to describe how things are related. Relationships of type “is-a” are very basic, yet taxonomies are built with is-a relationships. The value of ontologies is in the agreement they are intended to provide (for humans, and/or machines).
OWL: The OWL Web Ontology Language is designed for use by applications that need to process the content of information instead of just presenting information to humans. OWL facilitates greater machine interpretability of Web content than that supported by XML, RDF, and RDF Schema (RDF-S) by providing additional vocabulary along with a formal semantics (OWL Web Ontology Language Overview, W3C Recommendation, February 2004).
RDF(S): The Resource Description Framework is a language intended for representation and description of ‘resources’. RDF makes use of a Vocabulary Description Language (commonly referred as RDFS or RDF Schema) to describe classes and relations among resources. With RDF(S), we refer to both RDF and its accompanying vocabulary description language. (RDF Primer, W3C Recommendation, February 2004).
Semantic Metadata: We refer to ‘semantic metadata’ as that data about data that describes the content of the data. A representative example of semantic metadata is relating data with classes of an ontology, that is, the use of ontology for describing data.
Semantic Web: The Semantic Web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries. It is a collaborative effort led by W3C with participation from a large number of researchers and industrial partners (W3C). The Semantic Web is an extension of the current Web in which information is given well-defined meaning, better enabling computers and people to work in cooperation (Tim Berners-Lee, James Hendler, Ora Lassila, The Semantic Web, Scientific American, May 2001).
77
TEAM LinG

78
Common Information Model
JamesA.Fulton
The Boeing Company, USA (Retired)
INTRODUCTION
A common information model (CIM) defines information that is available for sharing among multiple business processes and the applications that support them. These common definitions are neutral with respect to the processes that produce and use that information, the applications that access the data that express that information, and the technologies in which those applications are implemented. In an architecture based on the CIM, applications map their data only to the CIM and not to other applications, and they interface only to the middleware that implements the CIM (the integration broker or IB), not to other applications. This not only reduces the number of interfaces to be built and maintained, it provides a basis for integrating applications in a way that reduces the coupling among them, thereby allowing them to be upgraded or replaced with minimal functional impact on other applications.
definition of the data available from the source (the internal schema) and the definition of the data required by the application (the external schema). The committee referred to such an approach as a two-schema architecture. The interfaces between the components had to implement this mapping in order to transform the physical representation, the syntax, and the semantics of the data from its source to its destination.
Although such computing systems are relatively simple and quick to build, they impose significant problems during maintenance:
•Number of Interfaces: As these systems mature, the number of interfaces to be built and maintained could grow with the square of the number of applications. If N is the number of applications, I is the number of interfaces, and each application has one interface in each direction with every other application, then
I = 2 × N × (N-1) = 2N2 -2
BACKGROUND
The concept of a common information model first emerged in public prominence under the name conceptual schema with the publication of the ANSI/SPARC Database Model (Jardine, 1977). At the time, it was intended as an approach to designing very large shared database systems.
The key thesis of the ANSI/SPARC Committee was that traditional integration practices required point-to- point interfaces between each application and the data sources it depended on, as illustrated in Figure 1. This required the development of a mapping between the
Figure 1. Two-schema architecture
People |
Payroll |
Plan- |
Oper- |
ning |
ations |
This is known as the N-Squared Problem. Although it is rare that every application talks to every other application, many applications share multiple interfaces. So the equation offers a reasonable approximation.
•Redundancy: The transformations required to implement communications are implemented redundantly by multiple interfaces. Each application must design its own approach to merging data from multiple sources.
•Impact Assessment: When an application changes in a way that affects its internal or external schema, every application that maps to that schema must be examined, reverified, and possibly revised.
•Scope of Knowledge: When an application is being upgraded to support new requirements, or when a new application is added to the architecture, architects have to examine every other application to determine what interfaces are required.
The ANSI/SPARC Committee recommended an alternative architecture, as depicted in Figure 2, in which interaction among applications is mediated through an integration broker that implements what they called a
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG

Common Information Model
Figure 2. Three-schema architecture
People |
Payroll |
Common
Information
Model
Plan- |
Oper- |
ning |
ations |
conceptual schema, and what this paper refers to as a common information model or CIM (also known as a Common Business Information Model, Canonical Business Information Model, Normalized Information Model, Common Business Object Model). In this architecture, individual applications map their schemas only to the conceptual schema and interface only to the component that implements the conceptual schema (herein referred to as the integration broker), which is responsible for translating data from the source application first into the neutral form of the CIM and from that into the form required by the target application. Since each exchange passes from the internal schema of the source application through the conceptual schema of the integration broker to the external schema of the destination, the committee referred to this approach as a three-schema architecture.
Although each communication requires a two-step translation instead of the one-step translation of the twoschema architecture,
•Number of Interfaces: The number of interfaces to build and maintain is substantially less, growing linearly with the number of applications, rather than with the square of that number.
•Redundancy: The broker manages the common tasks of transforming and merging data from multiple sources, a task that would have to be done redundantly by each application in the two-schema architecture.
•Impact Assessment: When an application changes in a way that affects its mapping of its schema to the CIM, the only other mappings that must be examined, reverified, and possibly revised are those that contain data whose definition in the CIM has changed.
•Scope of Knowledge: The effect of the three-schema architecture is to hide the sources and targets of data. An application’s integration architects need only know about the common information model; they need not know the sources and targets of its
contents. Hence, those sources and targets can, in
most cases, be changed without critical impact on C the application.
Although the three-schema architecture was published in 1977 (Jardine), it was not extensively implemented for several reasons:
•It increased the scope and complexity of the documentation to be developed and managed in order to keep track of all the schemas and mappings.
•The database technology for which it was intended did not scale well to support multiple applications with different external schemas.
•The change management practices needed to approve a change to the common information model did not meet project schedule requirements.
•Commercial products were being developed that utilized their own embedded databases that did not rely on the conceptual schema.
The end result was that database applications could be delivered faster without using a three-schema architecture, and the recommendation languished.
Today, the common information model has re-emerged as a viable architectural approach, not as an architecture for physical databases but as an approach to achieving application integration. Variations on this approach to achieving the goals of the CIM usually offer one or more of three features:
1.Ontology
2.Standard Exchange Format
3.Integration Framework
Ontologies
An ontology is a specification of the information appropriate to a particular business process or domain. It defines the entities that can be the subjects of information, the properties of those entities, the relationships among them, and in some cases the fundamental operations that can be performed among them. Like the CIM, an ontology is neutral with respect to applications and technology. Moreover, although it is typically drawn from the vocabulary and practice of a particular community of users, it is also like the CIM, available for use outside that community, and may overlap with the ontologies of other communities. The CIM is essentially an integrated ontology that embraces the ontologies of all the communities that need to share information. A number of standards (described below under Standard Exchange Format) pro-
79
TEAM LinG