
Rivero L.Encyclopedia of database technologies and applications.2006
.pdfvide languages for building ontologies, rather than ontologies themselves.
An ontology typically leaves undefined the mechanisms by which information based on the ontology is exchanged. That is left to those who implement the ontology. However, some ontologies are embedded in an architectural framework that provides standard mechanisms for exchanging information. This is particularly the case for certain industry standards such as Aerospace Industries Association (AIA) (2004) and ISO TC184/SC4 (2004).
Common Information Model
across application, enterprise, and community boundaries. It is a collaborative effort led by W3C with participation from a large number of researchers and industrial partners. It is based on the Resource Description Framework (W3C, 2004), which integrates a variety of applications using XML for syntax and URIs for naming.” The Semantic Web presumes the Internet as its only infrastructure foundation. It is based on formal languages for specifying ontologies to govern the interchange of information over the Internet:
Standard Exchange Format
Some industry organizations are developing standards for how to define and exchange information based on agreements about common information. They provide standard languages and mechanisms, and leave it as a separate exercise, which they often facilitate, to specify the information content to be expressed in those languages.
United Nations Directories for Electronic Data Interchange for Administration, Commerce and Transport (UN/EDIFACT)
UN/EDIFACT established some of the original rules for global electronic data exchange (EDI). Although the technologies that implemented these UN standards are being replaced, the ontologies on which they are based constitute the bulk of most of the modern standards.
Open Applications Group (OAG)
The Open Applications Group (OAG) provides a process for integration that depends on the exchange of XML documents that conform to the OAG standard for Business Object Documents (BODs) (OAG, 2002).
Organization for the Advancement of Structured Information Standards (OASIS)
The Organization for the Advancement of Structured Information Standards (OASIS) defines standards for Electronic Business using eXtensible Markup Language (ebXML) (OASIS, 2001) and Universal Business Language (UBL) (OASIS, 2004), both neutral languages for the definition of common information for inclusion in the CIM.
World Wide Web Consortium (W3C)
The World Wide Web Consortium (W3C) has developed the Semantic Web (W3C, 2003), which “provides a common framework that allows data to be shared and reused
Integration Frameworks
The standards mentioned above try to achieve the objectives of the CIM by standardizing the content and/or format of the exchange package, and leaving the question of how a component uses or produces that package entirely to its implementation. An alternative approach attempts to optimize the integrated system by exploiting specific technologies.
Data Management Task Force (DMTF)
The Data Management Task Force (DMTF) (2004) specifies an object-oriented framework for implementing the CIM as a collection of objects that encapsulate the routing of method invocations from client to target and back.
DMTF is the only standard described herein that actually uses the expression “Common Information Model”, which it defines as follows:
CIM provides a common definition of management information for systems, networks, applications and services, and allows for vendor extensions. CIM’s common definitions enable vendors to exchange semantically rich management information between systems throughout the network. (DMTF, 2004)
[An information model is an] abstraction and representation of the entities in a managed environment—their properties, operations, and relationships. It is independent of any specific repository, application, protocol, or platform. (DMTF, 2003, p. 6)
The ontology embedded in the DMTF CIM currently focuses on components of the computing infrastructure. However, the definitions, techniques, and supporting infrastructure are applicable regardless of the system whose elements need to be managed across a network.
The DMTF architecture is based on the standards of object-oriented systems promulgated by the Object Management Group (OMG). These include:
80
TEAM LinG

Common Information Model
•TheUnifiedModelingLanguage(UML)(OMG,2003a), which is used to define the elements of the CIM.
•The Meta-Object Facility (MOF) (OMG, 2002), which provides a meta-ontology that defines the kinds of elements that can be included in an ontology.
•XML Metadata Interchange (XMI) (OMG, 2003b), which specifies how the definitions of modeling elements that conform to MOF can be exchanged between tools as XML files.
THE COMMON INFORMATION MODEL IN INFORMATION SYSTEMS INTEGRATION
The key role of the CIM in all of the above industry activities is to reduce the coupling between components of an information system so that those components can be revised or replaced without a major redesign of other components in order to accommodate the change. The mapping of an application’s external schema, that is, the face it presents to the computing environment, to the CIM rather than to other applications, effectively encapsulates it, in other words, hides its schema and operations. The CIM “knows” which applications have elements that correspond to its own elements, which of them are sources of values for those elements, and which are users. The middleware that implements the CIM “knows” what operations and interfaces are required to move an element of data in and out of an application under what conditions. The maps constitute a service contract that allows each component to evolve as required, as long as it continues to abide by that contract in its interactions with the system.
Within that general framework, there are a number of dimensions that might vary among different implementations of the CIM.
Neutrality
The essence of the Common Information Model is neutrality with respect to the applications and technologies integrated through it, but that neutrality admits of degrees. The following levels of neutrality are derived in large part from John Zachman’s Framework for an Information Systems Architecture (Zachman, 1987; Zachman & Sowa, 1992).
•The Common Business Information Model (CBIM) is the most completely neutral. It defines the information to be shared among business processes, for example, employees, organizations, products, and all their myriad attributes and relationships. It also defines the changes in those business objects as
they progress through various business processes.
The elements of this model provide the semantics C for the data that are exchanged among the comput-
ing systems, as well as for the data that those systems present to people.
•The Common Data Model (CDM) or Common Object Model (COM), depending on the extent to which object-oriented concepts are use in specifying these common assets, is less implementationneutral in that it specifies the data that represent the elements of the CIM. The CDM provides an application (and technology) neutral vocabulary and rules of representation for specifying the contents of the data that flows among computing components and people. The elements of the CDM are mapped to elements of the CIM in order to provide well-defined semantics for data. The mappings between application schemas and the CIM are ultimately to their CDM counterparts because that mapping must not only reflect semantic alignment but also representational alignment.
•The Common Data Model Implementation (CDMI) is a manifestation of the CIM in a given technology. Integration architects must choose whether to implement the CIM as a physical data collection, or alternatively as only a virtual collection that exists only ephemerally in transformations that are derived from the mappings of applications schemas to the CDM. Among the possible physical implementations of the CIM are the following:
•Common objects in an integration broker, that is, middleware that transforms data into and out of the common objects.
•Common tables in a data warehouse, that is, a database that persists data states internally for use by other applications.
•Common objects in an application server, which may be the basis for common processing as well as for Web services, which aggregate those objects into the coarse-grained units that are suitable for business-to-business (B2B) interactions.
•OAG Business object documents (BODs) (OAG, 2002) transmit a chunk of the CDMI through the network from one application to another.
Each of these implementations of the CDM is encapsulated from the applications that rely on them for data exchange. The applications connect only to the adapters that transform data into and out of the application. What the internal representation is between source and target adapters is irrelevant to those applications.
81
TEAM LinG
The use of a subset of the CDM as a schema for middleware has a number of advantages:
•It reduces the schema design effort.
•It reduces variation among the schemas of the various technologies that support integration.
•It provides a roadmap for interface design.
Scope
The scope of the CIM defines the information to be comprehended, that is, its ontology, thus the kinds of information that can be exchanged through it.
No CIM can provide a theory of everything. Early attempts to implement shared databases presumed that development of the system would begin with an agreement on the schema for the corporate database, which would then provide concepts and guidelines for all of the components. Such a goal proved elusive. The very effort of seriously trying delayed needed applications.
The same applies to current manifestations of the Common Information Model. An enterprise CIM might be an ideal target, but it is unlikely to be achieved as an early milestone in the integration process. Instead, integration will proceed incrementally with a number of autonomous integration projects developing their own local CIMs in parallel with guidelines in place to improve the likelihood that local CIMs can be integrated into more comprehensive CIMs when the business case demands it.
Process
The use of the Common Information Model as the basis for large-scale integration is too new a practice for the processes around it to be well-founded. Thus the suggestions below should be understood as a starting point to be refined with experience.
Normalization
A canonical form is a standard for defining elements of the CIM. The most thoroughly developed approach to canonical form is E.F. Codd’s (1970) theory of normalized data. Codd’s theory was directed at database technology and was the mathematical foundation for relational databases. However, the mathematics of normalization applies to any collection of data designed for sharing across multiple applications.
Normalization is a technique applied to collections of data with relatively minimal documentation. What was required was only the knowledge of functional dependencies among data elements, that is, what values of what data elements assured the uniqueness of the values of other
Common Information Model
data elements. Normalization partitioned data elements into subcollections whose values were determined by the same set of determining elements, or keys.
Normalization as a technique was reinforced by the development by Peter Chen (1976) and others of the Entity-Relationship approach to data model, which tended to yield data structures that converged with those produced by normalization. ER modeling can be understood as the model theory that provides the formal semantics for normalized data.
The value of normalization to the development of the CIM is that it assures that any given type of data is to be found in only one place in the collection. Indeed, normalization can be understood as a process for removing redundancies from data collections, thereby removing anomalies that complicate updates to the same data in multiple places. Normalization simplifies the mapping with application schemas by assuring non-redundancy in the CIM.
INITIALIZING THE COMMON
INFORMATION MODEL
Throughout the life of the CIM, the relationships implicit in Table 1 must be maintained:
•Common data must be mapped to the common information that provides its meaning (its semantics).
•Business process information must be mapped to the common information to provide a basis for information flow across processes and to validate definitions of common information against the rules of business processes.
•Application data must be mapped to business process information to provide meaning to it in the context of supporting the process.
•Application data must be mapped to common data to support the exchange of data through CIM-based middleware.
This cycle of relationships assures the consistency of information as the data that represent it flows among applications through the CIM. The cycle also indicates
Table 1. CIM relationships
Common |
Business |
Information |
Process |
(CIM) |
Information |
Common Data |
Application |
(CDM) |
Data |
82
TEAM LinG

Common Information Model
the sources from which the CIM and the CDM are to be derived.
•Business process vocabulary and rules: These come from the professional disciplines on which the processes are based, as refined through the experience and practice of those disciplines within an enterprise.
•Application data schemas: These are typically part of the documentation of the applications and reflect the developers understanding of the disciplines and processes they are intended to support.
•Industry standards: Standard data models, such as those mentioned above, reflect a consensus on specific application domains and can be used as a subset of the CDM and CIM. Well-established standards are often supported by applications interfaces, which simplify the task of mapping the application to a CDM that includes them.
•Middleware components: Middleware products often include common data definitions that result from their vendor’s efforts to integrate applications.
Although no one of the sources should be taken as final and binding on the CIM, each provides valuable input to the process of building the CIM and CDM.
EVOLUTION OF THE COMMON INFORMATION MODEL
As applications are added, replaced, or retired, the CIM and CDM will evolve to reflect the changing collection of information that the system makes available. In most cases, when the schema of a new application is integrated into the CIM, it results in a few additional attributes and relationships being added to already defined entities. Occasionally, especially in the early phases of CIM development, whole new entities are added with new relationships. Neither of these changes normally affects existing interfaces of the CIM to the applications that had already been integrated. There are two situations in which a change of applications can be disruptive:
•An application is retired which is the only source of data used by other applications. This poses the dilemma of either finding a new source for that data or persuading the owners of the client applications that they no longer need it.
•In the first case, the new source can be integrated into the system, and other applications are not disrupted.
•In the second, all of the applications and inter-
faces, as well as all of the business processes, |
C |
that require the deprecated data must be revised. |
|
|
•An application is added which requires the CIM to be restructured. For example, a geographical analysis application is added which treats streets, cities, and states as interrelated first-class objects, rather than mere text-valued attributes. In order to apply the application to existing addresses, common objects that contain addresses will have to be revised as interrelated structures of objects. As a consequence, the maps from old applications will have to be revised to accommodate the new structure. Since the map is part of the CIM service contract, this can impose a significant procedural challenge. A mitigating approach, which this author has not yet been able to test, is to leave the old CIM structures in the CIM with appropriate mappings to the new structures, as sunsetted elements or as views, for use in existing interfaces until such time as those interfaces have to be revised for other reasons.
THE FUTURE OF THE COMMON
INFORMATION MODEL
The use of the CIM as a basis for integrating applications is a new practice for most application architects. It requires an investment in tools, techniques, and configuration management that has not been required by traditional application development. Given the pressures on costs and schedules that drive applications, many projects will resist such significant changes to their processes. On the other hand, there are a number of other forces that may overcome this inertia:
Business Integration
Business consolidation, restructuring, partnerships, and Web-based interactions are changing the face of enterprises. Competition is forcing companies to outsource processes and to engage in digital business-to-business (B2B) interactions in order to reduce costs. Companies that cannot engage in such interactions risk loss of business.
Typically, standard B2B interactions, whether they be implemented by standard messages or by Web services or something else, cannot be fully managed by a single application and require interoperation among several applications. The use of a CIM to manage the unpacking, workflow, and assembly of these standard B2B interac-
83
TEAM LinG
tions is expected to simplify and reduce the cost of that interaction.
Technology
Middleware vendors are competing aggressively to support businesses in their integration activities. Some middleware products already offer significant support for a CIM-based architecture, and others are beginning to use the rhetoric.
Service-Oriented Architecture
Platform vendors are competing aggressively to support what is being referred to as a service-oriented architecture, which relies on communication through Web services. Currently, the marketing rhetoric suggests that individual applications will expose their data and operations through their own Web services. The implication is a two-schema architecture in which applications will directly invoke the services of another application. This makes every object exposed by a Web service a separate interface to which other applications can bind. Moreover, applications are likely to expose very similar objects with significant semantic differences that cannot be determined from information in the directories that define the services. It remains to be seen whether enterprises can develop semantically consistent systems of such applications without facing the maintenance penalties described above for point-to-point interfaces.
An alternative is to forego application-specific Web services and to develop a service-oriented architecture around a CIM. This allows the enterprise to specify precisely in a single place the semantics of the objects it is exposing to the world through its Web services. It also avoids redesigning a large number of existing applications to achieve a service-oriented architecture.
Model-Driven Architecture
The Object Management Group (OMG) is actively promoting what it calls a Model Driven Architecture (MDA).
MDA® provides an open, vendor-neutral approach to the challenge of business and technology change. Based firmly upon OMG’s established standards [Unified Modeling Language (OMG, 2003c), Meta-Object Facility (OMG, 2002), XML Metadata Interchange (OMG, 2003b), and Common Warehouse Metamodel (OMG, 2001)], MDA aims to separate business or application logic from underlying platform technology. Platformindependent applications built using MDA and associated standards can be realized on a range of open
Common Information Model
and proprietary platforms, including CORBA®, J2EE,
.NET, and Web Services or other Web-based platforms. Fully-specified platform-independent models (including behavior) can enable intellectual property to move away from technology-specific code, helping to insulate business applications from technology evolution, and further enable interoperability. In addition, business applications, freed from technology specifics, will be more able to evolve at the difference pace of business evolution. (OMG, 2003a, MDA)
The ability of MDA tools to generate working code from models and design and implementation profiles has been demonstrated. Whether they can scale to produce fully integrated systems of applications remains to be seen. One of the challenges lies in the complexity of the models that will be required at multiple levels to specify these systems and analyze their mutual impacts. Although the CIM does add a complex component to the mix, the resulting complexity of the set of models that specify the whole system is likely to be less, since it results in an architecture not burdened by the N-Squared Problem.
CONCLUSION
The Common Information Model is a technique for insulating the details of the schemas of interoperating applications one from the other, allowing each to evolve while continuing to communicate. It is an integrated approach to reducing the overall complexity of systems, making them more manageable and adaptable to changing business requirements. The CIM manifests itself in several different industry standards with different technologies and varying ontologies. This paper has attempted to show the relationships among these standards and technologies, despite the differences in their respective vocabularies. Whether successful practices evolve that make the CIM and its supporting technology with other similar technologies remains to be seen.
REFERENCES
Aerospace Industries Association (AIA). (2004). Retrieved February 2, 2005, from http://www.aiaaerospace.org/supplier_res/work_groups/ecwg/ about_ecwg.cfm
Chen, P. (1976, March). The entity relationship model: Towards a unified view of data. ACM Transactions on Database Systems, 1(1), 9-36.
Codd, E.F. (1970). A relational model of data for large shared data banks. Communications of the ACM, 13(6),
84
TEAM LinG

Common Information Model
377-387. Abstract retrieved February 2, 2005, from http:/ /www.informatik.uni-trier.de/~ley/db/journals/cacm/ Codd70.html
DMTF (Data Management Task Force). (2003). CIM concepts white paper. Retrieved February 2, 2005, from http:/ /www.dmtf.org/standards/documents/CIM/DSP0110.pdf
DMTF (Data Management Task Force). (2004). Common information model (CIM) standards. Retrieved February 2, 2005, from http://www.dmtf.org/standards/cim
Gruber, T. (n.d.). Ontology. Retrieved February 2, 2005, from http://www-ksl.stanford.edu/kst/what-is-an- ontology.html
ISO TC184/SC4 (2004). Industrial Data. Retrieved February 2, 2005, from http://www.tc184-sc4.org/
Jardine, D. (1977). ANSI/SPARC database model (threeschema architecture). Proceedings of the 1976 IBM SHARE Working Conference on Data Base Management Systems, Montréal, Québec.
Merriam-Webster Online. (2004). Retrieved February 2, 2005, from http://www.m-w.com/dictionary.htm
OAG (Open Applications Group). (2002). OAGIS: A “canonical” business language. Retrieved February 2, 2005, from http://www.openapplications.org/downloads/whitepapers/whitepaperdocs/whitepaper.htm
OASIS (Organization for the Advancement of Structured Information Standards). (2001). ebXML technical architecture specification v1.0.4. Retrieved February 2, 2005, from http://www.ebxml.org/specs/index.htm #technical_specifications
OASIS (Organization for the Advancement of Structured Information Standards). (2004). OASIS universal business language 1.0. Retrieved February 2, 2005, from http://www.oasis-open.org/committees/tc_home. php?wg_abbrev=ubl
OMG (Object Management Group). (2001). Common warehouse metamodel (CWM) specification. Retrieved February 2, 2005, from http://www.omg.org/cgi-bin/ apps/doc?ad/01-02-01.pdf
OMG (Object Management Group). (2002). Meta-ob- ject facility (MOF) specification. Retrieved February 2, 2005, from http://www.omg.org/cgi-bin/apps/ doc?formal/03-05-02.pdf
OMG (Object Management Group). (2003a). Modeldriven architecture (MDA). Retrieved February 2, 2005, from http://www.omg.org/mda/
OMG(ObjectManagementGroup).(2003b).XMLmetadata interchange (XMI) specification. Retrieved February 2, C
2005, from http://www.omg.org/cgi-bin/apps/ doc?formal/03-05-02.pdf
OMG (Object Management Group). (2003c). Unified modeling language (UML). Retrieved February 2, 2005, from http:/ /www.omg.org/cgi-bin/apps/doc?formal/03-03-01.pdf
United Nations Directories for Electronic Data Interchange for Administration, Commerce and Transport. (2004). Retrieved February 2, 2005, from http://www.unece.org/ trade/untdid/welcome.htm
W3C (World Wide Web Consortium). (2003). Semantic Web. Retrieved February 2, 2005, from http:// www.semanticweb.org/
W3C (World Wide Web Consortium). (2004). RDF/XML syntax specification (revised). Retrieved February 2, 2005, from http://www.w3.org/TR/rdf-syntax-grammar/
Zachman, J. (1987). A framework for information systems architecture. IBM Systems Journal, 26(3), 276. Retrieved February 2, 2005, from Zachman Institute for Framework Advancement (2004) http://zifa.dynedge. c o m / c g i - b i n / d o w n l o a d . c g i ? f i l e _ n a m e = Z a c h m a n S y s t e m s J o u r n a l 1 . p d f & file_path=/home/zifa.com/friends/Zachman Systems Journal1.pdf&size=15856203
Zachman, J., & Sowa, J. (1992). Extending and formalizing the framework for information systems architecture. IBM Systems Journal, 31(3), 590. Retrieved February 2, 2005, from Zachman Institute for Framework Advancement (2004) http://zifa.dynedge.com/cgi-bin/ d o w n l o a d . c g i ? f i l e _ n a m e = Z a c h m a n S y s t e m s Journal2.pdf&file_path=/home/zifa.com/friends/ ZachmanSystemsJournal2.pdf&size=2405114
Zachman Institute for Framework Advancement. (2004) Retrieved February 2, 2005, from http://www.zifa.com/
KEY TERMS
Architecture: The organization of components of a system that enables their interacting with one another to achieve the objectives of the system.
Common Data Model (CDM): A definition of the data to be shared across the scope of an integrated computing system in terms that are neutral with respect to the applications and technologies that make up that system. A CDM provides a vocabulary for talking meaningfully about the data that represent the information
85
TEAM LinG
defined in the flow, transformation, and contents of data packages that are delivered from one application or technology to another.
Common Information Model (CIM): A definition of the information to be shared across the scope of an integrated system. A CIM may span multiple domains, as long as the elements of each of the domains can be mapped uniquely to an element of the CIM.
Conceptual Schema: A definition of the information to be shared across the scope of an integrated system. This term was used by the ANSI-SPARC Committee (Jardine, 1977) for what is referred to in this paper as a Common Information Model.
Domain: A scope of information definition. A domain defines a collection of information generally recognized as appropriate to a field of study, a business process or function, or mission.
External Schema: Used by the ANSI-SPARC Committee (Jardine, 1977) for a definition of the information required by a particular application or other component.
Integration: A system is integrated when it is sufficiently interconnected that a change to any element of the system by any component of the system is reflected appropriately, that is, according to the business rules of the system, in every other component of the system. For example, if a Human Resources system is integrated, and an employee changes his or her address through any human interface that allows it, then the new address will be shown automatically every place else in the system.
Integration Broker (IB): A middleware product that uses an internal (or virtual) representation of a Common Data Model (CDM) to mediate the exchange of data and data-related services among applications. An IB manages the physical, syntactic, and semantic translation of data from any application, the validation of rules of authorization and business processing, and the transport to each application and component required to preserve systemwide consistency of the virtual collection of data.
Internal Schema: Used by the ANSI-SPARC Committee (Jardine, 1977) for a definition of the information available from a particular database, application, or other source.
Common Information Model
Meta-Data: Literally, data about data. The term is used, often ambiguously in either of two ways:
1.Data that defines the types of data, applications, information, business processes, and anything else of importance to those responsible for specifying, building, maintaining, deploying, operating, or using components of an information system. This kind of meta-data is typified by the contents of requirements or design specifications, or of the models produced by CASE tools, and comes in many different forms. For example, the most readily accessible meta-data for our cast of technologies might be:
•COBOL typically provides data definitions in COPYLIBS.
•Relational databases typically provide SQL definitions of tables.
•XML systems provide DTDs or XML Schemas.
•Object-oriented systems provide some form of object directory.
•Formal models of other kinds, such as UML, might be provided, but maintained models are rare.
2.Data that describes the provenance and particulars of a specific instance of a data package. For example, the header of an HTML file might specify the title, owner, dates of creation and last change, creating application, and other properties specific to that particular instance of the HTML file.
Ontology: A branch of metaphysics concerned with the nature and relations of being. A particular theory about the nature of being or the kinds of existents. (Merriam-Webster, 2004). “A specification of a conceptualization. That is, an ontology is a description (like a formal specification of a program) of the concepts and relationships that can exist for an agent or a community of agents” (Gruber).
Schema: A definition of the data available for use by an application or system of applications.
Three-Schema Architecture: Used by the ANSI/ SPARC Committee (Jardine, 1977) for a data-sharing architecture in which a client application interfaces its external schema only to a conceptual schema, which is itself interfaced to the internal schemas of available sources.
Two-Schema Architecture: Used by the ANSI/SPARC Committee (Jardine, 1977) for a data-sharing architecture in which a client application interfaces its external schema directly to the internal schema of a source.
86
TEAM LinG
|
87 |
|
Component-Based Generalized Database |
|
|
|
C |
|
Index Model |
|
|
|
|
|
|
|
Ashraf Gaffar
Concordia University, Canada
INTRODUCTION
The performance of a database is greatly affected by the performance of its indexes. An industrial quality database typically has several indexes associated with it. Therefore, the design of a good quality index is essential to the success of any nontrivial database. Parallel to their significance, indexed data structures are inherently complex applications that require a lot of effort and consume a considerable amount of resources. Index frameworks rely on code reuse to reasonably reduce the costs associated with them (Lynch & Stonebraker, 1988; Stonebraker, 1986). Generalized database systems have further addressed this challenge by offering databases with indexes that can be adjusted to different data/key types, different queries, or both. The generalized search tree (GiST; Hellerstein, Naughton, & Pfeffer, 1995; Hellerstein, Papadimitriou, & Koutsoupias, 1997) is a good example of a database system with a generalized index, or generalized index database for simplicity. Additional improvements extended the concept of generalized index databases to work on different domains by having generalized access methods (search criteria). For example, based on the work of Hellerstein et al. (1995), Aoki (1998) provides a generalized framework that allows users to adjust the index to different search criteria like equality, similarity, or nearest neighbor search. This makes the database system customizable to not only finding the exact records, but also to finding records that are “similar” or “close” to a given record. Users can customize their own criteria of “similarity” and let the index apply it to the database and return all “similar” results. This is particularly important for more challenging domains like multimedia applications, where there is always the need to find a “close image,” a “similar sound,” or “matching fingerprints.”
The common drawback of all these improvements is having a monolithic code that allows users to only adjust the few lines of code that are meant to be customized. Outside these areas, code is often bulky and difficult to maintain or upgrade. However, with the increasing dependence on software engineering techniques, these problems can be ratified. Component-based frameworks provide solid ground for future maintenance, replace-
ment, and additions to the code in an orderly fashion that does not reduce the quality of the system (the aging effect). Customization does not have to be limited to few lines of the code. In our work, we provide a new model to redesign the generalization of database indexes using components. Unlike previous works, the customization is not done at the code level, but rather at higher conceptual levels from the early design stages. Our design is based on total decoupling of the design modules and connecting them through well-defined interfaces to build the database system. All this is done at the design level. After the design is completed, implementation is done by obtaining the already-existing commercial off the-shelf (COTS) components and connecting them together according to the design model. Following this design-level customization paradigm, the system can be customized as usual by the user at prespecified locations (predefined few lines of code), but more importantly, a large-scale customization is also possible at the design level. The system designer can redesign the generalized model by adding, removing, or replacing a few components in the design model to instantiate the model into a new concrete design. Afterwards, only the affected components of the source code need to be replaced by new ones. In our system, COTS components are extensively used, which dramatically reduces the cost of development. We needed to implement a few components with the same predefined interfaces where COTS components were not suitable, for example, when special concurrency control management or specific storage needs were necessary. This adds more flexibility to the model.
BACKGROUND
Component-based software systems have received a lot of attention recently in several application domains. Reusing existing components serves the important goals of simultaneously reducing development time and cost (time-to-market) and producing quality code. Moreover, components can be added, removed, or replaced during system maintenance or reengineering phases, which leads to faster and better results (Sanlaville,
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
Component-Based Generalized Database Index Model
Faver, & Ledra, 2001). Quartel, Sinderen, and Ferreira Pires (1999) offer some useful techniques to model and analyze software components and their composition. To promote successful reuse, components have to be con- text-independent (Schmidt, 1997). C++ has adopted a component-based library as its standard library (STL, the Standard Template Library), which provides pro- gramming-level components that can be used at the design stage and later added as code fragments to generate a considerable part of the application code (Musser, Derge, & Saini, 2001). STL adopts a new paradigm that separates software applications into a few component types for storage and processing. Using the STL paradigm at the early phases of conceptual design of the generalized database indexes facilitates the design and implementation of the whole systems by iteratively adding arbitrary combinations of STL types.
STL BUILDING BLOCKS
We built a modular framework for a generalized database by applying the STL (ANSI C++) modularity concept (Austern, 1999) in the analysis, architecture, design, and interface stages. We developed a new set of models for an index in different application domains; linearly ordered domain, general domain (with both depth-first and breadth-first access methods), and eventually similarity search domain.
Using coherent, decoupled building blocks allows us to locate and replace a few blocks in each model to obtain a new model with no major impact on the rest of the system design. This makes modifications limited to specific regions of the system. A wealth of STL modules can be used in the replacement process. The framework also allows for introducing new compatible modules as needed.
STL is based on separating algorithms from data structures as two different types of generic building blocks (Breymann, 2000). It also introduces other generic building blocks to complete all the abstract design blocks (e.g. iterators, functors, container adaptors). They work as “adaptors” or “glue” to allow for building a large system using arbitrary combinations of these blocks. This emphasizes code reuse, a fundamental concept in modern software engineering. STL allows for new building blocks to be written and seamlessly integrated with the existing ones, thus emphasizing flexibility and extendibility. This makes STL particularly adaptive to different programming contexts, including algorithms, data structures, and data types. We briefly explain some of the major building blocks of STL that we used in the design and implementation of the index system.
Containers
A container is a common data structure that stores a group of similar objects, each of which can be a primitive data type, a class object, or—as in the database domain—a data record. The container manages its own objects: their storage (see the Allocators section) and access (see the Iterators section). The stored objects belong to the container and are accessed through its interface. Each container provides a set of public data members and member functions to provide information and facilitate the access to its elements. Different container types have different ways of managing their objects. In other words they offer different sets of functionality to deal with their objects.
Iterators
Containers do not allow direct access to their stored objects, but rather through another class of objects called iterators. An iterator is an object that can reference an element in a container. It allows programmers to access all elements in a sequential, random, or indexed way depending on the type of the container.
Algorithms
Iterators separate the processing logic from the container types and element types stored inside them. The same algorithm, written once, can be applied to different containers with different stored elements by having the algorithm code deal with iterators that can access the containers. This allows for a complete implementation of generic algorithms, like search and sort, without knowing the exact type of the container (its functionality) or the type of elements stored in it.
Allocators
As we know, containers are responsible for managing both the access and the storage of their elements. At the front end, the access is allowed by providing “suitable” iterators and a set of functions to provide information about the elements, to return a reference to an element, or to add/delete an element. At the back end, the physical storage remains an essential part to the container, allocating memory to new elements and freeing memory from deleted elements. Normally the user need not worry about storage management when using a container. This is done “behind the scenes” without any user intervention. This greatly simplifies the use of containers. Frequently, however, some applications require a
88
TEAM LinG

Component-Based Generalized Database Index Model
special kind of storage management and cannot simply work with the standard memory allocation. If the container was made directly responsible for the back-end storage, this would have meant that those special applications would need to discard the STL containers altogether and write their own special containers, defeating the purpose of code reuse, efficiency, and elegance of STL building blocks in this area of design. Therefore, to be able to serve those special cases as well, while still keeping the containers simple for everyone else, STL made a further separation of its building blocks. The container is not directly responsible for the storage. An allocator class was made responsible for the physical storage of the container elements, and each container handles storage by simply asking an allocator object to do just that. To keep things “behind the scenes” for users, each container uses a “default” general-purpose allocator for the storage needs, unless given a specialized allocator to handle the job.
BUILDING BLOCKS’ CONNECTIVITY
STL provides us with a complete set of compatible building blocks that can be connected together—using logical connectivity rules—to build complex systems. Using these blocks, we can produce a “blueprint” design of the system components, which can be drawn as a directed graph. The nodes would be STL blocks and the edges would be the allowed connections between them. Semantically, a connection between two nodes would represent a “use” relationship between the two STL blocks represented by the nodes. Connecting these blocks is not, however, done at random. We cannot simply connect any two blocks of our choice and put them to work. There are certain rules that apply when trying to connect two blocks together.
THE MODULAR INDEX SYSTEM
Figure 1 shows the building blocks of the indexed database model. In this system, integrated in a larger application (Butler et al., 2002), the index is seen as a container that provides an iterator to its database contents. The only way to interact with this container is through its iterator, which provides controlled access to the elements of the container.
Due to the huge number of entries in a typical database, the entries are paged rather than manipulated individually (Folk & Zeollick, 1992). The index container is therefore an index of pages, recursively built as containers with their own iterators, allocator, and memory man-
Figure 1. Subsystem view of the index framework
C
agement schema. In other words, the elements that the index stores are of type “page.” Each page is itself a container on a smaller scale. While the database is a container of pages that typically resides on hard disk, the page is a memory-resident container of pairs of key and data references. Again, the page is made independent of the types of both key and data reference by having them as two templates passed to the page in the implementation phase. Therefore the same page can work with any key type and data reference type that are passed to it.
The page has a page allocator, responsible for memory management like storage allocation and deallocation. The page uses a standard in-memory allocator by default. This allocator assumes that the page is entirely stored in memory, which is a reasonable assumption regarding the concept of a page in a database. Therefore it is likely for many systems to use the default page allocator. If the system has special needs for page storage, like specific space or time management, the designer can provide a special page allocator without affecting the design of the other parts of the system. The page container is unaffected by the type of allocator used, as long as it satisfies the standard interface to provide the page with all the services it needs.
Pages can iterate through their contents, searching for specific pairs of data that satisfy some key according to a similarity criteria provided by a search function. Again, the decoupling of components allows us to write any similarity criteria in the form of an algorithm module that decides if data entities are similar. Once a location is found, the corresponding pair is returned to the object asking for this service. This implies that the iteration process is done internally, with no page iterator needed to be visible for the outside world. The
89
TEAM LinG