
Rivero L.Encyclopedia of database technologies and applications.2006
.pdf450
Ontology-Based Data Integration
Agustina Buccella
Universidad Nacional del Comahue, Argentina
Alejandra Cechich
Universidad Nacional del Comahue, Argentina
Nieves Rodríguez Brisaboa
Universidade da Computación, Spain
INTRODUCTION
Nowadays, different areas of large modern enterprises use different database management systems to store and search their critical data. Competition, evolving technology, geographical distribution, and the inevitable growing decentralization contribute to this diversity. All of these databases are very important to an enterprise, but the their different interfaces make their administration difficult. Therefore, recovering information through a common interface becomes crucial to realize, for instance, the full value of data contained in the databases (Hass & Lin, 2002).
In the 1990s, the term federated database emerged to characterize techniques for proving an integrating data access, giving a set of distributed, heterogeneous, and autonomous databases (Busse, Kutsche, Leser & Weber, 1999; Litwin, Mark, & Roussoupoulos, 1990; Sheth & Larson, 1990). Even these concepts are rather wellknown today, a brief introduction would clarify their meanings in the context of our paper. They are as follows:
•Autonomy: Users and applications can access data through a federated system or by their own local system. Autonomy can be classified into three types: design autonomy, communication autonomy and execution autonomy (Busse et al., 1999; Ozsu & Valduriez, 1999). The first type refers to the data model independence, the second type involves the different ways of communication among the systems, and the third type refers to the independence of the execution of the local operations.
•Distribution: In the last years and with the arrival of the Internet it is very common to see computers connected by some type of network. Generally speaking, data may be distributed among multiple sources and stored in a single computer system or
in multiple computer systems. These computer systems may be geographically distributed but interconnected by a communication network.
•Heterogeneity: Different meanings that may be inferred from data stored in databases. In Cui and O’Brien (2000), heterogeneity is classified into four categories: structural, syntactical, system, and semantic. Structural heterogeneity deals with inconsistencies produced by different data models; syntactical heterogeneity deals with consequences of using different languages and data representations; system heterogeneity deals with having different supporting hardware and operating systems; and semantic heterogeneity is further classified as (a) dealing with semantically equivalent concepts; some models use different terms to refer to the same concept (e.g., synonyms, or properties, are modeled differently by different systems); (b) dealing with semantically unrelated concepts; the same term may be used by different systems to denote completely different concepts; and (c) dealing with semantically related concepts by using generalization/specification, different classifications, and so forth. Additionally, a similar classification of heterogeneity can be found in Goh (1996).
In this paper, we will focus on the use of ontologies because of their advantages when using them for data integration. For example, an ontology may provide a rich, predefined vocabulary that serves as a stable conceptual interface to the databases and is independent of the database schemas; knowledge represented by the ontology may be sufficiently comprehensive to support translation of all relevant information sources; and an ontology may support consistency management and recognition of inconsistent data. The next section will analyze several systems using ontologies as a tool to solve data integration problems.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG

Ontology-Based Data Integration
BACKGROUND
Recently, the term federated databases has evolved into federated information systems (FIS) because of the diversity of new information sources involved in the federation, such as HTML pages, databases, and filing, either static or dynamic.
A useful classification of information systems based on the dimensions of distribution and heterogeneity can be found in Busse et al., 1999). Figure 1 shows this classification where the addition of an additional dimen- sion—autonomy—makes the term Federated Information Systems appears.
Besides, the work reported in Busse et al., 1999 defines the classical architecture of federated systems (based on Sheth & Larson [1990]), which is widely referred by many researches. Figure 2 shows this architecture. In the figure, the wrapper layer involves a number of modules belonging to a specific data organization. These modules know how to retrieve data from the underlying sources hiding their data organizations. As the federated system is autonomous, local users may access local databases through their local applications
Figure 1. Classification of information systems.
Figure 2. Architecture of Federated Systems
independently from users of other systems. Otherwise, to access the federated system, they need to use the user O interface layer.
The federated layer is one of the main components currently under analysis and study. Its importance comes from its responsibility to solve the problems related to the semantic heterogeneity, as was previously introduced. So far, different approaches have been used to model this layer. They are as diverse as complementary in some cases, and can involve different perspectives such as the use of ontologies (Ambite et al., 1997; Buccella, Cechich, & Brisaboa, 2003; Goh, Bressan, Siegel, & Madnick, 1999; Gray et al., 1997), the use of metadata (Busse et al., 1999; Nam & Wang, 2002; Seligman & Rosenthal, 1996).
ONTOLOGY-BASED DATA
INTEGRATION
The term ontology was introduced by Gruber (1993) as an “explicit specification of a conceptualization.” A conceptualization, in this definition, refers to an abstract model of how people commonly think about a real thing in the world; and explicit specification means that concepts and relationships of the abstract model receive explicit names and definitions.
An ontology gives the name and description of the domain-specific entities by using predicates that represent relationships between these entities. The ontology provides a vocabulary to represent and communicate domain knowledge along with a set of relationships containing the vocabulary’s terms at a conceptual level. Therefore, because of its potential to describe the semantic of information sources and to solve the heterogeneity problems, the ontologies are being used for data integration tasks.
Some surveys on ontology-based systems for data integration can be found in literature. For example, Wache et al. (2001) focused on some aspects of the use of ontologies: the language representation, mappings, and tools. This work also classified the use of ontologies into three approaches: single ontology approach, multiple ontology approach, and hybrid ontology approach (see Figure 3).
Another different survey comparing the expressiveness of the languages can be found in Corcho and GomezPerez (2000), but only languages to represent ontologies are compared in this case.
This section will focus on how ontology-based systems address the semantic heterogeneity problems. We have investigated many systems, which follow some of the approaches of Figure 3, considering the relevant aspects shown in the hierarchy of Figure 4. This hierar-
451
TEAM LinG

Figure 3. Classification of the use of ontologies
chy does not mean specialization, rather whether quality properties are improved due a particular use of ontologies in a given system.
The use of ontologies refers to how ontologies help solve data integration problems. Commonly, the systems show how ontologies are used to solve the integration problems and how ontologies interact with other architectural components. In this aspect the systems describe their ontological components and the ways to solve the different semantic heterogeneity problems.
Then we characterize this feature by means of three quality characteristics: reusability, changeability, and scalability. Reusability refers to the ability of reusing the ontologies, that is, ontologies defined to solve other problems can be used in a system because (a) the system supports different ontological languages and (b) the system defines local ontologies. Changeability refers to the ability of changing structures within an information source without producing substantial changes in the system components. Finally, scalability refers to the possibility of easily adding new information sources to the integrated system.
Following, we will describe how some ontologybased systems implement the aspects of Figure 4. In general, we have chosen systems widely referred by researchers, although some of them are still under development. Additionally, we have analyzed other recent systems that implement new ideas to solve heterogeneity problems.
Figure 4. Main aspects of the use of ontologies
Ontology-Based Data Integration
Firstly, with respect to the use of the ontologies, we can find similarities among some systems. For example the ontologies defined in SIMS (Ambite et al., 1997; Arens, Hsu, & Knoblock, 1993, 1996) and Carnot (Woelk, Cannata, Huhns, Shen, & Tomlinson, 1993) do not facilitate reusability because both define a global ontology (single ontology approach) in order to integrate data. The same is true with changeability because no support is given by these systems to bear changes in the information sources. When one source changes, the global ontology must be rebuilt. Finally, scalability is not supported at all because, again, adding new information sources will generate the rebuilding of the global ontology in order to include the new terms and mappings not considered yet.
In general, the problem of dealing with a global ontology approach is that we must manage a global integrated ontology, which involves administration, maintenance, consistency, and efficiency problems that are very hard to solve. For example, SIMS requires constructing a general domain model that encompasses the relevant parts of the database schemas. Because each database model is related to this general domain model, the integration problem is shifted from how to build a single integrated model to how to map the domain and the information source models. Other similar examples are the MediWeb (Arruda, Baptista, &, Lima, 2002) and DIA (Medcraft, Schiel, & Baptista, 2003) systems. On one hand, MediWeb uses XML documents as information sources (Baru, 1998). XML provides only a syntactical view of the underlying data. To allow for precise querying and semantic interpretation, the XML documents should be complemented with a conceptual model that adequately describes the semantics of its tags. The ontologies arise to fulfil this semantic gap, enabling a precise semantic expression for querying XML documents. These ontologies are used as a common schema. Therefore, the scalability and changeability are not very good, because a change in one source or the addition of a new source generates a modification in its global ontology. On the other hand, DIA does not provide any support to reach reusable ontologies. But the changeability and scalability can be improved, because this global ontology does not need to be modified when a source is added; only an ontology-schema matching table is necessary to add a new database to the integrated system.
Otherwise, systems such as OBSERVER (Mena, Kashyap, Sheth, & Illarramendi, 1996, 2000), use the multiple ontologies approach to alleviate the problems presented by the single approach. OBSERVER defines a model for dealing with multiple ontologies avoiding problems about integrating global ontologies. The different ontologies (user ontologies) can be described
452
TEAM LinG

Ontology-Based Data Integration
using different vocabularies depending on the user’s needs. In OBSERVER, every information source has only one ontology server, a module that provides information about ontologies located in the node as well as about their underlying data repositories. The system allows encapsulating any direct interaction with user ontologies and also any access to data repositories. Another important component of the OBSERVER system is its IRM shared repository. It can be seen as a catalog of semantics of the system used to solve the “vocabulary problem” (heterogeneous vocabularies used to describe the same information). The IRM component supports the ontology-based interoperation defining several kinds of interoperable relationships, such as synonym, hyponym, hypernym, and overlap, among the terms of different (locally developed) ontologies. The reusability is widely supported by OBSERVER because the local ontologies might be defined for other purposes.
From the perspective of the hybrid ontology approach, we can find that the KRAFT (Gray et al. 1997; Preece, Hui, & Gray, 1999; Preece, Hui, Gray, Jones, & Cui, 1999) system defines two kinds of ontologies: a local ontology and a shared ontology. For each knowledge source there is one local ontology and the shared ontology formally defines the terminology of the domain problem. In order to avoid the problems about semantics heterogeneity, that might occur between a local ontology and the shared ontology (ontology mismatches; Visser, Jones, Bench-Capon, & Shave, 1998), an ontology mapping is also defined for each knowledge source. It is a partial function that maps terms and expressions defined by a local ontology to terms and expressions of the shared ontology. As the local ontologies can be defined independently, the reusability is possible. Also, the changeability and the scalability are better supported because when a source change or a new source needs to be added, only the local ontologies and the mapping ontology must be modified by including the changed or new information.
Something similar happens with mediators over on- tology-based information sources (MOIS; Tzitzikas, Spyratos, & Constantopoulos, 2001), COIN (Firat, Madnick, & Grosof, 2002; Goh et al., 1999; Siegel & Madnick, 1991) and InfoSleuth (Bayardo et al., 1997; Deschaine, Brice, & Nodine, 2000; Woelk & Tomlinson, 1994) In the case of MOIS, each source is associated to an ontology, which consists of a set of terms structured by a subsumption relation. Also, MOIS has a mediator, a secondary source that can bridge the heterogeneities between two or more sources and can provide a unified access to those sources. Besides, the mediator has a number of articulations to the sources. An articulation to a source is a set of relationships between the terms of
the mediator and the terms of that source. The system
does not merge the ontologies to solve heterogeneity O problems among the sources, instead of this, if the same
term appears within two sources, both terms are not assumed as equivalent because they may have different interpretations (meanings). Two terms are considered equivalent only if they can be shown to be equivalent using the articulations of each source. The addition of a new source is a simple task because only two steps are necessary: the selection of a mediator in the network and the design of an articulation between the selected mediator and the new source.
In the case of COIN, it contains three main components: context axioms, elevation axioms, and a domain model. The domain model is a collection of the source’s primitive types, such as strings or integers, and semantics types that define the application domain corresponding to the integrated data sources. The elevation axioms act as a mapping between attributes in the source and semantics types in the domain model. Finally, the context axioms define alternative interpretations of the semantics objects in different contexts. All of these components and the object oriented approach used by COIN, generate the domain model does not require to be updated every time a new source is added. The main advantage of COIN is allowing knowledge of data semantics to be independently captured in sources while let a specialized mediator (context mediator) detect and solve potential conflicts at the time a query is submitted.
As a final case, the InfoSleuth system is an agentbased system that provides interoperation among autonomous systems. To add a resource to the system the resource only needs to have an interface to advertise its services and let other agents make use of it immediately.
FUTURE TRENDS
Several systems compared here are still in a development stage and, as we have explained, some problems should be solved in order to reach a good integration. For example, systems as SIMS and Carnot use a global ontology which decreases reusability, changeability and scalability. Other systems propose multiple or hybrid approaches to avoid these problems, but additional efforts have to be made to reach reusability.
Finally, automatic or semiautomatic tools are necessary to help the integration process—tools to map different ontologies or assistant-tools to create the ontologies. For example, the DOME (Cui, Jones, & O’Brien, 2001; Cui & O’Brien, 2000) system uses ontology extraction tools to generate the initial ontologies. Specifically, DOME uses XRA (Yang, Cui, & O’Brien, 1999), which is an ontology extraction tool
453
TEAM LinG
that uses a reverse engineering approach to extract an initial ontology from given data sources and their application programs.
CONCLUSION
Today, the semantic heterogeneity involves many complex problems that are addressed by using different approaches, among them the use of ontologies, which give a higher degree of semantics to the treatment of data involved in the integration.
We have analyzed several systems accordingly with the use of the ontologies, and we have evaluated three important aspects also related with them—reusability, changeability, and scalability. Each system has implemented its own solution with advantages and disadvantages, but some elements in common and some original aspects can be found. There are other important aspects we could have considered for characterizing current ontology-based systems. Among them, we should mention the use of automated tools for supporting the manipulation of ontologies.
Of course, further characterization is needed to completely understand the use of ontologies in data integration. We hope our work will motivate the readers to deeper immerse themselves in this interesting world.
REFERENCES
Ambite, J. L., Arens, Y., Ashish, N., Knoblock, C. A., et al. (1997, December 22). The SIMS manual 2.0 (University of Southern California Tech. Rep.). Retrieved October 10, 2003, from http://www.isi.edu/sims/papers/ sims-manual.ps
Arens, Y., Hsu, C., & Knoblock, C. A. (1993). Retrieving and integrating data from multiple information sources.
International Journal in Intelligent and Cooperative Information Systems, 2(2), 127-158.
Arens, Y., Hsu, C., & Knoblock, C. A. (1996). Query processing in the SIMS information mediator. In A. Tate (Ed.), Advanced Planning Technology (pp. 61-69).
Menlo Park, CA: AAAI Press.
Arruda, L., Baptista, C., & Lima, C. (2002). MEDIWEB: A mediator-based environment for data integration on the Web. Databases and Information Systems Integration. ICEIS, 34-41.
Baru C. (1998). Features and requirements for an XML view definition language: Lessons from XML information mediation. Paper presented at the W3C Workshop on
Ontology-Based Data Integration
Query Language (QL’98). Retrieved March 5, 2004, from http://www.w3.org/TandS/QL/QL98/pp/xmas.html
Bayardo, R. J., Jr.. Bohrer, W., Brice, R., Cichocki, A., Fowler, J., Helal, A., et al. (1997). InfoSleuth: Agentbased semantic integration of information in open and dynamic environments. Proceedings from the ACM SIGMOD International Conference on Management of Data (pp. 195-206).
Buccella, A., Cechich, A., & Brisaboa, N. R. (2003). An ontology approach to data integration. Journal of Computer Science and Technology, 3(2), 62-68. Retrieved March 20, 2004, from http://journal.info.unlp.edu.ar
Busse, S., Kutsche, R. D., Leser, U., & Weber H. (1999).
Federated information systems: Concepts, terminology and architectures (Tech. Rep. No. 99-9). Berlin, Germany: Technische Universität.
Corcho, O., & Gomez-Perez, A. (2000). Evaluating knowledge representation and reasoning capabilities of ontology specification languages. Proceedings of the ECAI 2000 Workshop on Applications of Ontologies and Problem-Solving Methods. Retrieved February 5, 2004, from http://delicias.dia.fi.upm.es/WORKSHOP/ ECAI00/schedule.html
Cui, Z., Jones, D., & O’Brien, P. (2001). Issues in Ontology-based Information Integration. Proceedings of the IJCAI-01 Workshop on Ontologies and Information Sharing (pp. 141-146).
Cui, Z., & O’Brien, P. (2000). Domain ontology management environment. In Proceedings of the 33rd Hawaii International Conference on System Sciences.
Deschaine, L. M., Brice, R. S., & Nodine, M. (2000). Use of InfoSleuth to coordinate information acquisition, tracking and analysis in complex applications (Tech. Rep. No. MCC-INSL-008-00).
Firat, A., Madnick, S., & Grosof, B. (2002). Knowledge integration to overcome ontological heterogeneity: challenges from financial information systems. Twenty-Third International Conference on Information Systems.
Goh, C. H. (1996). Representing and reasoning about semantic conflicts in heterogeneous information sources.
Doctoral dissertation, Massachusetts Institute of Technology, Sloan School of Management.
Goh, C. H., Bressan, S., Siegel, M., & Madnick, S. E. (1999). Context interchange: New features and formalisms for the intelligent integration of information. ACM Transactions on Information Systems, 17(3), 270-293.
Gray, P. M. D, Preece, A., Fiddian, N. J., et al. (1997). KRAFT: Knowledge fusion from distributed databases
454
TEAM LinG

Ontology-Based Data Integration
and knowledge bases. Proceedings of the 8th International Workshop on Database and Expert Systems Application (pp. 682-691).
Gruber, T. (1993). A translation approach to portable ontology specifications. Knowledge Acquisition, 5(2), 199-220.
Hass, L, & Lin, E. (2002). IBM federated database technology. Retrieved September 25, 2003, from http://www106.ibm.com/developerworks/db2/library/techarticle/ 0203haas/0203haas.html
Litwin, W., Mark, L., & Roussoupoulos, N. (1990). Interoperability of multiple autonomous databases. ACM Computing Surveys, 22(3), 267-293.
Medcraft, P., Schiel, U., & Baptista, P. (2003). DIA: Data integration using agents. Databases and Information Systems Integration. ICEIS, 79-86.
Mena, E., Kashyap, V., Sheth, A., & Illarramendi, A. (1996). Managing multiple information sources through ontologies: relationship between vocabulary heterogeneity and loss of information. Proceedings of Knowledge Representation Meets Databases, ECAI’96 Conference,
Budapest, Hungary (pp. 50-52).
Mena, E., Kashyap, V., Sheth, A., & Illarramendi, A. (2000). Observer: An approach for query processing in global information systems based on interoperation across pre-existing ontologies (pp. 1-49). Boston: Kluwer Academic.. Retrieved February 15, 2004, from http:// citeseer.nj.nec.com/mena96observer.html
Nam, Y., & Wang, A. (2002). Metadata Integration Assistant Generator for Heterogeneous Distributed Databases.
Proceedings of the International Conference on Ontologies, Databases, and Applications of Semantics for Large Scale Information Systems, Irvine, CA (pp. 28-30).
Ozsu, M. T., & Valduriez, P. (1999). Principles of distributed database systems (2nd Ed.). Prentice Hall.
Preece, A., Hui, K., Gray, A., Jones, D., & Cui, Z. (1999). The KRAFT architecture for knowledge fusion and transformation. Proceedings of the 19th SGES International Conference on Knowledge-Based Systems and Applied Artificial Intelligence, Berlin, Germany. Retrieved February 15, 2004, from http://www.csd.abdn.ac.uk/~apreece/ Research/KRAFT.html
Preece, A., Hui, K., & Gray, P. (1999). KRAFT: Supporting virtual organisations through knowledge fusion. artificial intelligence for electronic commerce (Tech. Rep. WS-99-01). AAAI Press.
Seligman, L., & Rosenthal, A. (1996). A metadata resource
to promote data integration. Retrieved February 15, 2004, O from http://www.computer.org/conferences/meta96/ seligman/seligman.html
Sheth, A. P., & Larson, J. A. (1990). Federated database systems for managing distributed, heterogeneous and autonomous databases, ACM Computing Surveys, 22(3), 183-236.
Siegel, M. & Madnick, S. E. (1991). A metadata approach to resolving semantic conflicts. Proceedings of the 17th Conference on Very Large Data Bases, Barcelona, Spain, September.
Tzitzikas, Y., Spyratos, N., & Constantopoulos, P. (2001). Mediators over ontology-based information sources.
Proceedings of the 2nd International Conference on Web Information Systems Engineering, Kyoto, Japan.
Visser, P. R. S., Jones, D. M., Bench-Capon, T. J. M., & Shave, M. J. R. (1998). Assessing heterogeneity by classifying ontology mismatches. Proceedings of the International Conference on Formal Ontology in Information System (pp. 148-162).
Wache, H., Vögele, T., Visser, U., Stuckenschmidt, H., Schuster, G., Neumann, H., et al. (2001). Ontology-based integration of information: A survey of existing approaches. Proceedings of the IJCAI-01 Workshop: Ontologies and Information Sharing, Seattle, WA.
Woelk, D., Cannata, P., Huhns, M., Shen, W., & Tomlinson, C. (1993). Using carnot for enterprise information integration. Proceedings of the 2nd International Conference of Parallel and Distributed Information Systems (pp. 133136).
Woelk, D., & Tomlinson, C. (1994). The InfoSleuth Project: Intelligent Search Management via Semantic Agents (Tech. Rep.). Retrieved February 15, 2004, from http:// www.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searching/ woelk/woelk.html
Yang, H. Cui, Z., & O’Brien, P. (1999). Extracting ontologies from legacy systems for understanding and re-engi- neering. IEEE International Conference on Computer Software and Applications.
KEY TERMS
Data Integration: Process of unifying data that share some common semantics but originate from unrelated sources.
455
TEAM LinG
Distributed Information System: A set of information systems physically distributed over multiple sites, which are connected with some kind of communication network.
Federated Database: Idem FIS, but the information systems only involve databases (i.e., structured sources).
Federated Information System (FIS): A set of autonomous, distributed and heterogeneous information systems, which are operated together to generate a useful answer to users.
Heterogeneous Information System: A set of information systems that differs in syntactical or logical aspects, such as hardware platforms, data models or semantics.
Ontological Changeability: The ability of changing some structures of an information source without producing substantial changes in the ontological components of the integrated system.
Ontology-Based Data Integration
Ontological Reusability: The ability of creating ontologies that can be used in different contexts or systems.
Ontological Scalability: The ability of easily adding new information sources without generate substantial changes in the ontological components of the integrated system.
Ontology: Provides a vocabulary to represent and communicate knowledge about the domain and a set of relationship containing the terms of the vocabulary at a conceptual level.
Semantic Heterogeneity: Each information source has a specific vocabulary according to its understanding of the world. The different interpretations of the terms within each of these vocabularies cause the semantic heterogeneity.
456
TEAM LinG
457
Open Source Database Management Systems |
|
|
O |
||
|
||
|
|
|
SulaymanK.Sowe |
|
|
Aristotle University of Thessaloniki, Greece |
|
|
Ioannis Samoladas |
|
|
Aristotle University of Thessaloniki, Greece |
|
Ioannis Stamelos
Aristotle University of Thessaloniki, Greece
INTRODUCTION
This article discusses open source database management systems (OSDBMS) trends from two broad perspectives. First, the software engineering discipline platform on which databases are built has recently witnessed a new form of software development—Free/Open Source Software Development (F/OSSD). Methodically, the F/OSSD paradigm has changed the way relational databases, initiated in the 1960s and 1970s, are developed, distributed, supported, and maintained. Second, commercial relational database management systems (RDBMS) still dominate the database market because, on one hand, vendors and users are skeptical of the boon of applications developed and distributed under the F/OSSD paradigm, and on the other hand, it has been argued that OSDBMS are not likely to follow the successful trend of other robust Free/Open Source Software (F/OSS) systems (Linux, Apache, etc.). This article presents trends in OSDBMS by looking at the morphology and landscape of the type of applications developed by the F/OSS community. Implementation of F/OSS strategies and factors mitigating the adoption and utilization of OSDBMS are explored by looking at the interactions between the F/OSSD process and database firms, vendors, and users.
The greatest claim levied against the F/OSS movement is that application development is tilted towards products that are highly modular and application program interfaces (APIs) rather than end-user products such as databases. The other thought, mostly coming from commercial RDBMS users, vendors, and enterprises claims that databases are a critical component of software to be left to the F/OSSD model. Both camps seem to have ample evidence to protect their stance. What we present here is a compilation of tenet views coming from existing literature presented in research and narrative journals and our experience as F/OSS participants.
BACKGROUND
Users of F/OSS, having access to the source code, are free to study what the program does, modify it to suit their needs, distribute copies to other people, and publish improved versions so that the whole F/OSS community can benefit. The license agreement under which the source code is distributed exactly defines the rights users have over the product. Methodically, F/OSS is collaboratively developed software. An egalitarian network of developers, referred to as hackers, develop software online in a decentralized environment free of hierarchical control structures. Participants rely on extensive peer collaboration through the Internet, using project’s mailing lists, e-mails, discussion forums, and so forth. However, communities have been found to provide support services such as suggestions for product features, acting as distributing organs, answering queries, helping new members, and so forth. Products at various development statuses (mature, stable, alpha, etc.) can be freely downloaded, while nonprofit foundations (NPF) and sec- ond-generation companies (SGC) may distribute products at minimal cost. F/OSSD has emerged as an alternative approach to decrease cycle-time, design complexities, improve software quality, and costs for a large number of software applications such as OSDBMS.
What is more interesting about the F/OSS landscape is that applications developed by the community are not uniformly distributed across all disciplines. Part of our ongoing survey of F/OSSD activities from both freshmeat.net (Figure1) and sourceforge.net (Figure2) shows that F/OSS projects cover wide ranging topics, with development dominated by infrastructural or Inter- net-based components. In this study, databases show a moderate increase. The Freshmeat study shows 17.18% increase in database-related projects during 20 weeks of our monitoring in 2003. Sourceforge recorded 24.20% during the same period. Popularity with the developer community shows that OSDBMS are no longer
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG

Open Source Database Management Systems
Figure 1. freshmeat.net |
Figure 2. sourceforge.net |
viewed as an add-on when distributing popular F/OSS products.
One explanation for the concentration of applications in the areas described could be that the F/OSSD model tends to work best for areas where developers are themselves users. The model has had much less success with regards to OSDBMS. In addition, F/OSSD has been viewed as a kind of intrinsic activity concentrating more on personally interesting aspects of software targeting the high-end technical user (Krishnamurthy, 2003). This leaves the development of most databases to commercial entities.
A database is a coherent and structured collection of related data, and OSDBMS represents those RDBMS built and distributed by means of the F/OSS model of software development and distribution. OSDBMS users not only have access to the product’s source code, which is not possible under commercial RDBMS, but may freely download and modify the code and tailor it to fit their particular needs. The F/OSSD model yields two kinds of OSDBMS. Pure OSDBMS emanates from the F/OSSD practices of a particular individual, research/academic institution, or a group of hackers with no corporate involvement. Hybrid OSDBMS, on the other hand, result from outsourcing a whole or part of a database development to the F/OSS community by a software firm.
OSDBMS are in use in all aspects of life and industry, ranging from manipulation of data for hospital use, accounting, scientific and engineering applications, to such areas as immigration, allowing data harmonization across geographical boundaries. The dawn of the knowledge economy has made them an important resource that can be harnessed to generate crucial information for critical and real-time decisions making. Table 1 shows a list of selected databases that are reported to be free of charge for academic or personal use (Jeusfeld, 2003). Heteroge-
neous communities of F/OSS developers develop these databases, targeting diverse aspects and needs.
MAIN THRUST OF THE ARTICLE
The emergence of F/OSS enterprises seeks to push software development out of the academic stream into the commercial mainstream (Babcook, 2004; LaMonica, 2003b), and as a result, end-user applications such as OSDBMS are becoming more popular. Gedda (2003) substantiated that increased adoption of OSDBMS technologies by well-known organizations has also led to more enterprise recognition, hence, consideration of OSDBMS. What is more compelling is that companies (e.g., Sybase, Oracle, Sun, IBM) are increasingly implementing open source strategies—porting programs and application into the Linux environment while at the same time realizing that they can “charge complementary services” such as post-sale services (ibid). Thus, F/ OSS is redefining the software industry (Feller & Fitzerald, 2000) in general and database development in particular. There is a gradual shift in focus from protecting software knowledge to maximizing gain from F/ OSSD, use, and distribution. Furthermore, software enterprises are realizing that there is the need to move from being in-house software developers and distributors to that of a service industry where software products are judged by their quality, reliability, and performance by the people who develop and use the software.
Beyond the promise of a reduced total cost of ownership of the software and potentially better support, there is an added dimension of freedom from vendor lock-in, the situation where an entire database application becomes dependent on a single vendor’s implementation of a technology (Adams, 2002). As conjectured by
458
TEAM LinG

Open Source Database Management Systems
Table 1. Some databases available free of charge for academic or personal use
O
Jeusfeld (2003), when OSDBMS users have access to source code, they are not forced into a perpetual upgrade cycle. A panel discussing trends in open source databases sees this freedom as having a big impact on the ecosystem of the company distributing the OSDBMS (Editingwhiz, 2003).
Even though F/OSS applications have made a giant leap in the server sector (e.g., Apache) and operating system and network environment (e.g., Linux), OSDBMS have yet to make a substantial break into the database market dominated by Microsoft, Oracle, and IBM, to mention a few. Nevertheless, both Wayner (2001) and the 2004 Database Development Survey by Evans Data Corporation demonstrated that, while companies continue to use their commercial databases, there is a recognized phase-shift towards OSDBMS, especially when new applications and major upgrades are needed. This is partly due to the attractive pricing of major OSDBMS (Gedda, 2003; Lowe, 2002; Martin, 2003), viable developer and support community, and ability to be easily integrated with other F/OSS tools and systems. However, a growing number of companies adopt a cautionary approach towards OSDBMS full utilization. The appre-
hensive trend is expected to continue, albeit OSDBMS offerings are continually improved. Overtly stated by Rooney (2004), the main challenges faced by large enterprises in adopting OSDBMS in mission-critical application have constantly been scalability and third-party support. Despite success with the lower-end and midsized markets (Martin, 2003; Rooney, 2004), it may take a while for the low-cost attraction of OSDBMS to make a real impact in the database industry the way Linux has in the operating system. As argued by Krishnamurthy (2003), releasing the source code and pricing are two separate decisions. What is encouraging in this sense is that releasing the source code of a given OSDBMS only improves the innovation base of the system. Understandably, one can reinstall a crash operating system or restore a network server malfunction with little damage to a company’s valuable data, but when a database application fails, a lot is at stake because databases contain information that is vital for the success of any industry in the information age. Yet, IT managers have reservations in trusting their companies most valuable data to OSDBMS (Wayner, 2001; Gedda, 2003).
459
TEAM LinG