
Rivero L.Encyclopedia of database technologies and applications.2006
.pdfKEY TERMS
Data Semantics: A reflection of the real world that captures the relationship between data in a database, its use in applications, and its corresponding objects in the real world. Data semantics requires some form of agreement between the different agents (human and computer) that interact with and use the data. The field of data semantics deals with understanding and developing methodologies to find and determine data semantics, represent those semantics, and enable ways to use the representations of the semantics.
Data Quality: Ensuring that data supplied is fit for use by its consumer. Elements of data quality can extend to include the quality of the context in which that data is produced, the quality of the information architecture in which that data resides, the factual accuracy of the data item stored, and the level of completeness and lack of ambiguity.
Information Model: A rich central model of the business or of a domain within the business. A traditional data model or object model may serve as the basis for an information model, but ideally data models should be extended to a full ontology.
Semantic Information Management
Metadata: Metadata includes a data asset’s schema as well as information about an asset’s location, usage, origin, relationship to other data assets, rules associated with it, and assignment of ownership.
Metadata Repository: An information store containing information about each data asset. Repositories act as catalogs and include technical information about data assets, their structures, how they are used, and who is responsible for them.
Ontology: A taxonomic representation of terms and their agreed-upon associated meanings. A full ontology will include the formally defined relationships between terms in that ontology. An ontology should be allencompassing, embracing every term within a universe of discourse.
Semantic Information Architecture: An architecture consisting of three core parts: metadata, an information model, and data semantics. The metadata facilitates a basic understanding of the organization’s data; the information model provides a conceptualization and representation of the business, based on ontology, to provide a formal specification of the real world; and the data semantics map the data sources to the information model to capture meaning, as result of which we can understand the data.
600
TEAM LinG
|
601 |
|
Semantically Modeled Enterprise Databases |
|
|
|
5 |
|
|
|
|
|
|
|
CherylL.Dunn
Grand Valley State University, USA
Severin V. Grabski
Michigan State University, USA
INTRODUCTION
A semantically modeled enterprise database is a reflection of the reality of the activities in which an enterprise engages and the resources and people involved in those activities. Many organizations have invested immense sums of money in enterprise resource planning systems (ERP) and associated “bolt-on” applications such as customer relationship management (CRM) and advanced planning systems (APS). A significant portion of the value of these systems is in the integrated database and associated data warehouse. To maximize value, the database should serve as a semantic representation of the organization. Otherwise, relevant information needed to reflect the organization’s activities may be omitted or may be stored in such a way that the underlying reality is hidden or disguised and is therefore of no use to decision makers.
Semantically modeled enterprise databases require their component objects to correspond closely to realworld phenomena and preclude the use of artifacts as system primitives (Dunn & McCarthy, 1997). Semantically modeled enterprise information systems allow for full integration of all system components centered on a single integrated database and facilitate joint use of information by decision makers. Researchers have advocated semantically designed information systems because they provide benefits to individual decision makers (Dunn & Grabski, 1998, 2000) and because they facilitate organizational productivity and interorganizational communication (Cherrington, Denna, & Andros, 1996; David, 1995; Geerts & McCarthy, 2001a).
Ontologically based systems with common semantics are necessary to facilitate interorganizational information systems (Dunn, Cherrington, & Hollander, 2005; Geerts & McCarthy, 2001b, 2003). Such systems are increasingly necessary as business-to-business e-commerce becomes a major component of the economy. Presently, most interorganizational data is sent via electronic data interchange (EDI), which requires very strict specifications as to how the data are sequenced and requires some investment by adopting organizations. Knowledge inherent in these systems is limited at best. Trading partners who
implement systems based on the same underlying semantic model may eliminate many of the current problems.
ERP systems have been defined as “designed to process an organization’s transactions and facilitate integrated real-time planning, production and customer response” (O’Leary, 2000, p. 27). David, Dunn, and McCarthy (1999) propose the use of the resources-events-agents (REA) enterprise ontology as a basis for comparison among systems and ERP packages. We agree REA is a robust candidate to which ERP systems may be compared because of its strong semantic, microeconomic, transaction and accounting heritage. More importantly, semantic models must be used as a basis for the information system because of the information contained within the semantics.1
The REA framework as an enterprise ontology provides a high-level definition and categorization of business concepts and rules, enterprise logic, and accounting conventions of independent and related organizations (Geerts & McCarthy, 2001b, 2003). The REA ontology includes three levels: the value chain level, the process level, and the task level. The value chain level models an enterprise’s “script” for doing business. That is, it identifies the high-level business processes or cycles2 (e.g., revenue, acquisition, conversion, financing, etc.) in the enterprise’s value chain and the resource flows between those processes. The process level represents the semantic components of each business process. The task (or workflow) level of the REA ontology is the most detailed level and includes a breakdown of all steps necessary for the enterprise to accomplish the business events that were included at the process level.
The robust nature of the basic REA model has been demonstrated in research using various notations and implementation tools (Geerts & McCarthy, 1991; Nakamura & Johnson, 1998). What began as a simple model to capture semantics of accounting transactions more effectively than traditional double-entry techniques progressed to become an enterprise ontology (Geerts & McCarthy, 2001, 2003) and has contributed to inter-enter- prise integration attempts such as the development of ebXML standards (e.g., RosettaNet, UNEDIFACT).
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
BACKGROUND
The core REA model for each transaction cycle consists of the following components, presented in list form for brevity’s sake. Readers are encouraged to read McCarthy (1982) for more detail.
•Two or more economic events that represent alternative sides of an economic exchange (at least one increment event and at least one decrement event).
•Two or more resources that represent what is received and given up in the economic exchange.
•Internal agents that represent the company’s personnel that are responsible for each of the economic events (at least one agent for each event).
•One external agent that represents the person or company with whom the company is engaging “at arms’ length” in the exchange.
•Duality relationship between the increment and decrement economic events.
•Stock-flow relationships between the events and the associated resources, representing the inflows or outflows of the resources resulting from the events.
•Responsibility relationships between the events and the internal agents.
•Participation relationships between the events and the external agents.
The following components were added in the progression from accounting model to enterprise ontology, again presented in list form. Readers are encouraged to read all of the Geerts and McCarthy articles listed in the references; David, Gerard, and McCarthy (2002); and Dunn et al. (2005) for more detail.
•Separation of the ontology into operational and knowledge (planning and control) levels to facilitate budgeting and management.
•Integration of transaction cycle models into an enterprise-wide value chain model.
•Expansion of transaction cycle models into workflow or task level models.
•Separation of components into continuants (enduring objects with stable attributes that allow them to be recognized on different occasions throughout a period of time) and occurrents (processes or events that are in a state of flux).
Semantically Modeled Enterprise Databases
with another agent category (e.g., salesperson assigned to customer).
•Custody relationships between agents and resources that represent the agents that are accountable for various resources.
•Fulfillment relationships between commitment images and the resulting economic events.
•Partner relationships between commitment images and the participating agents.
•Reservation relationships between commitment images and the resources that are the proposed subject of the future exchange.
•Typification description relationships between continuant components and the categories to which they belong, e.g., resource-resource type and agentagent type relationships.
•Characterization description relationships between continuant type images, e.g., agent type– agent type and agent type–resource type relationships.
•Typification history relationships between physical occurrents and their types, indicating that the occurrents share the same script, e.g., event-event type.
•Scenario history relationships between abstract occurrents and other abstractions, e.g., event type– resource type.
•Business process as a description of the interaction between resources, agents, and dual events.
•Partnering as the purpose of the interaction between resources, agents, and commitments.
•Segmentation as a description of the grouping of physical categories into abstract continuant categories.
•Policy or standard as a description of the expression of knowledge-level rules between abstract types (e.g., scripts and scenarios).
•Plan as a description of the application of a script to physical occurrents.
•Strategy as a description of rules for the execution of a business process or partnering.
REA ONTOLOGY RESEARCH
Much attention has been given to the REA ontology in the literature. Published research papers have included design science and empirical methodologies. Textbooks
•Type images that represent category-level abstracinclude extensive coverage of the REA ontology. In this
|
tions of similar components. |
section we provide brief overviews of each of these |
• |
Commitment images that represent agreements to |
categories and identify sources for interested readers. |
|
engage in future economic events. |
Design science research associated with semantically |
•Assignment relationships between agents that repmodeled enterprise databases began before the advent of
resent the designation of an agent category to work |
the REA ontology, in both computer science and account- |
602
TEAM LinG

Semantically Modeled Enterprise Databases
ing. As REA was developed, ideas were incorporated from scholars such as Codd (1970), Chen (1976), and Smith and Smith (1977). David et al. (2002) provide a detailed account of the design science research that influenced REA from its inception and throughout its subsequent progression. David et al. (2002) also details the research studies that have made significant design science advances in REA, separating them into those that created new constructs, models, methods, and instantiations.
Empirical research associated with semantically modeled enterprise databases investigates both enterpriseand individual-level benefits. Dunn and Grabski (2002) provide a detailed review of empirical research at both levels. Studies such as Andros, Cherrington, and Denna (1992) and Cherrington et al. (1996) revealed significant benefits from a semantically modeled system based on the REA model, including reductions in cost and processing time and increases in employee satisfaction. David (1995) observed productivity and administrative efficiencies for systems identified as more REA-like. Research by Weber (1986) and O’Leary (2004) compared the REA semantic model to existing software to determine consistencies and differences. Weber (1986) compared REA to wholesale distribution software; O’Leary (2004) compared REA to SAP, a leading ERP software package. Both studies found high-level consistency between the software and REA but found lower-level divergences due to accounting artifacts embedded in the software.
Dunn and Grabski (2002) reviewed individual-level research and semantically modeled enterprise databases and provided the following conclusions. Conceptual modeling formalisms are superior to logical modeling formalisms for design accuracy (Sinha & Vessey, 1999; Kim & March, 1995). The focus on increment and decrement resources and events along with the associated agents to those events is consistent with database designers’ thought processes. Additionally, knowledge structures consistent with the REA template’s structuring orientation are associated with more accurate conceptual accounting database design (controlling for knowledge content, ability, and experience level; Gerard, 1998). The lack of mandatory properties with entities is not critical (Bodart et al., 2001), perhaps because of the semantics inherent in the modeled system. System designers distinguish between entities and relationships, with entities being primary (Weber, 1996). Accounting systems based on the REA model are perceived as more semantically expressive by end users than are accounting systems based on the traditional debit-credit-account mode, and systems perceived as semantically expressive result in greater accuracy and satisfaction by end users than do non-semanti- cally expressive systems (Dunn & Grabski, 2000). The ability to dis-embed the essential objects and relationships between the objects in complex surroundings de-
pends on a cognitive personality trait and field independence and leads to more accurate conceptual model 5 design (at least for undergraduate students; Dunn & Grabski, 1998). Data and process methodologies are easier for novices than the object methodology and resulted in less unresolved difficulties during problemsolving processes (Vessey & Conger, 1994), and there is
a more pronounced effect for process-oriented tasks (Agarwal, Sinha, & Tanniru, 1996a). Experience in process modeling matters, regardless of whether the modeling tool (process versus object-oriented) is consistent with the experience (Agarwal, Sinha, & Tanniru, 1996b).
REA is used to teach semantically modeled database design in many accounting and management information systems courses around the world. Several information systems textbooks have incorporated REA. Some cover REA in one or more chapters but also cover a wide variety of other systems topics (e.g., Gelinas, Sutton, & Oram, 2001; Hall, 2003; Romney & Steinbart, 2003). Others focus exclusively on REA (e.g., Dunn et al., 2005). The increasing coverage of REA in academic textbooks demonstrates its acceptance in the academic community.
FUTURE TRENDS
Electronic commerce is a means of conducting business whereby electronic technologies facilitate communications between trading partners and enhance the value of their business relationships. The most common current use of information and communication technologies (i.e., information systems) for electronic commerce is to transmit data back and forth between companies in rigidly structured formats. In business-to-consumer electronic commerce, businesses create Web sites from which they display their product offerings to potential customers. Customers complete online order forms that transmit the data in the company’s chosen format. Most business-to- business electronic commerce utilizes electronic data interchange with its rigid standards. Information systems structured in these ways meet the definition of conduits—they are pipes through which the data flows (and how well the data flows depends on the size and layout of the pipes). While they certainly add value beyond not having Internet Web sites or EDI or any other alternative, whether they maximize value is unclear. Use of a semantically based ontological modeling approach such as the REA ontology has the potential to enhance electronic commerce, particularly in the business-to- business environment.
Supply chain management issues are similar to those of e-commerce. Use of a semantically based ontological system like REA is a necessary but not sufficient condition to facilitate effective and efficient interorganizational
603
TEAM LinG
system integration. Intelligent agents and automated intensional reasoning are also required to facilitate such integration, and the system semantics must not be obscured by subsequent implementation artifacts.
Practices associated with data warehouses can benefit from semantically modeled systems. Data warehouses and data marts are used in most large firms today because of the current limitations of hardware and software. It is not physically possible to store all event data on a single system for current transaction processing and all system queries to obtain sufficient performance. However, on a conceptual level, a system designed based on the REA ontology negates the need for a separate data warehouse. Information in data warehouses is simply a set of “views” of the operational data prepared for specific functions. These views could be produced directly from the operational database and manipulated by users. Only technological speed, capacity, and security issues limit the feasibility of discarding data warehouses.
Many issues still need to be resolved, and as such these present many research opportunities. One issue focuses on the scalability of systems based on the REA ontology to support very small enterprises, large multinational firms, and firms of all sizes in between. Additional research is needed to build on automated intensional reasoning (Rockwell & McCarthy, 1999) and extend it to include the use of intelligent agents and object-based environments. Preservation of the semantics at an operational level, beyond that of the database itself, would allow decision makers additional insight into the problems and the information available to address the issues that they face. Again, object-based systems seem to provide the most benefit, but additional research is needed.
CONCLUSION
Semantically modeled enterprise databases have demonstrated benefits for individuals, for enterprises, and for inter-enterprise system integration. To take full advantage of the semantically rich ontological patterns and templates, the REA ontology must be implemented with current advances in artificial intelligence technology and object-oriented database technology. Many current problems faced by companies who attempt to install ERP systems and integration tools such as EDI can be minimized by use of common semantic patterns about which intelligent systems can reason. Because REA systems are ontologically driven, they allow enterprises whose business practices are different from each other to realize their base constructs are the same and can be used for the basis of integration.
Semantically Modeled Enterprise Databases
REFERENCES
Agarwal, R., Sinha, A. P., & Tanniru, M. (1996a). Cognitive fit in requirements modeling: A study of object and process methodologies. Journal of Management Information Systems, 13(2), 137-162.
Agarwal, R., Sinha, A. P., & Tanniru, M. (1996b). The role of prior experience and task characteristics in objectoriented modeling: An empirical study. International Journal of Human-Computer Studies, 45, 639-667.
Andros, D., Cherrington, J. O., & Denna, E. L. (1992, July/ August). Reengineer your accounting the IBM way. The Financial Executive, 8(4), 28-31.
Bodart, F., Patel, A., Sim, M., & Weber, R. (2001). Should optional properties be used in conceptual modeling? A theory and three empirical tests. Information Systems Research, 12(4), 384-405.
Chen, P. P. (1976, March). The entity-relationship model— Toward a unified view of data. ACM Transactions on Database Systems, 1(1), 9-36.
Cherrington, J. O., Denna, E. L., & Andros, D. P. (1996). Developing an event-based system: The case of IBM’s national employee disbursement system. Journal of Information Systems, 10(1), 51-69.
Codd, E. F. (1970, June). A relational model of data for large shared data banks. Communications of the ACM, 13(6), 377-387.
David, J. S. (1995). An empirical analysis of REA accounting systems, productivity, and perceptions of competitive advantage. Unpublished doctoral dissertation, East Lansing, MI: Michigan State University.
David, J. S., Dunn. C. L., & McCarthy, W. E. (1999).
Enterprise resource planning systems research: The necessity of explicating and examining patters in symbolic form. East Lansing, MI: Michigan State University.
David, J. S., Gerard, G., & McCarthy, W. E. (2002). Design science: An REA perspective on the future of AIS. In V. Arnold & S. Sutton (Eds.), Researching accounting as an information systems discipline. Sarasota, FL: American Accounting Association.
Dunn, C. L., Cherrington, J. O., & Hollander, A. S. (2005).
Enterprise information systems: A pattern based approach (3rd ed.). Burr Ridge, IL: McGraw-Hill Irwin.
Dunn, C. L., & Grabski, S. V. (1998). The effect of field independence on conceptual modeling performance. Advances in Accounting Information Systems, 6, 65-77.
604
TEAM LinG

Semantically Modeled Enterprise Databases
Dunn, C. L., & Grabski, S. V. (2000). Perceived semantic expressiveness of accounting systems and the effect on task accuracy. International Journal of Accounting Information Systems, 1(2), 79-87.
Dunn, C. L., & Grabski, S. V. (2002). Evaluative research in semantically modeled accounting systems. In V. Arnold & S. Sutton (Eds.), Researching accounting as an information systems discipline. Sarasota, FL: American Accounting Association.
Dunn, C. L., & McCarthy, W. E. (1997). The REA accounting model: Intellectual heritage and prospects for progress.
Journal of Information Systems, 11(1), 31-51.
Geerts, G. L., & McCarthy, W. E. (1991). Database accounting systems. In B. C. Williams & B. J. Spaul (Eds.),
IT and accounting: The impact of information technology (pp. 159-183). London: Chapman & Hall.
Geerts, G. L., & McCarthy, W. E. (1998). Accounting as romance: Patterns of unrequited love and incomplete exchanges in life and in business software. Working paper presented at the Arizona State University REA Roundtable Workshop, Tempe, AZ.
Geerts, G. L., & McCarthy, W. E. (1999, July/August). An accounting object infrastructure for knowledge-based enterprise models. IEEE Intelligent Systems & Their Applications, 14(4), 89-94.
Geerts, G. L., & McCarthy, W. E. (2000). Augmented intensional reasoning in knowledge-based accounting systems. Journal of Information Systems, 14(2), 127-150.
Geerts, G. L., & McCarthy, W. E. (2001a). Using object templates from the REA accounting model to engineer business processes and tasks. The Review of Business Information Systems, 5(4), 89-108.
Geerts, G. L., & McCarthy, W. E. (2001b). The ontological foundation of REA enterprise information systems (Working paper). East Lansing, MI: Michigan State University.
Geerts, G. L., & McCarthy, W. E. (2002). An ontological analysis of the economic primitives of the extended-REA enterprise information architecture. International Journal of Accounting Information Systems, 3(1), 1-16.
Geerts, G. L., & McCarthy, W. E. (2003). Type-level specification in REA enterprise systems (Working paper). East Lansing, MI: Michigan State University.
Gelinas, U. J., Sutton, S. G., & Oram, A. E. (2001). Accounting information systems (5th ed.). Mason, OH: SouthWestern College.
Gerard, G. (1998). REA knowledge acquisition and related conceptual database design performance. Unpub-
lished doctoral dissertation, East Lansing, MI: Michigan |
5 |
State University. |
Hall, J. (2003). Accounting information systems (4th ed.). Mason, OH: South-Western College.
Kim, Y. K., & March, S. T. (1995). Comparing data modeling formalisms. Communications of the ACM, 38(6), 103-115.
McCarthy, W. E. (1982). The REA accounting model: A generalized framework for accounting systems in a shared data environment. The Accounting Review, 57(3), 554578.
Nakamura, H., & Johnson, R. E. (1998). Adaptive framework for the REA accounting model. Proceedings of the OOPSLA’98 Business Object Workshop IV. Retrieved August 17, 2004, from http://jeffsutherland.com/ oopsla98/nakamura.html
O’Leary, D. E. (2000). Enterprise resource planning systems: Systems, life cycle, electronic commerce, and risk.
New York: Cambridge University Press.
O’Leary, D. E. (2004). On the relationship between REA and SAP. International Journal of Accounting Information Systems, 5(1), 65-81.
Rockwell, S. R., & McCarthy, W. E. (1999). REACH: Automated database design integrating first-order theories, reconstructive expertise, and implementation heuristics for accounting information systems. International Journal of Intelligent Systems in Accounting, Finance & Management, 8(3), 181-197.
Romney, M. B., & Steinbart, P. J. (2003). Accounting information systems (9th ed.). Upper Saddle River, NJ: Prentice Hall.
Scheer, A.-W. (1998). Business process engineering: Reference models for industrial enterprises. Secaucus, NJ: Springer-Verlag New York, Inc.
Sinha, A. P., & Vessey, I. (1999). An empirical investigation of entity-based and object-oriented data modeling. In P. De & J. DeGross (Eds.), Proceedings of the Twentieth International Conference on Information Systems (pp. 229-244). Charlotte, NC.
Smith, J. M., & Smith, D. C. (1977, June). Database abstractions: Aggregation and generalization. ACM Transactions on Database Systems, 2(2), 105-133.
Vessey, I., & Conger, S. A. (1994). Requirements specification: Learning object, process, and data methodologies. Communications of the ACM, 37(5), 102-113.
Weber, R. (1986). Data models research in accounting: An evaluation of wholesale distribution software. The Accounting Review, 61(3), 498-518.
605
TEAM LinG
Weber, R. (1996). Are attributes entities? A study of database designers’ memory structures. Information Systems Research, 7(2), 137-162.
KEY TERMS
Business Process: A term widely used in business to indicate anything from a single activity, such as printing a report, to a set of activities, such as an entire transaction cycle. In this paper, business process is used as a synonym of transaction cycle.
Enterprise Resource Planning System: An enter- prise-wide group of software applications centered on an integrated database designed to support a business process view of the organization and to balance the supply and demand for its resources. This software has multiple modules that may include manufacturing, distribution, personnel, payroll, and financials and is considered to provide the necessary infrastructure for electronic commerce.
Ontologically-Based Information System: An information system based upon a domain ontology, whereby the ontology provides the semantics inherent within the system. These systems facilitate organizational productivity and interorganizational communication.
Process Level Model: A second-level model in the REA ontology that documents the semantic components of all the business process events.
REA Enterprise Ontology: A domain ontology that defines constructs common to all enterprises and demonstrates how those constructs may be used to design a semantically modeled enterprise database.
Semantically Modeled Enterprise Databases
Semantically Modeled Enterprise Database: A database that is a reflection of the reality of the activities in which an enterprise engages and the resources and people involved in those activities. The semantics are present in the conceptual model but might not be readily apparent in the implemented database.
Task Level Model: A third-level model in the REA ontology is the most detailed level, which specifies all steps necessary for the enterprise to accomplish the business events that were included at the process level.
Value Chain: The interconnection of business processes via resources that flow between them, with value being added to the resources as they flow from one process to the next.
ENDNOTES
1An alternative semantic model has been presented by Scheer (1998). Additional research is needed comparing these two semantic models with subsequent evaluations of commercially available ERP packages.
2The REA framework uses the term business process to mean a set of related business events and other activities that are intended to accomplish a strategic objective of an organization. In this view, business processes represent a high level of abstraction. Some non-REA views define business processes as singular activities that are performed within a business. For example, some views consider “process sale order” to be a business process. The REA view considers “sale order” to be a business event that is made up of specific tasks, and it interacts with other business events within the “sales-collection” business process.
606
TEAM LinG

607
Semistructured Data and its Conceptual Models 5
MuraliMani
Worcester Polytechnic Institute, USA
Antonio Badia
University of Louisville, USA
INTRODUCTION
Semistructured data is data with no predefined schema, or a very flexible schema. Because such data does not fit well in traditional databases, new data models have been developed to deal with it. XML is the most well known of these new data models.
Traditional database systems have adapted to handle XML data by extending data types, query languages, indexing methods, and optimization techniques to the nature of XML; also, brand new database systems have been developed. However, the first step in developing a database is to create a conceptual model that can be used as the starting point for the design process (Davis, 1993). Despite the fact that existing conceptual models, most notably, Entity-Relationship (E-R) models (Chen, 1976), are inadequate for semistructured data because relatively little research is devoted to adapting conceptual models to the characteristics of this type of data (Badia, 2002; Elmasri et al., 2002). In this paper, we will review the main characteristics of XML and E-R, show the mismatch between them, and propose extensions to E-R that address those mismatches.
BACKGROUND
In this section, we review the basic ideas of conceptual models and XML in order to fix some vocabulary for the rest of the paper.
CONCEPTUAL MODELS
We briefly review the main characteristics of E-R models (Chen, 1976; Thalheim, 2000). There are other conceptual models, like Unified Modeling Language (UML) (Rumbaugh, Jacobson & Booch, 1999) and Object-Role- Modeling (ORM) (Halpin & Bloesch, 1999); we will not include them here for lack of space, although we will make a few comments later on.
An Entity-Relationship (E-R) (Chen, 1976; Thalheim, 2000) model is based on three basic concepts: entity types,
attributes, and relationships. E-R models are usually depicted in E-R diagrams; an example is given in Figure 1. Entity types are depicted as rectangles with a name inside, attributes as ovals, and relationships as lines with a diamond shape on them.
Entity types represent things either real or conceptual. They denote sets of objects, not particular objects; in this respect, they are close to classes in object-oriented models. The set of objects modeled by an entity type are called its extension. Particular objects are called entities.
Relationships are connections among entity types. Relationships may involve any number of entity types; those involving two entity types (called binary relationships) are the most common. However, n-ary relationships (involving n > 2 entity types) are also possible. In particular, relationships relating one entity to itself are allowed. For example, a relationship Manager-of may relate the entity Employee to itself. To distinguish the ways in which Employee participates in this relationship, roles are added to the entity (manager and managee, in this example). Relationships are fundamental in an E-R model in that they carry very important information in the form of constraints: participation constraint tells us whether all objects in the extension of an entity are involved in the relationship or whether some may not be. For example, entities Department and Employee have a relationship works-for between them. If all employees work for some department, then participation of Employee in Works-for is total. However, if there can be employees which are not assigned to a particular department, then participation is partial. Cardinality constraint tells us how many times an object in the entity’s extension may be involved in a relationship and allows us to classify binary relationship as one-to-one, one-to-many, or many-to-many. There are several notations to state constraints in an E-R diagram. The one chosen here associates with each entity type E and relationship R a pair of numbers (min, max), where min represents the minimum number of times an entity in E appears in R (thus, min represents the participation constraint by being 0 for partial and 1 for total), and max represents the maximum number of times an entity in E appears in R (thus, max represents the cardinality constraint by being 1 for one-to relationships and greater than
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG

Figure 1. Example E-R schema
1 for many-to relationships. The latter case is traditionally represented by using the letters M or N. Thus, the (1,1) by Employee and Works-for indicates that all employees work for exactly one department; the (0,M) by Department and Manages indicates that not all departments manage projects, but those that do may manage more than one, and so on.
Entity types and relationships may have attributes, which are properties with a value. Attributes convey characteristics or descriptive information about the entity type or relationship to which they belong. Attributes may be simple or composite (made up of simpler parts, like the attribute Address of entity Department in the example, which is made up of parts named street, city, and zip), single or multivalued (capable of having one or several values for a particular entity; multivalued attributes are displayed as dual ovals, like locations in the example above, meaning that some department may have multiple locations), and primitive or derived (a derived attribute value is computable from other information in the model). A key attribute is an attribute whose value is guaranteed to exist and be different for each entity in the entity type. Therefore, the attribute is enough to point out a particular entity. All entity types are assumed to have at least one key attribute.
A contentious issue is the semantics of entity type’s attributes, in particular, whether attributes are required (i.e., every entity of the type must have a value for each attribute of the type) or optional (i.e., some entities of the type may or may not have values for some attributes). Different authors take different views on this issue, some even arguing that it is a mistake to consider attributes optional (Bodart et al., 2001). Since this has an impact
Semistructured Data and its Conceptual Models
when transforming E-R models into different data models, we will point out how to deal with each view.
Some E-R models admit weak entities, entities with no key attributes; these entities are connected by a one-to- many relationship to a regular entity, called the strong entity. What characterizes a weak entity is its lack of clear identity (reflected in the lack of a key) and its dependence for existence on the strong entity. As a typical example, an entity loan may have an associated weak entity payment. Clearly, a loan is associated with several payments (hence, the one-to-many relationship), and if a loan ceases to exist (say it is paid off), then the associated payments also cease to exist.
Many proposals for additional features have been made over the years. The most successful one is the addition of class hierarchies by introducing IS-A (class/ subclass) relations between entities. This addition, obviously motivated by the success of object-oriented methods for analysis, allows the designer to recognize commonalities among entities; usually this means shared attributes exist. Shared attributes are removed and put together in a new entity (class) which is a generalization of the others, and a class-subclass relationship is created. As in object-oriented approaches, inheritance of attributes is assumed. In Figure 1, entity type Employee has two subtypes, Hourly-employee and Salaried-Employee. The IS-A relationship is indicated by a downward triangle instead of a diamond in the line joining the involved entity types. The IS-A relationship can be annotated to distinguish several situations: whether the subclasses are disjoint or not and whether the subclasses together cover the superclass (i.e., every entity of the superclass must also belong to one of the subclasses) or not. Note that
608
TEAM LinG

Semistructured Data and its Conceptual Models
both dimensions are orthogonal to each other; hence, two annotations are needed to determine the exact situation.
XML
We briefly review the main characteristic of XML. Conceptually, we can see schema for an XML document as a (annotated) regular tree grammar with production rules of the form A –> aX, where A is a nonterminal symbol, “a” is a terminal symbol, and X is a regular expression over the alphabet of nonterminal symbols. In a more descriptive analysis, XML describes elements (which are denoted by matching opening and closing tags) which may contain other elements and/or have attributes. XML has the ability to specify if a subelement may appear: exactly once (regular subelements); once or none at all (optional subelements); once or more times (repeated subelements); none, once or several times (Kleene star subelements), and whether two or more subelements may appear in a given position (choice subelements). Besides subelements, XML can also specify the attributes of an element. Attributes are properties with a (unique) value. All attributes of an element must be different, may have a default value, and be mandatory or optional. Because an element may contain itself, recursive definitions are allowed (i.e., partsubpart definitions). In XML Schema, a set of subelements and/or attributes of an element can be declared a key for that element; that key can then be referenced by other elements through the use of foreign keys.
There is much more to XML and XML Schema; here we are only interested in the main structural details. The interested reader is referred to the W3C standards (Bray, Paoli & Sperberg-McQueen, 2004).
MODELING XML IN E-R
There is a straightforward translation of E-R into XML Schema that simply imitates the way E-R is translated into a relational design1. While this translation (made possible by the ability of XML Schema to specify integrity constraints) is technically correct, it is contrary to the way XML is intended to be used. In particular, there are several XML features that are of no relevance for such a translation. In other words, if we take an E-R model and transform it to an XML Schema in the manner just indicated, there are certain features that will never be used. One of them is the ability to form union types, another one is the ability to have recursive types, and a third one is the possibility of having order in relationships. Finally, there is also the issue of whether the ability to specify how many times a subelement is present can be used. The flat translation proposed denies the need to have complex subelements,
and normalization gets rid of repeated subelements (Date, 2000). There is the question of whether optional simple 5 attributes would ever be used; this is a somewhat ambiguous question because of the ambiguity pointed out earlier in E-R semantics. If attributes are considered optional, then if we choose to represent E-R attributes as
XML simple elements, we should consider them optional (if we translate into XML attributes, the #OPTIONAL or #REQUIRED keyword can be used for the same purposes).
The ability to have union types provides important flexibility in modeling. It allows us to recognize commonalities in function, for instance, even if structurally two elements are different. Assume, as an example, an entity Journal_Paper and another entity Conference_Paper. Both can be publications, therefore, be in a certain relationship Written_by with entity Author. If this relationship is important, we would like to be able to see Journal_Paper and Conference_Paper as part of the same type (Publication) despite other differences. This is the root of the proposal to create categories in E-R models (Elmasri, Weeldreyer & Hevner, 1985). Categories are basically union entity types; when added to E-R models, they correspond to union types in XML. Following our example, Publication is a category, composed on entities Journal_Paper and Conference_Paper, and would get translated as en element embodying the union of such elements. The importance of categories is that they allow the expression of a new set of constraints, which we can call covering and overlap constraints. A covering constraint holds if every entity in a category belongs to one of the types making up the category (i.e., if all publications are either a journal paper or a conference paper). Otherwise, a non-covering constraint holds (i.e., some publications are neither journal papers nor conference papers). An overlap constraint holds if the types making up the category have empty intersections (i.e., if no paper can be at the same time a journal and a conference paper); otherwise, a non-overlap constraint holds. Note, by the way, that the same type of constraints may hold of the roles that an entity plays in relationships. Assume, for example, that entity Person is related to category Publication by relationship Writes and that the category is made up as before. We may then want to express the fact that all persons that are authors are the authors of journal or conference papers, nothing else (a covering constraint), or that no person is an author of both journal and conference papers (an overlap constraint).
The ability to have recursive types also gives the capability to model recursive relations like part and subpart. Finally, ordered relationships can be “simulated” in E-R models by adding an attribute Order to the relationship. However, it is still up to the analyst/designer to adhere to the intended semantics, as there is nothing in the E-R model that enforces the ordering.
609
TEAM LinG