Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Rivero L.Encyclopedia of database technologies and applications.2006

.pdf
Скачиваний:
14
Добавлен:
23.08.2013
Размер:
23.5 Mб
Скачать

Figure 1. RS-Tree

Status={Nucleus, Satellite}

 

 

 

 

Type={elaboration}

1-4

A1-D1

 

 

Promotion={B1}

 

 

 

 

 

 

 

 

Status={Nucleus}

 

 

 

 

Status={Satellite}

 

Type={purpose}

 

 

 

 

 

 

 

 

 

Type={ example} 3-4 C1-D1

 

Promotion={B1} 1-2

A1-B1

 

 

 

 

 

 

 

Promotion={C1}

 

Status={Satellite}

 

 

Status={ Nucleus}

Status={Satellite}

 

 

 

Type={leaf}

4

Type={leaf}

 

 

Type={leaf}

 

 

 

 

 

 

Promotion={D1}

D1

Promotion={A1} 1 A1

 

 

Promotion={C1}

3

 

 

C1

 

 

 

 

 

 

 

 

 

Status={Nucleus}

 

 

 

 

 

 

 

Type={leaf}

 

2

B1

 

 

 

 

 

Promotion={B1}

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

relation that holds between the text spans), and a salience or promotion set (the set of units which are the most important part of the text that is spanned by that node). A relation is maintained between two units, one of which is more important to the author point of view than the other. The former is referred to as the nucleus and the latter the satellite. By removing the satellite, the text still remains comprehensible. Then, generating a summary can be accomplished by merely ordering important units in the original text by considering that the units that are promoted closer to the root are more important than those that are promoted less close. To illustrate this approach, we give the following piece of text dealing with reform of irrigation policy and water conservation and its simplified RS-Tree as shown in Figure 1.

[To conserve water resources and encourage demand management in the irrigation sector,A1] [a national water saving strategy was implemented.B1] [As part of the strategy, a number of reforms were introduced in the past few years,C1] [for instance, the promotion of water users’ associations, the increase in the price of irrigation water, etc.D1]

By retrieving the most important parts, we obtain this ordering: B1> C1>A1, D1. B1 is retrieved first because it is the root of the tree, followed by C1, which is found one level below B1, and so on. By applying this criterion to each RS-Tree associated with each selected segment, each delegate retrieves the most relevant portions of the segments for each subtopic under its jurisdiction.

Agents Communication

Once a summarizer achieves the segmentation, it sends to the interface agent the list of its subtopics discussed

Semantic Enrichment of Geographical Databases

through its segments via m1: ListSubTopics (Summarizer, Int, SubTopic-set).

The syntax used in this message and the other ones is

MessageLabel (idS, idR, message-parameter).

MessageLabel is the type of message sent by idS to idR. In fact, the eventual values of idS or idR are Int, Delegate, and Summarizer. Int, for the Interface, Delegate, any delegate, and finally, the Summarizer is any summarizer agent. Message-parameter means what is sent by the sender to the receiver.

Then, the interface forms the groups which are the subtopics tackled in all the corpus and affects to each of its members the agents dealing with the corresponding subtopic. Finally, the interface makes the delegation decision. Whenever the delegation is over, the interface agent sends a notification message, m2: Delegation (Int, Delegate, SubTopic-set), to inform the concerned agents about the delegation decision.

There are two possible cases: either the agent is a delegate of one of the groups to which it belongs or it is a delegate of a group to which it does not belong (it could be a delegate of groups to which it belongs as well as the groups to which it does not belong).

For the first case, the delegate agent sends a message m3: SubTopicRequest (Delegate, Summarizer, SubTopic-set) asking its acquaintances (according to the subtopics) to send the pieces of texts dealing with the specified subtopics. The receivers in turn send the message m4:TextSubTopic(Summarizer,Delegate, Textset) as a reply to the SubTopicRequest query.

In the second case, the communication flow is as follows: The delegate agent sends a request message to identify the members of the groups under its responsibility, m5: ListMembers (Delegate, Int, SubTopicset). This message is sent on behalf of the delegate agent to the interface, which knows all the members of the society. The interface agent responds by providing the list of the concerned agents, m6: SubTopicMembers (Int, Delegate, Member-set). Then and after identifying all the members concerned with the specified subtopics, the delegate sends m3 to collect all the texts (segments) and waits until it receives m4 messages.

Whatever the situation, a delegate starts the summarization of its virtual document if it receives all the replies to the m3 messages for a given subtopic; otherwise, it waits. If a delegate achieves all the work affected to it, it sends m7: Summary (Delegate, Int, Summaryset) to provide the interface with the summary of each subtopic under its responsibility.

The whole process is over if the interface is satisfied, a situation reached whenever all the summarizer agents are satisfied, too. A summarizer agent is said to be satisfied if it achieves all the work affected to it, or else it is said unsatisfied and continues the summarization.

590

TEAM LinG

Semantic Enrichment of Geographical Databases

FUTURE TRENDS

Semantic enrichment of geographical databases is concerned with supplying additional information about geographic entities without taking into account the spatial aspect (shape and topological relations); only the alphanumeric attributes are handled. One perspective is to go beyond this limitation and to profit from the spatial aspect by considering the neighborhood of the entities. For instance, whenever the results of the current mining are not satisfactory enough, one can extend the search space to the adjacent entities. In fact, mining information about the neighbors may give knowledge more valuable than mining the documents related to the entity itself.

CONCLUSION

Our system aims to enrich geographical databases. GIS provides information retrieved from the GDB; thus, any data not stored within is not available. To make these databases rich sources of information, allowing any user to know almost everything about the displayed entities, we mined data available on the Web. This mining is achieved via the summarization technique and, more precisely, the MDS by collaborating a set of agents. There are essentially two classes of agents, namely, interface agent and summarizer ones. These agents interact to lead the system to the optimal summary.

REFERENCES

Bâazaoui, H., Faïz, S., & Ben Ghezala, H. (2001). Spatial data warehouses and exploration techniques. Proceedings of the ACS/IEEE International Conference on Computer Systems and Applications, Beirut, Liban.

Bâazaoui, H., Faïz, S., & Ben Ghezala, H. (2003). CASME: A case tool for spatial data marts design and generation.

Proceedings of the 5th International Workshop on Design and Management of Data Warehouses (DMDW’03) in conjunction with the 29th International Conference on Very Large Databases (VLDB’03), Berlin, Germany.

Barbuceanu, M. (1998). Negotiation as interactive exchange of constraint about agent behavior. Proceedings of the International Workshop on Multi-Agent Systems, Boston.

Barzilay, R., Mckeown, K., & Elhadad, M. (2000). Information fusion in the context of multi-document summariza-

tion. Proceedings of the 37th Annual Meeting of the ACL,

5

(pp. 550-557), MD.

Briot, J.P., & Demazeau, Y. (2001). Principals and architecture of multi-agent systems. Paris: Edition Hermes Science.

Faïz, S. (1999). Geographic information system: Quality information and data mining. Editions C.L.E.

Faïz, S. (2001). Spatial data mining, spatial data warehousing and spatial OLAP: application to geographic data quality. Proceedings of the 4th Tunisian Interdisciplinary Workshop on Science & Society (TIWSS), Tokyo, Japan (pp. 18-22).

Faïz, S., Abbassi, K., & Boursier, P. (1998). Applying datamining techniques to generate quality information within geographical databases. In Jeansoulin & Goodchild (Eds.), Data quality in GI. Paris: Editions Hermes.

Ferber, J. (1997). Les systèmes multi-agents vers une intelligence collective. Edition InterEditions.

Gees, C.S., Strzalkowski, T., Wise, G.B., & Bagga, A. (2000). Evaluating summaries for multiple documents in an interactive environment. General Electric, Corporate R&D.

Goldstein, J., Vibhu, M., Carbonell, J., & Kantrowitz, M. (2000). Multi-document summarization by sentence extraction. Proceedings of the ANLP/NAACL Workshop on Automatic Summarization, Seattle, Washington.

Hearst, M.A. (1997). TextTiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics, 23(1), 33-46.

Lin, C.Y., & Hovy, E. (2002). From single to multidocument summarization: A prototype system and its evaluation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), (pp. 457-464), Philadelphia.

Mahmoudi, K., & Ghédira, K. (2000). Distributed rescheduling for the workforce management dynamic aspect.

Proceedings of the 3rd Ibero American Workshop on Distributed Artificial Intelligence and Multi-Agent Systems, Atibaia, Sao Paulo, Brazil.

Mani, I., & Bloedoran, E. (1997). Summarizing similarities and differences among related documents. Proceedings of the RIAO-97, Montreal, Canada (pp. 373-387).

Marcu, D. (1997). Discourse trees are good indicators of importance in text. In Mani & Maybury (Eds.), Advances in automatic text summarization (pp. 123-136). MIT Press.

591

TEAM LinG

Radev, D.R., Jing, H., & Budzikowska, M. (2000). Cen- troid-based summarization of multiple documents: Sentence extraction. Proccedings of the ANLP/NAACL Workshop, 21-19.

Scholl, M., Voisard, A., Pelous, J.P., Raynal, L., & Rigaux, P. (1996). SGBD géographiques. International Thomson Publishing France.

Tan, A. (1999). Text mining: The state of the art and the challenges. Proceedings of the PAKDD’99 Workshop on Knowledge Discovery from Advanced Databases,

Beijing, Singapore.

Weiss, S., Apte, C., & Damerau, F. (1999). Maximizing text-mining performance. IEEE Intelligent Systems.

Semantic Enrichment of Geographical Databases

KEY TERMS

GDB: Geographic database (GDB) is a database integrated into the GIS, storing spatial and alphanumerical data.

GIS: Geographic Information System (GIS) is a computer system capable of capturing, storing, analyzing, and displaying geographically referenced information.

MAS: Multi-Agent Systems (MAS). In a distributed universe, the systems that shelter agents, are called Multi-Agent Systems. An agent, is a hardware or software entity able to act on itself and on its environment.

MDS: Multi-Document Summarization (MDS) is the process of distilling the most important information from a corpus of documents.

RS-Tree: A binary tree, which describes the rhetorical structure of every coherent discourse.

Text Mining: Known as text data mining or knowledge discovery from textual databases, and refers to the process of extracting interesting and non-trivial patterns or knowledge from text documents.

TextTiling: A technique for automatically subdividing texts into multi-paragraph units that represent passages, or subtopics.

592

TEAM LinG

 

593

 

Semantic Information Management

 

 

 

5

 

 

 

 

 

David G. Schwartz

Bar-Ilan University, Israel

Zvi Schreiber

Unicorn Solutions Inc., USA

INTRODUCTION: THE ENTERPRISE DATA PROBLEM

The need to manage enterprise data has been coming into increasingly sharp focus for some time. Years ago, data sat in silos attached to specific applications. Then came the network, with data becoming available across applications, departments, subsidiaries, and enterprises. Throughout these developments, one underlying problem has remained unsolved: Data resides in thousands of incompatible formats and cannot be systematically managed, integrated, unified, or cleansed.

To make matters worse, this incompatibility is not limited to the use of different data technologies (e.g., flat file, COBOL, IMS, Relational, XML, etc.) or to the multiple different “flavors” of each technology, such as the different relational databases (Oracle, DB2, SQL Server, Sybase, etc.) in existence. The most challenging incompatibility arises from semantic differences. Each data asset is set up with its own worldview and vocabu- lary—known as its schema. This incompatibility exists even if both assets use the same technology.

For example, one database has a table called “client,” intending this to include channel partners, and subdivides customers into individuals and institutions; the other data asset refers to the same concept as a “patron” (although not including channel partners) and subdivides “patrons” into individuals, corporations, government, and nonprofit groups. To make matters worse, “patron” excludes international clients, despite the fact that this is not explicitly mentioned in any documentation and the original developer retired five years ago.

In a larger enterprise, this problem may be multiplied by thousands of data structures located in hundreds of incompatible databases and message formats. And the problem is growing; enterprises continue to acquire subsidiaries, reengineer processes, and integrate with partners. Moreover, developers are continuing to write new applications and to create new databases based on requests from business users without worrying about overall data management issues.

Therefore, it is imperative to find an efficient way to manage multiple applications and data sources. This

requires a shift from an application-centric view of IT to a dataand information-centric view. And it requires a focus on the business meaning—or semantics—of the data. Focusing on a singular business meaning results in the effective use of a single language throughout the enterprise—the common business language approach.

BACKGROUND

Impact of the Data Problem

The enterprise data problem has a strong measurable impact on a company’s bottom line (Schreiber, 2003a).

First, the fragmented data environment inevitably leads to business information quality problems, causing businesses to mishandle information, customer relationships, and internal operations.

Second, the data problem creates a situation which all but prevents the agility that is critical to a modern enterprise responding to a constantly changing environment. Whether application deployment, business process reengineering, or mergers and acquisitions are involved, IT is unable to respond to a dynamic environment due to the fragmented and delicate nature of data assets and hard-coded scripts keeping assets communicating with one another. This is a significant barrier to moving towards the real-time—or zero-latency—en- terprise (Khosla & Pal, 2002).

Finally, IT remains unnecessarily inefficient so long as it lacks a strategic approach to data management. In the meantime, IT deals with the frustrating and costly challenge of administering databases (some of which are redundant), mapping each database multiple times and writing and manually maintaining point-to-point translation scripts.

Data needs to be systematically managed if it is to have long-term value to the enterprise. Specifically, data management must start by elevating data into information by explicitly capturing the meaning and context of the data. This process is known as data semantics (Sheth, 1995).

Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.

TEAM LinG

Semantic Information Management

Figure 1. Positioning of data management disciplines

ACTIVE

 

Semantic Information

 

Data Transformation

 

Management

 

 

 

Data Cleansing

 

 

 

 

 

Data Profiling

Metadata Repositories

PASSIVE

Data Modeling

Enterprise Modeling

 

 

 

 

 

 

TACTICAL

STRATEGIC

 

 

 

An Overview of Existing Approaches

A partial solution is a metadata repository containing information about each data asset. Repositories act as catalogs and include technical information about the assets, their structure, how they are used, and who is responsible for them (DeMarco, 2000). But a passive catalog does not provide a formal understanding of data or automated support for translating and cleansing the data.

A tactical solution to data quality (Strong, Lee, & Wang, 1997a, 1997b; Wand & Wang, 1996) involves data profiling and data cleansing. While these are important tools within the overall approach to data management, enterprise information quality will never be achieved one database at a time. The real problem lies in ensuring that there is an agreed-upon understanding of each data asset and of its relationship to other data assets. Data assets must be designed, created, maintained, integrated, and decommissioned with attention to the wider quality aims. This is accomplished by understanding data as part of an overall information architecture (Schreiber, 2003b) and by understanding and validating the data with respect to a single agreedupon business worldview and a series of business rules.

Data modeling typically supports better database design. But data models are usually entity-relationship diagrams (Chen, 2002) with limited business depth, lacking generalization/specialization capabilities (also known as inheritance, subtyping, or polymorphism), and not capturing business rules. They are also technically suitable only for modeling relational databases one at a time, while IT increasingly uses XML, especially for messaging. Furthermore, data models are typically tightly coupled to a specific relational database and are not generally used to promote a common understanding of multiple data assets.

SEMANTICS: TOWARDS A COMMON BUSINESS LANGUAGE

Companies will always struggle with a large number of physically different data formats. While a common data format will likely never be achieved, the key to efficiently managing data is to establish a common understanding. This is the idea of semantics, bridging nomenclature and terminological inconsistencies to comprehend underlying business meaning in a unified manner. Semantics can be achieved by formally capturing the meaning of data. This is accomplished by relating physical data schemas to concepts in an agreed-upon model of the business. Wand and Wang (1996) present some of the foundations for building improved data quality on an ontological basis.

This central model of the business is called an information model (Sheth & Kalinichenko, 1992). The information model does not reflect any specific data model but rather reflects the agreed-upon business view, business vocabulary, and business rules which will provide a common basis for understanding data.

Semantics builds upon traditional informal metadata (Bernstein & Bergstaesser, 1999) and captures the formal meaning of data in agreed-upon business terms. For example, the information model might capture the official business concepts of a “customer” and the more specific concepts of a “business customer” and an “individual customer.” A semantic mapping will then relate physical data schemas to this information model. For instance, a semantic mapping might capture the fact that an information model has a universally accepted concept called “individual customer” that is called “client” by a relational database table, “customer” by an XML schema, and “CUST3” by a COBOL copybook. Thus the semantic mapping formally captures the meaning of the data by reference to the agreed-upon business terminology.

594

TEAM LinG

Semantic Information Management

Why take the trouble to capture data semantics? Two reasons. Tactically, semantics saves time by capturing the meaning of data once. Without semantics, each data asset will be interpreted multiple times by different developers as it is designed, implemented, integrated, cleansed, extended, extracted (for a warehouse), and eventually decommissioned. This independent interpretation will be time-consuming and error-prone. With semantics, the data asset is mapped and interpreted only once. Second, new assets can be generated from the information model so that they use official business terminology from the outset and never require mapping.

The most significant impact of semantics is a strategic one. Semantics turns hundreds of data sources into one coherent body of information. The semantic architecture includes a record of where data is and what it means. Using this record of which business information is represented in each data asset, it becomes possible to automate the search for overlap and redundancy. The information model provides the basis for creating new data assets in a consistent way and serves as a reliable reference for understanding the interrelationship between disparate sources and for automatically planning how to translate between them. Finally, semantics provides a central basis for impact analysis and for the smooth computer-aided realization of change.

Semantics is not only being applied to managing enterprise data. The World Wide Web Consortium is working to reinvent the Web as the Semantic Web (Berners-Lee, Hendler, & Lasilla, 2001; Schwartz, 2003), where information will carry computer-readable meaning.

While some informal data semantics has existed for decades in data dictionaries in home-grown solutions and ETL tools, a more formal semantics is key to a modern information architecture enabling an effective strategic approach to data management.

Figure 2. Semantic information management (SIM)

•Set up information

Information

•Find/understand

Information

•What information

Knowledge

Resource

Technology

Workers

architecture

data

is available?

Managers

Professionals

 

 

 

 

 

•Find/decommission

 

•Data integration

 

•What does it

 

redundant assets

 

•Impact analysis

 

mean?

 

 

 

 

 

 

•Apply data policy

 

 

 

•How reliable is it?

 

Active Services

Utilize the Semantic Information Architecture to actively support data users

Information Model:

 

customerC er

Desired business

TransformationInformation ModelModel

vocabulary: entities,

 

 

(agreed IT model of Business)

 

businessB iness

relationships and

Individual

(agreed IT model of business) individual

Customer

rules

 

Customer

 

customer

customer

Semantics:

Capture meaning of

data structures by DataSemantics mapping to

Information Model

Metadata:

 

Catalog of data assets,

 

their structures and

Metadataetadata

usage

(understand(understand datadata environment)environment)

SEMANTIC INFORMATION MANAGEMENT: CORE PRINCIPLES 5

Core elements of semantic information management (SIM) may be summarized as follows (see also Figure 2).

Metadata—Know your data.

Information Model—Know your business.

Data Semantics—Understand your data.

Metadata

Before data assets can be understood, they must be cataloged. Metadata should include the asset’s schema as well as information about an asset’s location, usage, origin, relationship to other assets, rules associated with it, and assignment of ownership. Some of this metadata may be scanned automatically from assets such as relational databases or from existing sources of metadata.

Information Model

The information model is a rich central model of the business or of a domain within the business. A traditional data model or object model may serve as the basis for an information model, but ideally data models should be extended to a full ontology. An ontology literally means a study of what exists or as Gruber (1993) succinctly put it: “the specification of a conceptualization.” An ontological information model is therefore typically richer than a data model in its view of the business, including different levels of generalization/specialization and a layer of business rules in addition to the traditional entities and relationships. This richness allows the information model to serve as an authoritative reference by which meaning is given to multiple data assets, regardless of format or technology.

A well-organized information model will model business in five layers:

Organizational Layer: Divides and subdivides the information model into different packages reflecting different parts of the business, such as customer, product, etc. reflecting ownership of different parts of the model.

Entity Layer: Captures the entities, things, or classes that play a role in each area of the business. Examples include people, documents, or products. The entity layer should capture the business at different levels of detail or specialization. The general concept of “product,” a specific cat-

595

TEAM LinG

egory of products, or a specific product might all be captured as entities.

Property/Attribute Layer: Describes the vocabulary used to describe each entity and to relate entities. For example, it captures the fact that every customer is associated with a contact person or that a product is associated with a price. Properties are also known as relationships.

Business Rule Layer: Allows the information model to centrally capture business rules relating alternative vocabularies. Without this, these business rules would be hard-coded in applications and scripts around IT and would be impossible to track and change. Examples of rules are lookup tables for alternative product codes or logical/arithmetical rules relating alternative vocabularies. This layer might relate the concept of annual revenue to quarterly revenue or capture the logic for the different ways in which the marketing and service departments segment customers.

Descriptor Layer: Ensures that all concepts in the other four layers are documented in a structured way, providing definition, synonyms, alternatives, examples, and, where needed, foreign languages names.

Note that layers 1-3 are entirely compatible with UML class diagrams, popular with application developers. However, the business rule layer is a key differentiator. Whereas UML emphasizes application dynamics, an ontological information model has a different purpose—capturing business logic.

Semantic Information Management

In summary, an ontological information model leverages existing data and application modeling skills to allow a richer and more intuitive view of business logic.

DATA SEMANTICS

Semantics captures the formal meaning of data. It is achieved by mapping (or rationalizing) the data’s schema to the information model.

Any database or message format with a schema can be mapped, including relational databases, XML, older hierarchical databases, network databases, and COBOL copybooks. Data that is structured without a schema (e.g., EDI messages or flat files) can be parsed and then mapped.

Software can aid the mapping process (see Figure 3) using type information, foreign keys, and even name similarities to suggest matches and to provide an efficient graphical environment. However, mapping will never be totally automatic; only a database administrator or other expert will know how to interpret data accurately.

Semantic mapping creates immediate savings. Having mapped an asset once to an information model, its relationship to all other assets may be inferred automatically. Every asset is therefore mapped only once, in contrast to the current situation in which every data asset is mapped many times, often using inappropriate tools such as MS Word or Excel.

Figure 3: Computer-aided semantic mapping of a relational database to an information model

596

TEAM LinG

Semantic Information Management

FUTURE TRENDS

Utilizing semantic information management can have a direct impact on data management, integration, and quality. Having a common understanding allows an enterprise to achieve its key strategic goals of managing data assets, integrating data, and improving data quality. In addition, these goals are achieved while addressing the entire body of information using standard business terms rather than grappling with hundreds of specific data formats.

Data Management

Data Standards: Semantic information management allows new databases and message formats to be directly generated from the information model so that they reflect an agreed-upon business terminology. This is particularly useful as enterprises move to generate internal XML schema messaging standards.

Data Discovery: Data semantics provides a single business lens onto data assets. Managers may use the semantic mappings to discover the data assets covering a particular data concept and to direct developers to the correct assets.

Eliminating Redundancy: Data discovery exposes redundancy between data assets. Redundancy not only causes unnecessary development and operational expense but also creates potential for data inconsistency and is therefore the root of many serious enterprise data quality problems.

Security: Data discovery enables uniform security and privacy policies. Once a particular business concept is highlighted as sensitive, data discovery highlights all the data assets which require securing.

Influencing Developers: Publishing metadata enables developers to find and reuse existing assets rather than invest time creating new ones. It also ensures that they are aware of relevant data usage restrictions and policies. Published semantics goes further in enabling engineers to accurately understand data, thus increasing their productivity and code quality. Finally, semantic information management code generation is attractive to developers, which in turn encourages them to store business rules and transformation code in a central way, thereby making maintenance more effective.

Increasing Data Services: Semantic information management comes with active capabilities that increase the central data services or data

architecture group’s value to the rest of the enter-

5

prise.

Data Integration

EAI: Enterprise application integration products (e.g., IBM WebSphere MQ, Tibco, WebMethods, Vitria, and SeeBeyond) provide an important platform for integrating applications and especially for facilitating messaging and workflow. However, these can be deployed and maintained far more cheaply and flexibly if their translation scripts are auto-generated rather than manually coded.

ETL/BI: On the informational side of IT, mature tools exist to load warehouses and to perform business intelligence analysis of the warehouse. But when designing a warehouse, SIM provides the most reliable methodology for finding and interpreting sources, cleansing them according to central rules, planning transformation scripts, and creating a warehouse schema that reflects business reality.

Corporate Portals: Nowhere is quality information and correct business vocabulary more important than in the corporate portal. Many good portal products provide the portal runtime, but SIM identifies the data sources and generates the data translations, ensuring accurate data presentation.

Version Changes: Changing a database schema may have business importance but can be costly and risky. With a SIM approach, the old and new schemas can both be mapped to the information model. This allows the data migration scripts to be inferred automatically and enables all queries acting on the database to be identified and updated automatically to reflect the new schema.

Data Quality

Unambiguous Meaning: SIM contributes to data quality by ensuring that data has a formal and unambiguous business-oriented meaning. This helps to ensure that data is not misinterpreted either by its ultimate business audience or by developers writing queries to the data.

More Accurate Data Translations: SIM can be used to automate the creation of accurate data translation scripts. These scripts are the critical links ensuring the integrity of data as it travels

597

TEAM LinG

through the enterprise. Most importantly, the architecture can be used to update these transformations in a consistent way when a change occurs in a data source or a business rule.

Data Consistency: A common understanding of data allows incompatible data formats to be compared automatically, revealing hidden inconsistencies and overlap.

Central Approach to Data Rules and Cleansing: Enterprise information quality cannot be achieved by cleansing one database at a time. Business rules and constraints need only be captured once in the information model for all data sources—regardless of their format—to be automatically validated against the same central set of rules.

A SYSTEM FOR SEMANTIC INFORMATION MANAGEMENT

Semantic information management will be supported by an appropriate suite of products which should be fully integrated. Key components (as shown in Figure 4) should include:

Metadata: A repository for storing metadata on data assets, schemas, and models associated with the assets.

Data Semantics: Integrated tools for ontology modeling in order to support the creation of an information model (capabilities should include compatibility with entity-relationship and UML

Figure 4. Logical architecture diagram for semantic information management

IT Professionals

Knowledge Workers

 

 

Developers

 

Information Producers

 

 

DBAs

 

Information Consumers

 

 

 

User Interface

Templates

 

 

 

Reports

 

 

 

 

 

 

 

Web

 

 

 

 

 

ServicWebs

 

 

 

 

 

Broker

Data Administration

 

 

Transformation

 

Services

Data Definition

 

Time-Run

Broker

 

 

Generation

Governance & Policy

 

 

 

Data Comparison

 

 

 

 

 

Transformation

 

Asset Discovery

 

 

 

Data Cleansing

 

Planner

 

 

 

ApplicationsAp ications

Impact Analysis

Defect Prevention

 

Query Generation

 

 

Interface

 

Schema Generation

 

 

 

 

Data

Data

 

Data

 

DataData Cleansingleansi g

Management

Quality

 

Integration

 

ETL

Features

Features

 

Features

 

 

 

 

EAI BrokerEAI

Broker

Ex rnal

 

 

Information Model

Libraries

External

 

 

 

 

Repository

 

 

 

 

Repository

Metadata Interface

 

 

 

Asset

Metamodel

Data Semantics

 

Data torage

 

 

 

 

Data Storage

 

 

 

 

Asset

 

 

 

 

 

 

 

Metadata

 

Platform Versioning, Collaboration, Permissions, Configuration

Me ge

 

 

Message

Repository

 

Format

 

Format

 

 

Metadata

Semantic Information Management

Run-Time

Semantic Information Management

diagrams) and for semantically mapping data schemas to the central information model. Off- the-shelf information model libraries can also provide an important shortcut.

Data Management Services: The system should use the information model standard business terminology as a lens through which data is managed. Data management should include the ability to author and edit the information model, discover data assets for any given business concept, administer data, create reports and statistics about data assets, test and simulate the information model, and analyze impact in support of change.

Data Integration Services: The system should automatically generate code for queries and data transformation scripts between any two mapped data schemas, utilizing the common understanding provided by data semantics.

Data Quality Services: In order to provide a systematic approach to data quality, the system should support the identification and decommissioning of redundant data assets. It should support comparison for ensuring consistency among semantically different data and validation/cleansing of individual sources against the central repository of rules.

Metadata Interface: The system must be able to collect metadata and data models directly from relational databases and other asset types and to exchange metadata with other metadata repositories. Similarly, the metadata and models accumulated by the system must be open to exchange with other systems through the use of adaptors and standards such as XMI (XML Metadata Interchange standard).

Runtime Interface: A key differentiator of the semantic information technology is the active data integration capabilities. The runtime interface ensures that queries, translation scripts, schemas, and cleansing scripts generated automatically by the system may be exported using standard languages (such as SQL and XSLT) to runtime environments, including off-the-shelf ETL, EAI, and data cleansing products.

User Interface: The user interface should include a rich thick-client for power users in the data management group. A customizable Web interface is usually best for developers and business users who are primarily concerned with reading and using the metadata but not creating it.

Platform: The system should include a platform supporting version control, collaboration, per-

598

TEAM LinG

Semantic Information Management

mission management, and configuration for all metadata and active content in the system.

CONCLUSION

Semantics inspires a vision in which data carries unambiguous business meaning that can be found, aggregated, and used accurately and flexibly without prior knowledge of the data’s specific format. Semantic information management delivers the benefits of semantics and a common business language to enterprise IT.

Semantic differences between these data formats create an environment with poor business information, lack of business flexibility, and high IT costs.

Existing solutions are insufficient. Metadata repositories are important but do not create sufficient value without semantics. Data modeling is usually limited to relational databases treated one database at a time. Data cleansing is an important way to improve a database but does not address the strategic problem of obtaining quality across hundreds of data assets.

Semantic information management addresses the core of the problem by capturing the precise meaning of data in agreed-upon terms. Key elements of the architecture are metadata (knowing your data), an information model (knowing your business), and data semantics (understanding your data).

Semantic information management creates value by delivering higher quality business information, providing the flexibility to support business change, and making IT costs lower and more predictable. It can be introduced gradually to specific projects and eventually extended to the entire enterprise.

With semantic information management, the enterprise can strive for an environment in which everyone speaks the same business language, data carries unambiguous business meaning, and the data environment is managed and integrated at will. This is accomplished while providing high quality information to the business and allowing the enterprise to adapt itself in real time.

REFERENCES

Berners-Lee, T., Hendler, J., & Lassila, O. (2001, May). The Semantic Web. Scientific American.

Bernstein, P. A., & Bergstaesser, T. (1999). Meta-data support for data transformations using Microsoft Repository. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 26(4), 9-14.

Chen, P. P. (2002, June). Entity-relationship modeling: Historical events, future trends, and lessons learned. In 5 M. Broy & E. Denert (Eds.), Lecture notes in computer science: Software pioneers and their contributions to software engineering (pp. 100-114). Berlin, Germany: Springer-Verlag.

DeMarco, D. (2000). Building and managing the meta data repository: A full lifecycle guide. New York: Wiley.

Friedman, T. (2004, April). The importance of a common business language. Unicorn. Retrieved from http:/ / m e d i a p r o d u c t s . g a r t n e r . c o m / w e b l e t t e r / unicorn_issue1/index.html

Gruber, T. (1993). A translation approach to portable ontology specifications. Knowledge Acquisition, 5(2), 199-220.

Khosla, V., & Pal, M. (2002). Real time enterprises: A continuous migration approach. Retrieved from http:/ /www.kpcb.com/files/bios/RTEWHITEPAPER.pdf

Schreiber, Z. (2003a). Applying the Semantic Web vision to enterprise data management: A case study.12th International World Wide Web Conference.

Schreiber, Z. (2003b, October). Semantic information architecture: Creating value by understanding data. Data Management Review.

Schwartz, D. G. (2003, May/June). From open IS semantics to the Semantic Web: The road ahead. IEEE Intelligent Systems, (pp. 52-58).

Sheth, A. (1995). Data semantics: What, where and how?. In R. Meersman & L. Mark (Eds.), Proceedings of the Sixth IFIP Working Conference on Data Semantics (DS-6). London: Chapman & Hall.

Sheth, A., & Kalinichenko, L. (1992). Information modeling in multidatabase systems: Beyond data modeling.

Proceedings of the First International Conference on Information and Knowledge Management.

Strong, D. M., Lee, Y. W., & Wang, R. Y. (1997a). Data quality in context. Communications of the ACM, 40(5), 103-110.

Strong, D. M., Lee, Y. W., & Wang, R. Y. (1997b, August). 10 potholes in the road to information quality.

IEEE Computer, (pp. 38-46).

Wand, Y., & Wang, R. Y. (1996). Anchoring data quality dimensions in ontological foundations. Communications of the ACM, 39(11), 86-95.

599

TEAM LinG

Соседние файлы в предмете Электротехника