Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Скачиваний:
25
Добавлен:
07.02.2016
Размер:
8.32 Mб
Скачать

Data Mining with Ontologies:

Implementations, Findings, and Frameworks

Héctor Oscar Nigro

Universidad Nacional del Centro de la Provincia de Buenos Aires, Argentina

Sandra Elizabeth González Císaro

Universidad Nacional del Centro de la Provincia de Buenos Aires, Argentina

Daniel Hugo Xodo

Universidad Nacional del Centro de la Provincia de Buenos Aires, Argentina

Information science reference

Hershey• New York

Acquisitions Editor:

Kristin Klinger

Development Editor:

Kristin Roth

Senior Managing Editor:

Jennifer Neidig

Managing Editor:

Sara Reed

Copy Editor:

Katie Smalley

Typesetter:

Jamie Snavely

Cover Design:

Lisa Tosheff

Printed at:

Yurchak Printing Inc.

Published in the United States of America by

Information Science Reference (an imprint of IGI Global) 701 E. Chocolate Avenue, Suite 200

Hershey PA 17033

Tel: 717-533-8845

Fax: 717-533-8661 E-mail: cust@igi-pub.com

Web site: http://www.igi-pub.com/reference

and in the United Kingdom by

Information Science Reference (an imprint of IGI Global) 3 Henrietta Street

Covent Garden London WC2E 8LU Tel: 44 20 7240 0856 Fax: 44 20 7379 0609

Web site: http://www.eurospanonline.com

Copyright © 2008 by IGI Global. All rights reserved. No part of this publication may be reproduced, stored or distributed in any form or by any means, electronic or mechanical, including photocopying, without written permission from the publisher.

Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or companies does not indicate a claim of ownership by IGI Global of the trademark or registered trademark.

Library of Congress Cataloging-in-Publication Data

Date mining with ontologies : implementations, findings and frameworks / Hector Oscar Nigro, Sandra Gonzalez Cisaro, and Daniel Xodo, editors.

p. cm.

Summary: “Prior knowledge in data mining is helpful for selecting suitable data and mining techniques, pruning the space of hypothesis, representing the output in a comprehensible way, and improving the overall method. This book examines methodologies and research

for the development of ontological foundations for data mining to enhance the ability of ontology utilization and design”--Provided by publisher.

Includes bibliographical references and index.

ISBN 978-1-59904-618-1 (hardcover) -- ISBN 978-1-59904-620-4 (ebook)

1. Data mining. 2. Ontologies (Information retrieval) I. Nigro, Hector Oscar. II. Cisaro, Sandra Gonzalez. III. Xodo, Daniel.

QA76.9.D343D39 2008

005.74--dc22

2007007283

British Cataloguing in Publication Data

A Cataloguing in Publication record for this book is available from the British Library.

All work contributed to this book set is new, previously-unpublished material. The views expressed in this book are those of the authors, but not necessarily of the publisher.

Table of Contents

Preface ..................................................................................................................................................

xi

Acknowledgment ................................................................................................................................

xx

Section I

 

Implementations

 

Chapter I

 

TODE: An Ontology-Based Model for the Dynamic Population of Web Directories /

 

Sofia Stamou, Alexandros Ntoulas, and Dimitris Christodoulakis..........................................................

1

Chapter II

 

Raising, to Enhance Rule Mining in Web Marketing with the Use of an Ontology /

 

Xuan Zhou and James Geller................................................................................................................

18

Chapter III

 

Web Usage Mining for Ontology Management / Brigitte Trousse, Marie-Aude Aufaure,

 

Bénédicte Le Grand, Yves Lechevallier, and Florent Masseglia...........................................................

37

Chapter IV

 

SOM-Based Clustering of Multilingual Documents Using an Ontology /

 

Minh Hai Pham, Delphine Bernhard, Gayo Diallo, Radja Messai, and Michel Simonet.....................

65

Section II

 

Findings

 

Chapter V

 

Ontology-Based Interpretation and Validation of Mined Knowledge:

 

Normative and Cognitive Factors in Data Mining / Ana Isabel Canhoto............................................

84

Chapter VI

 

Data Integration Through Protein Ontology / Amandeep S. Sidhu,

 

Tharam S. Dillon, and Elizabeth Chang.............................................................................................

106

Chapter VII

 

TtoO: Mining a Thesaurus and Texts to Build and Update a Domain Ontology /

 

Josiane Mothe and Nathalie Hernandez.............................................................................................

123

Chapter VIII

 

Evaluating the Construction of Domain Ontologies for

 

Recommender Systems Based on Texts / Stanley Loh, Daniel Lichtnow,

 

Thyago Borges, and Gustavo Piltcher................................................................................................

145

Section III

 

Frameworks

 

Chapter IX

 

Enhancing the Process of Knowledge Discovery in Geographic Databases

 

Using Geo-Ontologies / Vania Bogorny, Paulo Martins Engel, and Luis Otavio Alvares.................

160

Chapter X

 

Ontology-Based Construction of Grid Data Mining Workflows /

 

Peter Brezany, Ivan Janciak, and A Min Tjoa.....................................................................................

182

Chapter XI

 

Ontology-Based Data Warehousing and MiningApproaches in Petroleum Industries /

 

Shastri L. Nimmagadda and Heinz Dreher.........................................................................................

211

Chapter XII

 

AFramework for Integrating Ontologies and Pattern-Bases /

 

Evangelos Kotsifakos, Gerasimos Marketos, and Yannis Theodoridis...............................................

237

Compilation of References ..............................................................................................................

256

About the Contributors ...................................................................................................................

278

Index...................................................................................................................................................

286

Detailed Table of Contents

Preface ..................................................................................................................................................

xi

Acknowledgment.................................................................................................................................

xx

Section I

 

Implementations

 

Chapter I

 

TODE: An Ontology-Based Model for the Dynamic Population of Web Directories /

 

Sofia Stamou, Alexandros Ntoulas, and Dimitris Christodoulakis..........................................................

1

In this chapter we study how we can organize the continuously proliferating Web content into topical categories, also known as Web directories. In this respect, we have implemented a system, named TODE that uses a topical ontology for directories’ editing. First, we describe the process for building our on-tology of Web topics, which are treated in TODE as directories’ topics. Then, we present how TODE in-teracts with the ontology in order to categorize Web pages into the ontology’s topics and we experimentallystudyoursystem’sefficiencyingroupingWebpagesthematically.WeevaluateTODE’s performance by comparing its resulting categorization for a number of pages to the categorization the same pages display in the Google directory as well as to the categorizations delivered for the same set of pages and topics by a Bayesian classifier. Results indicate that our model has a noticeable potential in reducing the human-effort overheads associated with populating Web directories. Furthermore, experimental results imply that the use of a rich topical ontology increases significantly classification accuracy for dy-namic contents.

Chapter II

 

Raising, to Enhance Rule Mining in Web Marketing with the Use of an Ontology /

 

Xuan Zhou and James Geller................................................................................................................

18

This chapter introduces Raising as an operation that is used as a preprocessing step for data mining. In the Web Marketing Project, people’s demographic and interest information has been collected from the

Web. Rules have been derived using this information as input for data mining. The Raising step takes advantageofaninterestontologytoadvancedataminingandtoimproverulequality.Thedefinitionand implementation of Raising are presented in this chapter. Furthermore, the effects caused by Raising are

analyzed in detail, showing an improvement of the support and confidence values of useful association rules for marketing purposes.

Chapter III

 

Web Usage Mining for Ontology Management / Brigitte Trousse, Marie-Aude Aufaure,

 

Bénédicte Le Grand, Yves Lechevallier, and Florent Masseglia...........................................................

37

This chapter proposes an original approach for ontology management in the context of Web-based information systems. Our approach relies on the usage analysis of the chosen Web site, in complement of the existing approaches based on content analysis of Web pages. Our methodology is based on the knowledge discovery techniques mainly from HTTP Web logs and aims at confronting the discovered knowledge in terms of usage with the existing ontology in order to propose new relations between concepts. We illustrate our approach on a Web site provided by French local tourism authorities (related to

Metzcity)withtheuseofclusteringandsequentialpatternsdiscoverymethods.Onemajorcontribution of this chapter, thus, is the application of usage analysis to support ontology evolution and/or Web site reorganization.

Chapter IV

 

SOM-Based Clustering of Multilingual Documents Using an Ontology /

 

Minh Hai Pham, Delphine Bernhard, Gayo Diallo, Radja Messai, and Michel Simonet....................

65

Clustering similar documents is a difficult task for text data mining. Difficulties stem especially from the way documents are translated into numerical vectors. In this chapter, we will present a method that uses Self Organizing Map (SOM) to cluster medical documents. The originality of the method is that it does not rely on the words shared by documents, but rather on concepts taken from an ontology. Our goal is to cluster various medical documents in thematically consistent groups (e.g., grouping all the documents related to cardiovascular diseases). Before applying the SOM algorithm, documents have to go through several preprocessing steps. First, textual data have to be extracted from the documents, which can be either in the PDF or HTML format. Documents are then indexed, using two kinds of indexing units: stems and concepts. After indexing, documents can be numerically represented by vectors whose dimensions correspond to indexing units. These vectors store the weight of the indexing unit within the document they represent. They are given as inputs to a SOM, which arranges the corresponding documents on a two-dimensional map. We have compared the results for two indexing schemes: stembased indexing and conceptual indexing. We will show that using an ontology for document clustering has several advantages. It is possible to cluster documents written in several languages since concepts are language-independent. This is especially helpful in the medical domain where research articles are written in different languages.Another advantage is that the use of concepts helps reduce the size of the vectors, which, in turn, reduces processing time.

Section II

 

Findings

 

Chapter V

 

Ontology-Based Interpretation and Validation of Mined Knowledge:

 

Normative and Cognitive Factors in Data Mining / Ana Isabel Canhoto............................................

84

The use of automated systems to collect, process, and analyze vast amounts of data is now integral to the operations of many corporations and government agencies, in particular it has gained recognition as a strategic tool in the war on crime. Data mining, the technology behind such analysis, has its origins in quantitative sciences. Yet, analysts face important issues of a cognitive nature both in terms of the input for the data mining effort, and in terms of the analysis of the output. Domain knowledge and bias information influence, which patterns in the data are deemed as useful and, ultimately, valid.This chapter addresses the role of cognition and context in the interpretation and validation of mined knowledge.

We propose the use of ontology charts and norm specifications to map how varying levels of access to information and exposure to specific social norms lead to divergent views of mined knowledge.

Chapter VI

 

Data Integration Through Protein Ontology / Amandeep S. Sidhu,

 

Tharam S. Dillon, and Elizabeth Chang.............................................................................................

106

Traditional approaches to integrate protein data generally involved keyword searches, which immediately excludes unannotated or poorly annotated data. An alternative protein annotation approach is to rely on sequence identity, structural similarity, or functional identification. Some proteins have a high degree of sequence identity, structural similarity, or similarity in functions that are unique to members of that family alone. Consequently, this approach can not be generalized to integrate the protein data. Clearly, these traditional approaches have limitations in capturing and integrating data for protein annotation. For these reasons, we have adopted an alternative method that does not rely on keywords or similarity metrics, but instead uses ontology. In this chapter we discuss conceptual framework of protein ontology that has a hierarchical classification of concepts represented as classes, from general to specific; a list of attributes related to each concept, for each class; a set of relations between classes to link concepts in ontology in more complicated ways then implied by the hierarchy, to promote reuse of concepts in the ontology; and a set of algebraic operators for querying protein ontology instances.

Chapter VII

 

TtoO: Mining a Thesaurus and Texts to Build and Update a Domain Ontology /

 

Josiane Mothe and Nathalie Hernandez.............................................................................................

123

This chapter introduces a method re-using a thesaurus built for a given domain, in order to create new resources of a higher semantic level in the form of an ontology. Considering ontologies for data-mining tasks relies on the intuition that the meaning of textual information depends on the conceptual relations between the objects to which they refer rather than on the linguistic and statistical relations of their con-

tent.To put forward such advanced mechanisms, the first step is to build the ontologies.The originality of the method is that it is based both on the knowledge extracted from a thesaurus and on the knowledge semiautomatically extracted from a textual corpus. The whole process is semiautomated and experts’ tasks are limited to validating certain steps. In parallel, we have developed mechanisms based on the obtained ontology to accomplish a science monitoring task. An example will be given.

Chapter VIII

 

Evaluating the Construction of Domain Ontologies for

 

Recommender Systems Based on Texts / Stanley Loh, Daniel Lichtnow,

 

Thyago Borges, and Gustavo Piltcher................................................................................................

145

This chapter investigates different aspects in the construction of a domain ontology to a content-based recommender system. The recommender systems suggests textual electronic documents from a digital library, based on documents read by the users and based on textual messages posted in electronic discussionsthroughaWebchat.Thedomainontologyisusedtorepresenttheuser’sinterestandthecontentof the documents. In this context, the ontology is composed by a hierarchy of concepts and keywords. Each concept has a vector of keywords with weights associated. Keywords are used to identify the content of the texts (documents and messages), through the application of text mining techniques. The chapter discusses different approaches for constructing the domain ontology, including the use of text mining software tools for supervised learning, the interference of domain experts in the engineering process and the use of a normalization step.

Section III

 

Frameworks

 

Chapter IX

 

Enhancing the Process of Knowledge Discovery in Geographic Databases

 

Using Geo-Ontologies / Vania Bogorny, Paulo Martins Engel, and Luis Otavio Alvares.................

160

This chapter introduces the problem of mining frequent geographic patterns and spatial association rules from geographic databases. In the geographic domain most discovered patterns are trivial, non-novel, and noninteresting, which simply represent natural geographic associations intrinsic to geographic data. A large amount of natural geographic associations are explicitly represented in geographic database schemas and geo-ontologies, which have not been used so far in frequent geographic pattern mining. Therefore, this chapter presents a novel approach to extract patterns from geographic databases using geo-ontologies as prior knowledge. The main goal of this chapter is to show how the large amount of knowledge represented in geo-ontologies can be used to avoid the extraction of patterns that are previously known as noninteresting.

Chapter X

 

Ontology-Based Construction of Grid Data Mining Workflows /

 

Peter Brezany, Ivan Janciak, and A Min Tjoa.....................................................................................

182

This chapter introduces an ontology-based framework for automated construction of complex interactive dataminingworkflowsasameansofimprovingproductivityofGrid-enableddataexplorationsystems. Theauthorsfirstcharacterizeexistingmanualandautomatedworkflowcompositionapproachesandthen present their solution called GridMiner Assistant (GMA), which addresses the whole life cycle of the knowledge discovery process. GMAis specified in the OWLlanguage and is being developed around a novel data mining ontology, which is based on concepts of industry standards like the predictive model markup language, cross industry standard process for data mining, and Java data mining API. The ontology introduces basic data mining concepts like data mining elements, tasks, services, and so forth. In addition, conceptual and implementation architectures of the framework are presented and its application to an example taken from the medical domain is illustrated. The authors hope that the further research and development of this framework can lead to productivity improvements, which can have significant impact on many real-life spheres. For example, it can be a crucial factor in achievement of scientific discoveries, optimal treatment of patients, productive decision making, cutting costs, and so forth.

Chapter XI

 

Ontology-Based Data Warehousing and MiningApproaches in Petroleum Industries /

 

Shastri L. Nimmagadda and Heinz Dreher.........................................................................................

211

Several issues of database organization of petroleum industries have been highlighted. Complex geospatial heterogeneous data structures complicate the accessibility and presentation of data in petroleum industries.Objectivesofthecurrentresearcharetointegratethedatafromdifferentsourcesandconnect them intelligently. Data warehousing approach supported by ontology, has been described for effective data mining of petroleum data sources. Petroleum ontology framework, narrating the conceptualization ofpetroleumontologyandmethodologicalarchitecturalviews,hasbeendescribed.Ontology-baseddata warehousing with fine-grained multidimensional data structures, facilitate to mining and visualization of data patterns, trends, and correlations, hidden under massive volumes of data. Data structural designs and implementations deduced, through ontology supportive data warehousing approaches, will enable the researchers in commercial organizations, such as, the one of Western Australian petroleum industries, for knowledge mapping and thus interpret knowledge models for making million dollar financial decisions.

Chapter XII

 

AFramework for Integrating Ontologies and Pattern-Bases /

 

Evangelos Kotsifakos, Gerasimos Marketos, and Yannis Theodoridis...............................................

237

Pattern base management systems (PBMS) have been introduced as an effective way to manage the high volume of patterns available nowadays. PBMS provide pattern management functionality in the same way where a database management system provides data management functionality. However, not all the extracted patterns are interesting; some are trivial and insignificant because they do not make