Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Скачиваний:
26
Добавлен:
07.02.2016
Размер:
8.32 Mб
Скачать

Enhancing the Process of Knowledge Discovery in Geographic Databases Using Geo-Ontologies

The development of toolkits that integrate data mining techniques, geographic databases, and knowledge repositories is another need for practical applications. Although a large number of algorithms has been proposed, their implementation in toolkits with friendly graphical user interfaces that cover the whole KDD process is rare. The gap between data mining techniques and geographic databases is still a problem that makes geographic data preprocessing be the most effort and time consuming step for knowledge discovery in these databases.

COnClUsiOn

This chapter presented an intelligent framework for geographic data preprocessing and SAR mining using geo-ontologies as prior knowledge. The knowledge refers to mandatory and prohibited semantic geographic constraints, which are explicitly represented in geo-ontologies because they are part of the concepts of geographic data. We showed that explicit mandatory relationships produce irrelevant patterns, and that prohibited relationships do not need to be computed, since they will never hold if the database is consistent. Possible implicit spatial relationships may lead to more interesting patterns and rules, and they can be inferred using geo-ontologies.

Experiments showed that independent of the number of elements, one dependence is enough to prune a large number of patterns and rules, and the higher the number of eliminated semantic constraints, the larger is the frequent pattern and rule reduction. We showed that well-known dependences can be partially eliminated with intelligent data preprocessing, independently of the algorithm to be used for frequent pattern mining. To completely eliminate geographic dependences we presented a pruning method that can be applied to any algorithm that generates frequent sets, including closed frequent sets. Algorithms for mining nonredundant association rules can

reduce the number of rules further if applied to the geographic domain using our method to generate frequent sets.

Considering semantics in geographic data preprocessing and frequent pattern mining has three main advantages: spatial relationships between feature types with dependences are not computed; the number of both frequent sets and associationrulesissignificantlyreduced;andthe most important, the generated frequent sets and rules are free of associations that are previously known as noninteresting.

The main contribution of the method presented in this chapter for mining spatial association rules is for the data mining user, which will analyze much less obvious rules. The method is effective independently of other thresholds, and warrants that geographic domain associations will not appear among the resultant set of rules.

aCknOwledgment

The authors would like to thank both CAPES and CNPq, which partially provided the financial support for this research. To Procempa, for the geographic database and to Nodo XLDB da

Linguateca of Universidade de Lisboa for the geographic ontology. Our special thanks for

Mariusa Warpechowski and Daniela Leal Musa for the ontology modeling support, and for Sandro da Silva Camargo for the support with data mining algorithms.

referenCes

Agrawal,R.,Imielinski,T.,&Swami,A.(1993).

Mining association rules between sets of items in large databases. In P. Buneman & S. Jajodia

(Eds.), ACMSIGMODInternationalConference on Management of Data (Vol. 20, pp. 207-216).

New York: ACM Press.

Enhancing the Process of Knowledge Discovery in Geographic Databases Using Geo-Ontologies

Agrawal,R.,&Srikant,R.(1994).Fastalgorithms for mining association rules in large databases.

In J.B. Bocca, M. Jarke, & C. Zaniolo (Eds.), InternationalConferenceonVeryLargeDatabases

(Vol. 20, pp. 487-499). San Francisco: Morgan

Kaufmann Publishers.

Appice, A., Berardi, M., Ceci, M., & Malerba, D. (2005). Mining and filtering multilevel spatial association rules with ARES. In M. Hacid,

N. V. Murray, Z. W. Ras, & S. Tsumoto (Eds.),

Foundations of Intelligent Systems, 15th InternationalSymposiumISMIS.Vol.3488(pp. 342-353). Berlin: Springer.

Bogorny, V., & Iochpe, C. (2001). Extending the opengis model to support topological integrity constraints. In M. Mattoso & G. Xexéo (Eds.),

16th Brazilian Symposium in Databases (pp. 25-

39). Rio de Janeiro: COPPE/UFRJ.

Bogorny,V.,Engel,P.M.,&Alvares,L.O.(2005a).

A reuse-based spatial data preparation framework for data mining. In J. Debenham & K. Zhang

(Eds.), 15th InternationalConferenceonSoftware Engineering and Knowledge Engineering (pp. 649-652). Taipei: Knowledge Systems Institute.

Bogorny,V.,Engel,P.M.,&Alvares,L.O.(2005b).

Towards the reduction of spatial join for knowledge discovery in geographic databases using geo-ontologies and spatial integrity constraints. In M. Ackermann, B. Berendt, M. Grobelink, & V. Avatek (Eds.), ECML/PKDD 2nd Workshop on Knowledge Discovery and Ontologies (pp. 51-58). Porto.

Bogorny,V.,Engel,P.M.,&Alvares,L.O.(2006a).

GeoARM: An interoperable framework to improve geographic data preprocessing and spatial association rule mining. In Proceedingsofthe18th

InternationalConferenceonSoftwareEngineeringandKnowledgeEngineering(pp. 70-84). San Francisco: Knowledge Systems Institute.

Bogorny,V.,Camargo,S.,Engel,P.,M.,&Alvares, L.O.(2006b).Towardseliminationofwellknown

geographic domain patterns in spatial association rule mining. In Proceedingsofthe3rd IEEEInternational Conference on Intelligent Systems (pp. 532-537). London: IEEE Computer Society.

Bogorny,V.,Camargo,S.,Engel,P.,&Alvares,L. O.(2006c).Miningfrequentgeographicpatterns with knowledge constraints. In 14th ACM InternationalSymposiumonAdvancesinGeographic Information Systems. Arlington, November (to appear).

Chaves,M.S.,Silva,M.J.,&Martins,B.(2005a).

A geographic knowledge base for semantic web applications. In C. A. Heuser (Ed.), 20th Brazilian Symposium on Databases (pp. 40-54). Uberlandia: UFU.

Chaves,M.S.,Silva,M.J.,&Martins,B.(2005b).

GKB—GeographicKnowledgeBase. (TR05-12). DI/FCUL.

Clementini, E., Di Felice, P., & Van Ostern, P.

(1993). A small set of formal topological relationships for end-user interaction. In D.J. Abel & B.C. Ooi (Eds.), Advances in Spatial Databases, 3rd International Symposium, 692 (pp. 277-295). Singapore: Springer.

Cockcroft, S. (1997). A Taxonomy of spatial data integrity constraints. Geoinformatica, 1(4), 327-343.

Clementini, E., Felice, Di, P., & Koperski, K.

(2000). Mining multiple-level spatial association rules for objects with a broad boundary. Data & Knowledge Engineering, 34(3), 251-270.

Egenhofer, M., & Franzosa, R. (1995). On the equivalenceoftopologicalrelations. International Journal of Geographical Information Systems, 9(2), 133-152.

Ester,M.,Frommelt,A.,Kriegel,H.-P.,&Sander,

J. (2000). Spatial data mining: Database primitives, algorithms and efficient DBMS support.

Journal of Data Mining and Knowledge Discovery, 4(2-3), 193-216.

Enhancing the Process of Knowledge Discovery in Geographic Databases Using Geo-Ontologies

Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P.

(1996). From data mining to discovery knowledge in databases. AI Magazine, 3(17), 37-54.

Gruber, T. R. (1993). Towards principles for the design of ontologies used for knowledge sharing. Formal ontology in conceptual analysis and knowledge representation. InternationalJournal of Human-Computer Studies, 43, 907-928.

Guarino, N. (1998). Formal ontology and information systems. In N. Guarino (Ed.), International Conference on Formal Ontology in Information Systems (pp. 3-15). Trento: IOS Press.

Han, J., & Fu, Y. (1995). Discovery of multiplelevel association rules from large databases. In U.

Dayal,P.M.D.Gray,&S.Nishio(Eds.),InternationalConferenceonVeryLargeDataBases(pp.

420-431). Zurich: Morgan-Kaufmann.

Han, J., Koperski, K., & Stefanvic, N. (1997).

GeoMiner: a system prototype for spatial data mining. In J. Peckham (Ed.), ACMSIGMOD InternationalConferenceonManagementofData, 26 (pp. 553-556). Tucson: ACM Press.

Han,J.,PeiJ.,&Yin,Y.(2000).Miningfrequent patterns without candidate generation. In J.Chen,

F. Naughton, & P.A. Bernstein (Eds.), 20th ACMSIGMODInternationalConferenceonManagement of Data (pp. 1-12) Dallas: ACM.

Huang, Y., Shekhar, S., & Xiong, H. (2004).

Discovering co-location patterns from spatial datasets: A general approach. IEEETransactions on Knowledge and Data Engineering, 16(12), 1472-1485.

Koperski, K., & Han, J. (1995). Discovery of spatial association rules in geographic information databases. In M.J. Egenhofer, J.R. Herring (Eds.), 4th International Symposium on Large Geographical Databases, 951 (pp. 47-66). Portland: Springer.

Lu,W.,Han,J.,&Ooi,B.C.(1993).Discoveryof general knowledge in large spatial databases. In Far East Workshop on Geographic Information Systems (pp. 275-289). Singapore.

Open Gis Consortium. (1999a). Topic 5, the OpenGIS abstract specification—OpenGIS fea- tures—Version4.RetrievedAugust20,2005,from http://www.OpenGIS.org/techno/specs.htm

OpenGisConsortium.(1999b).OpenGISSimple Features Specification For SQL. Retrieved August 20, 2005, from http://www.opengeospatial. org/specs

OpenGisConsortium.(2001).FeatureGeometry. Retrieved August 20, 2005, from http://www. opengeospatial.org/specs

Pasquier, N., Bastide, Y., Taouil, R., & Lakhal, L. (1999). In C. Beeri & P. Buneman (Eds.), 7th

International Conference on Database Theory, 1540 (pp. 398-416). Jerusalem: Springer.

Servigne,S.,Ubeda,T., Puricelli,A.,& Laurini,

R. (2000). A methodology for spatial consistency improvement of geographic databases. Geoinformatica, 4(1), 7-34.

Shekhar, S., & Chawla, S. (2003). Spatial databases: a tour. Upper Saddle, NJ: Prentice Hall.

Srikant,R.&Agrawal,R.(1995).Mininggeneralizedassociationrules.InU.Dayal,P.M.D.Gray,

S. Nishio (Eds.), Proceedings of the 21st InternationalConferenceonVeryLargeDatabases(pp.

407-419). Zurich: Morgan Kaufmann.

Yoo, J. S., & Shekhar, S. (2006). A join-less approach for mining spatial co-location patterns.

IEEE Transactions on Knowledge and Data Engineering, 18(10).

Zaki.M.(2000).Generatingnonredundantassociationrules.InS.J.Simoff&O.R.Zaïane(Eds.),

Proceedingsofthe6thACMSIGKDDInternational Conference on Knowledge Discovery and Data Mining (pp. 34-43) Boston: ACM Press.

0

Enhancing the Process of Knowledge Discovery in Geographic Databases Using Geo-Ontologies

Zaki., M., & Hsiao, C. (2002). CHARM: An efficient algorithm for closed itemset mining. In R.L. Grossman, J. Han, V. Kumar, H. Mannila, &R.Motwani(Eds.),Proceedingofthe2nd SIAM International Conference on Data Mining (pp. 457-473). Arlington: SIAM.

additiOnal reading

Bernstein, A., Provost, Foster J., & Hill, S. (2005). Toward intelligent assistance for a data mining process: An ontology-based approach for cost-sensitive classification. IEEE Transactions on Knowledge and. Data Engineering, 17(4), 503-518.

Bogorny, V., Valiati, J.F., Camargo, S.S., Engel, P.M.,Kuijpers,B.,&Alvares,L.O.(2006).Mining maximalgeneralizedfrequentgeographicpatterns with knowledge constraints. SixthIEEEInternationalConferenceonDataMining(pp. 813-817). Hong Kong: IEEE Computer Society.

Bogorny, V. (2006). Enhancing spatial association rule mining in geographic databases. PhD

Thesis. Porto Alegre, Brazil: Instituto de Infor- matica—UFRGS.

Chen, X., Zhou, X., Scherl, R.B., & Geller, J.

(2003). Using an interest ontology for improved support in rule mining. In Y. Kambayashi, M. K.

Mohania,&W.Wolfram(Eds),FifthInternational ConferenceonDataWareHouseandKnowledge Discovery (pp. 320-329). Prague: Springer.

Farzanyar, Z., Kangavari, M., & Hashemi, S. (2006). A new algorithm for mining fuzzy association rules in the large databases based on ontology. WorkshopsProceedingsofthe6thIEEE

International Conference on Data Mining (pp. 65-69). Hong Kong: IEEE Computer Society.

Jozefowska, J., Lawrynowicz, A., & Lukaszewski, T. (2006). Frequent pattern discovery from

OWLDLPknowledgebases.InS.Staab&V.Sv

(Eds). International Conference on Managing KnowledgeinaWorldofNetworks(pp. 287-302).

Czech Republic: Springer.

Knowledge Discovery and Ontologies. (2004).

ECML/PKDDWorkshop.Retrieved February 12, 2006, from http://olp.dfki.de/pkdd04/cfp.htm

Knowledge Discovery and Ontologies. (2005).

ECML/PKDD Workshop. Retrieved February

12, 2006, from http://webhosting.vse.cz/svatek/ KDO05

Mennis, J., & Peuquet, D.J. (2003). The role of knowledge representation in geographic knowledge discovery: A case study. Transactions in GIS, 7(3), 371-391.

Singh, P., & Lee, Y. (2003). Context-based data mining using ontologies. In I. Song, S.W. Liddle,

T.WangLing,&P.Scheuermann(Eds.),International Conference on Conceptual Modeling (pp. 405-418). Chicago: Springer.

Xu, W., & Huang, H. (2006). Research and application of spatio-temporal data mining based on ontology. First International Conference on InnovativeComputing,InfroamtionandControl,

(pp. 535-538). Los Alamitos: IEEE Computer Society.

Yu, S., Aufaure, M. Cullot, N., & Spaccapietra,

S. (2003). Location-based spatial modelling using ontology. SixthAGILEConferenceonGeographic Information Science. Lyon, France.

Chapter X

Ontology-Based Construction

of Grid Data Mining Workflows

Peter Brezany

University of Vienna, Austria

Ivan Janciak

University of Vienna, Austria

A Min Tjoa

Vienna University of Technology, Austria

aBstraCt

Thischapterintroducesanontology-basedframeworkforautomatedconstructionofcomplexinteractive data mining workflows as a means of improving productivity of Grid-enabled data exploration systems. The authors first characterize existing manual and automated workflow composition approaches and then present their solution called GridMiner Assistant (GMA), which addresses the whole life cycle of theknowledgediscoveryprocess.GMAisspecifiedintheOWLlanguageandisbeingdevelopedaround anoveldataminingontology,whichisbasedonconceptsofindustrystandardslikethepredictivemodel markup language, cross industry standard process for data mining, and Java data mining API. The ontology introduces basic data mining concepts like data mining elements, tasks, services, and so forth. In addition,conceptualandimplementationarchitecturesoftheframeworkarepresentedanditsapplication to an example taken from the medical domain is illustrated. The authors hope that the further research and development of this framework can lead to productivity improvements, which can have significant impact on many real-life spheres. For example, it can be a crucial factor in achievement of scientific discoveries, optimal treatment of patients, productive decision making, cutting costs, and so forth.

Copyright © 2008, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.

Ontology-Based Construction of Grid Data Mining Workflows

intrOdUCtiOn

Grid computing is emerging as a key enabling infrastructure for a wide range of disciplines in science and engineering. Some of the hot topics in current Grid research include the issues associated with data mining and other analytical processes performed on large-scale data repositories integrated into the Grid. These processes are not implemented as monolithic codes. Instead, the standalone processing phases, implemented as Grid services, are combined to process data and extract knowledge patterns in various ways.

They can now be viewed as complex workflows, which are highly interactive and may involve several subprocesses, such as data cleaning, data integration, data selection, modeling (applying a data mining algorithm), and postprocessing the mining results (e.g., visualization). The targeted workflows are often large, both in terms of the numberoftasksinagivenworkflowandinterms ofthetotalexecutiontime.Therearemanypossible choices concerning each process’s functionality and parameters as well as the ways a process is combined into the workflow but only some combinations are valid. Moreover, users need to discover Grid resources and analytical services manually and schedule these services directly on the Grid resources essentially composing detailed workflow descriptions by hand. At present, only such a “low-productivity” working model is available to the users of the first generation data mining Grids, like GridMiner (Brezany et al., 2004) (a system developed by our research group),

DiscoveryNet (Sairafi et al., 2003), and so forth. Productivity improvements can have significant impact on many real-life spheres, for example, it can be a crucial factor in achievement of scientific discoveries, optimal treatment of patients, productive decision making, cutting costs, and so forth. There is a stringent need for automatic or semiautomatic support for constructing valid andefficientdataminingworkflowsontheGrid,

and this (long-term) goal is associated with many research challenges.

The objective of this chapter is to present an ontology-basedworkflowconstructionframework reflecting the whole life cycle of the knowledge discoveryprocessandexplainthescientificrationalebehinditsdesign.Wefirstintroducepossible workflowcompositionapproaches—weconsider two main classes: (1) manual composition used by the current Grid data mining systems, for example, the GridMiner system, and (2) automated composition, which is addressed by our research and presented in this chapter. Then we relate these approaches to the work of others. The kernel part presents the whole framework built-up around a data mining ontology developed by us. This ontology is based on concepts reflecting the terms of several standards, namely, the predictive model markup language, cross industry process for data mining, and Java data mining API. The ontology isspecifiedbymeansofOWL-S,aWebontology language for services, and uses some concepts from Weka, a popular open source data mining toolkit. Further, conceptual and implementation architectures of the framework are discussed and illustrated by an application example taken from a medical domain. Based on the analysis of future and emerging trends and associated challenges, we discuss some future research directions followed by brief conclusions.

BaCkgrOUnd

In the context of modern service-oriented Grid architectures, the data mining workflow can be seen as a collection of Grid services that are pro- cessedondistributedresourcesinawell-defined order to accomplish a larger and sophisticated data exploration goal. At the highest level, functions of Grid workflow management systems could be characterized into build-time functions and run-time functions. The build-time functions are

Ontology-Based Construction of Grid Data Mining Workflows

concernedwithdefiningandmodelingworkflow tasks and their dependencies while the run-time functions are concerned with managing the workflow execution and interactions with Grid resources for processing workflow applications. Users interact with workflow modeling tools to generate a workflow specification, which is submitted for execution to a run-time service called workflowenactmentservice,orworkflowengine. Many languages, mostly based on XML, were defined for workflow description, like XLANG

(Thatte, 2001), WSFL (Leymann, 2001), DSCL (Kickinger et al., 2003) and BPML (Arkin, 2002). Eventually the WSBPEL (Arkin et al., 2005) and BPEL4WS (BEA et al., 2003) specifications emerged as the de facto standard.

In our research, we consider two main workflowcompositionmodels:manual (implemented in the fully functional GridMiner prototype (Kickingeretal.,2003))andautomated(addressed in this chapter), as illustrated in Figure 1. Within

manual composition, the user constructs the target workflow specification graphically in the workflow editor by means of the advanced graphical user interface. The graphical form is converted into a workflow description document, which is passed to the workflow engine. Based on the workflow description, the engine sequentially or in parallel calls the appropriate analytical services

(database access, preprocessing, OLAP, classification, clustering, etc.). During the workflow execution, the user only has the ability to stop, inspect, resume, or cancel the execution. As a result, the user has limited abilities to interact with the workflow and influence the execution process. A similar approach was implemented in the DiscoveryNet (Sairafi et al., 2003) workflow management system.

The automated composition is based on an intensive support of five involved components: workflowcomposer, resourcesmonitoring, workflow engine, knowledge base, and reasoner.

Figure 1. Workflow composition approaches

Ontology-Based Construction of Grid Data Mining Workflows

Workflow composer: Is a specialized tool, which interacts with a user during the workflow composition process. This chapter describes its functionality in detail.

Resources monitoring: Its main purpose is obtaininginformationconcerningtheutilizationof system resources. Varieties of different systems exist for monitoring and managing distributed Grid-based resources and applications. For example, the monitoring and discovery system (MDS) is the information services component of the Globus Toolkit (Globus Alliance, 2005), which provides information about available resources on the Grid and their status. Moreover, MDS facilitates the discovery and characterization of resources and monitors services and computations. The information provided by resource monitoring can be continuously updated in the knowledgebase(KB)toreflectthecurrentstatus of the Grid resources.

Workflow engine: Is a runtime execution environment that performs the coordination of servicesasspecifiedintheworkflowdescription expressed in terms of a workflow language. The workflowengineisabletoinvokeandorchestrate the services and acts as their client, that is, listen tothenotificationmessages,deliveroutputs,and so forth.

Knowledge base (KB) and reasoner: A set of ontologies can be used for the specification of the KB structure, which is built-up using a set of instances of ontology classes and rules. The reasoner applies deductive reasoning about the stored knowledge in a logically consistent manner; it assures consistency of the ontology and answers given queries.

Due to different roles and behaviors of the presented components, we distinguish two modes ofautomatedworkflowcomposition:passiveand active.

Passive Workflow Construction

The passive approach is based on the assumption that the workflow composer is able to compose a reasoning-based complete workflow description involving all possible scenarios of the workflow engine behavior and reflecting the status of the involved Grid resources and task parameters providedbytheuserattheworkflowcomposition time. Although the KB is continuously modified bytheuser’sentriesandbyinformationretrieved from the resource monitoring services, the composition of involved services is not updated during the workflow execution. Therefore, the compositiondoesnotreflectthe‘stateoftheworld,’ which can be dynamically changed during the execution.Itmeansthattheworkflowenginedoes not interact with the inference engine to reason about knowledge in the KB. Thus, the behavior of the engine (the decisions it takes) is steered by fixed condition statements as specified in the workflow document.

Theessentialtasksleadingtoafinaloutcome of the passive workflow composition approach can be summarized as follows:

1.The workflow composer constructs a complete workflow description based on the information collected in KB and presents it to the workflow engine in an appropriate workflow language.

2.The workflow engine executes each subsequent composition step as presented in the workflow description, which includes all possible scenarios of the engine behavior.

Active Workflow Construction

The active approach assumes a kind of intelligent behaviorbytheworkflowenginesupportedbyan inference engine and the related KB. Workflow composition is done in the same way as in the passiveapproach,butitsusabilityismoreefficient because it reflects a ‘state of the world.’ It means

Ontology-Based Construction of Grid Data Mining Workflows

thattheoutputsandeffectsoftheexecutedservices are propagated to the KB together with changes of the involved Grid resources. Considering these changes,theworkflowenginedynamicallymakes decisions about next execution steps. In this approach,noworkflowdocumentisneededbecause the workflow engine instructs itself using an inference engine which queries and updates the

KB. The KB is queried each time the workflow engine needs information to invoke a consequent service, for example, it decides which concrete service should be executed, discovers the values of its input parameters in the KB, and so forth.

The workflow engine also updates the KB when there is a new result returned from an analytical service that can be reused as input for the other services.

Theessentialtasksleadingtoafinaloutcome in active workflow composition approach can be summarized as follows:

1.The workflow composer constructs an abstract workflow description based on the information collected in the KB and propagates the workflow description back intotheKB.Theabstractworkflowisnota detailed description of the particular steps in theworkflowexecutionbutinsteadakindof path that leads to the demanded outcome.

2.The workflow engine executes each subsequent composition step as a result of its interactionwiththeKBreflectingitsactual state. The workflow engine autonomously constructs directives for each service execution and adapts its behavior during the execution.

related work

A main focus of our work presented in this chapter is on the above mentioned passive approach of the automated workflow composition. This research was partially motivated by (Bernstein et al., 2001). They developed an intelligent discovery assistant

(IDA), which provides users (data miners) with

(1) systematic enumerations of valid data mining processes according to the constraints imposed by the users’ inputs, the data, and/or the data mining ontology in order that important and potentially fruitful options are not overlooked, and (2) effective rankings of these valid processes by different criteria (e.g., speed and accuracy) to facilitate the choice of data mining processes to execute. The IDA performs a search of the space of processes defined by the ontology. Hence, no standardlanguageforontologyspecificationand appropriate reasoning mechanisms are used in their approach. Further, they do not consider any state-of-the-artworkflowmanagementframework and language.

Substantial work has already been done on automated composition of Web services using SemanticWebtechnologies.Forexample,Majithia et al., (2004) present a framework to facilitate automated service composition in service-oriented architectures(Tsalgatidou&Pilioura,2002)using

SemanticWebtechnologies.Themainobjectiveof the framework is to support the discovery, selection, and composition of semantically-described heterogeneous Web services. The framework supports mechanisms to allow users to elaborate workflows of two levels of granularity: abstract and concrete workflows. Abstract workflows specify the workflow without referring to any specificserviceimplementation.Hence,services

(and data sources) are referred to by their logical names. A concrete workflow specifies the actual names and network locations of the services participating in the workflow. These two level workflowgranularitiesarealsoconsideredinour approach, as shown in an application example.

Challenges associated with Grid workflow planningbasedonartificialintelligenceconcepts and with generation of abstract and concrete workflowsareaddressedby(Deelmanetal.,2003).

However, they do not consider any service-ori- entedarchitecture.Workflowrepresentationand enactment are also investigated by the NextGrid

Ontology-Based Construction of Grid Data Mining Workflows

Project (NextGrid Project, 2006). They proposed the OWL-WS (OWL for workflow and services) (Beco et al., 2006) ontologydefinitionlanguage.

The myGrid project has developed the Taverna

Workbench(Oinnetal.,2004)forthecomposition and execution of workflows for the life sciences community.Theassistedcompositionapproachof Sirin (Sirin et al., 2004) uses the richness of Semantic Web service descriptions and information fromthecompositionalcontexttofiltermatching services and help select appropriate services.

Underlying standards and teChnOlOgies

Cross industry standard process for data mining

Cross industry standard process for data mining (CRISP-DM) (Chapman et al., 1999) is a data

mining process model that describes commonly used approaches that expert data miners use to tackle problems of organizing phases in data mining projects. CRISP-DM does not describe a particulardataminingtechnique;ratheritfocuses ontheprocessofadataminingprojects’lifecycle.

The CRISP-DM data mining methodology is described in terms of a hierarchical process model consistingofsetsoftasksorganizedatfourlevels of abstraction: phase, generic task, specialized task, and process instance. At the top level, the life cycle of a data mining project is organized into six phases as depicted in Figure 2.

Thesequenceofthephasesisnotstrict.Moving back and forth between different phases is always required. It depends on the outcome of each phase, which one, or which particular task of a phase has to be performed next. In this chapter, we focus our attention on the three phases of data mining projects’ life cycle, namely: data understanding, data preparation, and modeling.

Figure 2. Phases of CRISP-DM reference model (Chapman et al., 1999)