Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Скачиваний:
26
Добавлен:
07.02.2016
Размер:
8.32 Mб
Скачать

TtoO: Mining a Thesaurus and Texts to Build and Update a Domain Ontology

Let t Lcorpus to be added to LCOnto ,

If main_term (t) LCOnto and remaining(t)

LConto thent LCOntowithF(t)=candc« subclass of » F(tmain_term(t))

(R10)

Case stUdy: BUilding and Using an OntOlOgy in the astrOnOmiCal field

We have applied the proposed method to the transformation of a thesaurus into a lightweight ontologyinthefieldofastronomy.TheIAUthesaurus

(http://www.mso.anu.edu.au/library/thesaurus/) was built to standardize the terminology of astronomy. Its use was to help librarians index and retrieve catalogues and scientific publications of the domain. Its building, requested by the InternationalUnionofAstronomyin1984,wasfinished in 1995. Its transformation into an ontology has been done in the framework of the French project DataMassinAstronomy(http://cdsweb.u-strasbg. fr/MDA/mda.html). We present in Section 6.1, the results obtained by applying the different rules presented in the article. In Section 6.2, we give an example of how the constructed ontology has been integrated to a science-monitoring task.

evaluation of the proposed method

Two corpuses of the domain have been considered. The documents they contain are abstracts of articles published in the international journal Astronomy and Astrophysics (www.edpsciences. org/aa). The first corpus is composed of articles published in 1995. The use of this corpus aims at capturing the implicit knowledge of the domain not stated in the thesaurus when it was built. The second corpus contains articles published in 2002. This corpus has been chosen to enable the update of the domain knowledge from recent documents. Domain experts have validated that both corpuses

describe the knowledge to be represented in the ontology.

Theprotocoldefinedtoevaluatethetransformation relies on presenting the results obtained by the different rules to two astronomers who accept or reject the propositions.

We present here the most significant results. The evaluation of our method in the field of astronomy has demonstrated its interest. The aim of our method is not to be automatic but to facilitate the creation of ontologies. Note that for eachstepthetaskofexpertsistoconfirmorreject propositions and not to give results. For this, our method is efficient. The identification of domain concepts and their labels is highly relevant (85% of the concepts are validated). It leads to the definition of generic concepts that are mostly validated and facilitates their integration in the ontology. Associative relations extracted from the thesaurus are validated for 84%. Thanks to the mining of the reference corpus, relevant semantic relations between existing concepts are added and new terms integrated by means of the creation of

new concepts or new relations.

Using Ontologies for a science monitoring task

Ontologies are used for semantically indexing documents. This relies on the intuition that the meaningoftextualinformation(andthewordsthat composethedocument)dependsontheconceptual relations between the objects to which they refer ratherthanonthelinguisticandstatisticalrelations of their content (Haav & Lubi, 2001).

Semantic indexing is done in two steps: identifying document concepts within documents and weighting those concepts. The weighting process uses the relations between concepts identified in the documents according to the structure of the ontology.Formoreprecisedetails,see(Hernandez et al., 2005).

Based on this semantic indexing, we propose to take the ontology into account to provide the

TtoO: Mining a Thesaurus and Texts to Build and Update a Domain Ontology

user access to the information contained in a corpus. The interface is developed to support OWL ontology visualization. A snapshot is presented in Figure 8. With the interface, it is possible to visualizeboththedomainontologylinkedtothe

content of documents and the specific metadata the user is looking for.

The user explores the corpus according to the specific theme of the domain (in our case astronomy).

Figure 8. Visualization through the astronomy ontology

Figure 9. Visualization of the knowledge learnt for an article

TtoO: Mining a Thesaurus and Texts to Build and Update a Domain Ontology

This possibility is illustrated in Figure 9. On the right hand side of the screen, the domain ontology is shown with its concepts and relations. The concepts present in the corpus are highlighted; they can be interpreted in their context as their relations to other concepts are represented. By clickingonaconcept,theusercanfindthearticles treating the theme and the researchers working on this theme. When a researcher working on the theme is selected, the windows are automatically updated and the information on this researcher is presented. In the same way, when the user is interested in accessing the article dealing with this subject, he selects it. The windows are updated: a pop-up window containing the article appears and the information linked to this article is represented as in Figure 8. The information known about this article is presented on the right hand side of the screen and all the themes of the domain treated in the article are presented in the astronomy ontology frame. The astronomy ontology concepts found in the document are presented in red on the left-hand side of the window (the article of reference 124 deals with comet and solar system). This makes it possible to evaluate rapidly if the document treats the themes in which the user is interested, who the authors of the article are, when it was published and where.

COnClUsiOn and fUtUre wOrks in the dOmain

The representation of textual contents is an issue to be addressed if systems are to handle documents efficiently. The assumption according to which documents contain all the information necessary to their representation is no longer enough. However, a more semantic indexing implies the use of terminological resources or domain knowledge. In this chapter, we have proposed a methodology that produces this type of linguistic resource in a formal way, based on ontologies. This methodology applies when a

thesaurus and a textual corpus on the domain are available. The process is based on the following main steps: extraction of initial knowledge from the thesaurus itself (basically terms, concepts and taxonomic links) and enrichment of the resulting ontology based on text mining (disambiguation of existing relations between concepts, extraction of new terms, and new relations from texts). We, thus, define an incremental process that can be applied to up-date existing ontologies as well. This process is automated as much as possible and the work of experts is limited, consisting only in validating the results (generic concepts, relations labels). Indeed, our methodology and its implementation is less demanding with regard to experts’ time (Soergel et al., 2004) (Wielinga et al, 2001). An important result is that relations betweenconceptsaresemiautomaticallyextracted and a label is associated to them.

This work is a contribution to the Semantic Web, which structures the web in a way that it is meaningful not only to computers but also to humans. (Lu et al., 2002) consider that one of the main challenges of developing the Semantic Web is through ontologies. To meet this challenge new methods and tools are needed in order to facilitate this process.

fUtUre researCh direCtiOns

Mining for information on the Semantic Web refers to various points of views (Corby et al. 2004): that of designing ontologies which aim at representingafieldofknowledge,thatofsemantically indexing resources according to these ontologies, finally,thatofprovidingaccesstoresourcesfora user or an engine. The work we present focuses on the first point of view.

Considering the same point of view, one of the challenges is the evolution of domains and related needs. Domain knowledge evolves with time. Now that methods have been defined to help ontology building, research should take into

0

TtoO: Mining a Thesaurus and Texts to Build and Update a Domain Ontology

consideration how ontology can evolve. It implies populating existing ontologies with new instances that are extracted from recent domain resources, restructuring ontologies when new concepts are identified or when knowledge become obsolete. Ontologyevolutionisalsolinkedtotheknowledge needed by information systems to better interact with users. When applications are modified or when user’s expertise of the application or the fieldchanges,theontologyhastoberestructured according to what is really useful. The PASCAL

OntologyLearningChallengeaimsatincouraging and evaluating work in such directions.

Semantic Web research incorporates a number of technologies and methodologies defined in different fields including artificial intelligence, ontology engineering, human language technology, machine learning, web services, databases, distributedsystems,softwareengineering,human computer interaction and information systems.

One of the challenges is to make these communities work together in order to integrate the advances in the different fields. Moreover, many application domains have been investigated (e- business, e-health, e-government, multimedia, e-learning, bio-pharmacy) but scalable applications is still a challenge when considering the 11.5 billion pages of indexable Web.

Another challenge is that ontologies are yet build in order to state the consensual knowledge of the domain. In order to be applied on large scale document repositories where many groups of users interact, processes should be developed toadddescriptionstoontologiesinordertodefine in which context and for whom the ontology is useful. This last point is part of the aims of the pragmatic web which is envisioned to be the new evolution of the Semantic Web.

referenCes

Aussenac-Gilles, N., Biébow, B., & Szulman, S.

(2000). Revisiting ontology design: A method

based on corpus analysis. In R. Dieng & O.

Corby (Eds.), Proceedings of the 12th European Knowledge Acquisition Workshop (EKAW’00)

(pp. 172-188).

Aussenac-Gilles, N., & Mothe, J. (2004). Ontologies as background knowledge to explore document collections. In Proceedings of RIAO (pp. 129-142)

Bozsak,E.,Ehrig,M.,Handschuh,S.,Hotho,A., Maedche, A., Motik, B., et al. (2002). KAON:

Towards a large scale Semantic Web. In Proceedings of the 3rd International Conference on E-Commerce and Web Technologies (Vol. 2455, pp. 304-313).

Bourigault, D., & Fabre, C. (2000). Approche linguistiquepourl’analysesyntaxiquedecorpus,

Cahiers de Grammaire, 25, Université Toulouse le Mirail (pp. 131-151).

Chaumier, J. (1988). Letraitementlinguistiquede l’information. Entreprise moderne d’éd.

Crouch, C.J., & Yang, B. (1992). Experiments in automatic statistical thesaurus construction. Conference on Research and Development in Information Retrieval (SIGIR)

(pp. 77-88).

Ding,Y.,&Foo,S.(2002). Ontologyresearchand development:Part1—Areviewofontologygen- eration. Journal of Information Science, 28(2).

Englmeier, K., & Mothe, J. (2003). IRAIA: A portal technology with a semantic layer coordinating multimedia retrieval and cross-owner content building. In Proceedings of the International Conference on Cross Media Service Delivery, Cross-Media Service Delivery Series. The InternationalSeriesinEngineeringandComputer Science (Vol. 740, pp. 181-192).

Fensel, S.B. (1998). Knowledge engineering: Principles and methods. Data and Knowledge Engineering, 25, 161-197.

TtoO: Mining a Thesaurus and Texts to Build and Update a Domain Ontology

Fischer, D.H. (1998). From thesauri towards ontologies? In W.M. Hadi, J. Maniez, & S. Pollitt

(Eds.), Structures and Relations in Knowledge Organization: Proceedings of the 5th International ISKO Conference (pp. 18-30). Würzburg:

Ergon.

Foskett, D.J. (1980). Thesaurus. In A. Kent &

H. Lancour (Eds), Encyclopedia of library and information science (pp. 416-463).

Gal, A., Modica, G., & Jamil, H.M. (2004). OntoBuilder: Fully automatic extraction and consolidation of ontologies from Web sources. In

Proceedingsofthe20th InternationalConference on Data Engineering. IEEE Computer Society.

Gangemi,A.,Guarino,N.,Masolo,C.,Oltramari, A., & Schneider, L. (2002). Sweetening ontologies with DOLCE. In Proceedings of the InternationalConferenceonKnowledgeEngineering and Knowledge Management (pp. 166-181).

Gómez-Pérez, A., Fernandez, M., & de Vicente,

A.J. (1996). Towards a method to conceptualize domain ontologies. In Proceedings of the European Conference on Artificial Intelligence (ECAI’96) (pp. 41-52).

Grefenstette, G. (1992). Use of syntactic context to produce term association lists for retrieval.In

Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR)

(pp. 89-97).

Guarino, N., Carrara, M., & Giaretta, P. (1994). Formalizing ontological commitments. In Proceedings of the AAAI Conference.

Haav, H.M., & Lubi, T.L. (2001). A survey of concept-based information retrieval tools on the web. In Proceedings of the 5th East-European Conference ADBIS (Vol. 2, pp. 29-41).

Hahn, U., & Schulz, S. (2004) Building a very large ontology from medical thesauri. In S. Staab

&R. Stuber (Eds.), Handbookonontologies(pp. 133-150).

Harman,D.(1992).TheDARPATIPSTERproject.

SIGIR Forum, 26(2), 26-28.

Hearst, M.A. (1992). Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th International Conference on Computational Linguistics.

Hernandez, N., Mothe, J., & Poulain, S. (2005). Accessing and mining scientific domains using ontologies:TheOntoExploSystem. InProceedingsofthe28th AnnualInternationalACMSIGIR

(pp. 607-608).

Lassila, O., & McGuiness, D. (2001). The role of frame-based representation on the Semantic Web. Rapport technique KSL-01-02, Knowledge Systems Laboratory, Stanford University.

Lu, S., Dong, M., & Fotouhi, F. (2002). The Semantic Web: Opportunities and challenges for next-generation Web applications. Retrievedfrom http://informationr.net/ir/7-4/paper134.html

Maedche,A.,&Staab,S.(2001).Ontologylearning for the Semantic Web. IEEE Intelligent Systems, Special Issue on the Semantic Web, 16(2).

McGuinness, D.L., van Harmelen F. (2004). OWL Web ontology language overview, W3C Recommendation. Retrieved February 10, 2004, from http://www.w3.org/TR/owl-features/

Miles, A., & Brichley, D. (2005). SKOS Core

GuideW3C Working Draft. Retrieved May 10, 2005, from http://www.w3.org/TR/swbp-skos- core-guide/

Mills, D. (2006). Semanticwaves2006. Retrieved from http://www.ift.ulaval.ca/~kone/Cours/WS/ WS-PlanCours2006.pdf

Miller, G.A. (1988). Nouns in WordNet. In C. Fellbaum (Ed.), WordNet. An electronic lexical database (pp. 23-46). MIT Press.

Porter,M.(1980).Analgorithmforsuffix.Stripping Program, 14(3), 130-137.

TtoO: Mining a Thesaurus and Texts to Build and Update a Domain Ontology

Qiu, Y., & Frei, H.P. (1993). Concept based query expension. Conference on Research and Development in Information Retrieval (SIGIR),

(pp. 160-169).

Robertson, S., Sparck, E., & Jones, K. (1976). Relevance weighting of search terms. Journal of the American Society for Information Sciences, 27(3), 129-146.

Salton, G. (1971). The smart retrieval system. Englewood Cliffs, NJ: Prentice Hall.

Smith, M.K., Welty, C., & McGuinness, D.L. (2004).OWLWebontologylanguageguide.W3C Recommendation. Retrieved February 10, 2004, from http://www.w3.org/TR/owl-guide/

Soergel, D., Lauser, B., Liang, A., Fisseha, F.,

Keizer, J., & Katz, S. (2004). Reengineering thesauri for new applications: The AGROVOC example, Journal of Digital Information, 4(4).

Tudhope, D., Alani, H., & Jones, C. (2001).

Augmenting thesaurus relationships: Possibilities for retrieval. Journal of Digital Information, 1-8(41).

Uschold,M.,&King,M.(1995).Towardsamethodology for building ontologies. In Proceedings of the Workshop on Basic Ontological Issues in KnowledgeSharingattheInternationalJointConference on Artificial Intelligence (IJCAI1995).

Wielinga, B., Schreiber, G., Wielemaker, J., &

Sandber, J.A.C. (2001). From thesaurus to ontology. In ProceedingsoftheInternationalConference on Knowledge Capture.

additiOnal reading

BrewsterC.,AlaniH.,DasmahapatraS.,&Wilks

Y. (2004). Data driven ontology evaluation. In

Proceedings of 4th International Conference on Language Resources and Evaluation.

Chebotko A., Lu S., & Fotouhi F. (2006). Challenges for information systems towards the SemanticWeb. SIGSEMIS Journal. Retrieved from http://www.sigsemis.org/?p=9

Chrisment C., Dousset B., & Dkaki T. (2007). Combining mining and visualization tools to discover the geographic structure of a domain.

Computers Environment and Urban Systems Journal, to be published in 2007.

Corby, O., Dieng-Kuntz, R., & Faron-Zucker, C.

(2004). Querying the Semantic Web with corese search engine. In Proceedings of the European conference on artificial intelligence (pp. 705709).

DesmontilsE.,&JaquinC.(2002).IndexingaWeb site with a terminology oriented ontology. In I.

Cruz,S.Decker,J.Euzenat,&D.L.McGuinness

(Eds.), TheemergingSemanticWeb(pp. 181-197).

IOS Press.

Englmeier, K., & Mothe, J. (2003). IRAIA: A portal technology with a semantic layer coordinating multimedia retrieval and cross-owner content building. In Proceedings of the International Conference on Cross Media Service Delivery, Cross-MediaServiceDeliverySeries.InProceed- ings of the International Series in Engineering and Computer Science, 740 (pp. 181-192).

Guha, R.V., McCool, R., & Miller, E. (2003).

Semantic search. In Proceedings of the 12th International World Wide Web Conference (pp. 700-709).

Hernandez,N.,&Mothe,J.(2004).Anapproachto evaluate existing ontologies for indexing a document corpus. In Proceedings of the Eleventh InternationalConferenceonArtificialIntelligence: Methodology, Systems, Applications (AIMSA) Semantic Web Challenges (pp. 11-21).

Hernandez,N.,Mothe,J.,Chrisment,C.,&Egret,

D. (2007). Modeling context through domain ontologies. Journal of Information Retrieval, to be published in 2007.

TtoO: Mining a Thesaurus and Texts to Build and Update a Domain Ontology

Jarvelin, K., & Ingwersen, P. (2004) Information seeking research needs extensions towards tasks and technology. Information Retrieval, 10(1), 212.

Kiryakov, A., Popov, B., Terziev, I., Manov, D., & Ognyanoff, D. (2004). Semantic annotation, indexing, and retrieval. Journal of Web Semantics, 2(1).

Lozano-Tello, A., & Gómez-Pérez, A. (2004). ONTOMETRIC:Amethodtochoosetheappropriate ontology. Journal of Database Management, 15(2).

Mothe,J.,Chrisment,C.,Dousset,B.,&Alaux,J.

(2003). DocCube: Multidimensional visualisation and exploration of large document sets. Journal of the American Society for Information Science and Technology, 54(2), 650-659.

Shaban-Nejad, A., Baker, C. J. O., Haarslev, V., & Butler, G. (2005). The FungalWeb ontology: Semantic Web challenges in bioinformatics and Genomics Semantic Web challenges.

Studer, R., Benjamins, R., & Fensel, D. (1998).

Knowledge engineering: Principles and methods, data and knowledge engineering, 25(1-2) 161-197.

Uschold M. (2003). Where are the semantics in the Semantic Web? AI Magazine, 24(3), 25-36.

Vallet, D., Fernández, M., & Castells, P. (2005).

An ontology-based information retrieval model. In Proceedingsofthe2nd EuropeanSemanticWeb Conference (pp. 455-470).

Chapter VIII

Evaluating the Construction of

Domain Ontologies for

Recommender Systems

Based on Texts

Stanley Loh

Catholic University of Pelotas and Lutheran University of Brazil, Brazil

Daniel Lichtnow

Catholic University of Pelotas, Brazil

Thyago Borges

Catholic University of Pelotas, Brazil

Gustavo Piltcher

Catholic University of Pelotas, Brazil

aBstraCt

This chapter investigates different aspects in the construction of a domain ontology to a content-based recommender system. The recommender systems suggests textual electronic documents from a digital library,basedondocumentsreadbytheusersandbasedontextualmessagespostedinelectronicdiscussionsthroughaWebchat.Thedomainontologyisusedtorepresenttheuser’sinterestandthecontentof thedocuments.In this context, theontologyis composed by a hierarchyof conceptsand keywords. Each concept has a vector of keywords with weights associated. Keywords are used to identify the content of the texts (documents and messages), through the application of text mining techniques. The chapter discusses different approaches for constructing the domain ontology, including the use of text mining software tools for supervised learning, the interference of domain experts in the engineering process and the use of a normalization step.

Copyright © 2008, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.

Evaluating the Construction of Domain Ontologies for Recommender Systems Based on Texts

intrOdUCtiOn

Recommender systems are an emerging technology recently used by marketing departments and vendors, that utilizes data mining and machine learning techniques to analyze behavior and preferences of users or customers and that suggests new offers.

A content-basedrecommendersystemutilizes the content of items to generate recommendations to users. Items may be products, services, objects, Websites, people, etc. According to Burke (2002), content-based recommendation is an evolution of information filtering research. In the con- tent-based approach, the recommender system learns a profile of the user’s interests based on the features present in objects associated to the user and recommends other items with similar features. When the content of an item matches the interest of a user, this item is recommended to that user.

Some works have reported improvements in retrieval tasks using ontologies. Gauch et al., (2003) argued that “one increasingly popular way to structure information is through the use of ontologies, or graphs of concepts.” Labrou and Finin (1999) use categories from Yahoo! as an ontology to describe content and features of documents. Middleton et al., (2003) associate concepts from an ontology to users’ profiles and to documents.

In a recommender system, an ontology may be used to identify and represent the content of the items and the interest (profile) of the users.

For example, supermarkets can use ontologies to classify products in sections and brands and the interest of the user may be composed by sections and brands of the items bought by the user. Then a recommender system may suggest to the user other items with the same brand or within the same section.

In this work, we consider an ontology as a formalandexplicitdefinitionofconcepts(classesor categories) and their attributes and relations (Noy

& McGuinness, 2003). A domain ontology is a descriptionof“things”thatexistorcanexistina domain (Sowa, 2002) and contains the vocabulary related to the domain (Guarino, 1998).

When dealing with items that have textual descriptions, the ontology may manage keywords that are used to represent the content of the items oruser’sprofile.Textualdescriptionsmayappear as product information (i.e., marketing slogans, technical specifications) or as content of the item (i.e., textual documents, Websites, e-mail messages).

The process of constructing such a domain ontology is important to the recommender system generatebetterrecommendations.Therearemany approachesforconstructingontologies.Thischapter deals with the special case of creating domain ontologies based on keywords or textual characteristics.Ontologicalengineeringtechniquesmust consider the textual content of the items. For this purpose, text mining techniques may be used in a supervised learning process to automate part of theconstructionprocess,minimizingthecharge on the engineer.

The chapter presents the approaches based on a real recommender system that suggests electronic documents from a digital library based on documents read by users and based on textual messages sent during electronic discussions in a Web chat.

general eXplanatiOn Of the reCOmmender system

This section presents the recommender system used as case study. The goal of the system is to provide people with useful information during a collaboration session. To do that, the system analyzes textual messages sent by users when interactinginaprivateWebchat,identifiestopics

(subjects/themes/concepts) inside the messages and recommends items cataloged in a private digital library, previously classified in the same

Evaluating the Construction of Domain Ontologies for Recommender Systems Based on Texts

topics. Figure 1 presents the architecture of the system with its main components. The digital library contains electronic documents, Web links and bibliographic references.

According to the classification of Terveen and Hill (2001), the system is a content-based recommender system because the context of the messages is matched against the content of items in the database.

The chat works like traditional chats over the Web. The difference is that it is specially constructed for this system and it is not open to nonregistered users. Thus, users have to be authorized for using the system. There is no limit for the number of persons interacting at the same time. At the moment, only one chat channel is allowed. Thus a discussion session concerns all messages sent during a day. In the future, this restriction will be eliminated.

The text mining module analyzes each message posted to the chat (like a sniffer). The words presentinthemessagearecomparedagainstterms present in a domain ontology. Generic terms like

prepositions (called stopwords) will be discarded. Each message is compared online against all concepts in the ontology.

The text mining method employed in this work

(a kind of classification task) was first presented in Loh et al.,(2000). Instead of using natural language processing (NLP) to analyze syntax and semantics, the method is based on probabilistic techniques: themes can be identified by cues.

Using a fuzzy reasoning about the cues found in a text, it is possible to calculate the likelihood of a theme or subject being present in that text. The algorithm is based on Bayesian algorithms, since it uses a prototype-like vector to represent texts and concepts and the probability theory to treat uncertainty.Themethodevaluatestherelationship between a text and a concept of the ontology using a similarity function that calculates the distance between the two vectors. The vectors representing texts and concepts are composed by a list of terms with a weight associated to each term. In the case of texts, the weight represents the relative frequency of the term in the text (number of

Figure 1. Architecture of the system

Chat - message exchange

Text Mining

 

Recommender

Module

Context

Module

Ontology

 

Database