Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Скачиваний:
26
Добавлен:
07.02.2016
Размер:
8.32 Mб
Скачать

cepts chains, residues, and atoms that are already defined.

logic rules in pO

ThekeytoscalabilityofPOconceptualmodelis the systematic and effective composition of data and information. In this section, we present PO ontology algebra that allows composition of multiple levels of information stored in the ontology for information retrieval. By retaining a log of composition process, we can also, with minimal adaptations, replay the composition whenever any oftheunderlyingdatasourcesthatPOintegrates change. The algebra has one unary operator: Select, and three binary operations: Intersection,

Union and Difference.

select Operator

The Select operator allows us to highlight and selectportionsofthePOthatarerelevanttoquery at hand. Given the PO structure and a concept to be selected, the select operator selects the sub tree rooted at that concept. Given the PO structure and a set of concepts, the select operator selects only those edges in the PO that connect nodes in a given set. Select operator is defined as:

OS = σ (NS, ES, RS) where

NS = Nodes (condition = true) and

ES = Edges ( N NS)

intersection Operation

Intersection is the most important and interesting binary operation. Let O1 = (N1, E1, R1), and O2 = (N2, E2, R2) be the two parts of PO whose composition will provide answer to the query submitted by the user. Here N is the set of nodes orconceptsofPO,EisthesetofedgesorthePO hierarchy, and R is set of semantic relationships.

Data Integration Through Protein Ontology

The intersection of two parts of PO with respect to semantic relationships defined for PO (SR in

Section 3.4) is: OI (1, 2) = O1 ∩SR O2 = (NI, EI, RI), where:

NI = Nodes (SR (O1, O2)),

EI=Edges(E1,NI∩N1)+Edges(E2,NI∩N2) + Edges (SR (O1, O2)), and

RI = Relationships (O1, NI ∩ N1) + Relationships (O2, NI ∩ N2) + SR (O1, O2) – Edges (SR (O1, O2)).

Note that SR is different from R as it does not include sequences. The nodes in the intersection ontology are those nodes that appear in the semantic relationships, SR. The edges in the intersection ontology are the edges among nodes that are either present in the source parts of the ontology or have been established as a semantic relationship, SR. Relationships in the intersection ontology are the relationships that have not been already been modeled as edges and those relationships present in source parts of the ontology that use only concepts that occur in intersection ontology.

Union Operation

The union of two parts of PO, O1 = (N1, E1, R1), and O2 = (N2, E2, R2) with respect to semantic relationships (SR) of PO (SR in Section 4.2) is expressed as:

OI (1, 2) = O1 SR O2 = (NU, EU, RU), where, NU = N1 N2 NI (1, 2),

EU = E1 E2 EI (1, 2), and RU = R1 R2 RI (1, 2), where,

OI (1, 2) = O1 ∩SR O2 = (NI (1, 2), EI (1, 2), RI (1, 2)) is the intersection of two ontologies.

The union operation combines two parts of the ontology retaining only one copy of the concepts in the intersection.

Data Integration Through Protein Ontology

difference Operation

The difference of two parts of PO, O1 and O2, written as O1 – O2, includes portions of the first part that are not common to the second part. The difference can be rewritten as O1 – (O1 ∩SR O2). The nodes, edges, and relationships that are not in intersection but are present in the first part comprise the difference. One of the objectives of computing the difference is to optimize the maintenance of PO. As the PO database is huge and so many people data to it, difference will suggest that data is not entered properly or there is change in underlying data sources that PO integrates. Change suggested by difference is forwarded to the administrator. If the change happenstobeindifferencebetweenstructuresofparts considered, then it does not occur in intersection and is not related to any semantic relationships that establish bridged between the parts of the ontology. Therefore semantic relationships do not need to be changed. If the changes is because of changeshappeningtounderlyingdatasourcesthat

PO integrates, then set of concepts and semantic relationships need to be checked for any changes required to remove the difference.

COnClUsiOn

In this chapter we presented an overview of common conceptual representation of PO that is needed to resolve the heterogeneity among protein data sources to enable meaningful data exchange or interoperation among them. Next we discussed semantic relationships in protein data that interprets semantics that is normally not interpreted by annotating systems, since they are not aware of the specific structural, chemical, and cellular interactions of protein complexes. We also discussed sequence relationships that impose order among the children of the nodes. We also covered

PO ontology algebra that allows composition of multiple levels of information stored in the ontol-

ogy for information retrieval. The PO approach supports precise composition of information from multiple diverse sources providing semantic relationships between among such sources. This approach allows reliable exploitation of protein information sources without any imposition on sourcesthemselves.POalgebrabasedonsemantic relationships allows systematic composition, which unlike integration is more scalable.

referenCes

Achard,F.,&Barillot,E.(1997).UbiquitousdistributedobjectswithCORBA.PacificSymposium Biocomputing. London, World Scientific.

Achard, F., & Dessen, P. (1998). GenXref VI:

Automaticgenerationoflinksbetweentwoheterogeneous databases. Bioinformatics, 14, 20-24.

Baxevanis, A. (2002). The molecular biology data collection: 2002 update. NucleicAcidsResearch, 30, 1-12.

Ben-natan, R. (1995). CORBA. New York: McGraw Hill.

Benson,D.,Karsch-mizrachi,I.,Lipman,D.,Os- tell,J.,Rapp,B.,&Wheeler,D.(2000).GenBank.

Nucleic Acids Research, 28, 8-15.

Bernstein, F.C., Koetzle, T.F., Williams, G.J.,

Meyer, E.F., Brice, M.D., Rodgers, J.R., et al. (1977). The protein data bank: A computer-based archivalfileformacromolecularstructures.Journal of Molecular Biology, 112, 535-542.

Collins,F.S.,Morgan,M.,&Patrinos,A.(2003).

The human genome project: Lessons from largescale biology. Science, 300, 286-290.

Frazier, M.E., Johnson, G.M., Thomassen, D.G., Oliver,C.E.,&Patrinos,A.(2003a).Realizingthe potential of genome revolution: The genomes to life program. Science, 300, 290-293.

Frazier, M.E., Thomassen, D.G., Patrinos, A., Johnson, G.M., Oliver, C. E., & Uberbacher, E.

(2003b). Setting up the pace of discovery: The genomes to life program. In Proceedings of the 2nd IEEE Computer Society Bioinformatics Conference (CSB 2003). Stanford, CA: IEEE CS Press.

Galperin, M.Y. (2006). The molecular biology database collection: 2006 update. Nucleic Acids Research, 34, D3-D5.

Garavelli, J.S. (2003). The RESID database of protein modifications: 2003 developments. Nucleic Acids Research, 31, 499-501.

George,D.G.,Mewes,H-W.,&Kihara,H.(1987). Astandardizedformatforsequencedataexchange. protein Seq. Data Anal., 1, 27-29.

George, D. G., Orcutt, B.C., Mewes, H.-W., &

Tsugita, A. (1993). An object-oriented sequence database definition language (sddl). protein Seq. Data Anal., 5, 357-399.

Gouy, M., Gautier, C., Attimonelli, M., Lanave,

C., & Di Paola, G. (1985). ACNUC: A portable retrieval system for nucleic acid sequence databases: Logical and physical designs and usage.

Computer Applications in the Biosciences, 1, 167-172.

Gyssens,M.,Paredaens,P.&Gucht,D.(1990).A graph-orientedobjectdatabasemodel.InProceed- ings of the 9th ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems. Nashville, TN: ACM Press.

Hadzic,F.,Dillon,T.S.,Sidhu,A.S.,Chang,E.,&

Tan, H. (2006). Mining substructures in protein data. 2006 IEEE Workshop on Data Mining in Bioinformatics(DMB2006)inconjunctionwith6th

IEEE ICDM 2006. Hong Kong:IEEE CS Press.

Hafner, C.D. & Fridman, N. (1996). Ontological foundations for biology knowledge models. In

Proceedings of the 4th International Conference

Data Integration Through Protein Ontology

onIntelligentSystemsforMolecularBiology. St. Louis: AAAI.

Han, J., & Kamber, M. (2001). Data mining: Conceptsandtechniques. San Francisco: Morgan Kaufmann.

Jagoto,A.(2000). Dataanalysisandclassification for bioinformatics. CA: Bay Press.

Karp, P.D. (1996). A strategy for database interoperation. Journal of Computational Biology, 2, 573-583.

Koonin, E.V., & Galperin, M.Y. (1997). Prokaryotic genomes: the emerging paradigm of genomebased microbiology. CurrentOpinionsinGenetic Development, 7, 757-763.

Kuonen, D. (2003). Challenges in bioinformatics for statistical data miners. Bulletin of Swiss Statistical Society, 46, 10-17.

Liu, L., Buttler, D.T.C., Han, W., Paques, H.,

Pu, C., & Rocco, D. (2003). BioSeek: Exploiting source-capability information for integrated access to multiple bioinformatics data sources. In Proceedings of the 3rd IEEE Symposium on BioinformaticsandBioengineering(BIBE2003).

Bethesda, MD: IEEE CS Press.

Maidak,B.L.,Olsen,G.J.,Larsen,N.,Overbeek, R.,Mccaughey,M.J.,&Woese,C.R.(1996).The ribosomal database project (RDP). NucleicAcids Research, 24, 82-85.

Maxam,A.M.,&Gilbert,W.(1977).Anewmethod for sequencing DNA. In ProceedingsofNational Academic of Science, 74 (pp. 560-564).

Mckusick, V.A. (1998). Mendelian inheritance in man: A catalog of human genes and genetic disorders. Baltimore: Johns Hopkins University Press.

Miyazaki,S.,Sugawara,H.,Gojobori,T.,&Tateno, Y. (2003). DNA Databank of Japan (DDBJ).

Nucleic Acids Research, 31, 13-16.

0

Data Integration Through Protein Ontology

Murzin, A.G., Brenner, S.E., Hubbard, T., & Chothia,C.(1995).SCOP:Astructuralclassification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology, 247, 536-540.

Narayanan, A., Keedwell, E.C., & Olsson, B. (2002).Artificialintelligencetechniquesforbioinformatics. Applied Bioinformatics, 1, 191-222.

Ng, S.K., & Wong, L. (2004). Accomplishments and challenges in bioinformatics. ITProfessional, 6, 44-50.

Ohkawa, H., Ostell, J., & Bryant, S. (1995). MMDB: An ASN.1 specification for macromolecular structure. In Proceedings of the 3rd

International Conference on Intelligent Systems for Molecular Biology. Cambridge,UK: AAAI.

Ohno-machado, L., Vinterbo, S., & Weber, G. (2002).Classificationofgeneexpressiondatausing fuzzy logic. Journal of Intelligent and Fuzzy Systems, 12, 19-24.

Ostell, J. (1990). GenInfo ASN.1 Syntax: Sequences.NCBI Technical Report Series. National Library of Medicine, NIH.

Pongor, S. (1998) Novel databases for molecular biology. Nature, 332, 24-24.

Rawlings, C.J. (1998). Designing databases for molecular biology. Nature, 334, 447-447.

Ritter, O. (1994). The integrated genomic database. In S. Suhai (Ed.), Computational methods in genome research. New York: Plenum.

Robbins, R.J. (1994). Genome informatics I: community databases. JournalofComputational Biology, 1, 173-190.

Sanger, F., Nicklen, S., & Coulson, A.R. (1977).

DNA sequencing with chain-terminating inhibitors. In ProceedingsofNationalAcademicofScience, 74 (pp. 5463-5467).

Schulze-Kremer, S. (1998). Ontologies for molecularbiology.PacificSymposiumofBiocomputing. In Proceedings of the PSB 1998 Electronic,

Hawaii.

Sidhu, A.S., Dillon, T.S., & Chang, E. (2005). Ontological foundation for protein data models.

In Proceedings of the 1st IFIP WG 2.12 & WG 12.4 International Workshop on Web Semantics (SWWS 2005). In conjunction with On The Move FederatedConferences(OTM2005).Agia Napa, Cyprus: Springer

Sidhu, A.S., Dillon, T.S., & Chang, E. (2006a). protein ontology. In Z. Ma & J.Y. Chen (Eds.),

Database modeling in biology: Practices and challenges. New York: Springer.

Sidhu, A.S., Dillon, T.S., & Chang, E. (2006b).

Advances in protein ontology project. In Proceedingsofthe19th IEEEInternationalSymposiumon Computer-BasedMedicalSystems(CBMS2006). Salt Lake City, UT: IEEE CS Press.

Sidhu,A.S.,Dillon,T.S.,Sidhu,B.S.,&Setiawan, H. (2004a). A unified representation of protein structuredatabases.InM.S.Reddy&S.Khanna

(Eds.), Biotechnologicalapproachesforsustainable development. India: Allied Publishers.

Sidhu,A.S.,Dillon,T.S.,Sidhu,B.S.,&Setiawan, H. (2004b). An XML based Semantic protein map. In A., Zanasi, N.F.F. Ebecken, & C.A.

Brebbia (Eds.), 5th International Conference on Data Mining, Text Mining and their Business Applications(DataMining2004).Malaga, Spain: WIT Press.

Sidhu, A.S., Dillon, T.S., Setiawan, H., & Sidhu,

B.S. (2004c). Comprehensive protein database representation. In A. Gramada & P.E. Bourne

(Eds.), 8th InternationalConferenceonResearch in Computational Molecular Biology 2004 (RECOMB 2004). San Diego, CA: ACM Press.

Stein, L.D., Cartinhour, S., Thierry-mieg, D., &

Thierry-Mieg, J. (1998). JADE: An approach for

interconnecting bioinformatics databases. Gene, 209, 39-43.

Stoesser,G.,Baker,W.,VanDenBroek,A.,Gar- cia-Pastor, M., Kanz, C., & Kulikova, T. (2003).

The EMBL nucleotide sequence database: Major new developments. Nucleic Acids Research, 31, 17-22.

Weissig, H., & Bourne, P.E. (2002). Protein structure resources. BiologicalCrystallography, D58, 908-915.

Wertheim, M. (1995). Call to desegregate microbial databases. Science, 269, 1516.

Data Integration Through Protein Ontology

Wesbrook,J.,Feng,Z.,Jain,S.,Bhat,T.N.,Thanki, N., Ravichandran, V., et al. (2002). The protein data bank: Unifying the archive. Nucleic Acids Research, 30, 245-248.

Wong, L. (2002). Technologies for integrating biological data. Briefings in Bioinformatics, 3, 389-404.

Wong, L. (2000). Kleisli, a functional query system. Journal of Functional Programming, 10, 19-56.

Zaki, M.J. (2005). Efficiently mining frequent trees in a forest: Algorithms and applications.

IEEE Transaction on Knowledge and Data Engineering, 17, 1021-1035.

Chapter VII

TtoO:

Mining a Thesaurus and Texts to Build and Update a Domain Ontology

Josiane Mothe

Institut de Recherche en Informatique de Toulouse Université de Toulouse, France

Nathalie Hernandez

Institut de Recherche en Informatique de Toulouse Université de Toulouse, France

aBstraCt

This chapter introduces a method re-using a thesaurus built for a given domain, in order to create new resources of a higher semantic level in the form of an ontology. Considering ontologies for data-mining tasks relies on the intuition that the meaning of textual information depends on the conceptual relations betweentheobjectstowhichtheyreferratherthanonthelinguisticandstatisticalrelationsoftheircontent. To put forward such advanced mechanisms, the first step is to build the ontologies. The originality of the method is that it is based both on the knowledge extracted from a thesaurus and on the knowledge semiautomatically extracted from a textual corpus. The whole process is semiautomated and experts’ tasks are limited to validating certain steps. In parallel, we have developed mechanisms based on the obtained ontology to accomplish a science monitoring task. An example will be given.

intrOdUCtiOn

Scientificandtechnicaldata,whatevertheirtypes

(textual, factual, formal, or nonformal), constitute strategicminesofinformationfordecisionmakers in economic intelligence and science monitoring

activities, and for researchers and engineers (e.g., forscientificandtechnologicalwatch).However, in front of the growing mass of information, these activities require increasingly powerful systems offering greater possibilities for exploring and representingthecollectedinformationorextracted

Copyright © 2008, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.

TtoO: Mining a Thesaurus and Texts to Build and Update a Domain Ontology

knowledge. Upstream, they must ensure search, selectionandfilteringoftheelectronicinformation available.Downstream,whencommunicatingand restituting results, they must privilege ergonomics in presentation, exploration, navigation, and synthesis.

To be manipulated by advanced processes, textualdocumentcontenthasfirsttoberepresented synthetically. This process is known as indexing andconsistsindefiningasetoftermsordescriptors that best correspond to the text content. Descriptors can either be extracted from the document themselves or by considering external resources suchasthesauri.Inthefirstapproach,whichcan be fully automatic and thus more appropriate for large volumes of electronic documents, the texts are analyzed and the most representative terms are extracted (Salton, 1971). These are the terms which constitute the indexing language. The automatic weighting of index terms (Robertson

& Sparck-Jones, 1976), their stemming (Porter,

1981), the automatic query reformulation by relevance feedback (Harman, 1992) or per addition of co-occurring terms (Qiu & Frei, 1993) in information retrieval systems (IRS) are methods associated with automatic indexing and have been effective as attested by international evaluation campaigns, such as text retrieval conference

(trec.nist.gov).Onecommonpointofalltheseapproaches is that they make the assumption that the documents contain all the knowledge necessary fortheirindexing.Ontheotherhand,thesauriare used to control the terminology representing the documents by translating the natural language of the documents into a stricter language (documentary language) (Chaumier, 1988). A thesaurus is based on a hierarchical structuring of one or more domains of knowledge in which terms of one or more natural languages are organized by relations between them, represented with conventionalsigns(AFNOR,1987).WithISO2788and ANSI Z39, their contents can be standardized in terms of equivalence, hierarchical, and associative relations between lexemes. Indexing based

on a thesaurus is generally carried out manually by librarians who, starting from their expertise, choose the terms of the thesaurus constituting the index of each document read. In an IRS, the same thesaurus is then used to restrict the range of a query or, on the contrary, to extend it, according to the needs of the user and the contents of the collection. Other types of systems combine the useofathesauruswithclassificationandnavigation mechanisms. The Cat a Cone system (Hearst, 1997) or IRAIA system (Englmeier & Mothe,

2003) make use of the hierarchical structure of the thesaurus to allow the user to browse within this structure and thus to access the documents associated with the terms. Compared to automatic indexing, the thesaurus approach leads to more semantic indexing, as terms are considered within their context (meaning and related terms). However, using thesauri raises several problems: they are created manually and their construction requires considerable effort; their updating is necessary;theirformatisnotstandardized(ASCII files,HTML,databasescoexist);finally,thesauri have a weak degree of formalization since they are built to be used by domain experts and not by automatic processes. Various solutions to these issues have been proposed in the literature. Automatic thesauri construction can call upon techniques based on term correlation (Tudhope, 2001), document classification (Crouch & Yang,

1992), or natural language processing (Grefenstette, 1992). On the other hand, the standards under development within the framework of the

W3C as SKOS Core (Miles et al., 2005) aim at making the thesaurus migrate towards resources that are more homogeneous and represented in a formalwaybyusingOWLlanguage(McGuiness & Harmelen, 2004) and making these resources available on the Semantic Web. However, thesauri represent a domain in terms of indexing categories and not in terms or meaning. They do not have a level of conceptual abstraction (Soergel et al., 2004), which plays a crucial role in man-machine communication. Ontologies make it possible to

TtoO: Mining a Thesaurus and Texts to Build and Update a Domain Ontology

reconsider this problem since it is a “formal and explicit specification of a shared conceptualization”(Fensel,1998).Automaticsemanticindexing, starting from concepts rather than terms that are frequently ambiguous, then becomes possible

(Aussenac&Mothe,2004).However,thedevelopment of an ontology is costly as it requires many manual interventions. Indeed, ontology construction techniques in the literature generally do not base the development of ontologies on existing terminologicalrepresentationsofthefieldbuton a reference corpus, which is analyzed (Aussenac et al., 2000).

In this chapter, we propose to present the method we have developed and implemented to re-use a thesaurus built for a given domain, in order to create new resources of a higher semantic level in the form of a domain ontology. The originality of the method is that it is based both on the knowledge extracted from a thesaurus and on the knowledge automatically extracted from a textual corpus. It thus takes advantage of the domain terms stated in the thesaurus. This method includes the incremental population and update of an ontology, by the mining of a new set of documents. The whole process is semiautomatedandexperts’tasksarelimitedtovalidating certain steps. In parallel, we have developed mechanisms based on the obtained ontology to accomplish a science monitoring task. We will

give an example in order to illustrate the added value of using ontologies for data mining. This chapterisorganizedasfollows:Section2presents the main differences between thesauri and ontologies. In Section 3, we review the literature related to ontology building and ontology integration for representing and accessing documents. Section 4 presents our method for building the ontology from the thesaurus. Section 5 explains how we proposetominedocumentstoupdatetheontology. Finally, in Section 6, we present our case study: building a lightweight ontology in astronomy by transforming the IAU thesaurus and developing a prototype enabling a science monitoring task on the domain.

thesaUri and OntOlOgies

The main distinction between a thesaurus and an ontology is their degree of semantic engagement (Lassila & McGuiness, 2001). This degree corresponds to the level of formal specification restricting the interpretation of each concept, thus specifying their semantics.

standardization of thesauri Content

The ISO 2788 and ANSI Z39 (http://www.tech- street.com/cgi-bin/pdf/free/228866/z39-19a.pdf)

Figure 1. Relations between terms in a thesaurus

Broader Term (BT)

1

Use For (UF)

Narrower Term (NT)

n

Related Term (RT)

 

 

 

Preferred term

term

1

1

n

 

 

Use (U)

TtoO: Mining a Thesaurus and Texts to Build and Update a Domain Ontology

standards have proposed the guiding principles for building a thesaurus. It is a terminological resourceinwhichtermsareorganizedaccording to restricted relations (Foskett, 1980): equivalence (termi useforreferringto termj, insteadoftermj

use” termi), hierarchical term1 (termk broader termof terml, terml narrower term” of termk), and cross reference (termm related termto termn). The relations present in a thesaurus meeting those standards are shown in Figure 1.

From the point of view of knowledge representation,thesaurihavealowdegreeofformalization.

The distinction between a unit of meaning (or concept) and its lexicon is not clearly established.

Synonymy relations are defined between terms, but concepts are not identified. This can be explained by the initial use of thesauri, which was to express how a domain can be understood not in terms of meaning but in terms of terminology and categories to help the manual indexing of domain documents. Moreover, the semantics associated to a thesaurus are limited. Relations are vagueandambiguous.Thesemanticlinksbetween terms often reflect the planned use of the terms rather than their correct semantic relations. The relation “broaderterm”can,forexample,include the relations « is an instance of », « is a part of »

(Fischer, 1998). The associative relation «related term»isoftendifficulttoexploitasitcanconnect terms suggesting different semantics (Tudhope et al., 2001). For example, in the BIT thesaurus

(http://www.ilo.org/public/libdoc/ILO-Thesau- rus/french/tr1740.htm) describing the world of work, the term « family » is related to the term « woman » and « leave of absence », the semantic relation between those pairs of terms is intuitively different.Becauseofthechoicesmadeduringtheir elaboration,thesaurilackbothformalizationand coherence compared to ontologies.

Ontologies

Insteadofrepresentingadomainintermsofindexing categories, the aim of ontologies is to propose a

soundbaseforcommunicationbetweenmachines, but also between machines and humans. They consensuallydefinethemeaningofobjects,firstly through symbols (words or phrases) that designate and characterize them, and secondly through a structured or formal representation of their role in the domain (Aussenac, 2004). Before presenting ontologies through their degree of formalization

(lightweight and heavyweight ontologies), we definethethreeunitsofknowledgeonwhichthey rely: concept, relations, and axioms.

Concepts

A concept represents a material object, a notion, oranidea(Uschold&King,1995).Itiscomposed of three parts: one or several terms, a notion, and a set of objects. The notion corresponds to the concept’s semantics and is defined through its relations to other concepts and its attributes. The set of objects corresponds to the objects defined by the concept and is called concept extension; objectsareconceptinstances.Thetermsdesignate the concepts. They are also called concept labels. For example, the term « chair » references an artifact, piece of furniture on which one can sit and all objects having this definition. In order to identify a concept with no ambiguity, it is recommended to reference a concept with several terms that eliminate the problem of synonymy and to disambiguatethemeaningsoftermsbycomparing them to each other (Gomez-Perez et al., 1996).

Relations Between Concepts

A semantic relation R represents a kind of interaction between domain concepts c1, c2,... cn.

Thenotionof subsumption(alsocalled“isa”or taxonomic relation) is a binary relation implying the following semantic engagement: a concept c1 “subsumes” a concept c2 if all the semantic relationsdefinedforc1arealsosemanticrelationsof c2, in other terms if concept c2 is more specific than concept c1.

TtoO: Mining a Thesaurus and Texts to Build and Update a Domain Ontology

« Associative » relations are interaction relations between two concepts that are not subsumption relations. This corresponds to the notion of role in description logics and makes it possible to characterize concepts. These relations are either relations between concepts or relations between a concept and a data type. The semantics associated to it are referenced by a label. Examples of associative relations are “is a part of,” “is composed of,” “has a color”… They can also be specified with logic properties associated to the relation such as transitivity.

Axioms

Axioms aim at representing in a logic language the description of concepts and their relations. They represent the intention of concepts and relations and the knowledge that does not have a terminologicalcharacter.Theirspecificationinan ontology can have different purposes: define the signature of a relation (domain and range of the relation),definerestrictiononvaluesofarelation, definethelogicpropertyofarelation(transitivity, symmetry…). Through axioms both the consistency of the knowledge stated in the ontology is checked and new knowledge inferred.

Lightweight Ontologies vs.

Heavyweight Ontologies

Lightweight ontologies are composed of two semiotic levels (Maedche & Staab, 2002). The lexical level (L) covers all the terms or labels defined to designate the concepts, enabling their detection within textual contents. The conceptual level defined in the structure (S) of the ontology represents the concepts and semantics defined from the conceptual relations that link them.

The structure of an ontology is a tuple S :={C, R, C, σR } where:

C, R are disjoint sets containing concepts and associative relations

C : C x C is a partial order on C, it defines the hierarchy of concepts

C (c1, c2) meaning that c2 “is a” c1, it is called a taxonomic relation

σR : R C x C is the signature of an associative (or nontaxonomic) relation

The lexicon of a lightweight ontology is a tuple L : {LC,LR,F,G}

LC, LR are disjoint sets containing labels (or terms) referencing concepts and relations

F, G are two relations called reference, they enable access to the concepts and relations designated by a term and vice versa.

F LC for the concepts and GLR for the relations:

°For l LC, F(l)={c / c C}

°For c C, F-1(c)={l / l LC}

°For l LR, G(l)={r/r R}

°For r R, G-1(r)={l / l LR}

Heavyweight ontologies are based on the same structure and lexicon as lightweight ontologies and also include a set of axioms. This kind of ontology considerably restricts the interpretation of concepts, thus limiting ambiguity. However, their construction is extremely time-consum- ing and thus they cover only precisely defined domains. For data representation and data mining, considering lightweight ontologies is the first step in integrating formal knowledge in systems. Lightweight ontologies can play the role formerly played by thesauri by defining an indexing language relying on a more formalized knowledge that can be exploited in the retrieval or mining process. Moreover, as described in the next section, lightweight ontologies are easier to construct.

related wOrk: elaBOrating OntOlOgies

Designingontologiesisadifficulttaskinvolving thedevelopmentofelaborateprocessesthatextract