
Rivero L.Encyclopedia of database technologies and applications.2006
.pdfDecision System: Is a tuple (U,A,d), where (U,A) is an information system with the set A of condition attributes and the decision (attribute) d: U→Vd, where d A. Informally, d is an attribute whose value is given by an external source (oracle, expert) in contradiction to conditional attributes in A whose values are determined by the user of the system.
Functional Dependence: For attribute sets C, D, we say that D depends functionally on C, in symbols C→D, in case IND(C) IND(D). Also non-exact (partial) functional dependencies to a degree are considered.
Indiscernibility Relation: Is defined as follows: Objects x, y are indiscernible iff information about x is equal to (similar with) information about y. In the former case the indiscernibility relation is an equivalence relation; in the latter it is a similarity relation. Any object x defines an indiscernibility class (neighborhood) of objects indiscernible with this object. Also soft cases of indiscernibility relation are considered. Discernibility relation is a binary relation on objects defined as follows: Objects x,y are discernible iff information about x is discernible from information about y. In the simplest case, objects x, y are discernible iff it is not true that they are indiscernible.
Rough Sets
Information About Object: x in a given information system (U,A) is defined by InfA(x)={(a,a(x)): a A}.
Information System: Is a pair (U,A) where U is the universe of objects and A is a set of attributes, i.e., functions on U with values in respective value sets Va for a A.
Lower (Upper) Approximation: Of a set X U is the union of all indiscernibility classes contained in X (that intersect X), i.e., it is the greatest exact set contained in X (the smallest exact set containing X). Instead of exact containment also containment to a degree is used.
Reduct: Is a minimal with respect to inclusion subset C of B preserving a given indiscernibility (discernibility) constraint, e.g., IND(C)=IND(B). Many different kinds of reducts with respect to different discernibility criteria have been investigated and used in searching for relevant patterns in data.
Rough Set: Is a subset (concept) of the universe of objects U in an information system (U,A) that cannot be expressed (defined) as a union of indiscernibility classes; otherwise, the set is called exact.
580
TEAM LinG
|
581 |
|
Security Controls for Database Technology |
|
|
|
5 |
|
and Applications |
|
|
|
|
|
|
|
Zoltán Kincses
Eötvös Loránd University of Sciences, Hungary
INTRODUCTION
The word security has many meanings, even for the programmers and other members of the computer science community, but it is often inferred by frequently used keywords, such as threat, vulnerability, exploit, risk, and related expressions such as attack, circumvention, manipulation, sniffing, denial of service, malicious, privacy, and so forth.
Every system is at risk of being attacked, and the risk analysis of the system will discover the severity of these risks, the protection measures, and their efficiency. There may be situations in which the risk analysis concludes that the system is not well protected against an attack, or even that new risks arose during a periodical analysis. The important thing is to implement the analysis and, based on the results, decide how to proceed with the next steps.
Without the risk analysis there is no guarantee that all the protections will be sufficient, and the owner or maintainer of the system will be consciously regarding the security of the system. We can conclude, therefore, that security is conscious risk taking.
WHAT IS INFORMATION SECURITY?
Information security can be defined in different ways, depending on the starting point. The word security has different meanings for society, for the government, for a citizen, for a mother, for a company leader, for a developer, for a user, or even for a hardware object. The specific rules for information security can be learnt from good books and descriptions found on the Internet. The “security way of thinking,” however, can be learnt from a fewer number of books, publications, and newsletters (see, e.g., Anderson, 2001; Schneier, 1996, 2000, 2003). Beginners should, and professionals may, read these papers before editing configuration files, using security applications, and applying security protocols, procedures, and policies to their systems.
Information security is realized through settings and installations, but without built-in concepts of the security policies (i.e., concepts built in from the beginning,
in the planning state of the system), information security has less value and, in many cases, is worthless because of the bad concept. As Schneier explained, “security is a process, not a product.” But he also says that “security is a trade-off”, therefore we must be careful with applied security measures.
Security is a conscious risk taking, and, based on this idea, security means that the manager (user, developer, etc.) must consciously manage the risk of the system. Risk can be managed with the cycle of monitoring (MO), planning and organization (PO), acquisition and implementation (AI), and delivery and support (DS) processes defined by the COBIT system (Information Systems Audit and Control Association [ISACA], 2003). In this system the control measures can be placed, which will help us find possible security problems.
The relations among the security parameters (technical and managerial) and their effect on each other is shown in Figure 1.
MAIN THRUST OF THE CHAPTER
Controls will help us find possible security problems in a system. The best controls are preventive, followed by detective controls. At the least, corrective controls must be provided, because the business continuity is served in that way (Stoneburner, Goguen, & Feringa, 2002).
Additional controls exist, such as administrative controls, but this article will detail only the three previously introduced. Administrative controls have a strong relation to the quality assurance system (i.e., ISO9001 series), because the existence, the content, and the form of policies, documentations, and instructions are regulated by that standard. After introducing such a system, it is easier to complete the system with the securityrelated parts.
Samples and Rules
A sample of controls, and their types, are picked from different areas of everyday life, and include locks (preventive control), house alarms (detective control), and
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG

Security Controls for Database Technology and Applications
Figure 1. Security map. Note. From IT Security: Hackers, Crackers, and Thugs, by NCR, 2002. Available online from http://www.ncr.com/repository/articles/ pdf/wcs_security.pdf
insurance (corrective control). Samples from IT systems include firewalls (preventive control), intrusion detection systems (detective controls), and backup systems (corrective control).
Which security tool or defence technique can resolve certain controls, and which security tool or defence technique implements certain control or controls? The answer to these two questions must produce the same result at the end of the system’s security planning. If we consider the two sets of controls (C) and tools (T), then the C→T relation must fulfil the following rules:
1.t T, c C: C(c)→T(t) and
2.c C t T: C(c)→T(t)
The first rule means that in set of tools there are no unused tools, and the second rule means that in set of controls there are no unsupported controls by at least one tool. In many-to-many relations, the segregation of duties must be applied, which means the avoidance of overlapping. It is not very efficient to use two virus scanners on a computer or think that every control can be supported by one universal tool. Of course, wellconsidered cases can be suitable (e.g., one firewall can stop e-mails that contain viruses or can have intrusiondetection system features).
Multiple controls and separate tools can be applied on different levels (e.g., a virus scanner can be on a
firewall and on a single computer, also). The borders of the levels must be defined in advance, and every change in the borders or in the level membership of the computer must be followed by an audit, the topic of which is the consistency of the system and its security. The result may be that the change cannot be done in that way or that additional controls must be applied to ensure consistency.
Threats
The control measures concern the basic threats to the system. There are three basic threats: confidentiality
(C), integrity (I), and availability (A).
Confidentiality means that unauthorized users or processes cannot access the protected information (e.g., data, application, configuration settings, operating system, network). Considering the information life cycle, from its appearance till the unrecoverable removal, confidentiality exists only till the first unauthorized access. The accesses can be controlled by authentication and authorization (preventive), or by access alerter and monitoring the access logs (detective). The corrective control also could start with the monitoring of access logs and, in case of unauthorized access, the cause must be eliminated. Unfortunately, many or all parts of the information system might become nonconfidential, that is, only the new entries after the elimination of the unauthorized access possibility can be treated as confidential.
Confidentiality generally can be resolved with encryption techniques. Symmetric encryption ensures that only those users who possess the key for the access can access the information. If it is necessary for only one user to access the information, then an asymmetric or public key encryption system can be used. In the case of managing many users and public keys, the public key infrastructure (PKI) is used.
For database technologies and applications, this means that in case of unauthorized access or manipulation, the manipulated data must be replaced with the original data. In case of unauthorized access, the data confidentiality cannot be restored, but the confidentiality of new entries can be resolved by replacing the insufficient confidentiality control with a sufficient one.
Integrity means that the information was not modified in part or in whole. The integrity can be checked with the backup copy, but there are also encryption techniques for this problem. The hash algorithms with one-way function methods produce a standard output from any input, and the probability of having the same the output values from two inputs is very low. This short
582
TEAM LinG

Security Controls for Database Technology and Applications
output (e.g., 20 bytes) is good for large-size information also, and it can be easily checked that the backup and the live data have the same hash.
Integrity also can be ensured by digital signature, in which the sender or data producer signs the data and, in case of manipulation (even by just one byte), the receiver will recognize that the signatures are not matching. (For details about cryptology, see Schneier, 1996.) In a PKI system, the public key of the receiver is for confidentiality (only the owner of the secret key can decrypt the information), whereas the secret key of the sender is for integrity (the signature can be done only by the secret key of the sender, and everybody can check it with the public key of the sender).
For database technologies and applications, this means that in case of data manipulation or partially compromised data, the intact database or partial data must be reloaded.
Availability means that the resource is available. The resource can be a physical device or information, and its availability must be assured for authorized persons with good intentions. Bad intentioned and unauthorized persons are considered attackers. In some cases also the authorized users can damage the availability by extreme usage of the resource. This causes the system to be unavailable, and in the case of an intentional behaviour this is called a denial of service (DoS) attack. When many computers are involved in extreme resource demand, then the attack is called distributed DoS.
Availability can be ensured by reserve systems, called backup systems, where damaged resources can be replaced or quickly recovered. Good backups for information stored and transactions processed in the system must have a good backup policy. Not only are timely backups necessary, but the tests that show if these backups can be used in a case of emergency are also necessary. For database technologies and applications, this means that, in case of data damage or loss, the main value must be restored as soon as possible from at least the previous backup.
The basic threats and the controls form a preventive- detective-corrective/confidentiality-integrity-availabil- ity (PreDeCo/CIA; ISACA, 2003) table, which can be filled by the project manager with the help of developers. Together can be found the possible security threats in the system.
PreDeCo/CIA Table
In the PreDeCo/CIA table, each cell is detailed, and the table contains only the briefest features of the cell. The questions are specified depending on the area where this method is applied. There are nine cells:
•Preventive confidentiality: The main threat is that
somebody will access the system or a part of it 5 without legal permission, because of bad accessrights settings or by insufficient protection of confidential information. For example, a guest user
can read the files of other users. In this case, adequate settings of the access right and the encryption may help. Even if the user is able to access the file, the content will remain confidential because of the encryption.
•Preventive integrity: The attacker can change the information or the system settings. The situation is worst case if there is no control to check– detect–correct this manipulation. Hashes should be applied to database integrity, depending on the size of the database either on the entire table or on subparts (e.g., rows, columns, cells).
•Preventive availability: The attacker can reach that the database is not available. The multiplied service resources can prevent this, where the balance can be set dynamically depending on the load.
•Detective confidentiality: In the worst case, top-secret data will be compromised and disclosed to unauthorized competitors. Access detection controls could help, but if they detect the action after it occurs, it is too late. For example, an attacker could conduct an analysis on a database created for a survey and profit from the results. One protection is making sure the data cannot be read or used, even in case of possession. Some encryption techniques will serve in this case.
•Detective integrity: Same as in case of preventive control. There are integrity hashes and here are the checkers of these hashes. If the checkers detect integrity manipulation of the system parameters, they must react immediately.
•Detective availability: A DoS or DDoS attacks the system or a part of it, but sometimes the partial attack can cause the unavailability of the whole system. For example, the attacker targets the Web server and the whole operating system crashes.
•Corrective confidentiality: If the threats in the previous paragraphs happened, the data are no longer confidential, and there is no corrective control for the situation. For the future, applying preventive controls is the only corrective control.
583
TEAM LinG
Security Controls for Database Technology and Applications
Table 1. The PreDeCo/CIA Table with general control types
|
|
Confidentiality |
Integrity |
Availability |
|
|
|
Set access rights, |
Encryption (hash, digital |
|
|
|
Preventive |
encryption |
Multiplied resources |
|
|
|
signature) |
|
|||
|
|
(encoding) |
|
|
|
|
|
|
|
|
|
|
|
Monitoring the |
|
Net statistics about |
|
|
|
|
the load balance, |
|
|
|
|
access rights, |
|
|
|
|
Detective |
Integrity checker |
keep-alive messages |
|
|
|
analysing access logs |
|
|||
|
|
|
about the system |
|
|
|
|
(accountability) |
|
|
|
|
|
|
elements |
|
|
|
|
|
|
|
|
|
|
Nothing for the past, |
Install patches, update or |
New resources, |
|
|
Corrective |
preventive controls |
upgrade the system, back |
cutting malicious |
|
|
|
|
|
|
|
•Corrective integrity: If the system is available, an exploit can hurt the integrity of the system. For example, a known bug can be exploited from a remote site and get unauthorized access to the system. If there is an available patch for this security bug, it must be installed. Otherwise, self-made controls must be applied or the service must be stopped.
•Corrective availability: In the worst case, the system would be down or would be under an extreme load. Installing or putting into operation new resources can help well-meaning requests. Cutting the lines from where the extreme load arrives must be applied against malicious requests.
FUTURE TRENDS
Trends are mainly about standardization of widely used processes. There are many standards (ISO, 2004) and guidelines that are in use (RFC, 2004), but for nonsecurity experts there are three main ones: BS7799/ ISO17799, COBIT, and common criteria. They can help a whole company or project team think, construct, develop, test and audit security of their company, product, or policies.
BS7799 / ISO17799
Developed by the British Standards Institution (BSI) in conjunction with an international user group, BS7799 represents industry-developed standards on information security management. BS7799 has two parts: Code of Practice for Information Security Management (the standard code of practice that provides guidance on how to secure an information system), and Specification for Information Security Management Systems (which specifies the management framework, objectives, and control requirements for information security management systems).
The first part was replaced by the ISO17799 standard.
The second part of the standard remains a BS standard (BS 7799-2:2002) and has been aligned with other management systems standards, including the ISO 9000 and ISO 14000 series.
COBIT
The COBIT (ISACA, 2003) has been developed as a generally applicable and accepted standard for good IT security and control practices that provides a reference framework for management, users, information system (IS) audits, and control and security practitioners.
COBIT, issued by the IT Governance Institute and now in its third edition, is increasingly internationally accepted as good practice for control over information, IT, and related risks. Its guidance enables an enterprise to implement effective governance over the IT that is pervasive and intrinsic throughout the enterprise. In particular, COBIT’s management guidelines component contains a framework responding to management’s need for control and measurability of IT by providing tools to assess and measure the enterprise’s IT capability for the 34 COBIT IT processes. The tools include:
•performance measurement elements (outcome measures and performance drivers for all IT processes);
•a list of critical success factors that provides succinct, nontechnical best practices for each IT process; and
•maturity models to assist in benchmarking and decision making for capability improvements.
Common Criteria
The Common Criteria for Information Technology Security (CC, 1998, 1999) is an international security evaluation certification that was designed to replace the various national evaluation schemes (e.g., FC/TCSEC, ITSEC, CTCPEC). The standard has three main parts and several hundreds pages. Implementing it in a project, for a product, or at a company requires security specialists
584
TEAM LinG

Security Controls for Database Technology and Applications
who involve in this work every member of the project, product, team, or company.
The standard with its three parts became an ISO standard under the number 15408. The titles are Introduction and General Model (part 1), Security Functional Requirements (part 2), and Security Assurance Requirements (part 3).
CONCLUSION
This general model implements a method that can be applied by anyone who is sensitive to the security of his or her system. Just ask what kind of control and protection measure is included in the technology or application to manage the risk of attack against confidentiality, integrity, and availability. One can map the protections and take appropriate measures to manage the risks by filling out such a table by oneself. For those who want or must check their system or product against a standard before an audit, the two basic standard and the COBIT framework will help to reach their aims.
It is very important to remember that security through obscurity is dangerous, therefore it is better to plan, implement, and audit controls instead of applying obscure security methods.
REFERENCES
Anderson, R. (2001). Retrieved October 2004 from http://www.cl.cam.ac.uk/~rja14/book.html
BS7799-2:2002, part two of British Standard 7799. Specification for Information Security Management). The Internet <http://www.iso.ch>
CC. (1998). Common Criteria [IT standard]. Retrieved from http://csrc.nist.gov/cc
CC. (1999). Common Criteria [IT standard]. Retrieved from http://www.iso.org
Information Systems Audit and Control Association. (2003). COBIT: Control objectives for information and related technology [IT standard]. Retrieved from http:/ /www.isaca.org
International Organization for Standardization. (2004). Retrieved from http://www.iso.org
NCR. (2002). IT security: Hackers, crackers, and thugs. Retrieved from http://www.ncr.com/repository/articles/pdf/wcs_security.pdf
RFC. (2004). Request for comments, de facto standards for
the Internet, and many of them is security-related. Re- 5 trieved October 2004 from http://www.ietf.org/rfc.html
Schneier, B. (1996). Applied cryptography. New York: Wiley.
Schneier, B. (2000). Secrets & lies (digital security in a networked world). New York: Wiley.
Schneier, B. (2003). Beyond fear (Thinking sensibly about security in an uncertain world). New York: Copernicus Books.
Shore, D. (2004). The ITsecurity.com Dictionary+ of information security, Retrieved October 2004 from http://www.ITsecurity.com/zones.htm?z=13
Stoneburner, G., Goguen, A., & Feringa, A. (2002, July). Risk management guide for information technology systems. Retrieved October 2004 from http:// csrc.nist.gov/publications/nistpubs/800-30/sp800- 30.pdf
KEY TERMS (SHORE, 2004)
Accountability: One of the fundamental requirements of information security, accountability is the property that enables activities on a system to be traced to specific entities; who or which may then be held responsible for their actions. It requires an authentication system (to identify Users) and an audit trail (to log activities against Users). Accountability supports nonrepudiation, deterrence, fault isolation, intrusion detection and prevention, and after-action recovery and legal action (forensics).
Audit: (1) The process of compiling a list of all security relevant events. The list itself is called the “audit trail” and is essential and invaluable in ensuring accountability. (2) The process of compiling a list of all software and/or hardware installed on one or more PCs.
Authentication: A secure system requires that all users must identify themselves before they can perform any other system action. Authentication is the process of establishing the validity of the user attempting to gain access, and is thus a basic component of access control.
Availability: Requirement that information and/or services be available to an authorized user on demand. The maintenance of availability is one of the prime functions of a security system. Attacks against the availability of an information resource are called denial of service (DoS) attacks.
585
TEAM LinG
Security Controls for Database Technology and Applications
Confidentiality: Ensures that only those entities (both users and resources such as printers and other devices) that are authorized to access data may do so. It should not be confused with privacy, which is a different concept. Data whose confidentiality has failed is said to be compromised.
Denial of Service (DoS): An attack that is specifically designed to prevent the normal functioning of a system, and thereby to prevent lawful access to that system and its data by its authorized users. DoS can be caused by the destruction or modification of data, by bringing down the system, or by overloading the system’s servers (flooding) to the extent that service to authorized users is delayed or prevented. Denial of Service attacks normally stem from external sources using telecommunications (such as via the Internet), or from disaffected or disgruntled employees who bear a grudge towards the company.
Integrity: More properly “data integrity,” it is the property that the data in question has not been changed.
It is particularly important that integrity and confidentiality be combined, so that sensitive information can be neither altered without being read, nor read without being altered. Data whose integrity has failed is said to be corrupted.
Policy: The security policy is the collection of rules that define an organization’s security objectives and how those objectives are to be achieved. The security policy must also be clearly and fully documented and enforced.
Security Through Obscurity (Obfuscation): Generally a derogatory term used when vendors seek to hide security details. The basic principle is that if no one knows any of the details of the security, then, equally, no one knows any weaknesses. The problem with this argument is that you can never be certain that crackers have not already found the weaknesses—better by far to be as open as possible so that good guys can find and help solve those problems.
586
TEAM LinG
|
587 |
|
Semantic Enrichment of Geographical |
|
|
|
5 |
|
Databases |
|
|
|
|
|
|
|
Sami Faïz
National Institute of Applied Sciences and Technology, Tunisia
KhaoulaMahmoudi
High School of Communications-Tunis (SUPCOM), Tunisia
INTRODUCTION
The distributed Web-based multi-document summarization system is conceived to enrich semantically Geographic Databases (GDB) (Faïz, 1999; Scholl et al., 1996). In fact, in a traditional database, for instance, a city is described by its alphanumeric features: name, population count, and so forth; however, in a GDB, it is further described by spatial attributes which indicate its position (coordinates) in the space and its shape (point, line, polygon, etc.). Although the use of this myriad of information (alphanumeric and spatial data), the GDB suffers from the lack of an exhaustive set of information describing in a quasi-complete way the entities handled by it (Faïz, 2001). Hence, Geographic Information System (GIS) is not able to provide the end-user with information not fed into the GDB and that is not inherent to the application for which the GIS is designed (Bâazaoui, Faïz & Ben Ghezala, 2001, 2003; Faïz, Abbassi & Boursier, 1998). For instance, given a map displayed on the screen, it is not possible to get economic or historical information about cities for a given country whenever the GIS is concerned only with administrative boundaries. Having this idea in mind, our intention is to profit from the huge mass of information available online to enrich semantically a GDB. To fulfill this purpose and to manage the great amount of documents retrieved from the Web in a quick and convenient fashion, we adopted the Text Mining techniques (Tan 1999; Weiss, Apte & Damerau, 1999) and more precisely the summarization. Indeed, with the fast growth in the amount of textual information available online and the multitude of documents reporting almost the same thing, there is clearly a strong need for automatic summarization that copes with not only one document at one time but a set of topically similar ones.
Such systems are referred to as Multi-Document Summarization systems (MDS). While building such summarizers, there are many user requirements that have to be satisfied, in essence, the minimization of the redundancy and the coverage of all the information reported by the set of documents.
In fact, with the continuous growth in an astounding rate of information and the capacity of reading and analyzing that remains constant, the users are becoming amassed with a huge amount of information. Building up a distributed system is time consuming. To fulfill this objective, we used the Multi-Agent Systems (MAS). The distribution is also justified because MDS can be seen as a naturally-distributed problem owing to the fact that more than one entity is involved. Our intention is to speed up the summarization process while holding the main issues inherent to the problem.
BACKGROUND
Some MDS systems will be outlined and the MAS paradigm will be presented.
Multi-Document Summarization (MDS)
MDS (Lin & Hovy, 2002; Gees et al., 2000; Mani & Bloedoran, 1999; Regina, Kathleen & Michael, 2000) consists of condensing the content of a corpus of documents while coping with some issues, essentially, the coverage and the redundancy. The former concerns dealing with all the information conveyed by the whole collection. For the redundancy, one has to summarize the corpus of documents while not retrieving portions of text reporting the same information already included in the summary. In what follows, we outline some MDS approaches.
Gees et al. (2000) developed a summarizer that first creates individual summaries for all the documents in the set. Afterwards, these document summaries are grouped into clusters according to the similarity of their topics. Finally, for each cluster, a representative summary is selected. In the case of a summary based on a query submitted by the user, the representative summary is the one that has the most similarity with the concerned topic. In the case of a generic summary, the representative will be the one having words that occur frequently across all the summaries of the cluster.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
TEAM LinG
In Radev, Jing, and Budzikowska (2000), the summarization process begins by clustering the articles into groups relative to similar events. Then, for each cluster, the relevance of the sentences is computed. In fact, the degree of relevance (from 0 to 10) for a given sentence to the general topic of the entire cluster is determined by picking out the most frequent words across the cluster. An utility of 0 means that the sentence is not relevant to the cluster, and a 10 marks an essential sentence. Then, each sentence ranked first according to this degree and beyond a fixed threshold is retained. Because many sentences have similar contents, one has to minimize the redundancy by filtering out the resulting sentences. This is based on the notion cross-sentence informational subsumption. It states that a sentence S2 subsumes S1, if it has overlapping words with S1 and presents additional information. Hence, we have to include only one sentence of each class, according to the level of details desired. At the end of this process, each cluster is described by a set of relevant and non-redun- dant sentences.
Another method belonging to this field is the Maximal Marginal Relevance (MMR) Multi-Document Summarization (Goldstein et al., 2000). This method begins by segmenting the documents into passages. These may be sentences, sequences of sentences, or paragraphs. Then, the relevant passages to a given query are identified, using the cosine similarity metric. Thus, the passages under a fixed threshold are discarded. For the remainder of the passages (above the threshold), the MMR metric is applied. According to this measure, text passage has high marginal relevance if it is relevant to the query, while having minimal similarity to previously selected passages. The selected passages constitute the summary of the documents corpus.
Multi-Agent Systems (MAS)
MAS (Barbuceanu, 1998; Briot & Demazeau, 2001; Mahmoudi & Ghédira , 2000) are generally regarded as being systems in which a group of autonomous agents interact to perform some set of tasks or satisfy some set of goals. According to Ferber, an agent is a hardware or software entity able to act on itself and on its environment. It has a partial representation of its environment and is able to communicate with other agents. It aims at an individual goal, and its behavior is the result of its observations, knowledge, abilities, and interactions it can have with other agents and with the environment (Ferber, 1997). These agents have several concerns to cope with, and they inhabit and interact within dynamic and not entirely predictable environments. Relevant examples of MAS include a group of agents moving a heavy object, telecommunication networks, and so forth.
Semantic Enrichment of Geographical Databases
These systems have proven their efficiency owing to the large number of publications which report their applications in heterogeneous domains.
MAIN THRUST OF THE ARTICLE
The overall process and the main modules of the system will be detailed.
Overall Enrichment Process
The semantic enrichment occurs when a user is looking for information about geographic entities. In fact, the GDB is queried at first to retrieve the data stored. Whenever the user is not satisfied by the response, the user starts up the summarization system. Thus, a mining of the online documents is triggered and will be over by returning the generated summaries.
The system relies on two kinds of agents: interface agent and summarizer ones. Every agent enjoys a simple structure independently of its type: acquaintances (the agents it knows), a local memory gathering its knowledge, a mailbox, storing the received messages that it will process later on.
The interface agent is responsible of launching the overall summarization process. It gets the set of documents resulting from an information retrieval task triggered by the user handling a map. Afterwards, it creates the summarizer agents whose number is equal to the one of the documents. These latter execute simultaneously a segmentation algorithm in order to determine the subtopics discussed along through the texts of their documents. Thus, each document is seen as nothing but a set of subtopics related to the main topic. Then, the interface agent carried a delegation task which aims to allocate to each subtopic a delegate (one of the summarizers) responsible for condensing the segments relative to a given subtopic. In fact, the summary of the corpus of documents is the merging of the partial summaries generated by the delegates. Thus, each delegate agent will extract the most salient pieces of texts retrieved from the set of segments under its jurisdiction. It considers one or all the segments and builds its, or their, Rhetorical Structure-Trees (RS-tree). The summary of a given subtopic under the responsibility of a delegate is generated by gathering the most relevant sentences.
Segmentation
Each summarizer agent executes a segmentation algorithm in order to detect the subtopics discussed in its
588
TEAM LinG

Semantic Enrichment of Geographical Databases
document. To this end, we adopted the TextTiling algorithm (Hearst, 1997). The algorithm begins by converting the raw text to streams of tokens. These are grouped into sequences called token-sequence of size w (the number of tokens). These latter are in turn gathered into blocks of size k (number of the token-sequences). Similarity values are computed for every token-sequence gap number; that is, a score is assigned to token-sequence gap i corresponding to how similar the token-sequences from token-sequence i-k to i are to the token-sequences from i+1 to i+k+1. Similarity values are computed for every token-sequence gap according to the following formula:
score(i)= |
∑ wt,b |
wt,b |
2 |
|
||
|
t |
1 |
|
|
||
∑ wt2,b |
∑ wt2,b |
2 |
||||
|
t |
1 |
|
t |
|
Where t is a token, and w t,b is the weight (frequency) of t in block b. b1 and b2, two text blocks, where, b1={token- sequencei-k,...,token-sequencei} and b2={token-
sequencei+1,..., token-sequencei+ k+1}.
Then, the depth at each gap between the blocks is computed to determine the deeper valleys that mark the segments relative to different subtopics. For each segment, the topic identification is fulfilled by considering the most frequent words as topics.
Delegation
At this stage, the purpose is to affect a delegate agent to each virtual document. This latter is called so because it is not a result of an information retrieval task, but it is the output of gathering the segments dealing with the same subtopic, which are distributed among the summarizer agents.
In fact, the interface has to take into account the cost incurred by each delegation. Formally, the cost function is:
f (s) = α workload+δ communication
Where s is a given allocation; α and δ are the weight coefficients determined experimentally.
The workload is defined as follows for each delegate k:
workload =∑∑segment−sizeij
i j
Where i ranges over the set of subtopics affected to agent k, j ranges over the set of all agents tackling i and segment-sizeij, the size (number of sentences) of a seg-
ment retrieved from the text of the agent j and dealing with subtopic i. 5
For the communication, it is defined as:
communication =∑β
j
with β=1 and j acquaintances of agent k
It costs 0 if the agent k is a delegate of one of the subtopics it handled. Otherwise, it is equal to the total number of the agents dealing with the subtopic at hand. In fact, each subtopic stemming from the segmentation is considered as a group, and the agents that deal with this subtopic are its members. Hence, an agent can be simultaneously a member of more than one group whenever its document tackles more than one subtopic. The interface processes first the groups with the fewer members. For instance, it processes first the groups having cardinality equal to one. These groups have a unique candidate (the unique member) for the delegation. In fact, for each group (subtopic) the interface considers all its members ordered according to the increasing workload. We do so, not to overload some agents more than others, which is quite wasteful. Each summarizer agent already affected is removed from all the groups to which it belongs to give the opportunity to other agents to participate in the summarization process. In the case where a group has no member (all its members are removed that are already affected) and it is not yet allocated, the interface considers all the summarizers (its members as well as the rest of the summarizer agents) ordered according the increasing cost (value of f). Thus, it affects the subtopic at hand to the agent having the minimum cost. Hence, a delegate can be responsible for condensing a subtopic that it does not tackle initially (according to its native document). Finally, any summarizer agent can be a delegate of 0, 1, or several subtopics that it has to condensate.
Text Extraction
Once the delegation is over, each delegate uses one of the following alternatives for each subtopic it governs: arbitrarily selects any of the segments, selects the longest segment because we hope that it gives more details of the subtopic, and considers all the segments. Hence, from the selected segments, each delegate derives the most important parts that best summarize the subtopics by building the RS-Trees (Marcu, 1999). In fact, an RS-Tree is a binary tree whose leaves denote elementary textual units and whose internal nodes correspond to contiguous text spans. Each node has a status (nucleus or satellite), a type (the rhetorical
589
TEAM LinG