Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Rivero L.Encyclopedia of database technologies and applications.2006

.pdf
Скачиваний:
14
Добавлен:
23.08.2013
Размер:
23.5 Mб
Скачать

A Development Environment for Customer-Oriented Web Business

KEY TERMS

ASP: An Active Server Page (ASP) is an HTML page that includes one or more scripts that are processed on a Microsoft Web server before the page is sent to the user. An ASP is somewhat similar to a server-side include or a common gateway interface (CGI) application in that all involve programs that run on the server, usually tailoring a page for the user.

CASE: Computer-Aided Software (or Systems) Engineering. Software tools that provide computer-as- sisted support for some portion of the software or systems development process, especially on large and complex projects involving many software components and people.

HTML: Hypertext Markup Language. It is a markup language, using tags in pairs of angle brackets, for identifying and representing the Web structure and layout through Web browsers.

Meta-Data: Data that describe the properties or characteristics of other data, including the structure, source, use, value, and meaning of the data. It is often described as data of data, briefly.

Repository: A repository is a centralized database where meta-data about database structure, applications, Web pages, users, and other application components is stored and maintained. It provides a set of mechanisms and structures to achieve seamless data-to-tool and data-to-data integration.

Scenario: A scenario is similar to a use case or script that describes interactions at a technical level. Scenarios are more informal and capture customers’ requirements in a natural fashion.

Web Mining: Web mining is the integration of information gathered by traditional data mining methodologies and techniques with information gathered over the World Wide Web. Web mining is used to capture customer behavior, evaluate the effectiveness of a particular Web site, and help quantify the success of a marketing campaign.

190

TEAM LinG

 

191

 

Digital Media Warehouses

 

 

 

D

 

 

 

 

 

MenzoWindhouwer

Center for Mathematics and Computer Science, The Netherlands

Martin Kersten

Center for Mathematics and Computer Science, The Netherlands

INTRODUCTION

Due to global trends, like the rise of the Internet, the cheapness of storage space and the ease of digital media acquisition, vast collections of digital media, are becoming ubiquitous. Futuristic usage scenarios, like ambient technologies, strive to open up these collections for the consumer market. However, this requires high-level semantic knowledge of the items in the collections. Experiences in the development and usage of multimedia retrieval systems have shown that within a specific, but still limited domain, semantic knowledge can be automatically extracted and exploited. However, when the domain is unspecified and unrestricted, that is, the collection becomes a warehouse, semantic knowledge quickly boils down to generics. The research on Digital Media Warehouses (DMWs) focuses on improving this situation by providing additional support for annotation extraction and maintenance to build semantically rich knowledge bases.

The following sections will introduce these DMW topics for closely related topics like multimedia storage, the usage/indexing of the extracted annotations for multimedia retrieval, and so forth, the reader is directed to the wide variety of literature on multimedia databases (for example, Subrahmaniam, 1997).

BACKGROUND

Media objects are semi or even unstructured by nature. They are mostly sensor data, a natural phenomenon represented in a series of numbers. There is no, or very limited, meta-data available. Adding annotations is a labor-intensive work, which has been for a long time already the task of librarians. However, the vastness and diversity of DMWs makes this approach no longer feasible. Automatic annotation programs are envisioned to take over or support this task.

The automatic extraction of semantically meaningful annotations (also know as features) has been, and still is, a major research area in computer vision. A main problem is bridging the semantic gap. For visual data, this gap can be defined as follows:

The semantic gap is the lack of coincidence between the information that one can extract from the visual data and the interpretation that the same data has for a user in a given situation (Smeulders et al., 2000, p. 1349).

This definition can be generalized to digital media objects of any modality without loss of validity.

The semantic gap is not bridged and may never be. One of the main reasons is the role of ambiguity. The more abstract a semantic concept is, the more subjective interpretations are possible, for example, due to cultural con- text-sensitivity. Eakins (1996) distinguishes three levels of features:

Level 1: primitive features; for example, color, texture and shape;

Level 2: derived (or logical) features; for example, containment objects of a given type or individual objects;

Level 3: abstract attributes; for example, named events or types of activities, or emotional or religious significance.

The higher the level the more subjective, and thus ambiguous, annotations become. State of the art annotation extraction algorithms reach level 2. Level 3 algorithms are currently only applicable in clearly defined and distinguishable (narrow) domains.

The most popular approach is to start with level 1 features and aggregate those into higher level semantically meaningful concepts. Common intuition is that annotations of multiple modalities should be combined, hopefully improving and enforcing each other’s findings.

A wide variety of extraction algorithms can be found in scientific publications, mostly dedicated to one specific modality (Del Bimbo, 1999; Foote, 1999; Jurafski & Martin, 2000). Models to combine features range from hard-coded rules to machine learning algorithms (Mitchell, 1997). Examples of frameworks encompassing these techniques are COBRA (Petkovic, 2003) and the compositional semantics method used in Colombo, Del Bimbo, and Pala (1999).

Figure 1 shows an annotation example, which uses various extraction algorithms: detectors to extract basic

Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.

TEAM LinG

Digital Media Warehouses

Figure 1. Example of annotation extraction and dependencies

output/input dependencies

 

 

 

 

Feature

Color features:

Boolean

This

 

- Number: 14,137

image

 

detector

- Prevalent: 0.01

rule

 

is a

 

 

- Saturation: 0.36

 

 

 

 

photo

 

 

 

 

dependencies

Feature

 

Neural

 

detector

 

net

 

context

 

 

 

 

Raw

Feature

Features

Concept

Concepts

data

extraction

extraction

 

 

(level 1) features (color features and a skin map), a handcoded boolean rule to classify the image as a photograph andaneuralnetwork,todetermineifthephotocontainsaface.

ANNOTATION EXTRACTION

Higher level annotations depend on the availability of lower level annotations. These dependencies imply an order of execution of extraction algorithms. A fixed set of algorithms can never take all possible modalities and contexts into account and produce all needed subjective and objective annotations. So, the order will not be hardcoded but described in a declarative specification, thus allowing easy extensibility.

The annotation extraction process is concerned with executing the correct extraction algorithms in the specified order. Two basic implementation approaches are:

1.The pool approach: Annotations live in a data pool.

Extraction algorithms become active when their input is available. This approach, where the algorithms are known as daemons, is taken in the LIBRAR- IAN extension to the ALTAVISTA indexing engine (De Vries, Eberman & Kovalcin, 1998).

2.The pipeline approach: Annotations are pushed or pulled through a pipeline of extraction algorithms. The processing graphs of the MOODS system (Griffioen, Yavatkar & Adams, 1997) follow this approach.

When ambiguity is allowed, special care should be taken to maintain the complete context of the annotations in the extraction process. This context will serve as a means for supporting disambiguation at query time. One approach is to store the annotations in a tree, where a path from the root to an annotation reflects the processing

order. Whole subtrees can thus be easily culled out. This approach is taken in the ACOI system (Windhouwer, 2003), where non-deterministic context-sensitive grammars are used to describe the processing order and (ambiguous) annotations live in the resulting parse forests. This system distinguishes two types of dependencies (also shown in Figure 1):

1.Output/input dependencies: An algorithm depends on the annotations produced by another algorithm; for example, the face detector depends on the skin detector.

2.Context dependencies: An algorithm is deliberately placed in the context of another algorithm; for example, the face detector is only executed when the photo detector was successful.

While output/input dependencies can be automatically discovered based on the algorithms signatures (Haidar, Joly & Bahsoun, 2004), the context dependencies are domain-based design decisions explicitly made by the developer of the DMW.

ANNOTATION MAINTENANCE

The approaches outlined in the previous section are able to produce the collection of annotations, as far as they can be (semi-) automatically extracted, to describe the semantic content of a media object (in its context). However, in due time new extraction algorithms will become available, making it possible to capture a new (subjective and ambiguous) viewpoint or improve old ones. Also the source media objects may change over time. This may happen beyond the control of the DMW, as it may be a virtual collection (the objects do not reside in and are not controlled by the DMW).

192

TEAM LinG

Digital Media Warehouses

All these (external) sources of change lead to start a new extraction process for (all) media objects. However, extraction algorithms can be very expensive: a frame-by- frame analysis of a video may take several hours. Reruns of these expensive algorithms should be prevented. The dependencies of a new or updated algorithm should thus be investigated to localize affected annotations and limited the number of revalidation runs. Support for this kind of incremental maintenance of media annotations is still in its infancy. The ACOI system (Windhouwer, 2003) supports this by deriving a dependency graph of the process specification. This graph is analyzed, and an incremental extraction process is started which will only produce the affected annotations.

Looking once more at Figure 1, the constants in the photo decision rule can be changed, that is, not affecting the output/input dependencies without re-executing the lower level feature extraction. However, after revalidation of the updated decision rule, it may be needed to also revalidate the face detector; that is, the dependencies are followed to incrementally update the annotations in the DMW.

FUTURE TRENDS

As mentioned in the introduction, envisioned application scenarios, like ambient environments, contribute to an even more rapid growth of DMWs. So, the future will show an even greater need for transparent management of the annotation process. Consumers will only spend a limited amount of time for manual annotation or supervision of a learning process, which will stress the need for incremental maintenance with its capability to leverage the reuse of already accumulated knowledge. The manual annotation burden can further be decreased when MPEG-7 (Martínez, 2003) becomes more common. Commercial digital media will then, by default, provide more detailed meta-data, which may provide a good starting point for more advanced and personal annotations.

The two basic approaches to steering annotation extraction; that is, the pool and pipeline approach are based on very basic computer science models. Other (research) areas where these models are common; for example, workflow management and database view maintenance can provide more insight for their proper and advanced use in this new area. For example, a coordination system like Galaxy (Galaxy, 2003) closely resembles an annotation extraction management system, and study of its underlying hub model may provide fruitful insights for more dynamic annotation systems.

CONCLUSION

D

Multimedia collections are growing and diverging at a rapid rate, turning into DMWs. However, opening up these type of collections poses a major problem. Manual annotating the whole collection is too labor intensive and will have to be supported by annotation extraction algorithms. As the set of needed algorithms and annotations will dynamically grow over time, they have to be managed in a transparent way. Executing a declarative specification of this process is already well understood and bears many resemblances to existing practices. However, incremental maintenance of the created collection of annotations is still in its infancy, but bearing future trends in mind will be a key issue in building useful and maintainable collection.

REFERENCES

Colombo, C., Del Bimbo, A., & Pala, P. (1999). Semantics in visual information retrieval. IEEE MultiMedia, 6(3), 38-53.

De Vries, A.P., Eberman, B., & Kovalcin, D.E. (1998). The design and implementation of an infrastructure for multimedia digital libraries. Proceedings of the 1998 International Database Engineering & Applications Symposium (pp. 103-110).

Del Bimbo, A. (1999). Visual information retrieval. San Francisco: Morgan Kaufmann.

Eakins, J.P. (1996). Automatic image content retrieval: Are we getting anywhere? Proceedings of the 3rd International Conference in Electronic Library and Visual Information Research (pp.123-135).

Foote, J. (1999). An overview of audio information retrieval. Multimedia Systems, 7(1), 2-10.

Galaxy. (2003). Galaxy Communicator. Retrieved February 3, 2005, from http://communicator.sourceforge.net/

Griffioen, J., Yavatkar, R., & Adams, R. (1997). A framework for developing content-based retrieval systems.

Intelligent multimedia retrieval. Cambridge, MA: AAAI Press/MIT Press.

Haidar, B., Joly, P., & Bahsoun, J.-P. (2004, Month 00). An open platform for dynamic multimedia indexing. Proceedings of the 5th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS 2004), Lisboa, Portugal.

193

TEAM LinG

Jurafsky, D., & Martin, J.H. (2000). Speech and language processing. Prentice Hall.

Martínez, J.M. (2003). MPEG-7 overview. Retrieved February 3, 2005, from http://www.chiariglione.org/mpeg/ standards/mpeg-7/mpeg-7.htm

Mitchell, T. (1997). Machine learning. McGraw-Hill.

Petkovic, M. (2003). Content-based video retrieval supported by database technology. PhD thesis, Centre for Telematics and Information Technology, Enschede, The Netherlands.

Smeulders, A.W., Worring, M., Santini, S., Gupta, A., & Jain, R. (2000). Content-based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12), 1349-1380.

Subrahmaniam, V.S. (1997). Principles of multimedia database systems. San Francisco: Morgan Kaufmann.

Windhouwer, M.A. (2003). Feature grammar systems: Incremental maintenance of indexes to digital media warehouses. PhD thesis, University of Amsterdam, Amsterdam, The Netherlands.

KEY TERMS

Annotation: Information about an (multimedia) object. This information may directly describe the (semantic) content of an object (e.g., this photo shows a cityscape) or describe its relations to other (external) objects (for example, this photo was made in the year 1999).

Digital Media Warehouses

Annotation Extraction Algorithm: An algorithm that automatically extracts (set of) annotations, which describe (the content of) a media object. The input of such an algorithm can consist of the media object itself combined with previously extracted annotations or other additional information.

Annotation Extraction Process: The (semi-) automatic extraction of annotations aimed at describing (the content of) a digital media object. In this process, annotation extraction algorithms are called in the correct order, so their outand input dependencies are met.

Annotation Maintenance: When the annotation extraction algorithms or the media objects change, the already extracted annotations have to be revalidated. If a dependency description for the extraction algorithms is available, an incremental extraction process can be started, where only the affected annotations are (re)produced.

Annotation Pipeline: A media object (and its growing collection of annotations) is pushed or pulled through a sequence of annotation extraction algorithms.

Annotation Pool: Annotations of digital media are stored in a data pool. Annotation extraction algorithms populate this pool when their input is available.

Digital Media Warehouse: A vast collection of digitized media objects from an unrestricted set of different domains.

Semantic Gap: The lack of coincidence between the information that one can extract from the data and the interpretation that the same data has for a user in a given situation.

194

TEAM LinG

 

195

 

Discovering Association Rules in

 

 

 

D

Temporal Databases

 

 

 

 

 

Juan M. Ale

Universidad de Buenos Aires, Argentina

GustavoH.Rossi

Universidad Nacional de La Plata, Argentina

INTRODUCTION

The problem of the discovery of association rules comes from the need to discover interesting patterns in transaction data in a supermarket. Since transaction data are temporal we expect to find patterns that depend on time. For example, when gathering data about products purchased in a supermarket, the time of the purchase is stamped in the transaction.

In large data volumes, as used for data mining purposes, we may find information related to products that did not necessarily exist throughout the complete datagathering period. So we can find a new product, such as a DVD player, that was introduced after the beginning of the gathering, as well as a product, like a 5 1/4-inch flexible disk unit, that had been discontinued before the ending of the same gathering. It would be possible that that new product could participate in the associations, but it may not be included in any rule because of support restrictions. Suppose we have gathered transactions during 10 years. If the total number of transactions is 10,000,000 and we fix as minimum support 0.5 %, then a particular product must appear in, at least, 50,000 transactions to be considered frequent. Now, take a product that has been sold during these 10 years and has just the minimum support: It appears on average in 5,000 transactions per year. Consider now another product that was incorporated two years ago and that appears in 20,000 transactions per year. The total number of transactions in which it occurs is 40,000; for that reason, it is not frequent, even though it is four times as popular as the first one. However, if we consider just the transactions generated since the product appeared in the market, its support might be above the stipulated minimum. In our example, the support for the new product would be 2%, relative to its lifetime, since in two years the total of transactions would be about 2,000,000 and this product appears in 40,000 of them. Therefore, these new products should appear in interesting and potentially useful association rules. Moreover, we should consider the case of some products that may be frequent justinsomesubintervalsstrictlycontainedintheirperiodoflife but not in the entire interval corresponding to their lifespan.

We solve this problem by incorporating time in the model of discovery of association rules. We call these new rules general temporal association rules.

One by-product of this idea is the possibility of eliminating outdated rules, according to the user’s criteria. Moreover, it is possible to delete obsolete sets of items as a function of their lifetime, reducing the amount of work to be done in the determination of the frequent items and, hence, in the determination of the rules.

The temporal association rules introduced in Ale and Rossi (2000) are an extension of the nontemporal model. The basic idea is to limit the search for frequent sets of items, or itemsets, to the lifetime of the itemset’s members. On the other hand, to avoid considering frequent an itemset with a very short period of life (for example, an item that is sold once), the concept of temporal support is introduced. Thus, each rule has an associated time frame, corresponding to the lifetime of the items participating in the rule. If the extent of a rule’s lifetime exceeds a minimum stipulated by the user, we analyze whether the rule is frequent in that period. This concept allows us to find rules that with the traditional frequency viewpoint, it would not be possible to discover.

The lifespan of an itemset may include a set of subintervals. The subintervals are those such that the given itemset: (a) has maximal temporal support and (b) is frequent. This new model addresses the solution of two problems: (1) Itemsets not frequent in the entire lifespan but just in certain subintervals, and (2) the discovery of every itemset frequent in, at least, subintervals resulting from the intersection of the lifespans of their components, assuring in this way the anti-monotone property (Agrawal & Srikant, 1994). Because of this we call “general” the rules formed from these kinds of frequent itemsets.

BACKGROUND

Previous work about data mining that includes temporal aspects is usually related to the sequence of events’ analysis (Agrawal & Srikant, 1995; Bettini, Wang, Jajodia, & Lin, 1998; Mannila, Toivonen, & Verkamo, 1995). The

Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.

TEAM LinG

Discovering Association Rules in Temporal Databases

usual objective is to discover regularities in the occurrence of certain events and temporal relationships between the different events. In particular, in Mannila et al. the authors discuss the problem of recognizing frequent episodes in an event sequence; an episode is defined there as a collection of events that occur during time intervals of a specific size. Meanwhile Agrawal and Srikant (1995) review the problem of discovering sequential patterns in transactional databases. The solution consists in creating a sequence for every customer and to look for frequent patterns into each sequence. In Bettini et al. the authors consider more complex patterns. In these cases temporal distances with multiple granularities are treated. Chakrabarti, Sarawagi, and Dom (1998), in a totally different approach, use the minimum description length principle together with an encoding scheme to analyze the variation of inter-item correlation along time. That analysis, whose goal is extracting temporally surprising patterns, is an attempt to substitute the role of knowledge domain in searching interesting patterns.

Now we will analyze how the present work is related to others, specifically in mining temporal association rules. All of them have the same goals as ours: the discovery of association rules and their periods or interval time of validity. Our proposal was formulated independently of the others but shares with them some similarities. In Ozden, Ramaswamy, and Silberschatz (1998), the authors study the problem of association rules that exist in certain time intervals and display regular cyclic variations over time. They present algorithms for efficiently discovering what they called “cyclic association rules.” It is assumed that time intervals are specified for the user.

In Ramaswamy, Mahajan, and Silberschatz (1998), the authors study how the association rules vary over time, generalizing the work in Ozden et al. (1998). They introduce the notion of calendar algebra to describe temporal phenomena of interest to the users and present algorithms for discovering “calendric association rules,” that is, association rules that follow the temporal patterns set forth in the user-supplied calendar expressions.

The third study (Chen, Petrounias, & Heathfield, 1998) also suggests calendar time expressions to represent temporal rules. They present only the basic ideas of the algorithms for discovering the temporal rules. The fourth study (Li, Ning, Wang, & Jajodia, 2001) is the most elaborated expression within the calendar approach. The authors define calendar schemas and temporal patterns in these schemas. They also define two types of association rules: precise-match and fuzzy-match. They try to find association rules within the time intervals defined by the schemas.

Finally, in Lee, Lin, and Chen (2001) and Lee, Chen, and Lin (2003), the authors introduce some kind of temporal rules and propose an algorithm to discover temporal

association rules in a publication database. The basic idea is to partition the publication database in light of exhibition periods of items. Possibly, this work is the most similar to ours. The notion of exhibition period is similar to our lifespan (Ale & Rossi, 2000) for an itemset, and the same happens with the concept of the maximal common exhibition period that we have called again lifespan when applying the concept to an itemset. The differences with the present work (Ale & Rossi, 2002) are more significant because we are defining frequent subintervals within an itemset’s lifespan and, in this way, we get the a priori property holds.

Our approach is based on taking into account the items’ period of life, or lifespan, this being the period between the first and the last time the item appears in transactions in the database. We compute the support of an itemset in the interval defined by its lifespan or subintervals contained in it, and define temporal support as the minimum interval width. We consider the history of an itemset as a time series, having the possibility of performing on it different kinds of analysis, based on its wavelet transform. However, this last part is not included in this article because of space restrictions. Our approach differs from the others in that it is not necessary to impose intervals or calendars since the lifespan is intrinsic to the data. Even more, we find association rules and the time when they hold, so the calendar approach becomes a special case. Moreover, our model satisfies the downward closure property, which a-priori-based algorithms are based on.

THE GENERAL TEMPORAL MODEL

Let T = { t0, t1, t2, . . . } be a set of times over which a linear order <T is defined, where ti <T tj means ti occurs before or is earlier than tj (Tansel et al., 1993). We will assume that T is isomorphic to N (natural numbers) and restrict our attention to closed intervals [ti, tj].

Let R = { A1, . . ., Ap}, where the Ai’s are called items, a transaction database d is a collection of subsets of R.

Each transaction s in d is a set of items such that s R. Associated to s we have a time stamp ts, which represents the valid time of transaction s.

Example 1.1.a: R = {A, B, C, D, E, F, G, H, I}. d is the collection of 10 transactions with tids 100, ...,1000 and time stamps 1, ..., 10, respectively; see Figure1(a).

We consider d is temporally ordered. Every item has a period of life, or lifespan, in the database, which explicitly represents the temporal duration of the item information, i.e., the time in which the item is relevant to the user. The lifespan of an item Ai is given by an interval [t1, t2], with

196

TEAM LinG

Discovering Association Rules in Temporal Databases

Figure 1. Example 1—The transaction database and the set of 1-item-set candidates

D

(a) The database d

(b)

The

lifespan(LS)

and support (Sup) for the Items

 

 

 

 

(or

1-itemsets), and

frequent

lifespan

(FLS): the subintervals and

 

T

Tid

Items

corresponding support. This is

the set

of candidate itemsets.

1

100

ABCFHI

 

 

 

 

 

 

2

200

ABCG

Itemset

LS/Sup

FLS:Subintervals/Support

3

300

CDI

A

 

[1,10],0.5

<[1,10], 0.5>

 

4

400

ACI

B

 

[1,9], 0.44

<[1,4],0.5>,<[6,9], 0.5>

5

500

DEHI

 

C

 

[1,10], 0.7

<[1,10], 0.7>

 

6

600

AF

 

 

D

 

[3,10], 0.37

<[3,6], 0.5>

 

7

700

BCI

 

 

E

 

[5,9], 0.4

--------

 

8

800

CHG

 

 

F

 

[1,6], 0.33

--------

 

9

900

BE

 

 

G

 

[2,8], 0.29

--------

 

10

1000

ACD

 

 

H

 

[1,8], 0.37

<[5,8], 0.5>

 

 

 

 

 

 

 

 

 

 

 

I

 

[1,7], 0.71

<[1,7], 0.71>

 

τ = 3, σ = 0.5

t1 ≤ t2, where t1 is the time stamp of the first transaction in dthat contains Ai, and t2 is the time stamp of the last transaction in d that contains Ai.

With each item Ai and database d, we associate a lifespan defined by a time interval [Ai.t1, Ai.t2] or simply [t1, t2] if Ai is understood. The lifespan of Ai is noted as lAi and it can be defined as the minimal interval I (w.r.t. set inclusion) satisfying for each time t, if t is associated to a

transaction containing A, then t is in I. In addition, we define ld, the lifespan of d, as ld = lAi , i.

Example 1.1.b: Observe, for instance, lA = [1,10], lD = [3,10], lF = [1,6], and so on, in Figure 1(b) in the column headed by LS/Sup.

The set of transactions in d that contain X is indicated by V(X) = { s | s d X s} (we omit d for the sake of clarity). If the cardinality of X is k, X is called a k-item-set.

We can estimate the lifespan of a k-item-set X, with k > 1, by [t, t’] where t = max{t1| [t1, t2] is the lifespan of an item Ai in X} and t’= min{ t2| [t1, t2] is the lifespan of an item Ai in X }.

Example 1.2.a: In Figure 2(b) we have the 2-item-set candidates with their computed lifespan, such as lAB = [1,9], lAD = [3,6], lDI = [3,7], etc.

Let X R be a set of items and lX its lifespan. If d is the set of transactions of the database, then dlX is the subset of transactions of d whose time stamps ti lX.

With |dlX| we indicate the number of transactions of dlX. The inclusion of time allows us to determine if an itemset is frequent by computing the ratio between the number of transactions that contain the itemset and the number of transactions in the database, such that their

valid time is included in the itemset’s lifespan.

Figure 2. Example 1—The frequent 1-item-sets and the 2-item-set candidates

(a) The frequent 1-itemsets, their lifespans(LS) and their

(b) The 2-item-set candidates, life spans and support, and

frequent life

spans and

corresponding support.

frequent

life spans

and support

Itemset

LS

FLS:Subintervals/support

Itemset

LS/Sup

FLS:Subintervals/Support

AB

[1,9], 0.22

<[1,4],0.5>

A

[1,10]

<[1,10],0.5>

AC

[1,10], 0.4

<[1,6], 0.5>

B

[1,9]

<[1,4],0.5>, <[6,9],0.5>

AD

[3,6], 0.0

 

C

[1,10]

<[1,10], 0.7>

 

AH

[5,8], 0.0

 

D

[3,10]

<[3,6],0.5>

 

AI

[1,7], 0.28

<[1,4], 0.5>

H

[1,8]

<[5,8],0.5>

BC

[1,9], 0.33

<[1,4], 0.5>

I

[1,7]

<[1,7], 0.71>

BI

[1,7], 0.28

 

 

 

 

 

 

 

 

CD

[3,10], 0.25

 

 

 

 

CH

[1,8], 0.25

 

 

 

 

CI

[1,7], 0.57

<[1,7], 0.57>

 

 

 

DI

[3,7], 0.4

<[3,6], 0.5>

 

 

 

HI

[1,7], 0.28

 

τ = 3, σ = 0.5

197

TEAM LinG

Given an itemset X, the temporal support of X is the lifespan’s amplitude of X, namely |lX|.

We also define a threshold for the temporal support: if ld is the lifespan of the database and |ld| is its duration, then the threshold of the temporal support τ is a fraction of |ld|. On the other hand, the user could specify a time instant to, such that any item whose lifespan is [t1, t2] and t2 < to is considered obsolete.

In certain cases, an itemset may not be frequent in the interval corresponding to its entire period of life but just in one or more subintervals of this period. Then, it could participate in interesting rules in these portions of its lifespan. These subintervals do not depend strictly on the data but on the parameters, i.e., the threshold for the temporal support τ and the minimum support σ, provided by the user. We will not be interested in every possible subinterval but only in those that are maximal with respect to the temporal support and are frequent in their time frame. Another way for defining the subintervals is taking the subsets of contiguous time units, according to the granularity selected (for instance, days or weeks), in which the itemset has enough frequency, and whose union is not less than τ. We will use the first option.

We use τ = 3 and σ = 0.5 in the example. We could have fixed to = 5; thus, no item would be considered obsolete since their lifespans’ upper limits are all greater than 5.

Given an itemset X, the frequent lifespan of X (flX) is the set of subintervals, contained in its lifespan, in which the itemset is frequent.

We can exemplify this last case with the item B that has flB = {[1, 4], [6, 9]}, in Figure 1(b).

As set operations are valid over lifespans, the lifespan lX of the k-item-set X, where X is the union of the (k-1)-item- sets V and W with lifespans lV and lW, respectively, is given by lX = lV ∩ lW. The same is not always true for frequent lifespans because of the restrictions of minimum frequency.

The support of X in d over its lifespan lX, denoted s(X,lX) (we again omit d for the sake of clarity), is the set of fractions of transactions in d that contain X in every interval maximal corresponding to lX. For each subinterval

[t, t’] we compute the support as |V(X, [t, t’])| / |d[t, t’]|. Given a threshold of support σ [0, 1] and a threshold of

Table 1. Notation

lX

life span of the item set X

dlX subset of transactions in d with time stamps in lX

τthreshold for temporal support or minimum temporal support

σthreshold for support or minimum support

θthreshold for confidence or minimum confidence

to

threshold for obsolescence

flX

frequent life span for item set X

s(X, lX) support of X in its life span lX

198

Discovering Association Rules in Temporal Databases

temporal support τ, X is frequent in its lifespan lX if there exists at least one subinterval [t,t’] in lX such that s(X,[t, t’]) ≥ σ and |[t, t’]|≥ τ, and [t, t’] maximal in lX. In this case, it is said that X has minimum support in lX.

In some cases flX contains a single subinterval, and this may be equal or smaller than the period of life for X.

The example in Figure 2(a) shows different cases: Itemset A with lifespan [1,10] is frequent in its entire period of life and its support is 0.5; itemset B with lifespan [1,9] is frequent in two subintervals, that is, [1,4] and [6,9], both with support 0.5; itemset D with lifespan [3,10] is frequent just in the subinterval [3,6], with support 0.5. In Figure 3(a) we can observe, for instance, 2-item -set AB with lifespan [1,9], which is frequent in [1,4] and its support is 0.5; itemset DI with lifespan [3,7] and frequent lifespan [3,6] and support 0.5.

For the sake of clarity we have included Table 1 with the main notation.

A general temporal association rule for d is an expression of the form X Y : flX Y, where X R, Y R \ X, and flX Y is the frequent lifespan of X Y, in a granularity determined by the user.

Example 2: Given the frequent itemset ABC, we are able to consider different rules such as AB C : {[1,4]}, AC B : {[1,4]}, etc. Also, we could consider I C : {[1,7]} and C I : {[1,7]}.

The confidence of a rule X Y : flX Y, denoted by conf(X Y: flX Y), is the conditional probability that a transaction of d, randomly selected in the frequent lifespan flX Y, that contains X also contains Y.

The conditional probability varies according to the subinterval considered within flX Y. Then it is expressed as:

conf(X Y : flX Y) = { s(X Y, [t, t’]) / s(X, ([t, t’]) | [t, t’] flX Y }

where

flX Y = {[t1, t2]/ |[t1, t2]| ≥ τ and s(X Y, [t1, t2]) ≥ σ and ← [tj, tk]( [t1, t2] [tj, tk] and s(X Y, [tj, tk]) ≥ σ) }.

Example 3: We compute the confidence of the rule CI : {[1,7]} as:

Conf(C I : {[1,7]}) = s(CI, [1,7]) / s(C, [1,7]) = 0.57 / 0.71 = 0.80.

The general temporal association rule X Y : flX Y holds in d with support s1, ...,sp, temporal support |flX Y| and confidence c1, ...,cp, if s1%, ..., sp% of the transactions

TEAM LinG

Discovering Association Rules in Temporal Databases

Figure 3. Example 1—The frequent 2-item sets, the 3-item sets candidates and the frequent 3-item sets

D

(a)

The

frequent

 

2-itemsets, their lifespans and their

(b) The 3-item-set candidates, their lifespans and their

frequent

lifespans

 

and

support.

frequent lifespans and

upport.

 

Itemset

 

LS

 

FLS:Subintervals/Support

Itemset

LS/Sup

FLS:Subintervals/Support

 

AB

[1,9]

<[1,4], 0.5>

ABC

[1,9], 0.22

<[1,4], 0.5>

 

AC

[1,10]

<[1,6], 0.5>

ACI

[1,7], 0.28

<[1,4], 0.5>

 

AI

[1,7]

<[1,4], 0.5>

 

 

 

 

BC

[1,9]

<[1,4], 0.5>

 

 

 

 

CI

[1,7]

<[1,7], 0.57>

 

 

 

 

DI

[3,7]

<[3,6], 0.5>

 

 

 

(c)

The

frequent

 

3-itemsets with their frequent

(d) There is no 4-item-set

lifespan.

L3 = {ABD:[1,4], ACI: [1,4]}

candidate. C4 =

 

τ = 3, σ = 0.5

of d in flX Y = { subinterval1, ..., subintervalp}contain X Y and c%1, ..., c%p of the transactions of d that contain X also contain Y in the set of time frames [t, t’ ] such that [t, t’] flX Y.

Example 4: The rule C I : {[1,7]} holds in d with support 0.57, temporal support 7, and confidence 0.80.

The discovery of all the association rules in a transaction set d can be made in two phases(Agrawal, Imielinski, & Swami, 1993). In the following paragraph, we introduce suitable modifications to support general temporal association rules discovery.

Phase 1T: Find every itemset X R such that X is frequent in its lifespan lX, i.e., s(X,[t,t’]) ≥ σ, |[t, t’]|≥ τ and |[t,t’]| maximal, for [t, t’] in lX, that is, flX ≠ .

Phase 2T: Use the frequent itemsets X to find the rules: Verify for every Y X, with Y ≠ if the rule X \ YY : flX Y is satisfied with enough confidence, in other words, exceeds the minimum confidence θ established in the interval [t, t] for all [t, t’] in flX Y.

FUTURE TRENDS

Traditionally temporal data mining is applied to static and regular temporal data sets. But real-world data might be complex and produced in huge amounts in real time. These new aspects of temporal data will deserve new theories and algorithms which should cope with the following.

Data streams: Some temporal data is stored only temporally and requires real-time analysis; in par-

ticular, association rules analysis over data just produced by communication switches, POS devices, etc.

Heterogeneous data types: Temporal data is usually expressed as partly categorical events and partly numerical time series. There exists a need to analyze all possible data in a uniform way.

On the other side, in order to improve our capabilities of analysis, it is necessary to enrich the algorithms with knowledge from other sources such as Markov-process- based modeling and other modeling techniques. It will benefit areas such as applications in bioinformatics: We believe that high-performance temporal data mining tools will play a crucial role in the analysis of the ever-growing databases of biosequences/biostructures.

CONCLUSION AND FUTURE WORK

We have presented a model for the discovery of general temporal association rules. Each itemset has an associated lifespan, which comes from the explicitly defined time in database transactions. In particular, each frequent itemset has an associated frequent lifespan, that is, a set of intervals in which it is frequent.

As immediate future work, we will analyze the problem of the maintenance of temporal association rules.

In another direction, we are studying the problem of trend dependencies. We analyze a historical database and try to discover all the trend dependencies. For this, we find all the frequent itemsets in their lifespan and then analyze their temporal trends.

We are also interested in finding substitute items in market basket analysis. Under certain conditions, it is

199

TEAM LinG

Соседние файлы в предмете Электротехника