Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Rivero L.Encyclopedia of database technologies and applications.2006

.pdf
Скачиваний:
14
Добавлен:
23.08.2013
Размер:
23.5 Mб
Скачать

tation of the DFT which avoids recomputation. Moreover, the R-tree is equipped by a deferred update policy in order to avoid index adjustments every time a new value for a streaming time series is available. Experiments performed on synthetic random walk time series and on real time series data have shown that the proposed approach is very efficient in comparison to previously proposed methods.

FUTURE TRENDS

The research interest in the last years has focused on the streaming time series. Apart from the investigation of more efficient techniques for similarity search, there is significant work performed towards data mining of streaming data. The challenge is to overcome the difficulty of continuous data change and apply clustering algorithms to streaming time series. Some interesting results have been reported (Guha et al., 2003).

Another important research direction is the management of continuous queries over streaming time series. In this case, users pose queries that must be continuously evaluated for a time interval. Therefore, when a new value for a time series arrives, the value must be used to determine which queries are satisfied. Continuous queries over data streams are studied in Babcock et al. (2002), Chandrasekaran and Franklin (2002), Gao and Wang (2002a, 2002b) and Gao et al. (2002c).

So far we have focused on one-dimensional time series, since there is only one measurement that changes over time. However, there are applications that require the manipulation of multi-dimensional time series. For example, consider an object that changes location, and we are interested in tracking its position. Assuming that the object moves in the two-dimensional space, there are two values (x and y coordinates) that change over time. Some interesting research proposals for multi-dimensional time series can be found (Vlachos et al., 2003).

CONCLUSION

Time series data are used to model values that change over time, and they are successfully applied to diverse fields such as online stock analysis, computer network monitoring, network traffic management, and seismic wave analysis. In order to manipulate time series effectively and efficiently, sophisticated processing tools are required from the database viewpoint.

Time series are categorized as static or streaming. In static time series, each sequence has a static length, whereas in the streaming case, new values are continu-

Similarity Search in Time Series Databases

ously appended. This dynamic characteristic of streaming time series poses significant difficulties and challenges in storage and query processing.

A fundamental operation in a time series database system is the processing of similarity queries. To achieve this goal, there is a need for a distance function, an appropriate representation, and an efficient indexing scheme. By using the filter-refinement processing technique, similarity range queries, similarity nearest-neigh- bor queries, and similarity join queries can be answered very efficiently.

REFERENCES

Agrawal, R., Faloutsos, C., & Swami, A. (1993). Efficient similarity search in sequence databases. Proceedings of FODO (pp. 69-84), Evanston, Illinois.

Agrawal, R., Lin, K.-I., Sawhney, H.S., & Swim, K. (1995). Fast similarity search in the presence of noise, scaling, and translation in time-series databases. Proceedings of VLDB, Zurich, Switzerland.

Babcock, B., Babu, S., Datar, M., Motwani, R., & Widom, J. (2002). Models and issues in data stream systems. Proceedings of ACM PODS (pp. 1-16), Madison, Wisconsin.

Babu, S., & Widom, J. (2001). Continuous queries over data streams. SIGMOD Record, 30(3), 109-120.

Beckmann, N., Kriegel, H.-P., Schneider, R., & Seeger, B. (1990). The R*-tree: An efficient and robust access method for points and rectangles. Proceedings of ACM SIGMOD, Atlantic City, NJ (pp. 322-331).

Berchtold, S., Keim, D., & Kriegel H.-P. (1996). The X-tree: An index structure for high-dimensional data. Proceedings of VLDB, Bombay, India.

Bozkaya, T., Yazdani, N., & Ozsoyoglu, M. (1997). Matching and indexing sequences of different lengths. Proceedings of CIKM, Las Vegas, Nevada.

Chan, K., & Fu, A.W. (1999). Efficient time series matching by wavelets. Proceedings of IEEE ICDE (pp. 201-208).

Chandrasekaran, S., & Franklin, M.J. (2002). Streaming queries over streaming data. Proceedings of VLDB, Hong Kong, China.

Faloutsos, C., Ranganathan, M., & Manolopoulos, Y. (1994). Fast subsequence matching in time-series databases. Proceedings of ACM SIGMOD, Minneapolis, Minnesota (pp. 419-429).

650

TEAM LinG

Similarity Search in Time Series Databases

Gao, L., Yao, Z., & Wang, X.S. (2002c). Evaluating continuous nearest neighbor queries for streaming time series via pre-fetching. Proceedings of VLDB, Hong Kong, China.

Gao, L., & Wang, X.S. (2002a). Continually evaluating similarity-based pattern queries on a streaming time series. Proceedings of ACM SIGMOD, Madison, Wisconsin.

Gao, L., & Wang, X.S. (2002b). Improving the performance of continuous queries on fast data streams: Time series case. Proceedings of SIGMOD/DMKD Workshop, Madison, Wisconsin.

Gilbert, A.C., Kotidis, Y., Muthukrishnan, S., & Strauss, M.J. (2003). One-pass wavelet decompositions of data streams. IEEE Transactions on Knowledge and Data Engineering, 15(3), 541-554.

Guha, S., Meyerson, A., Mishra, N., Motwani, R., & O’Callaghan, L. (2003). Clustering data streams: Theory and practice. IEEE Transactions on Knowledge and Data Engineering, 15(3), 515-528.

Guttman, A. (1984). R-trees: A dynamic index structure for spatial searching. Proceedings of ACM SIGMOD (pp. 4757). Boston.

Kontaki, M., & Papadopoulos, A.N. (2004). Similarity search in streaming time sequences. Proceedings of SSDBM (to appear), Santorini, Greece.

Lin, K., Jagadish, H.V., & Faloutsos, C. (1995). The TVtree: An index structure for high dimensional data. The VLDB Journal, 3, 517-542.

Liu, X., & Ferhatosmanoglu, H. (2003). Efficient k-NN search on streaming data series. Proceedings of SSTD, Santorini, Greece.

Park, S., Chu, W.W., Yoon, J., & Hsu, C. (2000). Efficient searches for similar subsequences of different lengths in sequence databases. Proceedings of IEEE ICDE.

Vlachos, M., Hatjieleftheriou, M., Gunopoulos, D., & Keogh, E. (2003). Indexing multidimensional time series with support for multiple distance measures. Proceedings of ACM SIGKDD, Washington, DC.

Weber, R., Schek, H.-J., & Blott, S. (1998). A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. Proceedings of VLDB, New York (pp. 194-205).

Yi, B.-K., Jagadish, H.V., & Faloutsos, C. (1998). Efficient retrieval of similar time sequences under time wraping. 5 Proceedings of IEEE ICDE, Orlando, Florida (pp. 201-

208).

Yi, B.-K., & Faloutsos, C. (2000). Fast time sequence indexing for arbitrary Lp norms. Proceedings of VLDB, Cairo, Egypt.

KEY TERMS

Data Mining: A research field which investigates the extraction of useful knowledge from large datasets. Clustering and association rule mining are two examples of data mining techniques.

Dimensionality Reduction: It is a technique that is used to lower the dimensionality of the original dataset. Each object is transformed to another object which is described by less information. It is very useful for indexing purposes, since it increases the speed of the filtering step.

Distance Function: It is used to express the similarity between two objects. It is usually normalized in the range between 0 to 1. Examples of distance functions used for time series data are the Euclidean distance and the Time Warping distance.

Filter-Refinement Processing: A technique used in query processing, which is composed of the filter step and the refinement step. The filter step discards parts of the database that cannot contribute to the final answer and determines a set of candidate objects, which are then processed by the refinement step. Filtering is usually enhanced by efficient indexing schemes for improved performance.

Similarity Queries: These are queries that retrieve objects which are similar to a query object. There are three basic similarity query types, namely, similarity range, similarity nearest-neighbor, and similarity join.

Streaming Time Series: It is composed of a sequence of values, where each value corresponds to a time instance. The length changes, since new values are appended.

Time Series: It is composed of a sequence of values, where each value corresponds to a time instance. The length remains constant.

651

TEAM LinG

652

Spatio-Temporal Indexing Techniques1

MichaelVassilakopoulos

TEI of Thessaloniki, Greece

AntonioCorral

University of Almeria, Spain

INTRODUCTION

Time and space are ubiquitous aspects of reality. Temporal and spatial information appear together in many everyday activities, and many information systems of modern life should be able to handle such information. For example, information systems for traffic control, fleet management, environmental management, military applications, local and public administration, and academic institutions need to manage information with spatial characteristics that change over time, or in other words, spatiotemporal information. The need for spatio-temporal applications has been strengthened by recent developments in mobile telephony technology, mobile computing, positioning technology, and the evolution of the World Wide Web.

Research and technology that aim at the development of Database Management Systems (DBMSs) that can handle spatial, temporal, and spatio-temporal information have been developed over the last few decades. The embedding of spatio-temporal capabilities in DBMSs and GISs is a hot research area that will continue to attract researchers and the informatics industry in the years to come.

In spatio-temporal applications, many sorts of spatiotemporal information appear. For example, an area covered by an evolving storm, the changing population of the suburbs of a city, the changing coast lines caused by ebb and tide. However, one sort of spatio-temporal information is quite common (and in some respects easier to study) and has attracted the most research efforts: moving objects or points, for example, a moving vehicle, an aircraft, a wandering animal.

One key issue for the development of an efficient spatio-temporal DBMS (STDBMS) is the use of spatiotemporal access methods at the physical level of the DBMS. The efficient storage, retrieval, and querying of spatio-temporal information demands the use of specialized indexing techniques that minimize the cost during management of such information.

In this article, we report on the research efforts that have addressed the indexing of moving points and other spatio-temporal information. Moreover, we discuss the

possible research trends within this area of rising importance.

BACKGROUND

The term spatial data refers to multidimensional data, like points, line segments, regions, polygons, volumes, or other kinds of geometric entities, while the term temporal data refers to data varying in the course of time. Since in database applications the amount of data that should be maintained is too large for main memory, external memory (hard disk) is considered as a storage means. Specialized access methods are used to index disk pages and, in most cases, have the form of a tree. Numerous indexing techniques have been proposed for the maintenance of spatial and temporal data. Two good sources of related information are the survey by Gaede and Günther (1998) and the survey by Saltzberg and Tsotras (1999) for spatial and temporal access methods, respectively.

During last years, several researchers have focused on spatio-temporal data (spatial data that vary in the course of time) and the related indexing methods for answering spatio-temporal queries. A spatio-temporal query is a query that retrieves data according to a set of spatial and temporal relationships. For example, “find the vehicles that will be in a distance of less than 5km from a specified point within the next 5 minutes”. A number of recent short reviews that summarize such indexing techniques (especially, indexing of moving points) have already appeared in the literature. There are several ways for categorizing (several viewpoints, or classifications of) spatio-temporal access methods. In the rest of this section, we report on the approach followed by each of these reviews and on the material that the interested reader would find there.

In the book Spatiotemporal Databases: The ChoroChronos Approach that was authored within the ChoroChronos project and edited by Sellis et al. (2003), Chapter 6 is entitled “Access Methods and Query Processing Techniques” and reviews spatio-temporal access methods that have appeared up to 2001. The main classification followed in this chapter is between methods

Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.

TEAM LinG

Spatio-Temporal Indexing Techniques

belonging in the R-tree family and methods belonging in the Quadtree family. The principle guiding the hierarchical decomposition of data distinguishes between these two indexing approaches. The two fundamental principles, or hierarchies are:

the data space hierarchy: a region containing data is split (when, for example, a maximum capacity is exceeded) to sub-regions in a way that depends on these data (for example, each of two sub-regions contains half of the data), and

the embedding space hierarchy: a region containing data is split (when a certain criterion holds) to subregions in a predefined way (for example, a square region is always split in four quadrant sub-regions).

R-trees are data-driven, while Quadtrees are spacedriven access methods.

The June 2002 (Vol. 25, No. 2) issue of the IEEE Data Engineering Bulletin (http://sites.computer.org/debull/ A02june/issue1.htm) is devoted to “Indexing of Moving Objects” and is an excellent source of updated information. Pfoser (2002) reviews techniques that index the trajectories of moving objects that follow unconstrained movement (e.g., vessels at sea), constrained movement (e.g., pedestrians), and movement in transportation networks (e.g., trains or cars). These techniques allow the answering of queries concerning the past of the objects.

On the other hand, Papadopoulos et al. (2002) review techniques that index mobile objects and allow answering of queries about their future positions. The basis of these techniques is the duality transform, that is, mapping of data from one data space to another data space, where answering of queries is easier, or more efficient (mapping of a line segment representing a trajectory to a point in space of equal dimensionality).

Agarwal and Procopiuc (2002) classify indexing techniques according to the consideration of time as another dimension (called time oblivious approach that can be used for answering queries about the past), the use of kinetic data structures (that can be used for answering present queries or even queries that arrive in chronological order), and the combined use of the two techniques (that can be used for efficiently answering queries about the near past or future).

Jensen and Saltenis (2002) discuss a number of techniques that may lead to improved update performance of moving-objects indices. Their paper is a good source of possible future research trends.

Chon, Agrawal, and El Abbadi (2002) report on managing object trajectories by following a partitioning approach that is best suited to answering time-dependent shortest path queries (where the cost of edges varies with time).

Mokbel, Ghanem, and Aref (2003) review numerous spatio-temporal access methods classifying them accord- 5 ing their ability to index only the past, only the present,

and the present together with the future status of data. A very descriptive figure that displays the evolution of spatio-temporal access methods with the underlying spatial and temporal structures is included.

Tzouramanis, Vassilakopoulos, and Manolopoulos (2004) review and compare four temporal extensions of the Linear Region Quadtree and can store and manipulate consecutive raster images and answer spatio-temporal queries referring to the past.

MAIN TRUST OF THE ARTICLE

In this section, we briefly review the most fundamental spatio-temporal indexing techniques.

Quadtree-Based Methods

The Quadtree is a four-way tree where each node corresponds to a subquadrant of the quadrant of each father node (the root corresponds to the whole space). These trees subdivide space in a hierarchical and regular fashion. They are mainly designed for main memory; however, several alternatives for secondary memory have been proposed. The most widely used Quadtree is the Region Quadtree that stores regional data in the form of raster images. More details appear in Gaede and Günther (1998).

Tayeb, Ulusoy, and Wolfson (1998) used the PMR-quadtree for indexing future trajectories of moving objects. The PMR tree is a tree based on quadtrees, capable of indexing line segments. The internal part of the tree consists of an ordinary region quadtree residing in main memory. The leaf nodes of this quadtree point to the bucket pages that hold the actual line segments and reside on disk. Each line segment is stored in every bucket whose quadrant (region) it crosses. This causes the problem that data replication is introduced (every trajectory, a semiinfinite line, is stored in all the quadrants that it crosses).

Raptopoulou, Vassilakopoulos, and Manolopoulos (2004) used a new Quadtree-based structure, called XBR tree, for indexing past trajectories of moving objects. XBR trees (External Balanced Regular trees) are secondary memory balanced structures that subdivide space in a quadtree manner into disjoint regions. In their paper, XBR trees are shown to excel over PMR trees when used for indexing past trajectories.

Tzouramanis et al. (2003, 2004) present and compare four different extensions of the Linear Region Quadtree (the Time-Split Linear Quadtree, the Multiversion Linear Quadtree, the Multiversion Access Structure for Evolv-

653

TEAM LinG

ing Raster Images, and Overlapping Linear Quadtrees) for indexing a sequence of evolving raster data (e.g., images of clouds as they evolve in the course of time). A Linear Region Quadtree is an external memory version of the Region Quadtree, where each quadrant is represented by a codeword stored in a B+-tree. In their paper, temporal window queries (e.g., find the regions intersecting the query window within a time interval) are studied.

R-Tree-Based Methods

An R-tree is a balanced multiway tree for secondary storage, where each node is related to a Minimum Bounding Rectangle (MBR), the minimum rectangle that bounds the data elements contained in the node. The MBR of the root bounds all the data stored in the tree. The most widely used R-tree is the R*-tree. More details appear in Gaede and Günther (1998).

Xu, Han, and Lu (1990) presented an R-tree variation called RT-tree. The RT-tree couples time intervals with spatial ranges in each node of the tree: each MBR is accompanied by the time interval during which the related object or node is valid. Queries that embed time may require traversal of the whole tree.

In the same paper, the MR-tree was presented. This R- tree variation employs the idea of overlapping between a sequence of trees (like Overlaping Linear Quadtrees mentioned above) by storing common subtrees between consecutive trees only once. This tree suffers from reduced performance for temporal window queries. Additionally, a small change in a common subtree causes replication of this subtree between consecutive trees. Nascimento and Silva (1998) presented the HR-tree that is very similar to the MR-tree. Tao and Papadias (2001a) presented an improved version called the HR+-tree that avoids replication of subtrees to some extent. In the HR+-tree, a node is allowed to have multiple parents.

Theodoridis, Vazirgiannis, and Sellis (1996) presented the 3D R-tree that treats time as one of the three dimensions. Nascimento, Silva, and Theodoridis (1999) presented the 2+3 R-tree: a 2D R-tree is used for current 2 dimensional points and a 3D R-tree for the historical 3D trajectories. Depending on the query time, both trees may need to be searched.

Tao and Papadias (2001b) presented the MV3R-tree. This consists of a Multiversion R-tree (like the Multiversion Linear Quadtree mentioned above) to process timestamp queries and a 3D R-tree to process long interval queries.

Pfoser, Jensen, and Theodoridis (2000) presented the STR-tree (Spatio-Temporal R-tree), an R-tree with a different insert/split algorithm. This algorithm aims at keeping the line segments belonging to the same trajectory together as much as possible. In the same paper, the TB-tree,

Spatio-Temporal Indexing Techniques

Trajectory-bundle tree, was introduced. This is an R- tree-like structure that strictly preserves trajectories. A leaf node can only contain segments belonging to the same trajectory. The TB-tree outperforms the STR-tree in trajectory-based queries (queries involving the topology of trajectories and derived information, such as speed and heading of objects).

Hadjieleftheriou et al. (2002) use the partially-persistent R-tree (PPR-tree) to index general spatio-temporal data (not necessarily point data) that are allowed to move/change with a general motion over time. Two problems that arise are excessive dead space and overlap. These problems are overcome by making use of artificial updates.

Saltenis et al. (2000) proposed the TPR-tree (Time Parameterized R-tree) for indexing the current and future positions of moving points. This tree introduced the idea of parametric bounding rectangles in the R-tree (bounding rectangles that expand according to the maximum and minimum velocities of the objects that they enclose). To avoid the case where the bounding rectangles grow to be very large, whenever the position of an object is updated, all the bounding rectangles on the nodes along the path to the leaf at which this object is stored are recomputed. Tao, Papadias, and Sun (2003) presented an improvement called TPR*-tree that uses new insertion and deletion algorithms that aim at minimizing a certain cost function. Moreover, Tao et al. (2004) introduced the STP tree, which is a generalization of the previous two trees to arbitrary polynomial functions describing the movement of objects.

Frentzos (2003) proposed the Fixed Network R-tree (FNR-tree) for indexing objects moving on fixed networks. The network is represented by a set of line segments (links) stored in a 2D R-tree. This tree is accompanied by a forest of 1D R-trees. There exists one 1D R-tree for each leaf node of the 2D R-tree. This 1D R-tree is used to index the time intervals that any moving object was moving on a link stored in this leaf.

Transformation-Based Methods

Kollios, Gunopoulos, and Tsotras (1999) used the duality transformation to index the current and future positions of moving objects. A moving point in 1D space is represented by a trajectory (line segment) that is transformed to a point in the 2D space (accordingly, a moving point in 2D space is transformed to a point in 4D space). The resulting points are indexed by a kd-tree-based spatial index (a structure more suitable than R-trees, since the distribution of these points is highly skewed). A spatio-temporal range query is transformed into a polygon query in the dual space.

654

TEAM LinG

Spatio-Temporal Indexing Techniques

Agarwal, Arge, and Erickson (2000) also use a duality transformation. A moving object in the 2D space is represented by a trajectory in 3D space. This trajectory is projected into the (x, t) and (y, t) planes. The duals 2D points of each of these projections are indexed separately. The answer of a spatio-temporal range query is the union of two spatio-temporal range queries in the two planes. A kinetic data structure (Basch, Guibas & Hershberger, 1997) is used to index each dual space.

FUTURE TRENDS

Apart from the development of new or improved indexing techniques, the following make up a non-exhaustive list of the research trends that are likely to be further addressed in the future.

The evolution of indexing techniques for other sorts of spatio-temporal data (e.g., evolving boundaries, non-point objects that exhibit not only a changing position but rotate, change shape, color, transparency, etc.) apart from moving objects, or evolving regions.

The development of access methods and query processing techniques for answering other kinds of queries (e.g., spatio-temporal joins) apart from range queries that have been mainly addressed so far.

Efficiency and qualitative comparison of the indexing techniques proposed in the literature.

The adoption of several of the update techniques proposed in Jensen and Saltenis (2002) like the utilization of buffering, the use of all available main memory, the consideration of movement constrains, and so forth.

Consideration of distributed access methods for spatio-temporal data, as well as addressing concurrency and recovery issues and incorporation of uncertainty of movement, as proposed in Agarwal and Procopiuc (2002).

CONCLUSION

In this article, we have reviewed the issues and techniques related to access methods for spatio-temporal data. This research area (and especially indexing of moving objects) has attracted many researchers during last years. Although this is a relatively new research area, numerous techniques have been developed. However, this is still a hot and demanding research area where many challenges need to be addressed.

REFERENCES

5

Agarwal, P.K., Arge, L., & Erickson, J. (2000). Indexing moving points. Proceedings of the 19th ACM Symposium on Principles of Database Systems (PODS 2000) (pp. 175-186).

Agarwal, P.K., & Procopiuc, C.M. (2002). Advances in indexing for mobile objects. IEEE Data Engineering Bulletin, 25(2), 25-34.

Basch, J., Guibas, L.J., & Hershberger, J. (1997). Data structures for mobile data. Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA 1997) (pp. 747-756).

Chon, H.D., Agrawal, D., & El Abbadi, A. (2002). Data management for moving objects. IEEE Data Engineering Bulletin, 25(2), 41-47.

Frentzos, E. (2003). Indexing objects moving on fixed networks. Proceedings of the 8th International Symposium on Spatial and Temporal Databases (SSTD 2003)

(pp. 289-305).

Gaede, V., & Günther, O. (1998). Multidimensional access methods. ACM Computing Surveys, 30(2), 170-231.

Hadjieleftheriou, M., Kollios, G., Tsotras, V.J., & Gunopoulos, D. (2002). Efficient indexing of spatiotemporal objects. Proceedings of the 8th International Conference on Extending Database Technology (EDTB 2002)

(pp. 251-268).

Jensen, C.S., & Saltenis, S. (2002). Towards increasingly update efficient moving-object indexing. IEEE Data Engineering Bulletin, 25(2), 35-40.

Kollios, G., Gunopoulos, D., & Tsotras, V.J. (1999). On indexing mobile objects. Proceedings of the 18th ACM Symposium on Principles of Database Systems (PODS 1999) (pp. 261-272).

Mokbel, M.F., Ghanem, T.M., & Aref, W.G. (2003). Spatiotemporal access method. IEEE Data Engineering Bulletin, 26(2), 40-49.

Nascimento, M.A., Silva, J.R.O., & Theodoridis, Y. (1999). Evaluation of access structures for discretely moving points. Proceedings of the International Workshop on Spatio-Temporal Database Management (STDBM 1999)

(pp. 171-188).

Nascimento, M.A., & Silva, J.R.O. (1998). Towards historical R-trees. Proceedings of the ACM Symposium on Applied Computing (SAC 1998) (pp. 235-240).

655

TEAM LinG

Papadopoulos, D., Kollios, G., Gunopulos, D., & Tsotras, V.J. (2002). Indexing mobile objects using duality transforms. IEEE Data Engineering Bulletin, 25(2), 18-24.

Pfoser, D. (2002). Indexing the trajectories of moving object. IEEE Data Engineering Bulletin, 25(2), 3-9.

Pfoser, D., Jensen, C., & Theodoridis, Y. (2000). Novel approaches to the indexing of moving object trajectories.

Proceedings of the 26th International Conference on Very Large Databases (VLDB 2000) (pp. 189-200).

Raptopoulou, K., Vassilakopoulos, M., & Manolopoulos, Y. (2004). Towards Quadtree-based moving objects databases. Proceedings of the 8th East-European Conference on Advances in Databases and Information Systems (ADBIS 2004), Budapest, Hungary (pp. 230-245).

Saltenis, S., Jensen, C.S., Leutenegger, S.T., & Lopez, M.A. (2000). Indexing the positions of continuously moving objects. Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 2000) (pp. 331-342).

Saltzberg, B., & Tsotras, V.J. (1999). Comparison of access methods for time-evolving data. ACM Computing Surveys, 31(2), 158-221.

Sellis, et al. (Eds.). (2003). Access methods and query processing techniques. Spatiotemporal databases: The ChoroChronos approach (pp. 169-217). Springer-Verlag.

Tao, Y., Faloutsos, C., Papadias, D., & Liu, B. (2004). Prediction and indexing of moving objects with unknown motion patterns. Proceedings of the ACM Conference on the Management of Data (SIGMOD 2004), Paris (pp. 611622).

Tao, Y., & Papadias, D. (2001a). Efficient historical R-trees.

Proceedings of the International Conference on Scientific and Statistical Database Management (SSDBM 2001) (pp. 223-232).

Tao, Y., & Papadias, D. (2001b). MV3R-Tree: A spatio-temporal access method for timestamp and interval queries. Proceedings of the 27th International Conference on Very Large Data Bases (VLDB 2001) (pp. 431440).

Tao, Y., Papadias, D., & Sun, J. (2003). The TPR*-Tree: An optimized spatio-temporal access method for predictive queries. Proceedings of the 29th International Conference on Very Large Data Bases (VLDB 2003) (pp. 790801).

Tayeb, J., Ulusoy, O., & Wolfson, O. (1998). A Quadtree based dynamic attribute indexing method. The Computer Journal, 41(3), 185-200.

Spatio-Temporal Indexing Techniques

Theodoridis, Y., Vazirgiannis, M., & Sellis, T. (1996). Spatio-temporal indexing for large multimedia applications. Proceedings of the 3rd IEEE International Conference on Multimedia Computing and Systems (ICMCS 1996) (pp. 441-448).

Tzouramanis, T., Vassilakopoulos, M., & Manolopoulos, Y. (2003). Overlapping linear Quadtrees and spatio-tem- poral query processing. The Computer Journal, 43(4), 325-343.

Tzouramanis, T., Vassilakopoulos, M., & Manolopoulos, Y. (2004). Benchmarking access methods for time-evolv- ing regional data. Data & Knowledge Engineering, 49(3), 243-286.

Xu, X., Han, J., & Lu, W. (1990). RT-Tree: An improved R-tree indexing structure for temporal spatial databases.

Proceedings of the International Symposium on Spatial Data Handling (SDH 1990) (pp. 1040-1049).

KEY TERMS

Access Method or Indexing: A technique of organizing data that allows the efficient retrieval of data according to a set of search criteria. R-trees and Quadtrees are two well-known families of such techniques.

Constrained (Unconstrained) Movement: Movement (of a moving object) that is (is not) confined according to a set of spatial restrictions.

Movement in Transportation Networks: Movement (of a moving object) that is confined on a transportation network (such as rails, or roads).

Moving Object or Moving Point: A data element that is characterized by its position in space that varies in the course of time (this is a kind of spatio-temporal datum).

Spatio-Temporal Data: Multidimensional data, like points, line segments, regions, polygons, volumes, or other kinds of geometric entities that vary in the course of time.

Spatio-Temporal Database Management System: A Database Management System that offers spatio-tempo- ral data types and is able to store, index, and query spatiotemporal data.

Spatio-Temporal Query: A set of conditions embedding spatial and temporal relationships that define the set of spatio-temporal data to be retrieved.

Trajectory: The track followed by a moving object in the course of time (due to the change of its position).

656

TEAM LinG

Spatio-Temporal Indexing Techniques

ENDNOTE

5

1Supported by the ARCHIMEDES project 2.2.14, «Management of Moving Objects and the WWW», of the Technological Educational Institute of Thessaloniki (EPEAEK II), co-funded by the Greek Ministry of Education and Religious Affairs and the European Union.

657

TEAM LinG

658

Storing XML Documents in Databases

Albrecht Schmidt

Aalborg University, Denmark

Stefan Manegold

CWI, The Netherlands

MartinKersten

Center for Mathematics and Computer Science, The Netherlands

INTRODUCTION

Ever since the Extensible Markup Language (XML) (W3C, 1998b) began to be used to exchange data between diverse sources, interest has grown in deploying data management technology to store and query XML documents. A number of approaches propose to adapt relational database technology to store and maintain XML documents (Deutsch, Fernandez & Suciu, 1999; Florescu & Kossmann, 1999; Klettke & Meyer, 2000; Shanmugasundaram et al., 1999; Tatarinov et al., 2002; O’Neil et al., 2004). The advantage is that the XML repository inherits all the power of mature relational technology like indexes and transaction management. For XML-enabled querying, a declarative query language (Chamberlin et al., 2001) is available.

Traditionally, database technology has been offering support for processing large amounts of data. Recent research has provided valuable insights into the nature of semistructured and XML data and has attempted to integrate them into existing paradigms. However, there are still challenges that have to be met to scale XML databases up to production levels as achieved by relational engines and, thus, to gain acceptance among practitioners. Naturally, XML warehouses inherit the power of relational warehouses (Roussopoulos, 1997), but they also face the same challenges; in particular, update and consistency problems of materialized, replicated, and aggregated views over source data need to be solved.

This article discusses techniques related to loading XML documents into a document warehouse. All techniques build on well-understood relational database technology and enable efficient management of large XML repositories. To get the most of relational database systems, we propose to do away with the pointer-chas- ing tree traversing operations, which many applications generate in the form of edit scripts and replace them with set-oriented operations. Edit scripts (Chawathe et al., 1996; Chawathe & Garcia-Molina, 1997) have been long known in text databases and are similar in behavior

to Document Object Model (DOM) (W3C, 1998a) traversals, which are standard in the XML world; they tend to put relational technology at a disadvantage due to their excessive use of pointer-chasing algorithms. We investigate the use of these scripts and propose alternative strategies for cases when they perform poorly.

We implemented our ideas in the XML extension of the Monet Database System (Schmidt, Kersten & Windhouwer, 2001; Schmidt et al., 2000). A more detailed description of our experiments is found in Schmidt and Kersten (2002). As we benchmarked the system’s performance, it turns out that the use of edit-scripts is only sensible if they only update a rather small fraction of the database; once a certain threshold is exceeded, the replacement of a complete database segment is preferable. We discuss this threshold and try to quantify the trade-off for our example document database.

The application scenario which motivates our research consists of a set of XML data sources which are feature detectors that monitor multimedia data sources and analyze their content. The detectors feed protocols of analyses into a central data warehouse. The warehouse now provides the following services: (1) insertion of a documents (a data source transmits a single protocol of an analysis to the warehouse), (2) insertion of versioned sets of documents (a set of check-out points transmits the result of a bulk analysis transcript to the warehouse), (3) deletion of documents and sets of documents (a document is deleted from the warehouse because it has become invalid or stale; duplicate analyses and erroneous insertion also happen frequently and need to be corrected), and (4) execution of edit-scripts that are transmitted from the sources and systematically correct errors in already inserted documents; for example, a posteriori normalization of feature values is frequently required.

While we regard (1) as a special case of (2), hence, do not treat it separately, there is an obvious trade-off between a combination of (2) and (3) and the use of editscripts (4). More precisely, the question is: When is it

Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.

TEAM LinG

Storing XML Documents in Databases

Figure 1. Example document

<image key="134" source="/cdrom/img1/293.jpeg"> <date> 999010530 </date>

<colors>

<histogram> 0.399 0.277 0.344 </histogram> <saturation> 0.390 </saturation>

<version> 0.8 </version> </colors>

</image>

cheaper to delete invalid data and reinsert a new consistent version than to use an edit script to “patch” the warehouse? This and other questions will be dealt with in detail later.

BACKGROUND

XML documents are commonly represented as syntax trees. This section recalls some of the usual terminology we need to work with XML documents. In the sequel, string and int denote sets of character strings, respectively integers; oid denotes a set of unique object identifiers. Figure 1 shows an XML fragment, which is taken from the area of content-based multimedia retrieval (Schmidt & Kersten, 2002). Figure 2 displays the corresponding schema tree (dotted arrows indicate XML attribute relationships, straight lines XML element relationships).

Before we discuss techniques on how to store a tree as a database instance, we introduce the notion of associations. They are used to cluster semantically related information in a single relation and constitute the basis for the Monet XML Model; the aim of the clustering process is to enable efficient scans over semantically related data, that is, data with the same element ancestry, which are the physical backbone of declarative associative query language like SQL. Different types of

Figure 2. Schema tree of example document

associations play different roles: associations of type oid×oid represent parent-child relationships. Both kinds 5 of leaves, attribute values and character data, are mod-

eled by associations of type oid×string, while associations of type oid×int are used to keep track of the original topology of a document. Paths describe the context of the element in the graph relative to the root node; we identify with path(o) the type of the association (×,o). The set of all paths in a document is called its Path Summary; it plays an important role in our query engine. The main rational for the path-centric storage of documents is to evaluate the ubiquitous XML path expressions efficiently; the high degree of semantic clustering distinguishes our approach from other mappings (see Florescu & Kossmann, 1999 for a discussion). Our approach is to store all associations of the same “type”

in one binary relation. A relation that contains the tuple ( ,o) is named R(path(o)). In Figure 2, the types or paths

are the i. Clustering XML elements by their type implies that we do not have to cope with many of the irregularities induced by the semi-structured nature of XML, which are typically taken care of with NULLs or overflow tables (Deutsch et al., 1999). In the sequel, we describe the machinery we need to convert documents to Monet format and bulkload them efficiently. Also note that we are able to reconstruct the original document given this path-centric representation. A detailed discussion of the reconstruction can be found in Schmidt et al. (2001). We remark that we can also access the documents in an object-oriented manner, that is, object as node in the syntax tree, which is often more intuitive to the user and is adopted by standards like the DOM (W3C, 1998a). However, we do not optimize for this as we see later.

XML WAREHOUSES

Populating the XML Warehouse

There are two basic notions of interest that we are going to discuss in this section as indicated in the introduction: populating a database from scratch, that is, bulk load, and incremental insertion of new data into an already existing database. However, similar technology underlies both cases. Let us consider an example first. There are two standard ways of accessing XML documents: (1) A low-level event-based, called SAX (Megginson, 2001), scans an XML document for token like start tag, end tag, character data, and so forth and invokes user-supplied functions for each token that is encountered in the input. The advantage of the SAX parsers is they only require minimal resources to work;

659

TEAM LinG

Соседние файлы в предмете Электротехника