Traditional databases store relatively simple data and are snapshots of data at a certain point in time. Large classes of applications (e.g., geographic information systems, location-based services, inventory control systems, real-time traffic information systems, fleet management systems, monitoring, graphic and simulation systems, historical data warehouses, multimedia systems) require the management of data about objects that have more sophisticated characteristics: they have spatial features (location), they evolve over time (temporal), the objects move, and objects may consist of multiple media types (video, audio, text, images). Our research addresses the development of database technology to fulfill these requirements (we discuss work on managing text to the next section since it is a major research activity).
Within the context of spatial databases, our efforts in the past few years have been concentrated on the indexing structures and query evaluation algorithms for these applications. Two classes of queries are investigated: buffer queries [3] and route queries ([4], [5], [6], [7]).
Buffer queries involve two spatial data sets and a distance d. The answer to a buffer query is a set of pairs of objects, one from each input set, that are within distance d of each other. Two objects are within d of each other precisely when their minimum distance is. As buffer queries concerning point objects are trivial, we concentrate on non-point objects. An efficient spatial-join algorithm is derived and is shown to outperform the brute-force over all cache sizes, data sets and buffer distances.
Route queries deal with a large graph. Given such a graph, many useful optimal route queries can be asked. An important subclass of route queries is to find a shortest path (SP) between two points in a graph. Given the size of the network, disk-based SP algorithms are required. Existing approaches have some serious drawbacks and do not scale well with large graphs. We derive a framework for storing and manipulating a large graph. In this framework, an arbitrary large graph is first partitioned, with the help of a partitioning algorithm, into a collection of fragments that are stored on disk. To facilitate query processing, shortest distances of some nodes in each fragment are pre-determined. The materialized data are stored on disk, and are accessed via a disk-based data structure. Based on this framework, we derive a fast disk-based SP query evaluation algorithm. Extensive experiments show that the proposed algorithm is scalable and efficient; its evaluation time is a fraction of well-known main memory Dijkstra algorithm ([4], [5], [7]).
Due to the size of a graph, some materialization of path or distance information is essential if a route query is to be evaluated efficiently. For our SP algorithm, the storage requirement of materialized shortest distance data is relatively small. However, the time to maintain this data could be long if edge weights in a network are changed. The problem faced by a route query evaluation algorithm is that the re-computation of materialized data must be done efficiently, preferably in real-time. To aggravate the problem, the percentage of edge weights changed in a fragment is unknown, and the number of shortest distance computations is relatively large. We study existing dynamic SPT algorithms. We discover that many of them either process a single edge weight update or fail to correctly process the multiple edge weight updates. We correct and extend several existing SPT algorithms to handle multiple edge weight updates. We show that there is no universally superior algorithm. Rather, under different circumstances, different algorithms should be used to solve the dynamic SPT problem when a set of edge weights is updated [6].
In related work we investigate generalized tools that assist implementation of spatial index structures. The large number of these structures makes it almost impossible to implement all of them inside a database engine. Moreover, these structures are not optimized for disk operations and hence are not database-friendly. We proposed and implemented a spatial index generalization framework, SP-GiST ([1], [2]). The framework identifies one core implementation of a wide class of spatial indexes and a simple user-defined interface to realize the individual structures. SP-GiST allows for supporting spatial indexing capabilities inside database engines with one core implementation. Besides developing the framework, the project addresses issues related to optimizing SP-GiST for disk accesses through clustering and concurrency control algorithms.
In moving object databases, we focus on indexing issues as well, considering two types of databases: historical and predictive databases. Historical databases deal with the past movement or trajectory of moving objects, whereas predictive databases concern with the current and near-future locations of moving objects.
In historical databases, indexes must be scalable and be able to accommodating different query classes. Moreover, as update rates are high, the index must be update-efficient as well. Existing indexes (there are several dozen of them) either do not scale well or are specialized for specific query classes. We propose an index, which we call TSL-trees, for trajectory data. We show experimentally that TSL-tree is scalable, update-efficient, and query-efficient for various query classes, including range and trajectory-based queries [17].
In a predictive database, the location of a moving object is constantly updated to reflect its movement. A database system needs to process hundreds or even thousands of updates in each update cycle. Needless to say, scalability under frequent updates is paramount. The scalability problem becomes more difficult in predictive databases, in which queries have to be processed in real-time. It has been shown in existing work that frequent updates will improve the query performance. However, frequent updates, at the same time, deteriorate the query retrieval performance, if queries are concurrently executed. We are looking into novel ways to group moving objects together, and to batch-process the updates in a predictive database.
Another issue that we investigate related to the moving objects is similarity search based on the trajectories recorded by these objects. The trajectories are stored as multi-dimensional data and similarity search over these stored trajectories involve finding and retrieving those trajectories that are "similar" to some sample query trajectory (note that the problem can be cast in terms of time series data which are one-dimensional versions of trajectory data). This problem arises in an important class of applications. For example, using a remote sensing system, and by mining the trajectories of animals in a large farming area, it is possible to determine migration patterns of certain groups of animals. In sports videos, it is quite useful for coaches or sports researchers to know the movement patterns of top players. In a store surveillance video monitoring system, finding the movement patterns of customers may help in the arrangement of merchandise. In a financial data analysis application, finding the sales volume data that have some specific patterns may help marketing people predict sale cycle patterns. All these applications are closely related to finding similar time series or trajectory data from a database. For example, the following query is quite common in a stock analysis system: "Give me all stock data from last month that is similar to IBM's stock data from the same period."
Based on the search criteria, similarity-based queries over time series and trajectory data are classified into two types:
Pattern existence queries : Given a time series or trajectory database D and a query pattern QP, find all the time series or trajectories in D that contain the specified pattern QP. In these queries, as long as the time series or trajectory data have the specified pattern, they will be retrieved, no matter when the pattern appears and how it appears. For example, "Give me all the stock data of last month that have a head and shoulder pattern"; or "Give me all the temperature data of patients in the last 24 hours that have two peaks."
Shape match queries : Given a time series or trajectory database D and a query time series or trajectory Q, find all the time series or trajectories in D that have a movement shape similar to that of Q. Shape match queries look for the time series data that have similar movement shapes. For example, "Give me all stock data of last month that is similar to the IBM's stock data of last month".
The unique aspects of our work in this area are that we consider both types of queries and that we address environments where data may be impure (i.e., contain local time shifting and noise). We focus on two key issues, namely the design of a distance function and the development of techniques to improve the retrieval efficiency. The major contributions we have thus far made are the following:
A novel representation of time series data, called multi-scale time series histograms (MSTSH), was proposed to answer two types of queries over time series data: pattern existence queries and shape match queries ([8], [12]). The presence of multiple scales (levels) enables querying at multiple precisions. Most importantly, the distances of time series histograms at lower scales are lower bounds of the distances at higher scales, which guarantees that no false dismissals will be introduced when a multi-step filtering process is used to answer shape match queries. The evaluation algorithm uses averages of time series histograms to reduce the dimensionality and avoid computing the distances of full time series histograms.
A metric distance function, called Edit distance with Real Penalty (ERP), was designed that can support local time shifting [10]. In many applications, such as speech recognition and music retrieval, sequences are required to do a flexible shifting of the time axis to accommodate sequences that are similar but out of phase. In these applications, Euclidean distance cannot be applied. Even though Dynamic Time Warping (DTW) can be used in these cases, it does not follow triangle inequality, which makes it non-indexable by traditional access methods. ERP is proposed to handle local time shifting issues and it is proved to be a metric distance function, which makes it indexable. Benchmark results show that ERP is natural for time series data.
A framework is proposed to index time series or trajectory data under a metric distance function [10]. Given a metric distance function, besides applying dimensionality reduction techniques and lower bounding the distances, triangle inequality can also be used to prune the search space. The indexing framework we propose applies both lower bounding and triangle inequality, and performs better than applying only lower bounding or triangle inequality.
A second distance function, called Edit Distance on Real sequence (EDR), is defined to measure the similarity between two time series or trajectory data that contain local time shifting and noise [9]. EDR is based on edit distance on strings. EDR is more robust than Euclidean distance, DTW, and ERP, and is more accurate than Longest Common Subsequence (LCSS) when it is used to measure the similarity between time series or trajectories that contain noise and local time shifting. Three pruning techniques are developed - mean value Q-grams, near triangle inequality, and histograms - to improve the retrieval efficiency of EDR. Unlike the pruning methods proposed for other distance measures, these three pruning methods do not require setting constraints on warping length (or matching region) between two trajectories and, therefore, offer users more flexibility.
Our work in temporal databases focuses on the problem of data change over time and becoming out-of-date. In particular, despite advances in storage technology, the accumulation of large amounts of data over time can easily exceed the ability of the application or the application users/owners to store the accumulating data in perpetuity. Also, even if the growth of the actual physical storage systems for the data was able to keep up with the data accumulation, the sheer volume of data that needs to be examined by information requests (queries and updates) of an application can often lead to an unacceptable performance deterioration.
There are two main approaches to combat the space and performance implications of managing a collection of data growing over time ([18], [19], [20]). Approaches based on lossless storage of the data collection (the information that is actually stored by the information system is sufficient to reconstruct the original data set) commonly resort to utilizing specialized data structures, and/or compression techniques to alleviate the time needed to access or modify the data and the space required to store the data set. However, in the worst case, lossless techniques face the same problems that naive storage of the whole data set faces: the space and time required to store and manipulate the data grows proportionally to the length of the time period over which the data is collected.
Alternatives to lossless approaches are strategies that allow for selective forgetting and removal (expiration) of parts of the accumulated data. Such strategies are called lossy. Indeed, in many situations, lossy strategies are mandatory for the information system due to application-specific requirements (e.g., for legal or other reasons the system may have to remove all information beyond a certain time limit). Selective removal of parts of the data set (often called data expiration) naturally leads to several questions:
How does the system determine which parts of the data set to retain and which parts to forget (expire)? What is the computational cost associated with the decision?
How does the expiration process affect future information requests of the application(s) querying/updating the data set (e.g., are answers to queries over the original data set preserved by the new residual data set obtained by the expiration process)?
How much space is needed to store the residual data set? In particular, is the space required significantly smaller than storing the whole set (formally, bounded by O(n) function after n units of time elapse - note that this cannot be the case for any lossless approach. In particular, compression-based approaches space requirements are O(n) even assuming arbitrarily high, but fixed compression ratio for all data we collect.)?
Most of the work on data expiration has focused on specifying which parts of the data set should be removed. In particular, in the policy-based approach to data expiration, the parts of the data set are identified by vacuuming specifications that are temporal relational algebra-like expressions that identify tuples that are to be removed/retained in the residual data set (the data set is in this work is assumed to be represented in a temporal relational data model of TSQL2 or SQL/Temporal). The policy-based approaches place the burden of deciding what parts of the data set to expire solely on the database administrator who is then responsible for ensuring that data still needed by an application is never removed.
The goal of our research is to study logical approaches to data expiration: approaches that, based on an analysis of information requests of applications, automatically detect what parts of the data set can be safely removed. The logical data expiration approach can be summarized as follows: Given a fixed application or an application suite issuing a fixed set of possibly parametric queries and update against a data set collected over time, what parts of the data can be removed without the application noticing the difference?
It is important to remember that, from the application's point of view, there must not be any observable difference between the full and the residual data sets both immediately after the expiration process finishes and as more data is collected and added to the data set in the future. In addition to proposing new techniques for data expiration, the proposed research aims to also provide theoretical guarantees for the proposed operators.
Finally, we have ongoing work on video databases. We investigate issues related to the modeling, representation and efficient retrieval of video objects. Our video data model ([13], [16]) is based on video segmentation and salient objects. Video segmentation-based models consider a video database consisting of three types of video sequences: videos, scenes, and shots. The content of each video sequence is captured through selected key frames, which are specializations of images. We have worked on the detection of scenes using rule-based approach [15], which was enhanced by the use of audio cues [14]. Salient object-based models extract from video objects of interest along with their audio-visual features and spatio-temporal relationships among them. The combination of the two approaches in one video model allows the capture of a rich set of video characteristics. Within the context of this model, we have investigated indexing structures that can capture the rich video content [11].
Our current work in video databases aims to address the "semantic gap" between what the multimedia content represents and the information that is used to index it (annotations or low-level features). In order to bridge this gap, semantic analysis of videos is indispensable. In other research domains such as natural language processing, speech recognition, face detection/recognition, character recognition, etc., semantic information is accessible by recent technologies based on the use of corpora. These corpora reflect the variation and distribution of the actual information, e.g., newspaper corpus for text analysis, spoken passage corpus for speech recognition, face database and hand-written text corpus for face and character recognition. Semantic content analysis based on corpus may achieve robust and practical performance.
Our research will address corpus-based video indexing. The corpus-based video indexing is expected to achieve a number of desirable properties: robust semantic analysis, adaptable to actual variation of videos, free from burdensome manual tuning, etc. This raises a number of research challenges in various domains including computer vision, machine learning, artificial intelligence, and database. Some of these challenges that we will address are the following:
Good Features. What type of features can lead to effective video retrieval: point-based image-based frequency domain or interest point detector? The "good features" for videos should be discriminative and invariant.
Generalization/Learning. The objective is to use machine learning methods and high-dimensional clustering techniques to extract general concepts from the corpus. Main idea of the corpus-based video indexing is to automatically extract concepts from sets of videos in the corpus. To do so, we need to generalize the set of videos, i.e., to find common components, common attributes, common structures, etc., from the set of videos corresponding to the concept. Once discriminative features are extracted from videos, machine learning techniques can be adopted to generalize a set of videos to extract common structures, i.e., concepts.
Scalability issues. For proper learning, a huge dataset should be gathered to compensate for noise, ambiguity, etc. This will necessitate novel indexing techniques to quickly retrieve videos. A further complication is that the descriptors of the videos are high-dimensional vectors that can possibly lead the problem known as curse of dimensionality. We will extend (and in some cases investigate the application of techniques developed in our group) to address multi-dimensional indexing without falling victim to the curse of dimensionality.
Multimodality. Multimedia data is multimodal (i.e., contain different media types such as text, audio, image, video). Each of these are encoded differently and hence handled differently. A typical multimedia system is in fact a set of independent sub-systems specific for each of the media types. As more applications demand that data from different media types be managed in the same system (e.g. news, entertainment, e-learning, e-commerce), there is a natural need to (logically) structure and access these media types together. The usual approach is to bridge the independents sub-systems by adding an application layer that interacts separately with the sub-systems while providing a unique interface to the user. Humans normally reason in term of events in their lives, and during these events pictures, videos, documents etc. are used to record specific moments. An event occurs at a certain place at a certain moment and all the data (image, text, video) are about the event. We will work on event models as an extension of object-oriented models that can provide an integrated framework for storing and querying multimedia data. An event can be viewed as composed of one or more real-life objects in addition to time and space.

Database Research Group
David R. Cheriton School of Computer Science
University of Waterloo
Waterloo, Ontario, Canada N2L 3G1
Tel: 519-888-4567
Fax: 519-885-1208
Contact | Feedback: db-webmaster@cs.uwaterloo.ca | Database Research Group