Research

My research is on data management. Although I have done work in basic database technologies such as query processing, transaction processing, and database integration, the main focus of my research follows two threads: (1) application of database technology to non-traditional data types, and (2) distributed data management.

Data management(X)

Investigating how database technology can be applied to data types that are more complex than business data processing (for which relational systems are the perfect fit) has always been one of my interests. Frank Tompa characterizes this type of work as Data Management(X) where X is the data type of interest -- we also have a graduate course with exactly this focus (CS 741). At different times in my career, X has been equal to one or more of the following: {"object" data (in the sense of object databases), multimedia data, temporal data, spatial data, XML, stream data}. Currently, my focus is on graph data and RDF data.

Graph data management

Graphs have always been important data types for database researchers. With the recent growth of social networks, Wikipedia, Linked Data, RDF, and other networks, the interest in managing very large graphs have again gained momentum. The ongoing work in this area (with PhD student Ning Zhang) focuses on generalized query models over property graphs and their efficient execution.

In many applications, graph data can be modeled as a property graph where each node and each edge may have a label and arbitrary key/value pairs as properties and problems are related to identifying patterns in the graph, i.e., identifying sub-graphs that match a particular topology and/or some features. However, most of the work in the database community focus on simple data graphs. Similarly, the common graph queries that have been the focus of study are of the following types: reachability queries, shortest distance path queries, and subgraph containment/match queries. Each of these query types match a particular application, but individually they are not sufficiently powerful or general to serve as a general purpose graph query language (similar to SQL for relational systems). We are working on generalized subgraph matching over property graphs where edge label matching, property matching, reachability, and shortest distance constraints are all incorporated into the query model. Consequently, our focus is on a generic query model over a property graph, which is more general than the node labeled graph that is commonly considered in traditional subgraph match queries. In this query model, node matching predicates can be property constraints and edge matching predicates can be a combination of label constraints, reachability constraints, and shortest distance constraints. We also study a general framework for query processing and new indexing techniques for generalized subgraph match queries.

RDF data management

Resource Description Framework (RDF) has been proposed for modeling Web objects as part of developing the "semantic web". It has also gained attention as a way to accomplish web data integration. For example, the Linking Open Data (LOD) cloud is a distributed RDF knowledge base created over hundreds of autonomous datasets.
Currently, the LOD cloud contains more than 25 billion triples, and its size is doubling every year. As the volume of RDF data has increased, interesting data management issues have arisen. We study the processing of SPARQL queries over RDF data using a graph-theoretic approach: we represent both the RDF data and the SPARQL queries a graphs and conver the query evaluation problem toone of subgraph matching. My work in this area follows two parallel threads.

The first thread, joint with Prof. Lei Chen (Hong Kong University of Science and Technology) and Prof. Lei Zou (Peking University) has developed a global index (called VS*-tree) over the RDF data following the mapping of both node labels and edge labels to bit signatures. The query processing methodology uses a filter-and-evaluate strategy. Given a query graph, VS*-tree is searched for each node of the query graph to generate a list of candidate matching RDF graph nodes. These lists are then joined using the query graph edge labels as join predicates to produce the set of candidate matches for the entire query. The signatures of these matches are then expanded for further and more precise checking. We investigate pruning techniques that ensure efficient generation of candidate matches.A second problem that we investigate is the processing of aggregation queries over large RDF data sets. We propose a processing approach that partitions aggregate queries into smaller parts (called star queries), processes these efficiently, and joins the results of star queries to obtain more general results. We develop indexes to assist in executing star queries and to facilitate joining their results.

The second research thread (with PhD student Güneş Aluç) focuses on the adaptive and distributed management of RDF data. Although some existing techniques address adaptivity to changes in the RDF data, they do not consider adaptivity to the workload changes. Consequently, the underlying database layouts and the indexes created over the databases are fixed a priori, and they cannot easily handle changing workloads. What we are investigating is adaptive RDF management system (RMS) design that can continuously and seamlessly adapt its physical configuration will more likely maintain a steady, optimal performance.

This research begins with the observation that in a graph-based RMS, query performance depends on three components of physical design, namely,

  • how the RDF graph is partitioned,
  • how it is indexed, and
  • how candidate views are generated from the graph and materialized.
Consequently, the first problem is related to adaptively aligning the partitioning with the most recent query patterns in the workload. This incurs three challenges. First, even in the worst case when the partitioning is random, correct answers should be produced, which cannot be achieved with just an arbitrary decomposition of a query. Second, it is not trivial which cost function would accurately represent the performance trade-offs of a given partitioning with respect to a series of query patterns. Third, computing a partitioning that sufficiently minimizes this cost function in linear-time---which is crucial for performance---requires extra investigation.

The second problem is to design indexes that can be maintained adaptively, both when the partitioning is updated and when the query patterns in the workloads change. In this work, we consider indexing at two levels: the structural-level index one organizes the partitions based on their structural similarity, while the instance-level index stores the distinguishing features of each partition. The major challenge in maintaining the structural-level index is to devise an efficient algorithm that can incrementally group similar partitions, even when the partitions are constantly being updated. The second challenge is that the instance-level index is constructed over multi-dimensional features; hence, novel sorting and indexing algorithms need to be developed to support adaptive multi-dimensional indexing.

The third problem relates to identifying subsets of vertices and edges in the graph that are worth materializing. Here, the main challenge is to balance the space/time trade-off: what is desired is to get the best performance out of the materialized views by consuming as small a storage space as possible. There are two aspects of the problem. First, candidate views need to be evaluated to find the view that, if materialized, produces the best space/time benefits. Second, when the workloads change, the materialized views need to be updated in a seamless fashion.

Distributed data management

For each of the data types mentioned above, I have eventually investigated their management in a distributed and parallel setting. Two most recent topics are stream data, and XML data.

Stream data management

Traditional databases have been used in applications that require persistent data storage and complex querying. However, a growing list of emerging applications receive and process data as a sequence (stream) of items. Examples include sensor networks, financial tickers and other on-line Web information sources, transaction log analysis, and Internet traffic measurement. A data stream is a real-time, continuous, ordered sequence of items. Queries over data streams are expected to run continuously and maintain up-to-date answers as new data arrive. As a result, continuous query processing involves many challenges, such as adaptive re-optimization of query plans in response to changing stream arrival rates, sharing resources among similar queries, and generating approximate answers in limited space. Additionally, mining these data poses additional challenges since many mining algorithms require multiple passes over the data while in data stream systems, one geets to see the data only once (unless the data are spooled to disk for later analysis). Two most recent PhD theses and their abstracts are the following.

Sliding Window Query Processing over Data Streams (Lukasz Golab)

Database management systems (DBMSs) have been used successfully in traditional business applications that require persistent data storage and an efficient querying mechanism. Typically, it is assumed that the data are static, unless explicitly modified or deleted by a user or application. Database queries are executed when issued and their answers reflect the current state of the data. However, emerging applications, such as sensor networks, real-time Internet traffic analysis, and on-line financial trading, require support for processing of unbounded data streams. The fundamental assumption of a data stream management system (DSMS) is that new data are generated continually, making it infeasible to store a stream in its entirety. At best, a sliding window of recently arrived data may be maintained, meaning that old data must be removed as time goes on. Furthermore, as the contents of the sliding windows evolve over time, it makes sense for users to ask a query once and receive updated answers over time.

This dissertation begins with the observation that the two fundamental requirements of a DSMS are dealing with transient (time-evolving) rather than static data and answering persistent rather than transient queries. One implication of the first requirement is that data maintenance costs
have a significant effect on the performance of a DSMS. Additionally, traditional query processing algorithms must be re-engineered for the sliding window model because queries may need to re-process expired data and "undo" previously generated results. The second requirement suggests that a DSMS may execute a large number of persistent queries at the same time, therefore there exist opportunities for resource sharing among similar queries.

The purpose of this dissertation is to develop solutions for efficient query processing over sliding windows by focusing on these two fundamental properties. In terms of the transient nature of streaming data, this dissertation is based upon the following insight. Although the data keep changing over time as the windows slide forward, the changes are not random; on the contrary, the inputs and outputs of a DSMS exhibit patterns in the way the data are inserted and deleted. It will be shown that the knowledge of these patterns leads to an understanding of the semantics of persistent queries, lower window maintenance costs, as well as novel query processing, query optimization, and concurrency control strategies. In the context of the persistent nature of DSMS queries, the insight behind the proposed solution is that various queries may need to be refreshed at different times, therefore synchronizing the refresh schedules of similar queries creates more opportunities for resource sharing.

Mining Time-Changing Data Streams (Yingying Tao)

Streaming data have gained considerable attention in database and data mining communities because of the emergence of a class of applications, such as financial marketing, sensor networks, internet IP monitoring, and telecommunications that produce these data. Data streams have some unique characteristics that are not exhibited by traditional data: unbounded, fast-arriving, and time-changing. Traditional data mining techniques that make multiple passes over data or that ignore distribution changes are not applicable to dynamic data streams. Mining data streams has been an active research area to address requirements of the streaming applications. This thesis focuses on developing techniques for distribution change detection and mining time-changing data streams. Two techniques are proposed that can detect distribution changes in generic data streams. One approach for tackling one of the most popular stream mining tasks, frequent itemsets mining, is also presented in this thesis. All the proposed techniques are implemented and empirically studied. Experimental results show that the proposed techniques can achieve promising performance for detecting changes and mining dynamic data streams.

Distributed XML data management

As XML data volumes increase, their management on a single server becomes problematic. It is, therefore, useful to investigate the distribution of XML data and processing queries over these distributed XML data collections. The latest PhD thesis on this topic is the following.

Distributed XML Query Processing (Patrick Kling)

While centralized query processing over collections of XML data stored at a single site is a well understood problem, centralized query evaluation techniques are inherently limited in their scalability when presented with large collections (or a single, large document) and heavy query workloads. In the context of relational query processing, similar scalability challenges have been overcome by partitioning data collections, distributing them across the sites of a distributed system, and then evaluating queries in a distributed fashion, usually in a way that ensures locality between (sub-)queries and their relevant data. This thesis presents a suite of query evaluation techniques for XML data that follow a similar approach to address the scalability problems encountered by XML query evaluation.

Due to the significant differences in data and query models between relational and XML query processing, it is not possible to directly apply distributed query evaluation techniques designed for relational data to the XML scenario. Instead, new distributed query evaluation techniques need to be developed. Thus, in this thesis, an end-to-end solution to the scalability problems encountered by XML query processing is proposed.

Based on a data partitioning model that supports both horizontal and vertical fragmentation steps (or any combination of the two), XML collections are fragmented and distributed across the sites of a distributed system. Then, a suite of distributed query evaluation strategies is proposed. These query evaluation techniques ensure locality between each fragment of the collection and the parts of the query corresponding to the data in this fragment. Special attention is paid to scalability and query performance, which is achieved by ensuring a high degree of parallelism during distributed query evaluation and by avoiding access to irrelevant portions of the data.

For maximum flexibility, the suite of distributed query evaluation techniques proposed in this thesis provides a number of alternative approaches for evaluating a given query over a given distributed collection. Thus, to achieve the best performance, it is necessary to predict and compare the expected performance of each of these alternatives. In this work, this is accomplished through a query optimization technique based on a distribution-aware cost model. The same cost model is also used to fine-tune the way a collection is fragmented to the demands of the query workload evaluated over this collection.

To evaluate the performance impact of the distributed query evaluation techniques proposed in this thesis, the techniques were implemented within a real-life XML database system. Based on this implementation, a thorough experimental evaluation was performed. The results of this evaluation confirm that the distributed query evaluation techniques introduced here lead to significant improvements in query performance and scalability.

Some previous research projects

 
Better organization coming in the near future...
 
 
[University of Waterloo]
University of Waterloo
[Department of Computer Science]
Computer Science
[M. Tamer Özsu's home page]
M. T. Özsu

Copyright © M. Tamer Özsu. All rights reserved.
Last update: January, 2012