26. ICDE 2010:Long Beach, California, USA

Proceedings of the 26th International Conference on Data Engineering, ICDE 2010, March 1-6, 2010, Long Beach, California, USA. IEEE Computer Society 【DBLP Link】

Paper Num: 158 || Session Num: 38

Data Clouds 5
Database Performance and Reliability 3
Demo Session 1: Events, Streams, Services, Mashups and Search 15
Demo Session 2: Scalability, Design, Optimization and Miscellaneous 14
Distributed Data 3
Graph Mining 4
Indexing and Hashing 4
Industry Session: Data Warehousing 3
Industry Session: Data, Data, and More Data 3
Industry Session: Query Optimization 3
Information Integration 3
KNN Queries 3
Keynotes 3
Keyword Search 4
Location Based Services 3
Panels 2
Parallel Processing 4
Privacy Techniques 3
Probabilistic Databases 4
Publishing Privacy 4
Query Interfaces 4
Query Optimization 3
Query Processing 3
Query Ranking and Database Testing 4
Scientific Data Mining 3
Scientific Databases 4
Seminars 5
Sensor Networks 4
Skyline Queries 3
Social Networks and Similarity Queries 4
Spatial Databases 3
Spatial Indexing 3
Stream Mining 4
Stream Processing 4
Top-K Queries 4
Tree Queries and Semi-Structured Databases 4
Web and Collaborative Applications 5
Workflow and Workload Management 4

Keynotes 3

1. Large scale data warehousing: Trends and observations.

【Paper Link】【Pages】:1

【Authors】: Richard Winter ; Pekka Kostamaa

【Abstract】: Summary form only given. How large are data warehouses? How fast are they growing? How big are they going to get? What is driving their growth? Why is all this data of value in commercial enterprises? What can we say about how these large data warehouses are being used? What are some key challenges ahead? In this talk, Richard Winter will share his views and observations concerning these questions and others, based on more than three decades of involvement with commercial data warehouses and their preferences.

【Keywords】: data warehouses; commercial enterprises; data warehouses; large scale data warehousing; Data warehouses; Large-scale systems; Warehousing

2. DBMS: Lessons from the first 50 years, speculations for the next 50.

【Paper Link】【Pages】:2

【Authors】: Jeffrey F. Naughton

【Abstract】: Some of the major themes in DBMS research appeared in the computer science literature as early as 50 years ago. The community has had a very productive time over the past 50 years exploring these themes, in the process contributing to a major software industry and creating a large and vibrant research community. I will give a subjective and probably highly biased view of these themes and why they have been so persistent, and speculate on how they might continue to persist in the future. While we will probably continue to be productive over the next 50 years as well, there are reasons for concern going forward. I will close with some speculation on what we might do to deal with this.

【Keywords】: DP industry; database management systems; DBMS; computer science literature; software industry; speculations; Computer industry; Computer science

3. How new is the cloud?

【Paper Link】【Pages】:3

【Authors】: Donald Kossmann

【Abstract】: Summary form only given. Cloud computing has been identified by Gartner as one of the ten most disruptive technologies for the next decade. It has made many promises and the first products have appeared on the market place and are rapidly gaining adoption. Time to step back a bit. This talk first gives an overview of the promises made by cloud computing. Which promises really matter? Which promises were only made because they could be fulfilled? And, which promises were only made because they could not be validated? Second, this talk discusses the fundamental limitations, light-housed by the CAP theorem. How bad is it really? Third, this talk discusses alternative architectures for data management in the cloud. What works? What is new? Fourth, this talk addresses the changes application programmers will face. What is realistic? Finally, this talk reports on three years of running a cloud computing start-up. What do users like, what do investors like? What do I like?

【Keywords】: Internet; CAP theorem; cloud computing; data management architectures; Cloud computing; Collaborative work; Computer architecture; Programming profession

KNN Queries 3

4. K nearest neighbor queries and kNN-Joins in large relational databases (almost) for free.

【Paper Link】【Pages】:4-15

【Authors】: Bin Yao ; Feifei Li ; Piyush Kumar

【Abstract】: Finding the k nearest neighbors (kNN) of a query point, or a set of query points (kNN-Join) are fundamental problems in many application domains. Many previous efforts to solve these problems focused on spatial databases or stand-alone systems, where changes to the database engine may be required, which may limit their application on large data sets that are stored in a relational database management system. Furthermore, these methods may not automatically optimize kNN queries or kNN-Joins when additional query conditions are specified. In this work, we study both the kNN query and the kNN-Join in a relational database, possibly augmented with additional query conditions. We search for relational algorithms that require no changes to the database engine. The straightforward solution uses the user-defined-function (UDF) that a query optimizer cannot optimize.We design algorithms that could be implemented by SQL operators without changes to the database engine, hence enabling the query optimizer to understand and generate the Â¿bestÂ¿ query plan. Using only a small constant number of random shifts for databases in any fixed dimension, our approach guarantees to find the approximate kNN with only logarithmic number of page accesses in expectation with a constant approximation ratio and it could be extended to find the exact kNN efficiently in any fixed dimension. Our design paradigm easily supports the kNN-Join and updates. Extensive experiments on large, real and synthetic, data sets confirm the efficiency and practicality of our approach.

【Keywords】: SQL; learning (artificial intelligence); query processing; relational databases; SQL operators; database engine; k nearest neighbor queries; kNN-joins; query conditions; query optimization; query point; relational databases; spatial databases; stand-alone systems; user-defined-function; Algorithm design and analysis; Application software; Computer science; Design optimization; Engines; Nearest neighbor searches; Optimization methods; Pattern recognition; Relational databases; Spatial databases

5. Quantile-based KNN over multi-valued objects.

【Paper Link】【Pages】:16-27

【Authors】: Wenjie Zhang ; Xuemin Lin ; Muhammad Aamir Cheema ; Ying Zhang ; Wei Wang

【Abstract】: K Nearest Neighbor search has many applications including data mining, multi-media, image processing, and monitoring moving objects. In this paper, we study the problem of KNN over multi-valued objects. We aim to provide effective and efficient techniques to identify KNN sensitive to relative distributions of objects.We propose to use quantiles to summarize relative-distribution-sensitive K nearest neighbors. Given a query Q and a quantile Â¿ Â¿ (0, 1), we firstly study the problem of efficiently computing K nearest objects based on a Â¿-quantile distance e.g. median distance from each object to Q. The second problem is to retrieve the K nearest objects to Q based on overall distances in the Â¿best populationÂ¿ with a given size specified by Â¿-quantile for each object. While the first problem can be solved in polynomial time, we show that the 2nd problem is NP-hard. A set of efficient, novel algorithms have been proposed to give an exact solution for the first problem and an approximate solution for the second problem with the approximation ratio. Extensive experiment demonstrates that our techniques are very efficient and effective.

【Keywords】: computational complexity; data mining; K nearest neighbor; KNN over multivalued objects; NP-hard problem; data mining; image processing; monitoring moving objects; multimedia; object relative distributions; Â¿-quantile distance; Australia; Content based retrieval; Data mining; Image processing; Image retrieval; Information retrieval; Monitoring; Nearest neighbor searches; Neural networks; Polynomials

6. Efficient rank based KNN query processing over uncertain data.

【Paper Link】【Pages】:28-39

【Authors】: Ying Zhang ; Xuemin Lin ; Gaoping Zhu ; Wenjie Zhang ; Qianlu Lin

【Abstract】: Uncertain data are inherent in many applications such as environmental surveillance and quantitative economics research. As an important problem in many applications, KNN query has been extensively investigated in the literature. In this paper, we study the problem of processing rank based KNN query against uncertain data. Besides applying the expected rank semantic to compute KNN, we also introduce the median rank which is less sensitive to the outliers. We show both ranking methods satisfy nice top-k properties such as exact-k, containment, unique ranking, value invariance, stability and fairfulness. For given query q, IO and CPU efficient algorithms are proposed in the paper to compute KNN based on expected (median) ranks of the uncertain objects. To tackle the correlations of the uncertain objects and high IO cost caused by large number of instances of the uncertain objects, randomized algorithms are proposed to approximately compute KNN with theoretical guarantees. Comprehensive experiments are conducted on both real and synthetic data to demonstrate the efficiency of our techniques.

【Keywords】: data acquisition; query processing; temporal databases; uncertainty handling; KNN query processing; environmental surveillance; quantitative economics; uncertain data; value invariance; Accidents; Databases; Economic forecasting; Global Positioning System; Nearest neighbor searches; Neural networks; Query processing; Stability; Surveillance; Temperature sensors

Distributed Data 3

【Paper Link】【Pages】:40-51

【Authors】: Nicholas E. Taylor ; Zachary G. Ives

【Abstract】: The sciences, business confederations, and medicine urgently need infrastructure for sharing data and updates among collaborators' constantly changing, heterogeneous databases. The ORCHESTRA system addresses these needs by providing data transformation and exchange capabilities across DBMSs, combined with archived storage of all database versions. ORCHESTRA adopts a peer-to-peer architecture in which individual collaborators contribute data and compute resources, but where there may be no dedicated server or compute cluster. We study how to take the combined resources of Orchestra's autonomous nodes, as well as PCs from "cloud" services such as Amazon EC2, and provide reliable, cooperative storage and query processing capabilities. We guarantee reliability and correctness as in distributed or cloud DBMSs, while also supporting cross-domain deployments, replication, and transparent failover, as provided by peer-to-peer systems. Our storage and query subsystem supports dozens to hundreds of nodes across different domains, possibly including nodes on cloud services. Our contributions include (1) a modified data partitioning substrate that combines cluster and peer-to-peer techniques, (2) an efficient implementation of replicated, reliable, versioned storage of relational data, (3) new query processing and indexing techniques over this storage layer, and (4) a mechanism for incrementally recomputing query results that ensures correct, complete, and duplicate-free results in the event of node failure during query execution. We experimentally validate query processing performance, failure detection methods, and the performance benefits of incremental recovery in a prototype implementation.

【Keywords】: distributed databases; groupware; query processing; relational databases; ORCHESTRA system; cloud DBMS; cloud services; collaborative data sharing systems; data exchange; data partitioning substrate; data transformation; database management systems; distributed DBMS; heterogeneous databases; indexing techniques; peer-to-peer architecture; query processing; query subsystem; relational data; storage subsystem; Clouds; Collaboration; Computer architecture; Databases; Indexing; Information science; Peer to peer computing; Personal communication networks; Prototypes; Query processing

8. Strongly consistent replication for a bargain.

【Paper Link】【Pages】:52-63

【Authors】: Konstantinos Krikellas ; Sameh Elnikety ; Zografoula Vagena ; Orion Hodson

【Abstract】: Strong consistency is an important correctness property for replicated databases. It ensures that each transaction accesses the latest committed database state as provided in centralized databases. Achieving strong consistency in replicated databases is a major performance challenge and is typically not provided, exposing inconsistent data to client applications. We propose two scalable techniques that exploit lazy update propagation and workload information to guarantee strong consistency by delaying transaction start. We implement a prototype replicated database system and incorporate the proposed techniques for providing strong consistency. Extensive experiments using both a micro-benchmark and the TPC-W benchmark demonstrate that our proposals are viable and achieve considerable scalability while maintaining strong consistency.

【Keywords】: replicated databases; TPC-W benchmark; bargain strongly consistent replication; centralized databases; micro benchmark; replicated databases; workload information; Availability; Communication channels; Database systems; Delay effects; Informatics; Propagation delay; Proposals; Prototypes; Scalability; Transaction databases

9. Detecting inconsistencies in distributed data.

【Paper Link】【Pages】:64-75

【Authors】: Wenfei Fan ; Floris Geerts ; Shuai Ma ; Heiko Müller

【Abstract】: One of the central problems for data quality is inconsistency detection. Given a database D and a set Â¿ of dependencies as data quality rules, we want to identify tuples in D that violate some rules in Â¿. When D is a centralized database, there have been effective SQL-based techniques for finding violations. It is, however, far more challenging when data in D is distributed, in which inconsistency detection often necessarily requires shipping data from one site to another. This paper develops techniques for detecting violations of conditional functional dependencies (CFDs) in relations that are fragmented and distributed across different sites. (1) We formulate the detection problem in various distributed settings as optimization problems, measured by either network traffic or response time. (2)We show that it is beyond reach in practice to find optimal detection methods: the detection problem is NP-complete when the data is partitioned either horizontally or vertically, and when we aim to minimize either data shipment or response time. (3) For data that is horizontally partitioned, we provide several algorithms to find violations of a set of CFDs, leveraging the structure of CFDs to reduce data shipment or increase parallelism. (4) We verify experimentally that our algorithms are scalable on large relations and complex CFDs. (5) For data that is vertically partitioned, we provide a characterization for CFDs to be checked locally without requiring data shipment, in terms of dependency preservation. We show that it is intractable to minimally refine a partition and make it dependency preserving.

【Keywords】: SQL; computational complexity; distributed databases; optimisation; CFD; SQL based techniques; conditional functional dependencies; data quality rules; data shipment; distributed data detecting inconsistencies; inconsistency detection data quality; Cities and towns; Databases; Delay; EMP radiation effects; Marine vehicles; OFDM modulation; Optimization methods; Partitioning algorithms; Remuneration; Time measurement

Stream Mining 4

10. Optimal load shedding with aggregates and mining queries.

【Paper Link】【Pages】:76-88

【Authors】: Barzan Mozafari ; Carlo Zaniolo

【Abstract】: To cope with bursty arrivals of high-volume data, a DSMS has to shed load while minimizing the degradation of Quality of Service (QoS). In this paper, we show that this problem can be formalized as a classical optimization task from operations research, in ways that accommodate different requirements for multiple users, different query sensitivities to load shedding, and different penalty functions. Standard nonlinear programming algorithms are adequate for non-critical situations, but for severe overloads, we propose a more efficient algorithm that runs in linear time, without compromising optimality. Our approach is applicable to a large class of queries including traditional SQL aggregates, statistical aggregates (e.g., quantiles), and data mining functions, such as k-means, naive Bayesian classifiers, decision trees, and frequent pattern discovery (where we can even specify a different error bound for each pattern). In fact, we show that these aggregate queries are special instances of a broader class of functions, that we call reciprocal-error aggregates, for which the proposed methods apply with full generality. Finally, we propose a novel architecture for supporting load shedding in an extensible system, where users can write arbitrary User Defined Aggregates (UDA), and thus confirm our analytical findings with several experiments executed on an actual DSMS.

【Keywords】: Bayes methods; data mining; load shedding; nonlinear programming; query processing; DSMS; SQL aggregates; aggregate queries; data mining functions; decision trees; extensible system; frequent pattern discovery; high volume data arrivals; k-means; load shedding; mining queries; naive Bayesian classifiers; nonlinear programming algorithms; optimization; penalty functions; quality of service degradation; reciprocal error aggregates; statistical aggregates; user defined aggregates; Aggregates; Computer science; Dairy products; Data mining; Degradation; Intrusion detection; Linear programming; Monitoring; Operations research; Quality of service

11. Scheduling for fast response multi-pattern matching over streaming events.

【Paper Link】【Pages】:89-100

【Authors】: Ying Yan ; Jin Zhang ; Ming-Chien Shan

【Abstract】: Real-time pattern matching over event streams has gained much more attention recently due to the analytical capability demanded in many operation-critical applications such as credit card fraud detection, algorithmic stock trading and RFID tracking. One of the common but important requirements in the above-mentioned applications is fast response. Usually, there are a large number of pattern queries subscribed in the system, running continuously and concurrently. However, not much research has been done on the scheduling algorithms and management to improve the overall response time of these queries. To address this challenge, we focus on the study of how to improve the average response time of multiple pattern queries. We first propose two static scheduling algorithms: Event-based (EBS) and Run-based (RBS) Scheduling and discuss what would be a better choice under different system configurations. We then come up with a hybrid method called Fast Response Time Scheduling (FRTS) to dynamically manage the scheduling in order to further reduce the average response time. The experimental results of these scheduling algorithms have shown that the FRTS method can improve 5 times average response time comparing with the basic methods in some cases.

【Keywords】: pattern matching; scheduling; RFID tracking; algorithmic stock trading; average response time; credit card fraud detection; event-based scheduling; fast response multipattern matching; fast response time scheduling; multiple pattern queries; pattern queries; real-time pattern matching; run-based scheduling; scheduling management; static scheduling algorithms:; streaming events; Algorithm design and analysis; Credit cards; Delay; Dynamic scheduling; Event detection; Pattern analysis; Pattern matching; Radiofrequency identification; Scheduling algorithm; Time factors

12. Discovery of cross-similarity in data streams.

【Paper Link】【Pages】:101-104

【Authors】: Machiko Toyoda ; Yasushi Sakurai

【Abstract】: In this paper, we focus on the problem of finding partial similarity between data streams. Our solution relies on dynamic time warping (DTW) as a similarity measure, which computes the distance between sequences whose lengths and/or sampling rates are different. Instead of straightforwardly using DTW that requires a high computation cost, we propose a streaming method that efficiently detects partial similarity between sequences. Our experiments demonstrate that our method detects pairs of optimal subsequences correctly and that it significantly reduces resources in terms of time and space.

【Keywords】: data analysis; sequences; cross-similarity discovery; data streams; dynamic time warping; sequence distance; sequence similarity; similarity measurement; streaming method; Computational efficiency; Data analysis; Information science; Laboratories; Length measurement; Monitoring; Object detection; Robustness; Sampling methods; Time measurement

13. Mining distribution change in stock order streams.

【Paper Link】【Pages】:105-108

【Authors】: Xiaoyan Liu ; Xindong Wu ; Huaiqing Wang ; Rui Zhang ; James Bailey ; Kotagiri Ramamohanarao

【Abstract】: Detecting changes in stock prices is a well known problem in finance with important implications for monitoring and business intelligence. Forewarning of changes in stock price, can be made by the early detection of changes in the distributions of stock order numbers. In this paper, we address the change detection problem for streams of stock order numbers and propose a novel incremental detection algorithm. Our algorithm gains high accuracy and low delay by employing a natural Poisson distribution assumption about the nature of stock order streams. We establish that our algorithm is highly scalable and has linear complexity. We also experimentally demonstrate its effectiveness for detecting change points, via experiments using both synthetic and real-world datasets.

【Keywords】: Poisson distribution; data mining; stock markets; Poisson distribution assumption; business intelligence; mining distribution change; monitoring intelligence; stock order numbers; stock order streams; stock prices detecting changes; Change detection algorithms; Computer science; Delay; Detection algorithms; Finance; Forward contracts; Kernel; Maximum likelihood detection; Monitoring; Petroleum

Location Based Services 3

14. TrajStore: An adaptive storage system for very large trajectory data sets.

【Paper Link】【Pages】:109-120

【Authors】: Philippe Cudré-Mauroux ; Eugene Wu ; Samuel Madden

【Abstract】: The rise of GPS and broadband-speed wireless devices has led to tremendous excitement about a range of applications broadly characterized as Â¿location based servicesÂ¿. Current database storage systems, however, are inadequate for manipulating the very large and dynamic spatio-temporal data sets required to support such services. Proposals in the literature either present new indices without discussing how to cluster data, potentially resulting in many disk seeks for lookups of densely packed objects, or use static quadtrees or other partitioning structures, which become rapidly suboptimal as the data or queries evolve. As a result of these performance limitations, we built TrajStore, a dynamic storage system optimized for efficiently retrieving all data in a particular spatiotemporal region. TrajStore maintains an optimal index on the data and dynamically co-locates and compresses spatially and temporally adjacent segments on disk. By letting the storage layer evolve with the index, the system adapts to incoming queries and data and is able to answer most queries via a very limited number of I/Os, even when the queries target regions containing hundreds or thousands of different trajectories.

【Keywords】: pattern clustering; query processing; temporal databases; tree data structures; GPS; TrajStore; adaptive storage system; broadband speed wireless devices; data clustering; database storage systems; location based services; spatio temporal data sets; static quadtrees; Adaptive systems; Aggregates; Databases; Drives; Global Positioning System; Information retrieval; Manipulator dynamics; Phase change materials; Proposals; Spatiotemporal phenomena

15. C3: Concurrency control on continuous queries over moving objects.

【Paper Link】【Pages】:121-132

【Authors】: Jing Dai ; Chang-Tien Lu

【Abstract】: Moving object management approaches, especially continuous query processing techniques, have attracted significant research effort due to the broad usage of location-aware devices. However, little attention has been given to designing concurrency control protocols for continuous query processing. Existing concurrency control protocols for spatial indices are based on a single indexing tree, while popular continuous query processing approaches require multiple indices. In addition, continuous monitoring combined with frequent location updates challenges the development of serializable isolation for concurrent index operations. This paper proposes an efficient concurrent continuous query processing approach C3, which fuses scalable continuous query processing methods with lazy update techniques on R-trees. The proposed concurrency control protocol, equipped with intra- and inter-index protection, assures serializable isolation, consistency, and deadlock-freedom. The correctness of the proposed protocol is theoretically proven, and the experiment results demonstrated its scalability and efficiency.

【Keywords】: concurrency control; mobile computing; query processing; C3; R-trees; concurrency control protocol; concurrent continuous query processing approach; concurrent index operations; deadlock freedom; interindex protection; intraindex protection; location aware devices; moving object management approach; scalable continuous query processing methods; serializable isolation; serializable isolation development; single indexing tree; Concurrency control; Databases; Fuses; Helicopters; Indexing; Monitoring; Protection; Protocols; Query processing; Vehicles

16. Policy-aware sender anonymity in location based services.

【Paper Link】【Pages】:133-144

【Authors】: Alin Deutsch ; Richard Hull ; Avinash Vyas ; Kevin Keliang Zhao

【Abstract】: Sender anonymity in location-based services (LBS) attempts to hide the identity of a mobile device user who sends requests to the LBS provider for services in her proximity (e.g. Â¿find the nearest gas stationÂ¿ etc.). The goal is to keep the requester's interests private even from attackers who (via hacking or subpoenas) gain access to the request and to the locations of the mobile user and other nearby users at the time of the request. In an LBS context, the best-studied privacy guarantee is known as sender k-anonymity. We show that state-of-the art solutions for sender k-anonymity defend only against naive attackers who have no knowledge of the anonymization policy that is in use. We strengthen the privacy guarantee to defend against more realistic Â¿policy-awareÂ¿ attackers. We describe a polynomial algorithm to obtain an optimum anonymization policy. Our implementation and experiments show that the policy-aware sender k-anonymity has potential for practical impact, being efficiently enforceable, with limited reduction in utility when compared to policy-unaware guarantees.

【Keywords】: computer crime; mobile computing; polynomials; LBS; LBS context; hacking gain access; k-anonymity; location based services; mobile user locations; policy aware attackers; policy aware sender anonymity; polynomial algorithm; state-of-the art solutions; subpoenas gain access; Art; Computer crime; Computer science; Databases; Hospitals; Partitioning algorithms; Polynomials; Power system protection; Privacy; Wireless networks

Probabilistic Databases 4

17. Approximate confidence computation in probabilistic databases.

【Paper Link】【Pages】:145-156

【Authors】: Dan Olteanu ; Jiewen Huang ; Christoph Koch

【Abstract】: This paper introduces a deterministic approximation algorithm with error guarantees for computing the probability of propositional formulas over discrete random variables. The algorithm is based on an incremental compilation of formulas into decision diagrams using three types of decompositions: Shannon expansion, independence partitioning, and product factorization. With each decomposition step, lower and upper bounds on the probability of the partially compiled formula can be quickly computed and checked against the allowed error. This algorithm can be effectively used to compute approximate confidence values of answer tuples to positive relational algebra queries on general probabilistic databases (c-tables with discrete probability distributions). We further tune our algorithm so as to capture all known tractable conjunctive queries without self-joins on tuple-independent probabilistic databases: In this case, the algorithm requires time polynomial in the input size even for exact computation. We implemented the algorithm as an extension of the SPROUT query engine. An extensive experimental effort shows that it consistently outperforms state-of-art approximation techniques by several orders of magnitude.

【Keywords】: approximation theory; probability; relational algebra; SPROUT query engine; Shannon expansion; approximate confidence computation; deterministic approximation algorithm; discrete probability distributions; discrete random variables; probabilistic databases; relational algebra queries; Algebra; Approximation algorithms; Distributed computing; Engines; Partitioning algorithms; Polynomials; Probability distribution; Random variables; Relational databases; Upper bound

18. PIP: A database system for great and small expectations.

【Paper Link】【Pages】:157-168

【Authors】: Oliver Kennedy ; Christoph Koch

【Abstract】: Estimation via sampling out of highly selective join queries is well known to be problematic, most notably in online aggregation. Without goal-directed sampling strategies, samples falling outside of the selection constraints lower estimation efficiency at best, and cause inaccurate estimates at worst This problem appears in general probabilistic database systems, where query processing is tightly coupled with sampling. By committing to a set of samples before evaluating the query, the engine wastes effort on samples that will be discarded, query processing that may need to be repeated, or unnecessarily large numbers of samples. We describe PIP, a general probabilistic database system that uses symbolic representations of probabilistic data to defer computation of expectations, moments, and other statistical measures until the expression to be measured is fully known. This approach is sufficiently general to admit both continuous and discrete distributions. Moreover, deferring sampling enables a broad range of goal-oriented sampling-based (as well as exact) integration techniques for computing expectations, allows the selection of the integration strategy most appropriate to the expression being measured, and can reduce the amount of sampling work required. We demonstrate the effectiveness of this approach by showing that even straightforward algorithms can make use of the added information. These algorithms have a profoundly positive impact on the efficiency and accuracy of expectation computations, particularly in the case of highly selective join queries.

【Keywords】: database management systems; estimation theory; probability; query processing; sampling methods; subroutines; PIP; computing expectations; continuous distributions; discrete distributions; estimation efficiency; goal directed sampling strategies; goal-oriented sampling-based integration techniques; great expectations; online aggregation; probabilistic database systems; query processing; small expectations; Computer applications; Data mining; Database systems; Encoding; Engines; Predictive models; Probability distribution; Query processing; Sampling methods

19. Generator-Recognizer Networks: A unified approach to probabilistic databases.

【Paper Link】【Pages】:169-172

【Authors】: Ruiwen Chen ; Yongyi Mao ; Iluju Kiringa

【Abstract】: Under the tuple-level uncertainty paradigm, we introduce a novel graphical model, Generator-Recognizer Network (GRN), as a model for probabilistic databases. The GRN modeling framework extends existing graphical models of probabilistic databases and is capable of representing a much wider range of dependence structures.

【Keywords】: statistical databases; uncertainty handling; dependence structures; generator recognizer networks; probabilistic databases; tuple level uncertainty paradigm; Databases; Graphical models; Power measurement; Random variables; Uncertainty

20. Probabilistic declarative information extraction.

【Paper Link】【Pages】:173-176

【Authors】: Daisy Zhe Wang ; Eirinaios Michelakis ; Michael J. Franklin ; Minos N. Garofalakis ; Joseph M. Hellerstein

【Abstract】: Unstructured text represents a large fraction of the world's data. It often contains snippets of structured information (e.g., people's names and zip codes). Information Extraction (IE) techniques identify such structured information in text. In recent years, database research has pursued IE on two fronts: declarative languages and systems for managing IE tasks, and probabilistic databases for querying the output of IE. In this paper, we make the first step to merge these two directions, without loss of statistical robustness, by implementing a state-of-the-art statistical IE model - Conditional Random Fields (CRF) - in the setting of a Probabilistic Database that treats statistical models as first-class data objects. We show that the Viterbi algorithm for CRF inference can be specified declaratively in recursive SQL. We also show the performance benefits relative to a standalone open-source Viterbi implementation. This work opens up the optimization opportunities for queries involving both inference and relational operators over IE models.

【Keywords】: SQL; inference mechanisms; maximum likelihood estimation; probability; relational databases; CRF inference; Viterbi algorithm; conditional random fields; database research; declarative information extraction; declarative languages; inference operators; open-source Viterbi implementation; probabilistic databases; recursive SQL; relational operators; unstructured text; Data mining; Database languages; Database systems; Learning systems; Merging; Open source software; Relational databases; Robustness; Uncertainty; Viterbi algorithm

Spatial Indexing 3

21. PARINET: A tunable access method for in-network trajectories.

【Paper Link】【Pages】:177-188

【Authors】: Iulian Sandu Popa ; Karine Zeitouni ; Vincent Oria ; Dominique Barth ; Sandrine Vial

【Abstract】: In this paper we propose PARINET, a new access method to efficiently retrieve the trajectories of objects moving in networks. The structure of PARINET is based on a combination of graph partitioning and a set of composite B⁺-tree local indexes. PARINET is designed for historical data and relies on the distribution of the data over the network as for historical data, the data distribution is known in advance. Because the network can be modeled using graphs, the partitioning of the trajectory data is based on graph partitioning theory and can be tuned for a given query load. The data in each partition is indexed on the time component using B⁺-trees. We study different types of queries, and provide an optimal configuration for several scenarios. PARINET can easily be integrated into any RDBMS, which is an essential asset particularly for industrial or commercial applications. The experimental evaluation under an off-the-shelf DBMS shows that PARINET is robust. It also significantly outperforms both MON-tree and another R-tree based access method which are the reference indexing techniques for in-network trajectory databases.

【Keywords】: information retrieval; optimisation; relational databases; trees (mathematics); PARINET access method; composite B⁺-tree local indexes; data distribution; graph partitioning theory; in-network trajectories; in-network trajectory databases; object trajectory retrieval; optimal configuration; reference indexing techniques; relational database management systems; Costs; Indexing; Information retrieval; Laboratories; Roads; Robustness; Spatial databases; Spatiotemporal phenomena; Traffic control; Transportation

22. Multi-guarded safe zone: An effective technique to monitor moving circular range queries.

【Paper Link】【Pages】:189-200

【Authors】: Muhammad Aamir Cheema ; Ljiljana Brankovic ; Xuemin Lin ; Wenjie Zhang ; Wei Wang

【Abstract】: Given a positive value r, a circular range query returns the objects that lie within the distance r of the query location. In this paper, we study the circular range queries that continuously change their locations. We present an efficient and effective technique to monitor such moving range queries by utilising the concept of a safe zone. The safe zone of a query is the area with a property that while the query remains inside it, the results of the query remain unchanged. Hence, the query does not need to be re-evaluated unless it leaves the safe zone. The shape of the safe zone is defined by the so-called guard objects. The cost of checking whether a query lies in the safe zone takes k distance computations, where k is the number of the guard objects. Our contributions are as follows. 1) We propose a technique based on powerful pruning rules and a unique access order which efficiently computes the safe zone and minimizes the I/O cost. 2) To show the effectiveness of the safe zone, we theoretically evaluate the probability that a query leaves the safe zone within one time unit and the expected distance a query moves before it leaves the safe zone. Additionally, for the queries that have diameter of the safe zone less than its expected value multiplied by a constant, we also give an upper bound on the expected number of guard objects. This upper bound turns out to be a constant, that is, it does not depend either on the radius r of the query or the density of the objects. The theoretical analysis is verified by extensive experiments. 3) Our thorough experimental study demonstrates that our proposed approach is close to optimal and is an order of magnitude faster than a naiÂ¿ve algorithm.

【Keywords】: client-server systems; query processing; visual databases; access order; guard objects; moving circular range queries; multiguarded safe zone; pruning rules; query monitoring; Algorithm design and analysis; Availability; Bandwidth; Costs; Euclidean distance; Mobile computing; Monitoring; Navigation; Shape; Upper bound

23. Geotagging with local lexicons to build indexes for textually-specified spatial data.

【Paper Link】【Pages】:201-212

【Authors】: Michael D. Lieberman ; Hanan Samet ; Jagan Sankaranarayanan

【Abstract】: The successful execution of location-based and feature-based queries on spatial databases requires the construction of spatial indexes on the spatial attributes. This is not simple when the data is unstructured as is the case when the data is a collection of documents such as news articles, which is the domain of discourse, where the spatial attribute consists of text that can be (but is not required to be) interpreted as the names of locations. In other words, spatial data is specified using text (known as a toponym) instead of geometry, which means that there is some ambiguity involved. The process of identifying and disambiguating references to geographic locations is known as geotagging and involves using a combination of internal document structure and external knowledge, including a document-independent model of the audience's vocabulary of geographic locations, termed its spatial lexicon. In contrast to previous work, a new spatial lexicon model is presented that distinguishes between a global lexicon of locations known to all audiences, and an audience-specific local lexicon. Generic methods for inferring audiences' local lexicons are described. Evaluations of this inference method and the overall geotagging procedure indicate that establishing local lexicons cannot be overlooked, especially given the increasing prevalence of highly local data sources on the Internet, and will enable the construction of more accurate spatial indexes.

【Keywords】: Internet; data mining; geographic information systems; visual databases; Internet; document-independent model; external knowledge; feature-based queries; generic methods; geographic locations; geotagging; inference method; internal document structure; local lexicons; location-based queries; spatial databases; spatial indexes; spatial lexicon model; textually-specified spatial data; Automation; Computer science; Educational institutions; Frequency; Geometry; Influenza; Internet; Spatial databases; Spatial indexes; Vocabulary

Privacy Techniques 3

24. On optimal anonymization for l+-diversity.

【Paper Link】【Pages】:213-224

【Authors】: Junqiang Liu ; Ke Wang

【Abstract】: Publishing person specific data while protecting privacy is an important problem. Existing algorithms that enforce the privacy principle called l-diversity are heuristic based due to the NP-hardness. Several questions remain open: can we get a significant gain in the data utility from an optimal solution compared to heuristic ones; can we improve the utility by setting a distinct privacy threshold per sensitive value; is it practical to find an optimal solution efficiently for real world datasets. This paper addresses these questions. Specifically, we present a pruning based algorithm for finding an optimal solution to an extended form of the l-diversity problem. The novelty lies in several strong techniques: a novel structure for enumerating all solutions, methods for estimating cost lower bounds, strategies for dynamically arranging the enumeration order and updating lower bounds. This approach can be instantiated with any reasonable cost metric. Experiments on real world datasets show that our algorithm is efficient and improves the data utility.

【Keywords】: computational complexity; data privacy; optimisation; NP-hardness; cost lower bounds; l⁺-diversity; optimal anonymization; optimal solution; protecting privacy; publishing person specific data; Cost function; Data privacy; Indexing; Joining processes; Protection; Publishing

25. Differential privacy via wavelet transforms.

【Paper Link】【Pages】:225-236

【Authors】: Xiaokui Xiao ; Guozhang Wang ; Johannes Gehrke

【Abstract】: Privacy preserving data publishing has attracted considerable research interest in recent years. Among the existing solutions, Â¿-differential privacy provides one of the strongest privacy guarantees. Existing data publishing methods that achieve Â¿-differential privacy, however, offer little data utility. In particular, if the output dataset is used to answer count queries, the noise in the query answers can be proportional to the number of tuples in the data, which renders the results useless. In this paper, we develop a data publishing technique that ensures Â¿-differential privacy while providing accurate answers for range-count queries, i.e., count queries where the predicate on each attribute is a range. The core of our solution is a framework that applies wavelet transforms on the data before adding noise to it. We present instantiations of the proposed framework for both ordinal and nominal data, and we provide a theoretical analysis on their privacy and utility guarantees. In an extensive experimental study on both real and synthetic data, we show the effectiveness and efficiency of our solution.

【Keywords】: data privacy; database management systems; wavelet transforms; data publishing technique; data utility; differential privacy; range-count queries; wavelet transforms; Aggregates; Data privacy; Diabetes; Frequency; Hospitals; Publishing; Sea measurements; Wavelet transforms

26. Efficient verification of shortest path search via authenticated hints.

【Paper Link】【Pages】:237-248

【Authors】: Man Lung Yiu ; Yimin Lin ; Kyriakos Mouratidis

【Abstract】: Shortest path search in transportation networks is unarguably one of the most important online search services nowadays (e.g., Google Maps, MapQuest, etc), with applications spanning logistics, spatial optimization, or everyday driving decisions. Often times, the owner of the road network data (e.g., a transport authority) provides its database to third-party query services, which are responsible for answering shortest path queries posed by their clients. The issue arising here is that a query service might be returning sub-optimal paths either purposely (in order to serve its own purposes like computational savings or commercial reasons) or because it has been compromised by Internet attackers who falsify the results. Therefore, for the above applications to succeed, it is essential that each reported path is accompanied by a proof, which allows clients to verify the path's correctness. This is the first study on shortest path verification in outsourced network databases. We propose the concept of authenticated hints, which is used to reduce the size of the proofs. We develop several authentication techniques and quantify their tradeoffs with respect to offline construction cost and proof size. Experiments on real road networks demonstrate that our solutions are indeed efficient and lead to compact query proofs.

【Keywords】: authorisation; graph theory; information retrieval systems; information services; search problems; online search services; query proofs; query service; shortest path search; shortest path verification; spanning logistics; spatial optimization; transportation networks; Computer network management; Computer networks; Costs; Information management; Logistics; Lungs; Management information systems; Road transportation; Spatial databases; Web server

Skyline Queries 3

27. Evaluating skylines in the presence of equijoins.

【Paper Link】【Pages】:249-260

【Authors】: Wen Jin ; Michael D. Morse ; Jignesh M. Patel ; Martin Ester ; Zengjian Hu

【Abstract】: When a database system is extended with the skyline operator, it is important to determine the most efficient way to execute a skyline query across tables with join operations. This paper describes a framework for evaluating skylines in the presence of equijoins, including: (1) the development of algorithms to answer such queries over large input tables in a non-blocking, pipeline fashion, which significantly speeds up the entire query evaluation time. These algorithms are built on top of the traditional relational Nested-Loop and the Sort-Merge join algorithms, which allows easy implementation of these methods in existing relational systems; (2) a novel method for estimating the skyline selectivity of the joined table; (3) evaluation of skyline computation based on the estimation method and the proposed evaluation techniques; and (4) a systematic experimental evaluation to validate our skyline evaluation framework.

【Keywords】: query processing; relational databases; database system; equijoins presence; estimation method; nested-loop join algorithms; relational systems; skyline computation; skyline evaluation; skyline query; skyline selectivity estimation; sort-merge join algorithms; Buildings; Costs; Database systems; Pipelines; Query processing; Recruitment; Remuneration; Statistical distributions; Visual databases; Visualization

28. Route skyline queries: A multi-preference path planning approach.

【Paper Link】【Pages】:261-272

【Authors】: Hans-Peter Kriegel ; Matthias Renz ; Matthias Schubert

【Abstract】: In recent years, the research community introduced various methods for processing skyline queries in multidimensional databases. The skyline operator retrieves all objects being optimal w.r.t. an arbitrary linear weighting of the underlying criteria. The most prominent example query is to find a reasonable set of hotels which are cheap but close to the beach. In this paper, we propose an new approach for computing skylines on routes (paths) in a road network considering multiple preferences like distance, driving time, the number of traffic lights, gas consumption, etc. Since the consideration of different preferences usually involves different routes, a skyline-fashioned answer with relevant route candidates is highly useful. In our work, we employ graph embedding techniques to enable a best-first based graph exploration considering route preferences based on arbitrary road attributes. The core of our skyline query processor is a route iterator which iteratively computes the top routes according to (at least one) preference in an efficient way avoiding that route computations need to be issued from scratch in each iteration. Furthermore, we propose pruning techniques in order to reduce the search space. Our pruning strategies aim at pruning as many route candidates as possible during the graph exploration. Therefore, we are able to prune candidates which are only partially explored. Finally, we show that our approach is able to reduce the search space significantly and that the skyline can be computed in efficient time in our experimental evaluation.

【Keywords】: database management systems; graph theory; iterative methods; path planning; query processing; road traffic; arbitrary linear weighting; best-first based graph exploration; graph embedding techniques; multidimensional databases; multipreference path planning approach; pruning techniques; road network; route iterator; route skyline queries; Cities and towns; Computer networks; Costs; Euclidean distance; Informatics; Multidimensional systems; Path planning; Roads; Spatial databases; Telecommunication traffic

29. Probabilistic contextual skylines.

【Paper Link】【Pages】:273-284

【Authors】: Dimitris Sacharidis ; Anastasios Arvanitis ; Timos K. Sellis

【Abstract】: The skyline query returns the most interesting tuples according to a set of explicitly defined preferences among attribute values. This work relaxes this requirement, and allows users to pose meaningful skyline queries without stating their choices. To compensate for missing knowledge, we first determine a set of uncertain preferences based on user profiles, i.e., information collected for previous contexts. Then, we define a probabilistic contextual skyline query (p-CSQ) that returns the tuples which are interesting with high probability. We emphasize that, unlike past work, uncertainty lies within the query and not the data, i.e., it is in the relationships among tuples rather than in their attribute values. Furthermore, due to the nature of this uncertainty, popular skyline methods, which rely on a particular tuple visit order, do not apply for p-CSQs. Therefore, we present novel non-indexed and index-based algorithms for answering p-CSQs. Our experimental evaluation concludes that the proposed techniques are significantly more efficient compared to a standard block nested loops approach.

【Keywords】: query processing; index based algorithms; information collection; p-CSQ; popular skyline methods; probabilistic contextual skyline query; skyline query; Airports; Business communication; Cities and towns; Databases; Hydrogen; Information management; Internet; Management information systems; Optimized production technology; Uncertainty

Information Integration 3

30. Schema covering: a step towards enabling reuse in information integration.

【Paper Link】【Pages】:285-296

【Authors】: Barna Saha ; Ioana Stanoi ; Kenneth L. Clarkson

【Abstract】: We introduce schema covering, the problem of identifying easily understandable common objects for describing large and complex schemas. Defining transformations between schemas is a key objective in information integration. However, this process often becomes cumbersome when the schemas are large and structurally complex. If such complex schemas can be broken into smaller and simpler objects, then simple transformations defined over these smaller objects can be reused to define suitable transformations among the complex schemas. Schema covering performs this vital task by identifying a collection of common concepts from a repository and creating a cover of the complex schema by these concepts. In this paper, we formulate the problem of schema covering, show that it is NP-Complete, and give efficient approximation algorithms for it. A performance evaluation with real business schemas confirms the effectiveness of our approach.

【Keywords】: XML; approximation theory; computational complexity; optimisation; relational databases; NP-complete problem; approximation algorithms; complex schema; information integration; repository; schema covering; Approximation algorithms; Business communication; Cities and towns; Computer science; Data mining; Data models; Data warehouses; Educational institutions; Visualization; XML

31. Managing uncertainty of XML schema matching.

【Paper Link】【Pages】:297-308

【Authors】: Reynold Cheng ; Jian Gong ; David W. Cheung

【Abstract】: Despite of advances in machine learning technologies, a schema matching result between two database schemas (e.g., those derived from COMA++) is likely to be imprecise. In particular, numerous instances of Â¿possible mappingsÂ¿ between the schemas may be derived from the matching result. In this paper, we study the problem of managing possible mappings between two heterogeneous XML schemas. We observe that for XML schemas, their possible mappings have a high degree of overlap. We hence propose a novel data structure, called the block tree, to capture the commonalities among possible mappings. The block tree is useful for representing the possible mappings in a compact manner, and can be generated efficiently. Moreover, it supports the evaluation of probabilistic twig query (PTQ), which returns the probability of portions of an XML document that match the query pattern. For users who are interested only in answers with k-highest probabilities, we also propose the top-k PTQ, and present an efficient solution for it. The second challenge we have tackled is to efficiently generate possible mappings for a given schema matching. While this problem can be solved by existing algorithms, we show how to improve the performance of the solution by using a divide-and-conquer approach. An extensive evaluation on realistic datasets show that our approaches significantly improve the efficiency of generating, storing, and querying possible mappings.

【Keywords】: XML; data integrity; learning (artificial intelligence); probability; COMA++; PTQ; XML schema matching; block tree; k-highest probabilities; machine learning technologies; probabilistic twig query; querying possible mappings; Catalogs; Companies; Computer science; Databases; Machine learning; Pattern matching; Technology management; Tree data structures; Uncertainty; XML

32. Propagating updates through XML views using lineage tracing.

【Paper Link】【Pages】:309-320

【Authors】: Leonidas Fegaras

【Abstract】: We address the problem of updating XML views over relational data by translating view updates expressed in the XQuery update facility to embedded SQL updates. Although our XML views may be defined using the full extent of the XQuery syntax, they can only connect relational tables through restricted one-to-many relationships that do not cause view side effects for a wide range of XQuery updates. Our approach is to use lineage tracing to propagate the necessary information about the origins of updatable data pieces through the query and the view code, to be used when these pieces are to be updated. Our system performs a compile-time analysis, based on polymorphic type inference and type usage, to detect the exclusive data sources, which are the table columns from the database that can be updated without causing side-effects to the view. The rest of the updates are associated with an update context in the form of a chain of tuples, which reflects the navigation path that was used to reach the update destination. At commit time, our system collectively considers all the compatible chains of all updates in the transaction and tries to relink them to new chains from the existing database whose tuples contain the updated data, so that the updates are reflected correctly without causing side effects to the other components of the view.

【Keywords】: SQL; XML; relational databases; SQL; XML; XQuery syntax; compile time analysis; lineage tracing; relational data; Data models; Internet; Navigation; Performance analysis; Query processing; Relational databases; Runtime; Tin; Transaction databases; XML

Query Interfaces 4

33. USHER: Improving data quality with dynamic forms.

【Paper Link】【Pages】:321-332

【Authors】: Kuang Chen ; Harr Chen ; Neil Conway ; Joseph M. Hellerstein ; Tapan S. Parikh

【Abstract】: Data quality is a critical problem in modern databases. Data entry forms present the first and arguably best opportunity for detecting and mitigating errors, but there has been little research into automatic methods for improving data quality at entry time. In this paper, we propose USHER, an end-to-end system for form design, entry, and data quality assurance. Using previous form submissions, USHER learns a probabilistic model over the questions of the form. USHER then applies this model at every step of the data entry process to improve data quality. Before entry, it induces a form layout that captures the most important data values of a form instance as quickly as possible. During entry, it dynamically adapts the form to the values being entered, and enables real-time feedback to guide the data enterer toward their intended values. After entry, it re-asks questions that it deems likely to have been entered incorrectly. We evaluate all three components of USHER using two real-world data sets. Our results demonstrate that each component has the potential to improve data quality considerably, at a reduced cost when compared to current practice.

【Keywords】: data mining; USHER; automatic methods; data quality improvement; dynamic forms; end-to-end system; modern databases; probabilistic model; real-world data sets; Artificial intelligence; Cleaning; Computer science; Costs; Databases; Decision making; Error correction; Feedback; Laboratories; Quality assurance

34. Explaining structured queries in natural language.

【Paper Link】【Pages】:333-344

【Authors】: Georgia Koutrika ; Alkis Simitsis ; Yannis E. Ioannidis

【Abstract】: Many applications offer a form-based environment for naiÂ¿ve users for accessing databases without being familiar with the database schema or a structured query language. User interactions are translated to structured queries and executed. However, as a user is unlikely to know the underlying semantic connections among the fields presented in a form, it is often useful to provide her with a textual explanation of the query. In this paper, we take a graph-based approach to the query translation problem. We represent various forms of structured queries as directed graphs and we annotate the graph edges with template labels using an extensible template mechanism. We present different graph traversal strategies for efficiently exploring these graphs and composing textual query descriptions. Finally, we present experimental results for the efficiency and effectiveness of the proposed methods.

【Keywords】: SQL; database management systems; directed graphs; natural languages; query processing; database accessing; directed graphs; graph edges; graph traversal strategies; naive users; natural language; query translation problem; structured query language; template mechanism; textual query descriptions; user interactions; Automatic speech recognition; Computer science; Database languages; Informatics; Natural languages; Qualifications; Resource description framework; Speech analysis; Telecommunications; Visual databases

35. ScoreFinder: A method for collaborative quality inference on user-generated content.

【Paper Link】【Pages】:345-348

【Authors】: Yang Liao ; Aaron Harwood ; Kotagiri Ramamohanarao

【Abstract】: User-generated content is quickly becoming the greatest source of information on the World Wide Web. Shared content items are initially considered unconfirmed in the sense that their credibility has not yet been established. Conventional, centralized confirmation of credibility is infeasible at the Internet scale and so making use of the annotators themselves to evaluate each item is essential. However, users usually differ in opinions to the same item, and the existence of bias, variance and maliciousness makes the problem of aggregating opinions more difficult. Addressing this problem, we propose the use of an Author-Annotator model with an iterative algorithm, called ScoreFinder, for inferring credibility by ranking shared items. In order to reduce the influence from a variety of error sources, we identify reliable users on each topic, and adaptively aggregate scores from them. Moreover, we transform the users' input to remove errors/anomalies, by identifying patterns of misbehaviour learned from a real data set. We show how our algorithm performs on both real data sets and synthetic data sets, and a significant improvement was achieved in the experiment.

【Keywords】: Internet; groupware; inference mechanisms; iterative methods; Internet scale; ScoreFinder algorithm; author-annotator model; collaborative quality inference; credibility confirmation; iterative algorithm; user-generated content; Bipartite graph; Collaboration; Collaborative software; Computer science; Information resources; Internet; Iterative algorithms; Software engineering; User-generated content; Web sites

36. IQP: Incremental query construction, a probabilistic approach.

【Paper Link】【Pages】:349-352

【Authors】: Elena Demidova ; Xuan Zhou ; Wolfgang Nejdl

【Abstract】: This paper presents IQ^P - a novel approach to bridge the gap between usability of keyword search and expressiveness of database queries. IQ^P enables a user to start with an arbitrary keyword query and incrementally refine it into a structured query through an interactive interface. The enabling techniques of IQ^P include: (1) a conceptual framework for incremental query construction; (2) a probabilistic model to assess the possible informational needs represented by a keyword query; (3) an algorithm to perform an optimal query construction.

【Keywords】: information services; probability; query formulation; IO^P; arbitrary keyword query; database query expressiveness; incremental query construction; interactive interface; keyword search; probabilistic approach; probabilistic model; structured query; usability; Australia; Books; Bridges; Cities and towns; Database languages; Keyword search; Lifting equipment; Motion pictures; Usability; User interfaces

Top-K Queries 4

37. TASM: Top-k Approximate Subtree Matching.

【Paper Link】【Pages】:353-364

【Authors】: Nikolaus Augsten ; Denilson Barbosa ; Michael H. Böhlen ; Themis Palpanas

【Abstract】: We consider the Top-k Approximate Subtree Matching (TASM) problem: finding the k best matches of a small query tree, e.g., a DBLP article with 15 nodes, in a large document tree, e.g., DBLP with 26M nodes, using the canonical tree edit distance as a similarity measure between subtrees. Evaluating the tree edit distance for large XML trees is difficult: the best known algorithms have cubic runtime and quadratic space complexity, and, thus, do not scale. Our solution is TASM-postorder, a memory-efficient and scalable TASM algorithm. We prove an upper-bound for the maximum subtree size for which the tree edit distance needs to be evaluated. The upper bound depends on the query and is independent of the document size and structure. A core problem is to efficiently prune subtrees that are above this size threshold. We develop an algorithm based on the prefix ring buffer that allows us to prune all subtrees above the threshold in a single postorder scan of the document. The size of the prefix ring buffer is linear in the threshold. As a result, the space complexity of TASM-postorder depends only on k and the query size, and the runtime of TASM-postorder is linear in the size of the document. Our experimental evaluation on large synthetic and real XML documents confirms our analytic results.

【Keywords】: XML; computational complexity; tree searching; XML trees; cubic runtime; prefix ring buffer; quadratic space complexity; query tree; top-k approximate subtree matching; tree edit distance; Cleaning; Computer science; Extraterrestrial measurements; Performance evaluation; Runtime; Upper bound; XML

38. Reverse top-k queries.

【Paper Link】【Pages】:365-376

【Authors】: Akrivi Vlachou ; Christos Doulkeridis ; Yannis Kotidis ; Kjetil Nørvåg

【Abstract】: Rank-aware query processing has become essential for many applications that return to the user only the top-k objects based on the individual user's preferences. Top-k queries have been mainly studied from the perspective of the user, focusing primarily on efficient query processing. In this work, for the first time, we study top-k queries from the perspective of the product manufacturer. Given a potential product, which are the user preferences for which this product is in the top-k query result set? We identify a novel query type, namely reverse top-k query, that is essential for manufacturers to assess the potential market and impact of their products based on the competition. We formally define reverse top-k queries and introduce two versions of the query, namely monochromatic and bichromatic. We first provide a geometric interpretation of the monochromatic reverse top-k query in the solution space that helps to understand the reverse top-k query conceptually. Then, we study in more details the case of bichromatic reverse top-k query, which is more interesting for practical applications. Such a query, if computed in a straightforward manner, requires evaluating a top-k query for each user preference in the database, which is prohibitively expensive even for moderate datasets. In this paper, we present an efficient threshold-based algorithm that eliminates candidate user preferences, without processing the respective top-k queries. Furthermore, we introduce an indexing structure based on materialized reverse top-k views in order to speed up the computation of reverse top-k queries. Materialized reverse top-k views trade preprocessing cost for query speed up in a controllable manner. Our experimental evaluation demonstrates the efficiency of our techniques, which reduce the required number of top-k computations by 1 to 3 orders of magnitude.

【Keywords】: query processing; bichromatic query; geometric interpretation; monochromatic query; product manufacturer; rank aware query processing; reverse top-k queries; Computer science; Costs; Databases; Indexing; Informatics; Pulp manufacturing; Query processing

39. Top-K aggregation queries over large networks.

【Paper Link】【Pages】:377-380

【Authors】: Xifeng Yan ; Bin He ; Feida Zhu ; Jiawei Han

【Abstract】: Searching and mining large graphs today is critical to a variety of application domains, ranging from personalized recommendation in social networks, to searches for functional associations in biological pathways. In these domains, there is a need to perform aggregation operations on large-scale networks. Unfortunately the existing implementation of aggregation operations on relational databases does not guarantee superior performance in network space, especially when it involves edge traversals and joins of gigantic tables. In this paper, we investigate the neighborhood aggregation queries: Find nodes that have top-k highest aggregate values over their h-hop neighbors. While these basic queries are common in a wide range of search and recommendation tasks, surprisingly they have not been studied systematically. We developed a Local Neighborhood Aggregation framework, called LONA, to answer them efficiently. LONA exploits two properties unique in network space: First, the aggregate value for the neighboring nodes should be similar in most cases; Second, given the distribution of attribute values, it is possible to estimate the upper-bound value of aggregates. These two properties inspire the development of novel pruning techniques, forward pruning using differential index and backward pruning using partial distribution. Empirical results show that LONA could outperform the baseline algorithm up to 10 times in real-life large networks.

【Keywords】: data mining; graph theory; query processing; relational databases; social networking (online); biological pathways; functional associations; graphs mining; large scale networks; local neighborhood aggregation framework; neighborhood aggregation queries; pruning techniques; relational databases; social networks; top-k aggregation queries; Couplings; Data privacy; Databases; Diseases; Hospitals; Influenza; Protection; Publishing; Stomach; Sun

40. TopCells: Keyword-based search of top-k aggregated documents in text cube.

【Paper Link】【Pages】:381-384

【Authors】: Bolin Ding ; Bo Zhao ; Cindy Xide Lin ; Jiawei Han ; Chengxiang Zhai

【Abstract】: Previous studies on supporting keyword queries in RDBMSs provide users with a ranked list of relevant linked structures (e.g. joined tuples) or individual tuples. In this paper, we aim to support keyword search in a data cube with text-rich dimension(s) (so-called text cube). Each document is associated with structural dimensions. A cell in the text cube aggregates a set of documents with matching dimension values on a subset of dimensions. Given a keyword query, our goal is to find the top-k most relevant cells in the text cube. We propose a relevance scoring model and efficient ranking algorithms. Experiments are conducted to verify their efficiency.

【Keywords】: data structures; relational databases; text analysis; RDBMSs; TopCells; individual tuples; keyword based search; structural dimensions; text cube; top-k aggregated documents; Aggregates; Computer science; Databases; Internet; Keyword search; Linux; Multidimensional systems; Navigation; Portable computers; Power system modeling

Workflow and Workload Management 4

41. Optimizing ETL workflows for fault-tolerance.

【Paper Link】【Pages】:385-396

【Authors】: Alkis Simitsis ; Kevin Wilkinson ; Umeshwar Dayal ; Malú Castellanos

【Abstract】: Extract-Transform-Load (ETL) processes play an important role in data warehousing. Typically, design work on ETL has focused on performance as the sole metric to make sure that the ETL process finishes within an allocated time window. However, other quality metrics are also important and need to be considered during ETL design. In this paper, we address ETL design for performance plus fault-tolerance and freshness. There are many reasons why an ETL process can fail and a good design needs to guarantee that it can be recovered within the ETL time window. How to make ETL robust to failures is not trivial. There are different strategies that can be used and they each have different costs and benefits. In addition, other metrics can affect the choice of a strategy; e.g., higher freshness reduces the time window for recovery. The design space is too large for informal, ad-hoc approaches. In this paper, we describe our QoX optimizer that considers multiple design strategies and finds an ETL design that satisfies multiple objectives. In particular, we define the optimizer search space, cost functions, and search algorithms. Also, we illustrate its use through several experiments and we show that it produces designs that are very near optimal.

【Keywords】: data warehouses; fault tolerant computing; matrix algebra; optimisation; ETL design; ETL workflows optimization; QoX optimizer; cost functions; data warehousing; extract-transform-load processes; fault tolerance; optimizer search space; quality metrics; search algorithms; sole metric; Availability; Cost function; Data mining; Data warehouses; Design optimization; Fault tolerance; Maintenance; Robustness; Scalability; Warehousing

42. Q-Cop: Avoiding bad query mixes to minimize client timeouts under heavy loads.

【Paper Link】【Pages】:397-408

【Authors】: Sean Tozer ; Tim Brecht ; Ashraf Aboulnaga

【Abstract】: In three-tiered web applications, some form of admission control is required to ensure that throughput and response times are not significantly harmed during periods of heavy load. We propose Q-Cop, a prototype system for improving admission control decisions that considers a combination of the load on the system, the number of simultaneous queries being executed, the actual mix of queries being executed, and the expected time a user may wait for a reply before they or their browser give up (i.e., time out). Using TPC-W queries, we show that the response times of different types of queries can vary significantly depending not just on the number of queries being processed but on the mix of other queries that are running simultaneously. We develop a model of expected query execution times that accounts for the mix of queries being executed and integrate this model into a three-tiered system to make admission control decisions. Our results show that this approach makes more informed decisions about which queries to reject and as a result significantly reduces the number of requests that time out. Across the range of workloads examined an average of 47% fewer requests are unsuccessful than the next best approach.

【Keywords】: Internet; software prototyping; Q-cop; TPC-W queries; admission control; avoiding bad query mixes; heavy loads; minimize client timeouts; prototype system; query execution times; three tiered Web applications; Admission control; Application software; Computer science; Database systems; Delay; Humans; Prototypes; Resource management; Throughput; Web server

43. Admission control mechanisms for continuous queries in the cloud.

【Paper Link】【Pages】:409-412

【Authors】: Lory Al Moakar ; Panos K. Chrysanthis ; Christine Chung ; Shenoda Guirguis ; Alexandros Labrinidis ; Panayiotis Neophytou ; Kirk Pruhs

【Abstract】: Amazon, Google, and IBM now sell cloud computing services.We consider the setting of a for-profit business selling data stream monitoring/management services and we investigate auction-based mechanisms for admission control of continuous queries. When submitting a query, each user also submits a bid of how much she is willing to pay for that query to run. The admission control auction mechanism then determines which queries to admit, and how much to charge each user in a way that maximizes system revenue while being strategyproof and sybil immune, incentivizing users to use the system honestly. Specifically, we require that each user maximizes her payoff by bidding her true value of having her query run. We design several payment mechanisms and experimentally evaluate them. We describe the provable game theoretic characteristics of each mechanism alongside its performance with respect to maximizing profit and total user payoff.

【Keywords】: Web services; electronic commerce; game theory; query processing; admission control mechanisms; auction-based mechanisms; cloud computing services; continuous queries; data stream management services; data stream monitoring services; for-profit business; profit maximization; provable game theoretic characteristics; total user payoff; Admission control; Business; Cloud computing; Computer science; Computerized monitoring; Cost accounting; Educational institutions; Game theory; Kirk field collapse effect; Streaming media

44. Interaction-aware prediction of business intelligence workload completion times.

【Paper Link】【Pages】:413-416

【Authors】: Mumtaz Ahmad ; Songyun Duan ; Ashraf Aboulnaga ; Shivnath Babu

【Abstract】: While planning the execution of report-generation workloads, database administrators often need to know how long different query workloads will take to run. Database systems run mixes of multiple queries of different types concurrently. Hence, estimating the completion time of a query workload requires reasoning about query mixes and inter-query interactions in the mixes; rather than considering queries or query types in isolation. This paper presents a novel approach for estimating workload completion time based on experiment-driven modeling and simulation of the impact of inter-query interactions. A preliminary evaluation of this approach with TPC-H queries on IBM DB2 shows how our approach can consistently predict workload completion times with good accuracy.

【Keywords】: competitive intelligence; database management systems; query processing; report generators; IBM DB2; TPC-H queries; business intelligence workload completion times; database administrators; database systems; inter query interactions; interaction aware prediction; query mixes; query workloads; report generation workloads; Bismuth; Data warehouses; Database systems; Deductive databases; Hardware; Spatial databases

Indexing and Hashing 4

45. Fast in-memory XPath search using compressed indexes.

【Paper Link】【Pages】:417-428

【Authors】: Diego Arroyuelo ; Francisco Claude ; Sebastian Maneth ; Veli Mäkinen ; Gonzalo Navarro ; Kim Nguyen ; Jouni Sirén ; Niko Välimäki

【Abstract】: A large fraction of an XML document typically consists of text data. The XPath query language allows text search via the equal, contains, and starts-with predicates. Such predicates can be efficiently implemented using a compressed self-index of the document's text nodes. Most queries, however, contain some parts querying the text of the document, plus some parts querying the tree structure. It is therefore a challenge to choose an appropriate evaluation order for a given query, which optimally leverages the execution speeds of the text and tree indexes. Here the SXSI system is introduced. It stores the tree structure of an XML document using a bit array of opening and closing brackets plus a sequence of labels, and stores the text nodes of the document using a global compressed self-index. On top of these indexes sits an XPath query engine that is based on tree automata. The engine uses fast counting queries of the text index in order to dynamically determine whether to evaluate top-down or bottom-up with respect to the tree structure. The resulting system has several advantages over existing systems: (1) on pure tree queries (without text search) such as the XPathMark queries, the SXSI system performs on par or better than the fastest known systems MonetDB and Qizx, (2) on queries that use text search, SXSI outperforms the existing systems by 1-3 orders of magnitude (depending on the size of the result set), and (3) with respect to memory consumption, SXSI outperforms all other systems for counting-only queries.

【Keywords】: XML; document handling; query languages; search engines; text analysis; tree data structures; SXSI system; SXSI system perform; XML document; XPath query engine; XPath query language; XPathMark queries; compressed indexes; global compressed self index; in-memory XPath search; text indexes; text search; tree automata; tree indexes; Australia; Automata; Computer science; Data structures; Database languages; Engines; Random access memory; Read-write memory; Tree data structures; XML

46. Hashing tree-structured data: Methods and applications.

【Paper Link】【Pages】:429-440

【Authors】: Shirish Tatikonda ; Srinivasan Parthasarathy

【Abstract】: In this article we propose a new hashing framework for tree-structured data. Our method maps an unordered tree into a multiset of simple wedge-shaped structures refered to as pivots. By coupling our pivot multisets with the idea of minwise hashing, we realize a fixed sized signature-sketch of the tree-structured datum yielding an effective mechanism for hashing such data. We discuss several potential pivot structures and study some of the theoretical properties of such structures, and discuss their implications to tree edit distance and properties related to perfect hashing. We then empirically demonstrate the efficacy and efficiency of the overall approach on a range of real-world datasets and applications.

【Keywords】: tree data structures; minwise hashing; pivot multisets; signature sketch; tree structured data hashing; unordered tree; wedge shaped structures; Application software; Collaboration; Computer science; Data analysis; Data engineering; Data mining; Phylogeny; Semantic Web; Tree graphs; XML

47. Estimating the compression fraction of an index using sampling.

【Paper Link】【Pages】:441-444

【Authors】: Stratos Idreos ; Raghav Kaushik ; Vivek R. Narasayya ; Ravishankar Ramamurthy

【Abstract】: Data compression techniques such as null suppression and dictionary compression are commonly used in today's database systems. In order to effectively leverage compression, it is necessary to have the ability to efficiently and accurately estimate the size of an index if it were to be compressed. Such an analysis is critical if automated physical design tools are to be extended to handle compression. Several database systems today provide estimators for this problem based on random sampling. While this approach is efficient, there is no previous work that analyses its accuracy. In this paper, we analyse the problem of estimating the compressed size of an index from the point of view of worst-case guarantees. We show that the simple estimator implemented by several database systems has several Â¿goodÂ¿ cases even though the estimator itself is agnostic to the internals of the specific compression algorithm.

【Keywords】: data analysis; data compression; database management systems; dictionaries; estimation theory; sampling methods; compression algorithm; data compression techniques; database systems; dictionary compression; estimators; index compression fraction estimation; null suppression; random sampling; Algorithm design and analysis; Capacity planning; Costs; Data analysis; Data compression; Database systems; Indexes; Performance analysis; Sampling methods; Yield estimation

48. The Hybrid-Layer Index: A synergic approach to answering top-k queries in arbitrary subspaces.

【Paper Link】【Pages】:445-448

【Authors】: Jun-Seok Heo ; Junghoo Cho ; Kyu-Young Whang

【Abstract】: In this paper, we propose the Hybrid-Layer Index (simply, the HL-index) that is designed to answer top-k queries efficiently when the queries are expressed on any arbitrary subset of attributes in the database. Compared to existing approaches, the HL-index significantly reduces the number of tuples accessed during query processing by pruning unnecessary tuples based on two criteria, i.e., it filters out tuples both (1) globally based on the combination of all attribute values of the tuples like in the layer-based approach (simply, layer-level filtering) and (2) based on individual attribute values used for ranking the tuples like in the list-based approach (simply, list-level filtering). Specifically, the HL-index exploits the synergic effect of integrating the layer-level filtering method and the list-level filtering method. Details and extensive experiments are available in the full paper.

【Keywords】: database management systems; query processing; set theory; arbitrary subset; arbitrary subspaces; databases; hybrid layer index; layer level filtering method; layer-level filtering; list based approach; query answering; query processing; tuples pruning; Computer interfaces; Computer science; Databases; Digital cameras; Filtering; Filters; Indexes; Partitioning algorithms; Query processing; Sensor phenomena and characterization

Scientific Data Mining 3

49. The model-summary problem and a solution for trees.

【Paper Link】【Pages】:449-460

【Authors】: Biswanath Panda ; Mirek Riedewald ; Daniel Fink

【Abstract】: Modern science is collecting massive amounts of data from sensors, instruments, and through computer simulation. It is widely believed that analysis of this data will hold the key for future scientific breakthroughs. Unfortunately, deriving knowledge from large high-dimensional scientific datasets is difficult. One emerging answer is exploratory analysis using data mining; but data mining models that accurately capture natural processes tend to be very complex and are usually not intelligible. Scientists therefore generate model summaries to find the most important patterns learned by the model. We formalize the model-summary problem and introduce it as a novel problem to the database community. Generating model summaries creates serious data management challenges: Scientists usually want to analyze patterns in different Â¿slicesÂ¿ and Â¿dicesÂ¿ of the data space, comparing the effects of various input variables on the output. We propose novel techniques for efficiently generating such summaries for the popular class of tree-based models. Our techniques leverage workload structure on multiple levels. We also propose a scalable implementation of our techniques in MapReduce. For both sequential and parallel implementation, we achieve speedups of one or more orders of magnitude over the naive algorithm, while guaranteeing the exact same results.

【Keywords】: algorithm theory; data analysis; data mining; trees (mathematics); MapReduce; data analysis; data management; data mining; naive algorithm; tree model; Birds; Computer simulation; Data analysis; Data mining; Databases; Educational institutions; Environmental factors; Information science; Instruments; Pattern analysis

50. Efficient and accurate discovery of patterns in sequence datasets.

【Paper Link】【Pages】:461-472

【Authors】: Avrilia Floratou ; Sandeep Tata ; Jignesh M. Patel

【Abstract】: Existing sequence mining algorithms mostly focus on mining for subsequences. However, a large class of applications, such as biological DNA and protein motif mining, require efficient mining of Â¿approximateÂ¿ patterns that are contiguous. The few existing algorithms that can be applied to find such contiguous approximate pattern mining have drawbacks like poor scalability, lack of guarantees in finding the pattern, and difficulty in adapting to other applications. In this paper, we present a new algorithm called FLAME (FLexible and Accurate Motif DEtector). FLAME is a flexible suffix tree based algorithm that can be used to find frequent patterns with a variety of definitions of motif (pattern) models. It is also accurate, as it always find the pattern if it exists. Using both real and synthetic datasets, we demonstrate that FLAME is fast, scalable, and outperforms existing algorithms on a variety of performance metrics. Using FLAME, it is now possible to mine datasets that would have been prohibitively difficult with existing tools.

【Keywords】: data mining; FLAME; contiguous approximate pattern mining; flexible and accurate motif detector; flexible suffix tree based algorithm; patterns discovery; sequence dataset; sequence mining algorithm; Application software; Biology computing; Computational biology; DNA; Data mining; Detectors; Fires; Proteins; Scalability; Sequences

51. Mining mutation chains in biological sequences.

【Paper Link】【Pages】:473-484

【Authors】: Chang Sheng ; Wynne Hsu ; Mong-Li Lee ; Joo Chuan Tong ; See-Kiong Ng

【Abstract】: The increasing infectious disease outbreaks has led to a need for new research to better understand the disease's origins, epidemiological features and pathogenicity caused by fast-mutating, fast-spreading viruses. Traditional sequence analysis methods do not take into account the spatio-temporal dynamics of rapidly evolving and spreading viral species. They are also focused on identifying single-point mutations. In this paper, we propose a novel approach that incorporates space-time relationships for studying changes in protein sequences from fast mutating viruses. We aim to detect both single-point mutations as well as k-mutations in the viral sequences. We define the problem of mutation chain pattern mining and design algorithms to discover valid mutation chains. Compact data structures to facilitate the mining process as well as pruning strategies to increase the scalability of the algorithms are devised. Experiments on both synthetic datasets and real world influenza A virus dataset show that our algorithms are scalable and effective in discovering mutations that occur geographically over time.

【Keywords】: biology computing; data mining; data structures; biological sequences; chain pattern mining; data structures; epidemiological features; mining mutation chains; single point mutations; space time relationships; spatio temporal dynamics; Amino acids; Diseases; Genetic mutations; Humans; Immune system; Influenza; Pathogens; Proteins; Vaccines; Viruses (medical)

Database Performance and Reliability 3

52. Exploring power-performance tradeoffs in database systems.

【Paper Link】【Pages】:485-496

【Authors】: Zichen Xu ; Yi-Cheng Tu ; Xiaorui Wang

【Abstract】: With the total energy consumption of computing systems increasing in a steep rate, much attention has been paid to the design of energy-efficient computing systems and applications. So far, database system design has focused on improving performance of query processing. The objective of this study is to experimentally explore the potential of power conservation in relational database management systems. We hypothesize that, by modifying the query optimizer in a DBMS to take the power cost of query plans into consideration, we will be able to reduce the power usage of database servers and control the tradeoffs between power consumption and system performance. We also identify the sources of such savings by investigating the resource consumption features during query processing in DBMSs. To that end, we provide an in-depth anatomy and qualitatively analyze the power profile of typical queries in the TPC benchmarks. We perform extensive experiments on a physical testbed based on the PostgreSQL system using workloads generated from the TPC benchmarks. Our hypothesis is supported by such experimental results: power savings in the range of 11% - 22% can be achieved by equipping the DBMS with a query optimizer that selects query plans based on both estimated processing time and power requirements.

【Keywords】: database management systems; performance evaluation; power consumption; query processing; relational databases; DBMS; PostgreSQL system; TPC benchmarks; database servers; database system design; energy consumption; energy efficient computing systems; physical testbed; power conservation; power performance tradeoffs exploration; query optimizer; query processing; relational database management systems; steep rate; system performance; tradeoffs control; Benchmark testing; Computer applications; Cost function; Database systems; Energy consumption; Energy efficiency; Energy management; Power system management; Query processing; Relational databases

53. Workload driven index defragmentation.

【Paper Link】【Pages】:497-508

【Authors】: Vivek R. Narasayya ; Manoj Syamala

【Abstract】: Decision support queries that scan large indexes can suffer significant degradation in I/O performance due to index fragmentation. DBAs rely on rules of thumb that use index size and fragmentation information to accomplish the task of deciding which indexes to defragment. However, there are two fundamental limitations that make this task challenging. First, database engines offer little support to help estimate the impact of defragmenting an index on the I/O performance of a query. Second, defragmentation is supported only at the granularity of an entire B+-Tree, which can be too restrictive since defragmentation is an expensive operation. This paper describes techniques for addressing the above limitations. We also study the problem of selecting the appropriate indexes to defragment for a given workload. We have implemented our techniques in Microsoft SQL Server and developed a tool that can provide appropriate index defragmentation recommendations to DBAs. We evaluate the effectiveness of the proposed techniques on several real and synthetic databases.

【Keywords】: SQL; decision support systems; input-output programs; B+-tree; I/O performance; Microsoft SQL server; database engines; decision support queries; expensive operation defragmentation; fragmentation information; scan large indexes; workload driven index defragmentation; Costs; Degradation; Engines; Indexes; Information analysis; Information management; Performance analysis; Relational databases; Thumb

54. Impact of disk corruption on open-source DBMS.

【Paper Link】【Pages】:509-520

【Authors】: Sriram Subramanian ; Yupu Zhang ; Rajiv Vaidyanathan ; Haryadi S. Gunawi ; Andrea C. Arpaci-Dusseau ; Remzi H. Arpaci-Dusseau ; Jeffrey F. Naughton

【Abstract】: Despite the best intentions of disk and RAID manufacturers, on-disk data can still become corrupted. In this paper, we examine the effects of corruption on database management systems. Through injecting faults into the MySQL DBMS, we find that in certain cases, corruption can greatly harm the system, leading to untimely crashes, data loss, or even incorrect results. Overall, of 145 injected faults, 110 lead to serious problems. More detailed observations point us to three deficiencies: MySQL does not have the capability to detect some corruptions due to lack of redundant information, does not isolate corrupted data from valid data, and has inconsistent reactions to similar corruption scenarios. To detect and repair corruption, a DBMS is typically equipped with an offline checker. Unfortunately, the MySQL offline checker is not comprehensive in the checks it performs, misdiagnosing many corruption scenarios and missing others. Sometimes the checker itself crashes; more ominously, its incorrect checking can lead to incorrect repairs. Overall, we find that the checker does not behave correctly in 18 of 145 injected corruptions, and thus can leave the DBMS vulnerable to the problems described above.

【Keywords】: data structures; database management systems; disc storage; public domain software; RAID manufacturers; database management systems; disk corruption impact; injected corruptions; open source DBMS; redundant information; Computer aided manufacturing; Computer crashes; Computer science; Concurrency control; Database systems; Drives; File systems; Open source software; Power capacitors; Protection

Spatial Databases 3

55. Locating mapped resources in Web 2.0.

【Paper Link】【Pages】:521-532

【Authors】: Dongxiang Zhang ; Beng Chin Ooi ; Anthony K. H. Tung

【Abstract】: Mapping mashups are emerging Web 2.0 applications in which data objects such as blogs, photos and videos from different sources are combined and marked in a map using APIs that are released by online mapping solutions such as Google and Yahoo Maps. These objects are typically associated with a set of tags capturing the embedded semantic and a set of coordinates indicating their geographical locations. Traditional web resource searching strategies are not effective in such an environment due to the lack of the gazetteer context in the tags. Instead, a better alternative approach is to locate an object by tag matching. However, the number of tags associated with each object is typically small, making it difficult for an object to capture the complete semantics in the query objects. In this paper, we focus on the fundamental application of locating geographical resources and propose an efficient tag-centric query processing strategy. In particular, we aim to find a set of nearest co-located objects which together match the query tags. Given the fact that there could be large number of data objects and tags, we develop an efficient search algorithm that can scale up in terms of the number of objects and tags. Further, to ensure that the results are relevant, we also propose a geographical context sensitive geo-tf-idf ranking mechanism. Our experiments on synthetic data sets demonstrate its scalability while the experiments using the real life data set confirm its practicality.

【Keywords】: Internet; query processing; search problems; APIs; Web 2.0; embedded semantic; geographical context sensitive geo-tf-idf ranking mechanism; geographical resources; mapped resources location; mapping mashups; online mapping solutions; search algorithm; synthetic data sets; tag centric query processing strategy; Blogs; Context modeling; Data models; Mashups; Query processing; Scalability; Search engines; Spatial databases; Tagging; Videos

56. Preference queries in large multi-cost transportation networks.

【Paper Link】【Pages】:533-544

【Authors】: Kyriakos Mouratidis ; Yimin Lin ; Man Lung Yiu

【Abstract】: Research on spatial network databases has so far considered that there is a single cost value associated with each road segment of the network. In most real-world situations, however, there may exist multiple cost types involved in transportation decision making. For example, the different costs of a road segment could be its Euclidean length, the driving time, the walking time, possible toll fee, etc. The relative significance of these cost types may vary from user to user. In this paper we consider such multi-cost transportation networks (MCN), where each edge (road segment) is associated with multiple cost values. We formulate skyline and top-k queries in MCNs and design algorithms for their efficient processing. Our solutions have two important properties in preference-based querying; the skyline methods are progressive and the top-k ones are incremental. The performance of our techniques is evaluated with experiments on a real road network.

【Keywords】: query processing; road traffic; traffic engineering computing; visual databases; Euclidean length; driving time; multicost transportation networks; multiple cost values; preference queries; road segment; skyline queries; spatial network databases; toll fee; top-k queries; transportation decision making; walking time; Computer network management; Computer networks; Costs; Information management; Logistics; Lungs; Management information systems; Road transportation; Solids; Spatial databases

57. Approximate string search in spatial databases.

【Paper Link】【Pages】:545-556

【Authors】: Bin Yao ; Feifei Li ; Marios Hadjieleftheriou ; Kun Hou

【Abstract】: This work presents a novel index structure, MHR-tree, for efficiently answering approximate string match queries in large spatial databases. The MHR-tree is based on the R-tree augmented with the min-wise signature and the linear hashing technique. The min-wise signature for an index node u keeps a concise representation of the union of q-grams from strings under the sub-tree of u. We analyze the pruning functionality of such signatures based on set resemblance between the query string and the q-grams from the sub-trees of index nodes. MHR-tree supports a wide range of query predicates efficiently, including range and nearest neighbor queries. We also discuss how to estimate range query selectivity accurately. We present a novel adaptive algorithm for finding balanced partitions using both the spatial and string information stored in the tree. Extensive experiments on large real data sets demonstrate the efficiency and effectiveness of our approach.

【Keywords】: tree data structures; visual databases; MHR-tree; index nodes subtrees; index structure; linear hashing technique; spatial databases string search; string match queries; Adaptive algorithm; Computer science; Costs; Dynamic programming; Indexes; Keyword search; Nearest neighbor searches; Spatial databases; Uncertainty

Sensor Networks 4

58. Global iceberg detection over distributed data streams.

【Paper Link】【Pages】:557-568

【Authors】: Haiquan (Chuck) Zhao ; Ashwin Lall ; Mitsunori Ogihara ; Jun (Jim) Xu

【Abstract】: In today's Internet applications or sensor networks we often encounter large amounts of data spread over many physically distributed nodes. The sheer volume of the data and bandwidth constraints make it impractical to send all the data to one central node for query processing. Finding distributed icebergs-elements that may have low frequency at individual nodes but high aggregate frequency-is a problem that arises commonly in practice. In this paper we present a novel algorithm with two notable properties. First, its accuracy guarantee and communication cost are independent of the way in which element counts (for both icebergs and non-icebergs) are split amongst the nodes. Second, it works even when each distributed data set is a stream (i.e., one pass data access only). Our algorithm builds upon sketches constructed for the estimation of the second frequency moment (F₂) of data streams. The intuition of our idea is that when there are global icebergs in the union of these data streams the F₂ of the union becomes very large. This quantity can be estimated due to the summable nature of F₂ sketches. Our key innovation here is to establish tight theoretical guarantees of our algorithm, under certain reasonable assumptions, using an interesting combination of convex ordering theory and large deviation techniques.

【Keywords】: Internet; query processing; wireless sensor networks; Internet applications; aggregate frequency; convex ordering theory; deviation techniques; distributed data streams; global iceberg detection; query processing; sensor networks; Aggregates; Computer networks; Costs; Data security; Distributed computing; Educational institutions; Frequency estimation; Internet; Monitoring; Query processing

59. Non-dyadic Haar wavelets for streaming and sensor data.

【Paper Link】【Pages】:569-580

【Authors】: Chetan Gupta ; Choudur Lakshminarayan ; Song Wang ; Abhay Mehta

【Abstract】: In streaming and sensor data applications, the problems of synopsis construction and outlier detection are important. Due to their low complexity, desirable properties and relative ease of understanding, wavelet based techniques are often used for both synopsis construction and anomaly detection. In streaming data literature, Mallat's algorithm is often used to achieve a Haar wavelet decomposition in O(n) time. However, there is one limitation to this popular technique, in that it leads to a dyadic decomposition of data. We demonstrate that the property of non-dyadicity is of considerable use in synopsis construction and anomaly detection. In this regard we present several application results, a synopsis data structure for streaming data that is an order of magnitude superior to the popular Haar based wavelet technique, a method for finding anomalies for sensor data over non-dyadic hierarchies, etc. In our work, we enable non-dyadicity by proposing a Mallat like construction for a wavelet system that admits non-dyadic basis. Our algorithm builds a non-dyadic hierarchical structure, and is more efficient than the state of the art construction. We prove the correctness of our construction by showing that our basis functions demonstrates the properties of a wavelet system.

【Keywords】: Haar transforms; computational complexity; data handling; probability; security of data; wavelet transforms; O(n) time; anomaly detection; nondyadic Haar wavelets; outlier detection; sensor data; streaming data; synopsis construction; synopsis data structure; Data analysis; Data mining; Data structures; Monitoring; Signal analysis; Signal resolution; Temperature measurement; Temperature sensors; Wavelet analysis; Wavelet coefficients

60. Ratio threshold queries over distributed data sources.

【Paper Link】【Pages】:581-584

【Authors】: Rajeev Gupta ; Krithi Ramamritham ; Mukesh K. Mohania

【Abstract】: In this paper we consider triggers over distributed data from various sources such as: Â¿Notify when sale of luxury goods constitute more than 20% of the overall salesÂ¿. In such queries client desires to be notified whenever the ratio of two aggregates, over distributed data, crosses the specified threshold. The challenge lies in being able to execute the queries with the minimal amount of communication necessary for update propagation. We address the challenge by proposing schemes for converting the client threshold condition into conditions on individual distributed data sources such that (1) violation of the client threshold occurs only if one or more source conditions are violated (zero false negative), and (2) the number of source violations when client threshold is not violated is small (minimize false positives). Using performance evaluation we show that our algorithms result in up to an order of magnitude less number of false positives compared to the approaches in the literature.

【Keywords】: competitive intelligence; decision making; distributed databases; query processing; client threshold condition; distributed data sources; performance evaluation; ratio threshold queries; Aggregates; Consumer electronics; Databases; Decision making; Frequency measurement; Geography; Marketing and sales; Medical services

61. Probabilistic Top-k query processing in distributed sensor networks.

【Paper Link】【Pages】:585-588

【Authors】: Mao Ye ; Xingjie Liu ; Wang-Chien Lee ; Dik Lun Lee

【Abstract】: In this paper, we propose the notion of sufficient set for distributed processing of probabilistic Top-k queries in cluster-based wireless sensor networks. Through the derivation of sufficient boundary, we show that data items ranked lower than sufficient boundary are not required for answering the probabilistic top-k queries, thus are subject to local pruning. Accordingly, we develop the sufficient set-based (SSB) algorithm for inter-cluster query processing. Experimental results show that the proposed algorithm reduces data transmissions significantly.

【Keywords】: distributed algorithms; distributed processing; probability; query processing; wireless sensor networks; cluster based wireless sensor networks; distributed processing; distributed sensor networks; intercluster query processing; probabilistic top-k query processing; pruning; sufficient set based algorithm; Amplitude modulation; Bandwidth; Clustering algorithms; Computer science; Data communication; Delay; Distributed processing; Query processing; Temperature sensors; Wireless sensor networks

Query Optimization 3

62. Polynomial heuristics for query optimization.

【Paper Link】【Pages】:589-600

【Authors】: Nicolas Bruno ; César A. Galindo-Legaria ; Milind Joshi

【Abstract】: Research on query optimization has traditionally focused on exhaustive enumeration of an exponential number of candidate plans. Alternatively, heuristics for query optimization are restricted in several ways, such as by either focusing on join predicates only, ignoring the availability of indexes, or in general having high-degree polynomial complexity. In this paper we propose a heuristic approach to very efficiently obtain execution plans for complex queries, which takes into account the presence of indexes and goes beyond simple join reordering. We also introduce a realistic workload generator and validate our approach using both synthetic and real data.

【Keywords】: computational complexity; polynomial approximation; query processing; heuristic approach; join predicates; join reordering; polynomial complexity; polynomial heuristics; query optimization; realistic workload generator; Availability; Computer languages; Costs; Dynamic programming; Heuristic algorithms; Performance evaluation; Polynomials; Proposals; Query processing; Volcanoes

63. Optimized query evaluation using cooperative sorts.

【Paper Link】【Pages】:601-612

【Authors】: Yu Cao ; Ramadhana Bramandia ; Chee-Yong Chan ; Kian-Lee Tan

【Abstract】: Many applications require sorting a table over multiple sort orders: generation of multiple reports from a table, evaluation of a complex query that involves multiple instances of a relation, and batch processing of a set of queries. In this paper, we study how multiple sortings of a table can be efficiently performed. We introduce a new evaluation technique, called cooperative sort, that exploits the relationships among the input set of sort orders to minimize I/O operations for the collection of sort operations. To demonstrate the efficiency of the proposed scheme, we implemented it in PostgreSQL and evaluated its performance using both TPC-DS benchmark and synthetic data. Our experimental results show significant performance improvement over the traditional non-cooperative sorting scheme.

【Keywords】: batch processing (computers); optimisation; query processing; sorting; PostgreSQL; TPC-DS benchmark; batch processing; cooperative sorts; query evaluation optimization; synthetic data; table sorting; Application software; Clustering algorithms; Computer science; Costs; Database systems; Decision making; Merging; Query processing; Scheduling; Sorting

64. Generating code for holistic query evaluation.

【Paper Link】【Pages】:613-624

【Authors】: Konstantinos Krikellas ; Stratis Viglas ; Marcelo Cintra

【Abstract】: We present the application of customized code generation to database query evaluation. The idea is to use a collection of highly efficient code templates and dynamically instantiate them to create query- and hardware-specific source code. The source code is compiled and dynamically linked to the database server for processing. Code generation diminishes the bloat of higher-level programming abstractions necessary for implementing generic, interpreted, SQL query engines. At the same time, the generated code is customized for the hardware it will run on. We term this approach holistic query evaluation. We present the design and development of a prototype system called HIQUE, the Holistic Integrated Query Engine, which incorporates our proposals. We undertake a detailed experimental study of the system's performance. The results show that HIQUE satisfies its design objectives, while its efficiency surpasses that of both well-established and currently-emerging query processing techniques.

【Keywords】: SQL; high level languages; program compilers; query processing; HIQUE; SQL query engines; database query evaluation; generating code; higher level programming; highly efficient code; holistic integrated query engine; holistic query evaluation; query processing techniques; query- and hardware specific source code; Database systems; Engines; Hardware; Heuristic algorithms; Informatics; Optimizing compilers; Proposals; Prototypes; Query processing; System performance

Graph Mining 4

65. Finding Clusters in subspaces of very large, multi-dimensional datasets.

【Paper Link】【Pages】:625-636

【Authors】: Robson Leonardo Ferreira Cordeiro ; Agma J. M. Traina ; Christos Faloutsos ; Caetano Traina Jr.

【Abstract】: We propose the Multi-resolution Correlation Cluster detection (MrCC), a novel, scalable method to detect correlation clusters able to analyze dimensional data in the range of around 5 to 30 axes. Existing methods typically exhibit super-linear behavior in terms of space or execution time. MrCC employs a novel data structure based on multi-resolution and gains over previous approaches in: (a) it finds clusters that stand out in the data in a statistical sense; (b) it is linear on running time and memory usage regarding number of data points and dimensionality of subspaces where clusters exist; (c) it is linear in memory usage and quasi-linear in running time regarding space dimensionality; and (d) it is accurate, deterministic, robust to noise, does not require stating the number of clusters as input parameter, does not perform distance calculation and is able to detect clusters in subspaces generated by original axes or linear combinations of original axes, including space rotation. We performed experiments on synthetic data ranging from 5 to 30 axes and from 12 k to 250 k points, and MrCC outperformed in time five of the recent and related work, being in average 10 times faster than the competitors that also presented high accuracy results for every tested dataset. Regarding real data, MrCC found clusters at least 9 times faster than the competitors, increasing their accuracy in up to 34 percent.

【Keywords】: data structures; pattern clustering; correlation clusters; data structure; multidimensional datasets; multiresolution correlation cluster detection; space dimensionality; Brazil Council; Clustering methods; Computer science; Data analysis; Data structures; Noise generators; Noise robustness; Performance evaluation; Performance gain; Testing

66. Similarity search on supergraph containment.

【Paper Link】【Pages】:637-648

【Authors】: Haichuan Shang ; Ke Zhu ; Xuemin Lin ; Ying Zhang ; Ryutaro Ichise

【Abstract】: A supergraph containment search is to retrieve the data graphs contained by a query graph. In this paper, we study the problem of efficiently retrieving all data graphs approximately contained by a query graph, namely similarity search on supergraph containment. We propose a novel and efficient index to boost the efficiency of query processing. We have studied the query processing cost and propose two index construction strategies aimed at optimizing the performance of different types of data graphs: top-down strategy and bottom-up strategy. Moreover, a novel indexing technique is proposed by effectively merging the indexes of individual data graphs; this not only reduces the index size but also further reduces the query processing time. We conduct extensive experiments on real data sets to demonstrate the efficiency and the effectiveness of our techniques.

【Keywords】: graph theory; query processing; bottom-up strategy; data graphs retrieval; index construction strategies; query graph; query processing; similarity search; supergraph containment; supergraph containment search; top-down strategy; Australia; Bioinformatics; Cost function; Indexing; Informatics; Information retrieval; Merging; Pattern recognition; Query processing; Spatial databases

67. Finding top-k maximal cliques in an uncertain graph.

【Paper Link】【Pages】:649-652

【Authors】: Zhaonian Zou ; Jianzhong Li ; Hong Gao ; Shuo Zhang

【Abstract】: Existing studies on graph mining focus on exact graphs that are precise and complete. However, graph data tends to be uncertain in practice due to noise, incompleteness and inaccuracy. This paper investigates the problem of finding top-k maximal cliques in an uncertain graph. A new model of uncertain graphs is presented, and an intuitive measure is introduced to evaluate the significance of vertex sets. An optimized branch-and-bound algorithm is developed to find top-k maximal cliques, which adopts efficient pruning rules, a new searching strategy and effective preprocessing methods. The extensive experimental results show that the proposed algorithm is very efficient on real uncertain graphs, and the top-k maximal cliques are very useful for real applications, e.g. protein complex prediction.

【Keywords】: data mining; graph theory; optimisation; tree searching; graph mining; optimized branch-and-bound algorithm; pruning rules; searching strategy; top-k maximal cliques; uncertain graph; Bioinformatics; Cells (biology); Computer science; Databases; Optimization methods; Performance evaluation; Polynomials; Probability distribution; Proteins; Uncertainty

68. Progressive clustering of networks using Structure-Connected Order of Traversal.

【Paper Link】【Pages】:653-656

【Authors】: Dustin Bortner ; Jiawei Han

【Abstract】: Network clustering enables us to view a complex network at the macro level, by grouping its nodes into units whose characteristics and interrelationships are easier to analyze and understand. State-of-the-art network partitioning methods are unable to identify hubs and outliers. A recently proposed algorithm, SCAN, overcomes this difficulty. However, it requires a minimum similarity parameter Â¿ but provides no automated way to find it. Thus, it must be rerun for each Â¿ value and does not capture the variety or hierarchy of clusters. We propose a new algorithm, SCOT (or Structure-Connected Order of Traversal), that produces a length n sequence containing all possible Â¿-clusterings. We propose a new algorithm, HintClus (or Hierarchy-Induced Network Clustering), to hierarchically cluster the network by finding only best cluster boundaries (not agglomerative). Results on model-based synthetic network data and real data show that SCOT'S execution time is comparable to SCAN, that HintClus runs in negligible time, and that HintClus produces sensible clusters in the presence of noise.

【Keywords】: artificial intelligence; database management systems; pattern clustering; HintClus; SCOT; hierarchy induced network clustering; macro level complex network; network progressive clustering; state-of-the-art network partitioning methods; structure connected order of traversal; traversal structure connected order; Biological system modeling; Clustering algorithms; Complex networks; Computer science; Iterative algorithms; Partitioning algorithms; Simulated annealing; Size measurement; Social network services

Parallel Processing 4

69. Osprey: Implementing MapReduce-style fault tolerance in a shared-nothing distributed database.

【Paper Link】【Pages】:657-668

【Authors】: Christopher Yang ; Christine Yen ; Ceryen Tan ; Samuel Madden

【Abstract】: In this paper, we describe a scheme for tolerating and recovering from mid-query faults in a distributed shared nothing database. Rather than aborting and restarting queries, our system, Osprey, divides running queries into subqueries, and replicates data such that each subquery can be rerun on a different node if the node initially responsible fails or returns too slowly. Our approach is inspired by the fault tolerance properties of Map Reduce, in which map or reduce jobs are greedily assigned to workers, and failed jobs are rerun on other workers. Osprey is implemented using a middleware approach, with only a small amount of custom code to handle cluster coordination. Each node in the system is a discrete database system running on a separate machine. Data, in the form of tables, is partitioned amongst database nodes and each partition is replicated on several nodes, using a technique called chained declustering [1]. A coordinator machine acts as a standard SQL interface to users; it transforms an input SQL query into a set of subqueries that are then executed on the nodes. Each subquery represents only a small fraction of the total execution of the query; worker nodes are assigned a new subquery as they finish their current one. In this greedy-approach, the amount of work lost due to node failure is small (at most one subquery's work), and the system is automatically load balanced, because slow nodes will be assigned fewer subqueries. We demonstrate Osprey's viability as a distributed system for a small data warehouse data set and workload. Our experiments show that the overhead introduced by the middleware is small compared to the workload, and that the system shows promising load balancing and fault tolerance properties.

【Keywords】: SQL; distributed databases; fault tolerant computing; middleware; query processing; resource allocation; user interfaces; MapReduce; Osprey; chained declustering; data replication; data warehouse; discrete database system; distributed shared nothing database; fault tolerance; load balancing; middleware; midquery faults; standard SQL interface; Computer crashes; Data warehouses; Database systems; Discrete transforms; Distributed databases; Failure analysis; Fault tolerance; Fault tolerant systems; Load management; Middleware

70. FPGA acceleration for the frequent item problem.

【Paper Link】【Pages】:669-680

【Authors】: Jens Teubner ; René Müller ; Gustavo Alonso

【Abstract】: Field-programmable gate arrays (FPGAs) can provide performance advantages with a lower resource consumption (e.g., energy) than conventional CPUs. In this paper, we show how to employ FPGAs to provide an efficient and high-performance solution for the frequent item problem. We discuss three design alternatives, each one of them exploiting different FPGA features, and we provide an exhaustive evaluation of their performance characteristics. The first design is a one-to-one mapping of the Space-Saving algorithm (shown to be the best approach in software [1]), built on special features of FPGAs: content-addressable memory and dual-ported BRAM. The two other implementations exploit the flexibility of digital circuits to implement parallel lookups and pipelining strategies, resulting in significant improvements in performance. On low-cost FPGA hardware, the fastest of our designs can process 80 million items per second-three times as much as the best known result. Moreover, and unlike in software approaches where performance is directly related to the skew factor of the Zipf distribution, the high throughput is independent of the skew of the distribution of the input. In the paper we discuss as well several design trade-offs that are relevant when implementing database functionality on FPGAs. In particular, we look at resource consumption and the levels of data and task parallelism of three different designs.

【Keywords】: content-addressable storage; field programmable gate arrays; logic design; parallel architectures; statistical distributions; FPGA acceleration; Zipf distribution; content-addressable memory; data level; database functionality; dual-ported BRAM; field programmable gate arrays; frequent item problem; one-to-one mapping design; resource consumption; space-saving algorithm; task parallelism level; Acceleration; Algorithm design and analysis; Digital circuits; Field programmable gate arrays; Hardware; Pipeline processing; Process design; Software algorithms; Software performance; Throughput

71. Estimating the progress of MapReduce pipelines.

【Paper Link】【Pages】:681-684

【Authors】: Kristi Morton ; Abram L. Friesen ; Magdalena Balazinska ; Dan Grossman

【Abstract】: In parallel query-processing environments, accurate, time-oriented progress indicators could provide much utility given that inter- and intra-query execution times can have high variance. However, none of the techniques used by existing tools or available in the literature provide non-trivial progress estimation for parallel queries. In this paper, we introduce Parallax, the first such indicator. While several parallel data processing systems exist, the work in this paper targets environments where queries consist of a series of MapReduce jobs. Parallax builds on recently-developed techniques for estimating the progress of single-site SQL queries, but focuses on the challenges related to parallelism and variable execution speeds. We have implemented our estimator in the Pig system and demonstrate its performance through experiments with the PigMix benchmark and other queries running in a real, small-scale cluster.

【Keywords】: SQL; parallel programming; query processing; MapReduce pipelines; Parallax indicator; Pig system; inter-query execution times; intra-query execution times; nontrivial progress estimation; parallel query-processing environments; single-site SQL queries; Computer science; Data analysis; Data processing; Database systems; Feedback; Open source software; Parallel processing; Pipelines; Query processing; Turning

72. Scalable distributed-memory external sorting.

【Paper Link】【Pages】:685-688

【Authors】: Mirko Rahn ; Peter Sanders ; Johannes Singler

【Abstract】: We engineer algorithms for sorting huge data sets on massively parallel machines. The algorithms are based on the multiway merging paradigm. We first outline an algorithm whose I/O requirement is close to a lower bound. Thus, in contrast to naive implementations of multiway merging and all other approaches known to us, the algorithm works with just two passes over the data even for the largest conceivable inputs. A second algorithm reduces communication overhead and uses more conventional specifications of the result at the cost of slightly increased I/O requirements. An implementation wins the well known sorting benchmark in several categories and by a large margin over its competitors.

【Keywords】: parallel algorithms; parallel machines; sorting; distributed-memory external sorting; multiway merging paradigm; parallel machines; Bandwidth; Costs; Data engineering; Data structures; Filling; Hard disks; Merging; Parallel machines; Sorting; Writing

Keyword Search 4

73. Supporting top-K keyword search in XML databases.

【Paper Link】【Pages】:689-700

【Authors】: Liang Jeff Chen ; Yannis Papakonstantinou

【Abstract】: Keyword search is considered to be an effective information discovery method for both structured and semi-structured data. In XML keyword search, query semantics is based on the concept of Lowest Common Ancestor (LCA). However, naive LCA-based semantics leads to exponential computation and result size. In the literature, LCA-based semantic variants (e.g., ELCA and SLCA) were proposed, which define a subset of all the LCAs as the results. While most existing work focuses on algorithmic efficiency, top-K processing for XML keyword search is an important issue that has received very little attention. Existing algorithms focusing on efficiency are designed to optimize the semantic pruning and are incapable of supporting top-K processing. On the other hand, straightforward applications of top-K techniques from other areas (e.g., relational databases) generate LCAs that may not be the results and unnecessarily expand efforts in the semantic pruning. In this paper, we propose a series of join-based algorithms that combine the semantic pruning and the top-K processing to support top-K keyword search in XML databases. The algorithms essentially reduce the keyword query evaluation to relational joins, and incorporate the idea of the top-K join from relational databases. Extensive experimental evaluations show the performance advantages of our algorithms.

【Keywords】: XML; data mining; query formulation; relational databases; LCA based semantic variant; Lowest Common Ancestor; Top-K keyword search; XML databases; information discovery method; join based algorithm; keyword query evaluation; query semantics; relational databases; semantic pruning; Algorithm design and analysis; Computer science; Data engineering; Database languages; Design optimization; Information retrieval; Keyword search; Query processing; Relational databases; XML

74. Personalized Web search with location preferences.

【Paper Link】【Pages】:701-712

【Authors】: Kenneth Wai-Ting Leung ; Dik Lun Lee ; Wang-Chien Lee

【Abstract】: As the amount of Web information grows rapidly, search engines must be able to retrieve information according to the user's preference. In this paper, we propose a new web search personalization approach that captures the user's interests and preferences in the form of concepts by mining search results and their clickthroughs. Due to the important role location information plays in mobile search, we separate concepts into content concepts and location concepts, and organize them into ontologies to create an ontology-based, multi-facet (OMF) profile to precisely capture the user's content and location interests and hence improve the search accuracy. Moreover, recognizing the fact that different users and queries may have different emphases on content and location information, we introduce the notion of content and location entropies to measure the amount of content and location information associated with a query, and click content and location entropies to measure how much the user is interested in the content and location information in the results. Accordingly, we propose to define personalization effectiveness based on the entropies and use it to balance the weights between the content and location facets. Finally, based on the derived ontologies and personalization effectiveness, we train an SVM to adapt a personalized ranking function for re-ranking of future search. We conduct extensive experiments to compare the precision produced by our OMF profiles and that of a baseline method. Experimental results show that OMF improves the precision significantly compared to the baseline.

【Keywords】: entropy; information retrieval; ontologies (artificial intelligence); search engines; support vector machines; user interfaces; SVM; Web search; content information; entropy; information retrieval; location information; mobile search; ontologies; ontology based multifacet; search engines; user preference; Data models; Data privacy; Educational institutions; Protection; Publishing; Sufficient conditions; Testing; Web search

75. Fuzzy matching of Web queries to structured data.

【Paper Link】【Pages】:713-716

【Authors】: Tao Cheng ; Hady Wirawan Lauw ; Stelios Paparizos

【Abstract】: Recognizing the alternative ways people use to reference an entity, is important for many Web applications that query structured data. In such applications, there is often a mismatch between how content creators describe entities and how different users try to retrieve them. In this paper, we consider the problem of determining whether a candidate query approximately matches with an entity. We propose an off-line, data-driven, bottom-up approach that mines query logs for instances where Web content creators and Web users apply a variety of strings to refer to the same Web pages. This way, given a set of strings that reference entities, we generate an expanded set of equivalent strings for each entity. The proposed method is verified with experiments on real-life data sets showing that we can dramatically increase the queries that can be matched.

【Keywords】: Internet; data mining; fuzzy set theory; query processing; Web pages; Web queries; candidate query; fuzzy matching; query log mining; structured data; Africa; Content based retrieval; Databases; Earth Observing System; Motion pictures; Search engines; Skull; Uniform resource locators; Web pages; Web search

76. Toward industrial-strength keyword search systems over relational data.

【Paper Link】【Pages】:717-720

【Authors】: Akanksha Baid ; Ian Rae ; AnHai Doan ; Jeffrey F. Naughton

【Abstract】: Keyword search (KWS) over relational data, where the answers are multiple tuples connected via joins, has received significant attention in the past decade. Numerous solutions have been proposed and many prototypes have been developed. Building on this rapid progress and on growing user needs, recently several RDBMS and Web companies as well as academic research groups have started to examine how to build industrial-strength keywords search systems. This task clearly requires addressing many issues, including robustness, accuracy, reliability, and privacy, among others. A major emerging issue, however, appears to be performance related: current KWS systems have unpredictable run time. In particular, for certain queries it takes too long to produce answers, and for others the system may even fail to return (e.g., after exhausting memory). In this paper we begin by examining the above problem and arguing that it is a fundamental problem unlikely to be solved in the near future by software and hardware advances. Next, we argue that in an industrial-strength setting, to ensure real-time interaction and facilitate user adoption, KWS systems should produce answers under an absolute time limit and then provide users with a description of what could be done next, should he or she choose to continue. Next, we show how to realize these requirements for DISCOVER, an exemplar of a recent KWS solution approach. Our basic idea is to produce answers as in today's KWS systems up to the time limit, then show users these answers as well as query forms that characterize the unexplored portion of the answer space. Finally, we present some preliminary experiments over real-world data to demonstrate the feasibility of the proposed solution approach.

【Keywords】: relational databases; search engines; DISCOVER solution approach; accuracy issue; industrial-strength setting; keyword search systems; privacy issue; relational data; reliability issue; robustness issue; Computer industry; Computer science; Hardware; Industrial relations; Keyword search; Privacy; Prototypes; Real time systems; Relational databases; Robustness

Query Processing 3

77. Efficient processing of substring match queries with inverted q-gram indexes.

【Paper Link】【Pages】:721-732

【Authors】: Younghoon Kim ; Kyoung-Gu Woo ; Hyoungmin Park ; Kyuseok Shim

【Abstract】: With the widespread of the internet, text-based data sources have become ubiquitous and the demand of effective support for string matching queries becomes ever increasing. The relational query language SQL also supports LIKE clause over string data to handle substring matching queries. Due to popularity of such substring matching queries, there have been a lot of study on designing efficient indexes to support the LIKE clause in SQL. Among them, q-gram based indexes have been studied extensively. However, how to process substring matching queries efficiently with such indexes has received very little attention until recently. In this paper, we show that the optimal execution of intersecting posting lists of q-grams for substring matching queries should be decided judiciously. Then we present the optimal and approximate algorithms based on cost estimation for substring matching queries. Performance study confirms that our techniques improve query execution time with q-gram indexes significantly compared to the traditional algorithms.

【Keywords】: SQL; query processing; string matching; text analysis; SQL; approximate algorithms; cost estimation; internet; inverted q-gram indexes; optimal algorithms; query execution time; relational query language; string data; substring match queries efficient processing; text-based data sources; Algorithm design and analysis; Cost function; Data structures; Database languages; Internet; Intrusion detection; Query processing

78. Progressive result generation for multi-criteria decision support queries.

【Paper Link】【Pages】:733-744

【Authors】: Venkatesh Raghavan ; Elke A. Rundensteiner

【Abstract】: Multi-criteria decision support (MCDS) is crucial in many business and web applications such as web searches, B2B portals and on-line commerce. Such MCDS applications need to report results early; as soon as they are being generated so that they can react and formulate competitive decisions in near real-time. The ease in expressing user preferences in web-based applications has made Pareto-optimal (skyline) queries a popular class of MCDS queries. However, state-of-the-art techniques either focus on handling skylines on single input sets (i.e., no joins) or do not tackle the challenge of producing progressive early output results. In this work, we propose a progressive query evaluation framework ProgXe that transforms the execution of queries involving skyline over joins to be non-blocking, i.e., to be progressively generating results early and often. In ProgXe the query processing (join, mapping and skyline) is conducted at multiple levels of abstraction, thereby exploiting the knowledge gained from both input as well as mapped output spaces. This knowledge enables us to identify and reason about abstract-level relationships to guarantee correctness of early output. It also provides optimization opportunities previously missed by current techniques. To further optimize ProgXe, we incorporate an ordering technique that optimizes the rate at which results are reported by translating the optimization of tuple-level processing into a job-sequencing problem. Our experimental study over a wide variety of data sets demonstrates the superiority of our approach over state-of-the-art techniques.

【Keywords】: Internet; Pareto optimisation; decision support systems; electronic commerce; query processing; B2B portals; Pareto optimal queries; ProgXe; Web searches; job sequencing problem; multicriteria decision support queries; online commerce; optimization; progressive query evaluation framework; progressive result generation; query processing; skylines; Application software; Business; Computer science; Costs; Databases; Delay; Portals; Query processing; Web and internet services; Web search

79. Nb-GCLOCK: A non-blocking buffer management based on the generalized CLOCK.

【Paper Link】【Pages】:745-756

【Authors】: Makoto Yui ; Jun Miyazaki ; Shunsuke Uemura ; Hayato Yamana

【Abstract】: In this paper, we propose a non-blocking buffer management scheme based on a lock-free variant of the GCLOCK page replacement algorithm. Concurrent access to the buffer management module is a major factor that prevents database scalability to processors. Therefore, we propose a non-blocking scheme for bufferfix operations that fix buffer frames for requested pages without locks by combining Nb-GCLOCK and a non-blocking hash table. Our experimental results revealed that our scheme can obtain nearly linear scalability to processors up to 64 processors, although the existing locking-based schemes do not scale beyond 16 processors.

【Keywords】: buffer storage; database management systems; microprocessor chips; GCLOCK; Nb-GCLOCK; buffer management module; database scalability; generalized CLOCK; non blocking buffer management; Biology; Clocks; Concurrent computing; Databases; Engineering management; Information science; Knowledge management; Scalability; Synchronization; Technology management

Web and Collaborative Applications 5

80. Effective automated Object Matching.

【Paper Link】【Pages】:757-768

【Authors】: Diego Zardetto ; Monica Scannapieco ; Tiziana Catarci

【Abstract】: Object Matching (OM) is the problem of identifying pairs of data-objects coming from different sources and representing the same real world object. Several methods have been proposed to solve OM problems, but none of them seems to be at the same time fully automated and very effective. In this paper we present a fundamentally new suite of methods that instead possesses both these abilities. We adopt a statistical approach based on mixture models, which structures an OM process into two consecutive tasks. First, mixture parameters are estimated by fitting the model to observed distance measures between pairs. Then, a probabilistic clustering of the pairs into Matches and Unmatches is obtained by exploiting the fitted model. In particular, we use a mixture model with component densities belonging to the Beta parametric family and we fit it by means of an original perturbation-like technique. Moreover, we solve the clustering problem according to both Maximum Likelihood and Minimum Cost objectives. To accomplish this task, optimal decision rules fulfilling one-to-one matching constraints are searched by a purposefully designed evolutionary algorithm. Notably, our suite of methods is distance-independent in the sense that it does not rely on any restrictive assumption on the function to be used when comparing data-objects. Even more interestingly, our approach is not confined to record linkage applications but can be applied to match also other kinds of dataobjects. We present several experiments on real data that validate the proposed methods and show their excellent effectiveness.

【Keywords】: evolutionary computation; maximum likelihood estimation; pattern clustering; pattern matching; perturbation techniques; probability; Beta parametric family; automated object matching; clustering problem; component densities; evolutionary algorithm; maximum likelihood; minimum cost objectives; mixture models; one-to-one matching constraints; optimal decision rules; perturbation like technique; probabilistic clustering; Algorithm design and analysis; Automation; Constraint optimization; Costs; Couplings; Design optimization; Evolutionary computation; Maximum likelihood estimation; Parameter estimation; Probability distribution

81. Efficient identification of coupled entities in document collections.

【Paper Link】【Pages】:769-772

【Authors】: Nikos Sarkas ; Albert Angel ; Nick Koudas ; Divesh Srivastava

【Abstract】: The relentless pace at which textual data are generated on-line necessitates novel paradigms for their understanding and exploration. To this end, we introduce a methodology for discovering strong entity associations in all the slices (meta-data value restrictions) of a document collection. Since related documents mention approximately the same group of core entities (people, locations, etc.), the groups of coupled entities discovered can be used to expose themes in the document collection. We devise and evaluate algorithms capable of addressing two flavors of our core problem: algorithm THR-ENT for computing all sufficiently strong entity associations and algorithm TOP-ENT for computing the top-k strongest entity associations, for each slice of the document collection.

【Keywords】: text analysis; THR-ENT algorithm; TOP-ENT algorithm; coupled entity identification; document collections; metadata value restrictions; strong entity associations; textual data; Association rules; Data mining; Demography; Diseases; Information services; Internet; TV; Testing; User-generated content; Web sites

82. On supporting effective web extraction.

【Paper Link】【Pages】:773-775

【Authors】: Wook-Shin Han ; Wooseong Kwak ; Hwanjo Yu

【Abstract】: Commercial tuple extraction systems have enjoyed some success to extract tuples by regarding HTML pages as tree structures and exploiting XPath queries to find attributes of tuples in the HTML pages. However, such systems would be vulnerable to small changes on the web pages. In this paper, we propose a robust tuple extraction system which utilizes spatial relationships among elements rather than the XPath queries of the elements. Our system regards elements in the rendered page as spatial objects in the 2-D space and executes spatial joins to extract target elements. Since humans also identify an element in a web page by its relative spatial location, our system extracting elements by their spatial relationships could possibly be as robust as manual extraction and is far more robust than existing tuple extraction systems.

【Keywords】: Internet; query processing; tree data structures; 2D space; HTML pages; XPath queries; effective Web extraction; robust tuple extraction system; spatial objects; spatial relationship; tree structures; tuple extraction systems; Cities and towns; Computer science; Data mining; HTML; Humans; Mashups; Robustness; Spatial resolution; Tree data structures; Web pages

83. A partial persistent data structure to support consistency in real-time collaborative editing.

【Paper Link】【Pages】:776-779

【Authors】: Qinyi Wu ; Calton Pu ; João Eduardo Ferreira

【Abstract】: Co-authored documents are becoming increasingly important for knowledge representation and sharing. Tools for supporting document co-authoring are expected to satisfy two requirements: 1) querying changes over editing histories; 2) maintaining data consistency among users. Current tools support either limited queries or are not suitable for loosely controlled collaborative editing scenarios. We address both problems by proposing a new persistent data structure-partial persistent sequence. The new data structure enables us to create unique character identifiers that can be used for associating meta-information and tracking their changes, and also design simple view synchronization algorithms to guarantee data consistency under the presence of concurrent updates. Experiments based on real-world collaborative editing traces show that our data structure uses disk space economically and provides efficient performance for document update and retrieval.

【Keywords】: data structures; document handling; groupware; knowledge representation; synchronisation; co-authored documents; data consistency; knowledge representation; knowledge sharing; partial persistent data structure; partial persistent sequence; querying changes; realtime collaborative editing; synchronization algorithms; Algorithm design and analysis; Buildings; Collaboration; Collaborative tools; Collaborative work; Data models; Data structures; Educational institutions; History; Information retrieval

84. Detecting bursty events in collaborative tagging systems.

【Paper Link】【Pages】:780-783

【Authors】: Junjie Yao ; Bin Cui ; Yuxin Huang ; Yanhong Zhou

【Abstract】: Collaborative tagging systems have emerged as an ubiquitous way to annotate and organize online resources. The users' tagging actions over time reflect the changing of their interests. In this paper, we propose to detect bursty tagging event, which captures the relations among a group of correlated tags where the tags are either bursty or associated with bursty tag co-occurrence. We exploit the sliding time intervals to extract bursty features from large tag corpora as the first step, and then adopt graph clustering techniques to group bursty features into meaningful bursty events. An experimental study demonstrates the superiority of our approach.

【Keywords】: feature extraction; graph theory; groupware; identification technology; pattern clustering; ubiquitous computing; bursty tagging event detection; collaborative tagging system; feature extraction; graph clustering technique; online resources; sliding time intervals; Collaborative software; Educational technology; Event detection; Feature extraction; Information services; Internet; Monitoring; Online Communities/Technical Collaboration; Tagging; Web sites

Scientific Databases 4

85. Credibility-enhanced curated database: Improving the value of curated databases.

【Paper Link】【Pages】:784-795

【Authors】: Qun Ni ; Elisa Bertino

【Abstract】: In curated databases, annotations may contain opinions different from those in sources. Moreover, annotations may contradict each other and have uncertainty. Such situations result in a natural question: Â¿Which opinion is most likely to be correct?Â¿ In this paper, we define a credibility-enhanced curated database and propose an efficient method to accurately evaluate the correctness of sources and annotations in curated databases.

【Keywords】: database management systems; uncertainty handling; annotations; credibility enhanced curated database; uncertainty; Computer science; Dictionaries; Encyclopedias; Genetics; Humans; Libraries; Ontologies; Relational databases; Spatial databases; Uncertainty

86. UV-diagram: A Voronoi diagram for uncertain data.

【Paper Link】【Pages】:796-807

【Authors】: Reynold Cheng ; Xike Xie ; Man Lung Yiu ; Jinchuan Chen ; Liwen Sun

【Abstract】: The Voronoi diagram is an important technique for answering nearest-neighbor queries for spatial databases. In this paper, we study how the Voronoi diagram can be used on uncertain data, which are inherent in scientific and business applications. In particular, we propose the Uncertain-Voronoi Diagram (or UV-diagram in short). Conceptually, the data space is divided into distinct Â¿UV-partitionsÂ¿, where each UV-partition P is associated with a set S of objects; any point q located in P has the set S as its nearest neighbor with non-zero probabilities. The UV-diagram facilitates queries that inquire objects for having non-zero chances of being the nearest neighbor of a given query point. It also allows analysis of nearest neighbor information, e.g., finding out how many objects are the nearest neighbors in a given area. However, a UV-diagram requires exponential construction and storage costs. To tackle these problems, we devise an alternative representation for UV-partitions, and develop an adaptive index for the UV-diagram. This index can be constructed in polynomial time. We examine how it can be extended to support other related queries. We also perform extensive experiments to validate the effectiveness of our approach.

【Keywords】: computational complexity; computational geometry; learning (artificial intelligence); probability; visual databases; UV-diagram; UV-partition; Voronoi diagram; nearest-neighbor queries; nonzero probabilities; polynomial time; spatial databases; uncertain data; uncertain-Voronoi diagram; Computer science; Costs; Data engineering; Information analysis; Knowledge engineering; Lungs; Nearest neighbor searches; Satellite broadcasting; Spatial databases; Sun

87. Supporting real-world activities in database management systems.

【Paper Link】【Pages】:808-811

【Authors】: Mohamed Y. Eltabakh ; Walid G. Aref ; Ahmed K. Elmagarmid ; Yasin N. Silva ; Mourad Ouzzani

【Abstract】: The cycle of processing the data in many application domains is complex and may involve real-world activities that are external to the database, e.g., wet-lab experiments, instrument readings, and manual measurements. These real-world activities may take long time to prepare for and to perform, and hence introduce inherently long time delays between the updates in the database. The presence of these long delays between the updates, along with the need for the intermediate results to be instantly available, makes supporting real-world activities in the database engine a challenging task. In this paper, we address these challenges through a system that enables users to reflect their updates immediately into the database while keeping track of the dependent and potentially invalid data items until they are re-validated. The proposed system includes: (1) semantics and syntax for interfaces through which users can express the dependencies among data items, (2) new operators to alert users when the returned query results contain potentially invalid or out-of-date data, and to enable evaluating queries on either valid data only, or both valid and potentially invalid data, and (3) mechanisms for data invalidation and revalidation. The proposed system is being realized via extensions to PostgreSQL.

【Keywords】: database management systems; query processing; PostgreSQL; data invalidation; data revalidation; database engine; database management systems; instrument readings; supporting real world activities; wet lab experiments; Application software; Computer science; Database systems; Delay effects; Engines; Information retrieval; Instruments; Propagation delay

88. XML-based computation for scientific workflows.

【Paper Link】【Pages】:812-815

【Authors】: Daniel Zinn ; Shawn Bowers ; Bertram Ludäscher

【Abstract】: Scientific workflows are increasingly used to rapidly integrate existing algorithms to create larger and more complex programs. However, designing workflows using purely dataflow-oriented computation models introduces a number of challenges, including the need to use low-level components to mediate and transform data (so-called shims) and large numbers of additional Â¿wiresÂ¿ for routing data to components within a workflow. To address these problems, we employ Virtual Data Assembly Lines (VDAL), a modeling paradigm that can eliminate most shims and reduce wiring complexity. We show how a VDAL design can be implemented using existing XML technologies and how static analysis can provide significant help to scientists during workflow design and evolution, e.g., by displaying actor dependencies or by detecting so-called unproductive actors.

【Keywords】: XML; data flow computing; VDAL; XML based computation; data transformation; dataflow oriented computation models; low level components; scientific workflows; virtual data assembly lines; Assembly; Computational modeling; Computer displays; Computer science; Data analysis; Face detection; Routing; Wires; Wiring; XML

Tree Queries and Semi-Structured Databases 4

89. ViewJoin: Efficient view-based evaluation of tree pattern queries.

【Paper Link】【Pages】:816-827

【Authors】: Ding Chen ; Chee-Yong Chan

【Abstract】: There is a lot of recent interest in applying views to optimize the processing of tree pattern queries (TPQs). However, existing work in this area has focused predominantly on logical optimization issues, namely, view selection and query rewriting. With the exception of the recent work on InterJoin (which is primarily focused on path queries and views), there is very little work that has examined the important physical optimization issue of how to efficiently evaluate TPQs using materialized views. In this paper, we present a new storage scheme for materialized TPQ views and a novel evaluation algorithm for processing general TPQ queries using materialized TPQ views. Our experimental results demonstrate that our proposed method outperforms the state-of-the-art approaches.

【Keywords】: XML; query processing; tree data structures; TPQ; efficient view based evaluation; logical optimization; query rewriting; state-of-the-art approaches; tree pattern queries; Couplings; Data privacy; Databases; Diseases; Hospitals; Influenza; Protection; Publishing; Stomach; Sun

90. FlexPref: A framework for extensible preference evaluation in database systems.

【Paper Link】【Pages】:828-839

【Authors】: Justin J. Levandoski ; Mohamed F. Mokbel ; Mohamed E. Khalefa

【Abstract】: Personalized database systems give users answers tailored to their personal preferences. While numerous preference evaluation methods for databases have been proposed (e.g., skyline, top-k, k-dominance, k-frequency), the implementation of these methods at the core of a database system is a double-edged sword. Core implementation provides efficient query processing for arbitrary database queries, however this approach is not practical as each existing (and future) preference method requires a custom query processor implementation. To solve this problem, this paper introduces FlexPref, a framework for extensible preference evaluation in database systems. FlexPref, implemented in the query processor, aims to support a wide-array of preference evaluation methods in a single extensible code base. Integration with FlexPref is simple, involving the registration of only three functions that capture the essence of the preference method. Once integrated, the preference method Â¿livesÂ¿ at the core of the database, enabling the efficient execution of preference queries involving common database operations. To demonstrate the extensibility of FlexPref, we provide case studies showing the implementation of three database operations (single table access, join, and sorted list access) and five state-of-the-art preference evaluation methods (top-k, skyline, k-dominance, top-k dominance, and k-frequency). We also experimentally study the strengths and weaknesses of an implementation of FlexPef in PostgreSQL over a range of single-table and multi-table preference queries.

【Keywords】: query processing; FlexPref; arbitrary database queries; database systems; double edged sword; extensible preference evaluation framework; multitable preference queries; query processing; query processor implementation; single table access; singletable preference queries; sorted list access; Computer science; Data engineering; Database systems; Decision making; Engines; Indexes; Information retrieval; Optimization methods; Query processing

91. Optimal tree node ordering for child/descendant navigations.

【Paper Link】【Pages】:840-843

【Authors】: Atsuyuki Morishima ; Keishi Tajima ; Masateru Tadaishi

【Abstract】: There are many applications in which users interactively access huge tree data by repeating set-based navigations. In this paper, we focus on label-specific/wildcard children/ descendant navigations. For efficient processing of these operations in huge data stored on a disk, we need a node ordering scheme that clusters nodes that are accessed together by these operations. In this paper, (1) we show there is no node order that is optimal for all these operations, (2) we propose two schemes, each of which is optimal only for some subset of them, and (3) we show that one of the proposed schemes can process all these operations with access to a constant-bounded number of regions on the disk without accessing irrelevant nodes.

【Keywords】: tree data structures; child-descendant navigations; constant-bounded number; node ordering scheme; optimal tree node ordering; Costs; Counting circuits; Degradation; Informatics; Information retrieval; Libraries; Navigation; Prefetching; Tree data structures; XML

92. XMorph: A shape-polymorphic, domain-specific XML data transformation language.

【Paper Link】【Pages】:844-847

【Authors】: Curtis E. Dyreson ; Sourav S. Bhowmick ; Aswani Rao Jannu ; Kirankanth Mallampalli ; Shuohao Zhang

【Abstract】: By imposing a single hierarchy on data, XML makes queries brittle in the sense that a query might fail to produce the desired result if it is executed on the same data organized in a different hierarchy, or if the hierarchy evolves during the lifetime of an application. This paper presents a new transformation language, called XMorph, which supports more flexible querying. XMorph is a shape polymorphic language, that is, a single XMorph query can extract and transform data from differently-shaped hierarchies. The XMorph data shredder distills XML data into a graph of closest relationships, which are exploited by the query evaluation engine to produce a result in the shape specified by an XMorph query.

【Keywords】: XML; graph theory; query processing; XML data; XMorph; data transformation language; query evaluation engine; shape polymorphic language; Computer science; Data mining; Database languages; Engines; Object oriented modeling; Query processing; Relational databases; Tree graphs; Vocabulary; XML

Query Ranking and Database Testing 4

93. Surrogate ranking for very expensive similarity queries.

【Paper Link】【Pages】:848-859

【Authors】: Fei Xu ; Ravi Jampani ; Mingxi Wu ; Chris Jermaine ; Tamer Kahveci

【Abstract】: We consider the problem of similarity search in applications where the cost of computing the similarity between two records is very expensive, and the similarity measure is not a metric. In such applications, comparing even a tiny fraction of the database records to a single query record can be orders of magnitude slower than reading the entire database from disk, and indexing is often not possible. We develop a general-purpose, statistical framework for answering top-k queries in such databases, when the database administrator is able to supply an inexpensive surrogate ranking function that substitutes for the actual similarity measure. We develop a robust method that learns the relationship between the surrogate function and the similarity measure. Given a query, we use Bayesian statistics to update the model by taking into account the observed partial results. Using the updated model, we construct bounds on the accuracy of the result set obtained via the surrogate ranking. Our experiments show that our models can produce useful bounds for several real-life applications.

【Keywords】: Bayes methods; data mining; query formulation; query processing; Bayesian statistics; computing cost; database administrator; database records; expensive similarity queries; query record; similarity search; surrogate ranking; top-k queries; Application software; Atomic measurements; Biochemistry; Computer science; Costs; Databases; Drugs; Indexing; Proteins; Robustness

94. Semantic ranking and result visualization for life sciences publications.

【Paper Link】【Pages】:860-871

【Authors】: Julia Stoyanovich ; William Mee ; Kenneth A. Ross

【Abstract】: An ever-increasing amount of data and semantic knowledge in the domain of life sciences is bringing about new data management challenges. In this paper we focus on adding the semantic dimension to literature search, a central task in scientific research. We focus our attention on PubMed, the most significant bibliographic source in life sciences, and explore ways to use high-quality semantic annotations from the MeSH vocabulary to rank search results. We start by developing several families of ranking functions that relate a search query to a document's annotations. We then propose an efficient adaptive ranking mechanism for each of the families. We also describe a two-dimensional Skyline-based visualization that can be used in conjunction with the ranking to further improve the user's interaction with the system, and demonstrate how such Skylines can be computed adaptively and efficiently. Finally, we evaluate the effectiveness of our ranking with a user study.

【Keywords】: data visualisation; natural sciences computing; query processing; user interfaces; vocabulary; MeSH vocabulary; PubMed; adaptive ranking mechanism; bibliographic source; data management; document annotations; high quality semantic annotation; life science publication visualization; search query; semantic knowledge; semantic ranking; two-dimensional Skyline-based visualization; user interaction; Bioinformatics; Blood; Cells (biology); Data visualization; Diseases; Genomics; Humans; Knowledge management; Vocabulary; Web search

95. Ranked queries over sources with Boolean query interfaces without ranking support.

【Paper Link】【Pages】:872-875

【Authors】: Vagelis Hristidis ; Yuheng Hu ; Panagiotis G. Ipeirotis

【Abstract】: Many online or local data sources provide powerful querying mechanisms but limited ranking capabilities. For instance, PubMed allows users to submit highly expressive Boolean keyword queries, but ranks the query results by date only. However, a user would typically prefer a ranking by relevance, measured by an Information Retrieval (IR) ranking function. The naive approach would be to submit a disjunctive query with all query keywords, retrieve the returned documents, and then re-rank them. Unfortunately, such an operation would be very expensive due to the large number of results returned by disjunctive queries. In this paper we present algorithms that return the top results for a query, ranked according to an IR-style ranking function, while operating on top of a source with a Boolean query interface with no ranking capabilities (or a ranking capability of no interest to the end user). The algorithms generate a series of conjunctive queries that return only documents that are candidates for being highly ranked according to a relevance metric. Our approach can also be applied to other settings where the ranking is monotonic on a set of factors (query keywords in IR) and the source query interface is a Boolean expression of these factors. Our comprehensive experimental evaluation on the PubMed database and TREC dataset show that we achieve order of magnitude improvement compared to the current baseline approaches.

【Keywords】: query processing; relevance feedback; user interfaces; Boolean query interfaces; PubMed algorithm; conjunctive queries; disjunctive query; information retrieval ranking function; query keywords; query ranking; querying mechanisms; ranking support; relevance metric; relevance ranking; Biomedical computing; Biomedical measurements; Computer interfaces; Databases; Immune system; Information management; Information retrieval; Optical computing; Trademarks; Web services

96. X-data: Generating test data for killing SQL mutants.

【Paper Link】【Pages】:876-879

【Authors】: Bhanu Pratap Gupta ; Devang Vira ; S. Sudarshan

【Abstract】: Checking if an SQL query has been written correctly is not an easy task. Formal verification is not applicable, since it is based on comparing a specification with an implementation, whereas SQL queries are essentially a specification without any implementation. Thus, the standard approach for testing queries is to manually check query results on test datasets. Intuitively, a mutant is a query variant that could have been the correct query if the query was in error; a mutant is killed by a dataset if the original query and the mutant return different results on the dataset. In this paper, we address the problem of generation of test data for an SQL query, to kill mutants. Our work focuses in particular on a class of join/outer-join mutants, which are a common cause of error. To minimize human effort in testing, our techniques generate a test suite containing small and intuitive test datasets, combining them into a single dataset where possible. In the absence of foreign-key constraints, and under certain assumptions, the test suite is complete, i.e. it kills all nonequivalent mutations, in the class of join-type mutations that we consider. We also consider some common types of where-clause predicate mutants. Our techniques have been implemented in a prototype data generation tool.

【Keywords】: SQL; formal verification; query processing; SQL mutants; SQL query; X-data query; foreign-key constraints; formal verification; join mutants; nonequivalent mutations; outer-join mutants; query variant; test data generation; where-clause predicate mutants; Couplings; Data privacy; Databases; Diseases; Hospitals; Influenza; Protection; Publishing; Sun; Testing

97. Discovery-driven graph summarization.

【Paper Link】【Pages】:880-891

【Authors】: Ning Zhang ; Yuanyuan Tian ; Jignesh M. Patel

【Abstract】: Large graph datasets are ubiquitous in many domains, including social networking and biology. Graph summarization techniques are crucial in such domains as they can assist in uncovering useful insights about the patterns hidden in the underlying data. One important type of graph summarization is to produce small and informative summaries based on user-selected node attributes and relationships, and allowing users to interactively drill-down or roll-up to navigate through summaries with different resolutions. However, two key components are missing from the previous work in this area that limit the use of this method in practice. First, the previous work only deals with categorical node attributes. Consequently, users have to manually bucketize numerical attributes based on domain knowledge, which is not always possible. Moreover, users often have to manually iterate through many resolutions of summaries to identify the most interesting ones. This paper addresses both these key issues to make the interactive graph summarization approach more useful in practice. We first present a method to automatically categorize numerical attributes values by exploiting the domain knowledge hidden inside the node attributes values and graph link structures. Furthermore, we propose an interestingness measure for graph summaries to point users to the potentially most insightful summaries. Using two real datasets, we demonstrate the effectiveness and efficiency of our techniques.

【Keywords】: data handling; data mining; graph theory; pattern classification; categorical node attribute; graph link structures; graph summarization; Biology computing; Computer networks; Data mining; Databases; Facebook; Humans; Navigation; Pervasive computing; Social network services; Usability

98. The similarity join database operator.

【Paper Link】【Pages】:892-903

【Authors】: Yasin N. Silva ; Walid G. Aref ; Mohamed H. Ali

【Abstract】: Similarity joins have been studied as key operations in multiple application domains, e.g., record linkage, data cleaning, multimedia and video applications, and phenomena detection on sensor networks. Multiple similarity join algorithms and implementation techniques have been proposed. They range from out-of-database approaches for only in-memory and external memory data to techniques that make use of standard database operators to answer similarity joins. Unfortunately, there has not been much study on the role and implementation of similarity joins as database physical operators. In this paper, we focus on the study of similarity joins as first-class database operators. We present the definition of several similarity join operators and study the way they interact among themselves, with other standard database operators, and with other previously proposed similarity-aware operators. In particular, we present multiple transformation rules that enable similarity query optimization through the generation of equivalent similarity query execution plans. We then describe an efficient implementation of two similarity join operators, Â¿-Join and Join-Around, as core DBMS operators. The performance evaluation of the implemented operators in PostgreSQL shows that they have good execution time and scalability properties. The execution time of Join-Around is less than 5% of the one of the equivalent query that uses only regular operators while Â¿-Join's execution time is 20 to 90% of the one of its equivalent regular operators based query for the useful case of small Â¿ (0.01% to 10% of the domain range). We also show experimentally that the proposed transformation rules can generate plans with execution times that are only 10% to 70% of the ones of the initial query plans.

【Keywords】: SQL; data mining; database management systems; DBMS operators; data cleaning; external memory data; multimedia applications; multiple application domains; query optimization; record linkage; sensor networks phenomena detection; similarity join database operator; video applications; Application software; Cleaning; Computer science; Couplings; Engines; Multimedia databases; Pipeline processing; Query processing; Scalability; Sensor phenomena and characterization

【Paper Link】【Pages】:904-907

【Authors】: Sudipto Das ; Ömer Egecioglu ; Amr El Abbadi

【Abstract】: The increasing popularity of social networks has initiated a fertile research area in information extraction and data mining. Although such analysis can facilitate better understanding of sociological, behavioral, and other interesting phenomena, there is a growing concern about personal privacy being breached, thereby requiring effective anonymization techniques. In this paper, we consider edge weight anonymization in social graphs. Our approach builds a linear programming (LP) model which preserves properties of the graph that are expressible as linear functions of the edge weights. Such properties form the foundations of many important graph-theoretic algorithms such as shortest paths, k-nearest neighbors, minimum spanning tree, etc. Off-the-shelf LP solvers can then be used to find solutions to the resulting model where the computed solution constitutes the weights in the anonymized graph. As a proof of concept, we choose the shortest paths problem, and experimentally evaluate the proposed techniques using real social network data sets.

【Keywords】: data privacy; linear programming; social networking (online); trees (mathematics); anonymization techniques; data mining; edge weight anonymization; graph-theoretic algorithms; information extraction; k-nearest neighbors; linear programming; minimum spanning tree; off-the-shelf LP solvers; personal privacy; shortest paths; weighted social network graphs; Advertising; Computer science; Data mining; Facebook; Information analysis; Linear programming; Privacy; Shortest path problem; Social network services; Tree graphs

100. Efficient similarity matching of Time Series Cliques with natural relations.

【Paper Link】【Pages】:908-911

【Authors】: Zhe Zhao ; Bin Cui ; Wee Hyong Tok ; Jiakui Zhao

【Abstract】: A Time Series Clique (TSC) consists of multiple time series. In each TSC, the time series hold some natural relations with each other. In conventional time series retrieval methods, such natural relations are often ignored. In this paper, we formalize the problem of similarity search over TSC databases and develop a novel framework for similarity search on TSC data, which considers both time series patterns and relations. We conduct an extensive performance study, and the results show the effectiveness and efficiency of the proposed method.

【Keywords】: data handling; information retrieval; pattern matching; time series; natural relation; similarity matching; time series data; time series retrieval method; Application software; Computer science; Computer science education; Educational technology; Electrocardiography; Information retrieval; Laboratories; Layout; Multimedia databases; Music

Stream Processing 4

101. Continuous query evaluation over distributed sensor networks.

【Paper Link】【Pages】:912-923

【Authors】: Oana Jurca ; Sebastian Michel ; Alexandre Herrmann ; Karl Aberer

【Abstract】: In this paper we address the problem of processing continuous multi-join queries, over distributed data streams. Our approach makes use of existing work in the field of publish/subscribe systems. We show how these principles can be ported to our envisioned architectural model by enriching the common query model with location dependent attributes. We allow users to subscribe to a set of sensor attributes, a service that requires processing multi-join correlation queries. The goal is to decrease the overall network traffic consumption by removing redundant subscriptions and eliminating unrequested events close to the publishing sensors. This is non-trivial, especially in the presence of multi-join queries without any central control mechanism. Our approach is based on the concept of filter-split-forward phases for efficient subscription filtering and placement inside the network. We report on a performance evaluation using a real-world dataset, showing the improvements over the state-of-the-art, as we reduce the overall data traffic by half.

【Keywords】: middleware; query processing; central control mechanism; common query model; continuous query evaluation; distributed data streams; distributed sensor networks; filter-split-forward phases; location dependent attributes; multijoin queries; publish-subscribe systems; redundant subscriptions removal; unrequested events elimination; Communication system traffic control; Humidity measurement; Hydrologic measurements; Large-scale systems; Meteorology; Query processing; Radiation monitoring; Sensor phenomena and characterization; Snow; Subscriptions

102. Space-efficient online approximation of time series data: Streams, amnesia, and out-of-order.

【Paper Link】【Pages】:924-935

【Authors】: Sorabh Gandhi ; Luca Foschini ; Subhash Suri

【Abstract】: In this paper, we present an abstract framework for online approximation of time-series data that yields a unified set of algorithms for several popular models: data streams, amnesic approximation, and out-of-order stream approximation. Our framework essentially develops a popular greedy method of bucket-merging into a more generic form, for which we can prove space-quality approximation bounds. When specialized to piecewise linear bucket approximations and commonly used error metrics, such as L₂ or L_Â¿, our framework leads to provable error bounds where none were known before, offers new results, or yields simpler and unified algorithms. The conceptual simplicity of our scheme translates into highly practical implementations, as borne out in our simulation studies: the algorithms produce near-optimal approximations, require very small memory footprints, and run extremely fast.

【Keywords】: approximation theory; error analysis; greedy algorithms; time series; amnesic approximation; bucket-merging method; data streams; error metrics; greedy method; online approximation; out-of-order stream approximation; piecewise linear bucket approximations; space-quality approximation bounds; time series data; Approximation algorithms; Computer science; Database systems; Monitoring; Out of order; Piecewise linear approximation; Piecewise linear techniques; Real time systems; Time measurement; Wavelet coefficients

103. Approximation trade-offs in Markovian stream processing: An empirical study.

【Paper Link】【Pages】:936-939

【Authors】: Julie Letchner ; Christopher Re ; Magdalena Balazinska ; Matthai Philipose

【Abstract】: A large amount of the world's data is both sequential and imprecise. Such data is commonly modeled as Markovian streams; examples include words/sentences inferred from raw audio signals, or discrete location sequences inferred from RFID or GPS data. The rich semantics and large volumes of these streams make them difficult to query efficiently. In this paper, we study the effects-on both efficiency and accuracy-of two common stream approximations. Through experiments on a realworld RFID data set, we identify conditions under which these approximations can improve performance by several orders of magnitude, with only minimal effects on query results. We also identify cases when the full rich semantics are necessary.

【Keywords】: Markov processes; approximation theory; data handling; query processing; GPS data; Markovian stream processing; RFID data set; stream approximation; Application software; Computer science; Computerized monitoring; Costs; Data engineering; Data models; Global Positioning System; Hospitals; Radiofrequency identification; Streaming media

104. FENCE: Continuous access control enforcement in dynamic data stream environments.

【Paper Link】【Pages】:940-943

【Authors】: Rimma V. Nehme ; Hyo-Sang Lim ; Elisa Bertino

【Abstract】: In this paper, we present FENCE framework that addresses the problem of continuous access control enforcement in dynamic data stream environments. The distinguishing characteristics of FENCE include: (1) the stream-centric approach to security, (2) the symmetric modeling of security for both continuous queries and streaming data, and (3) security-aware query processing that considers both regular and security-related selectivities. In FENCE, both data and query security restrictions are modeled in the form of streaming security metadata, called Â¿security punctuationsÂ¿, embedded inside data streams. We have implemented FENCE in a prototype DSMS and briefly summarize our performance observations.

【Keywords】: authorisation; meta data; query processing; FENCE; continuous access control enforcement; dynamic data stream environments; security aware query processing; security metadata; security punctuations; security related selectivities; security symmetric modeling; streaming data queries; Access control; Adaptive filters; Data privacy; Data security; Digital signal processing; Patient monitoring; Prototypes; Query processing; Runtime; Ubiquitous computing

Publishing Privacy 4

105. A privacy-preserving approach to policy-based content dissemination.

【Paper Link】【Pages】:944-955

【Authors】: Ning Shang ; Mohamed Nabeel ; Federica Paci ; Elisa Bertino

【Abstract】: We propose a novel scheme for selective distribution of content, encoded as documents, that preserves the privacy of the users to whom the documents are delivered and is based on an efficient and novel group key management scheme. Our document broadcasting approach is based on access control policies specifying which users can access which documents, or subdocuments. Based on such policies, a broadcast document is segmented into multiple subdocuments, each encrypted with a different key. In line with modern attribute-based access control, policies are specified against identity attributes of users. However our broadcasting approach is privacy-preserving in that users are granted access to a specific document, or subdocument, according to the policies without the need of providing in clear information about their identity attributes to the document publisher. Under our approach, not only does the document publisher not learn the values of the identity attributes of users, but it also does not learn which policy conditions are verified by which users, thus inferences about the values of identity attributes are prevented. Moreover, our key management scheme on which the proposed broadcasting approach is based is efficient in that it does not require to send the decryption keys to the users along with the encrypted document. Users are able to reconstruct the keys to decrypt the authorized portions of a document based on subscription information they have received from the document publisher. The scheme also efficiently handles new subscription of users and revocation of subscriptions.

【Keywords】: authorisation; data privacy; document handling; public key cryptography; access control policies; decryption keys; document broadcasting; group key management scheme; policy based content dissemination; privacy preserving; Access control; Broadcasting; Content management; Cryptography; Identity management systems; Internet; Law; Privacy; Protection; Subscriptions

106. Global privacy guarantee in serial data publishing.

【Paper Link】【Pages】:956-959

【Authors】: Raymond Chi-Wing Wong ; Ada Wai-Chee Fu ; Jia Liu ; Ke Wang ; Yabo Xu

【Abstract】: While previous works on privacy-preserving serial data publishing consider the scenario where sensitive values may persist over multiple data releases, we find that no previous work has sufficient protection provided for sensitive values that can change over time, which should be the more common case. In this work, we propose to study the privacy guarantee for such transient sensitive values, which we call the global guarantee. We formally define the problem for achieving this guarantee. We show that the data satisfying the global guarantee also satisfies a privacy guarantee commonly adopted in the privacy literature called the local guarantee.

【Keywords】: data privacy; global privacy guarantee; privacy literature; privacy preserving serial data publishing; serial data publishing; transient sensitive values; Couplings; Data privacy; Databases; Diseases; Hospitals; Influenza; Protection; Publishing; Stomach; Sun

107. XColor: Protecting general proximity privacy.

【Paper Link】【Pages】:960-963

【Authors】: Ting Wang ; Ling Liu

【Abstract】: As a severe threat in anonymized data publication, proximity breach is gaining increasing attention. Such breach occurs when an attacker learns with high confidence that the sensitive information of a victim associates with a set of semantically proximate values, even though not sure about the exact one. Recently (Â¿, Â¿)-dissimilarity [14] has been proposed as an effective countermeasure against general proximity attack. In this paper, we present a detailed analytical study on the fulfillment of this principle, derive criteria to efficiently test its satisfiability for given microdata, and point to a novel anonymization model, XCOLOR, with theoretical guarantees on both operation efficiency and utility preservation.

【Keywords】: data privacy; graph colouring; security of data; XCOLOR; anonymized data publication; privacy preservation; proximity breach; proximity privacy; utility preservation; Data models; Data privacy; Educational institutions; Protection; Publishing; Sufficient conditions; Testing

108. Correlation hiding by independence masking.

【Paper Link】【Pages】:964-967

【Authors】: Yufei Tao ; Jian Pei ; Jiexing Li ; Xiaokui Xiao ; Ke Yi ; Zhengzheng Xing

【Abstract】: Extracting useful correlation from a dataset has been extensively studied. In this paper, we deal with the opposite, namely, a problem we call correlation hiding (CH), which is fundamental in numerous applications that need to disseminate data containing sensitive information. In this problem, we are given a relational table T whose attributes can be classified into three disjoint sets A, B, and C. The objective is to distort some values in T so that A becomes independent from B, and yet, their correlation with C is preserved as much as possible. CH is different from all the problems studied previously in the area of data privacy, in that CH demands complete elimination of the correlation between two sets of attributes, whereas the previous research focuses on partial elimination up to a certain level. A new operator called independence masking is proposed to solve the CH problem. Implementations of the operator with good worst case guarantees are described in the full version of this short note.

【Keywords】: approximation theory; computational complexity; data encapsulation; data mining; minimisation; correlation hiding; data dissemination; data privacy; independence masking; relational table; Approximation algorithms; Ash; Association rules; Data mining; Data privacy; Databases; Government

Data Clouds 5

109. Monitoring continuous state violation in datacenters: Exploring the time dimension.

【Paper Link】【Pages】:968-979

【Authors】: Shicong Meng ; Ting Wang ; Ling Liu

【Abstract】: Monitoring global states of an application deployed over distributed nodes becomes prevalent in today's datacenters. State monitoring requires not only correct monitoring results but also minimum communication cost for efficiency and scalability. Most existing work adopts an instantaneous state monitoring approach, which triggers state alerts whenever a constraint is violated. Such an approach, however, may cause frequent and unnecessary state alerts due to unpredictable monitored value bursts and momentary outliers that are common in large-scale Internet applications. These false alerts may further lead to expensive and problematic counter-measures. To address this issue, we introduce window-based state monitoring in this paper. Window-based state monitoring evaluates whether state violation is continuous within a time window, and thus, gains immunity to short-term value bursts and outliers. Furthermore, we find that exploring the monitoring time window at distributed nodes achieves significant communication savings over instantaneous monitoring. Based on this finding, we develop WISE, a system that efficiently performs WIndow-based StatE monitoring at datacenter-scale. WISE is highlighted with three sets of techniques. First, WISE uses distributed filtering time windows and intelligently avoids global information collecting to achieve communication efficiency, while guaranteeing monitoring correctness at the same time. Second, WISE provides a suite of performance tuning techniques to minimize communication cost based on a sophisticated cost model. Third, WISE also employs a set of novel performance optimization techniques. Extensive experiments over both real world and synthetic traces show that WISE achieves a 50% - 90% reduction in communication cost compared with existing instantaneous monitoring approaches and simple alternative schemes.

【Keywords】: computer centres; computerised monitoring; optimisation; statistics; continuous state violation; data centers; filtering time windows; instantaneous state monitoring approach; momentary outliers; performance optimization techniques; performance tuning techniques; state monitoring; time dimension; unpredictable monitored value bursts; violation monitoring; window-based state monitoring; Application software; Computerized monitoring; Costs; Distributed computing; Educational institutions; Information filtering; Information filters; Internet; Large-scale systems; Scalability

110. Cost-efficient and differentiated data availability guarantees in data clouds.

【Paper Link】【Pages】:980-983

【Authors】: Nicolas Bonvin ; Thanasis G. Papaioannou ; Karl Aberer

【Abstract】: Failures of any type are common in current datacenters. As data scales up, its availability becomes more complex, while different availability levels per application or per data item may be required. In this paper, we propose a self-managed key-value store that dynamically allocates the resources of a data cloud to several applications in a cost-efficient and fair way. Our approach offers and dynamically maintains multiple differentiated availability guarantees to each different application despite failures. We employ a virtual economy, where each data partition acts as an individual optimizer and chooses whether to migrate, replicate or remove itself based on net benefit maximization regarding the utility offered by the partition and its storage and maintenance cost. Comprehensive experimental evaluations suggest that our solution is highly scalable and adaptive to query rate variations and to resource upgrades/failures.

【Keywords】: Internet; data handling; cost differentiated data availability; cost efficient data availability; data clouds; data scales up; query rate variations; virtual economy; Application software; Availability; Bandwidth; Cloud computing; Cost function; Delay; Hardware; Internet; Resource management; Scattering

111. Intensional associations in dataspaces.

【Paper Link】【Pages】:984-987

【Authors】: Marcos Antonio Vaz Salles ; Jens Dittrich ; Lukas Blunschi

【Abstract】: Dataspace applications necessitate the creation of associations among data items over time. For example, once information about people is extracted from sources on the Web, associations among them may emerge as a consequence of different criteria, such as their city of origin or their elected hobbies. In this paper, we advocate a declarative approach to specifying these associations. We propose that each set of associations be defined by an association trail. An association trail is a query-based definition of how items are connected by intensional (i.e., virtual) association edges to other items in the dataspace. We study the problem of processing neighborhood queries over such intensional association graphs. The naive approach to neighborhood query processing over intensional graphs is to materialize the whole graph and then apply previous work on dataspace graph indexing to answer queries. We present in this paper a novel indexing technique, the grouping-compressed index (GCI), that has better worst-case indexing cost than the naive approach. In our experiments, GCI is shown to provide an order of magnitude gain in indexing cost over the naive approach, while remaining competitive in query processing time.

【Keywords】: data mining; database indexing; database management systems; query processing; Web; association trail; dataspace graph indexing; dataspaces applications; grouping compressed index; indexing technique; information extraction; intensional association edges; intensional association graphs; intensional associations; magnitude gain; naive approach; neighborhood query processing; query processing time; worst case indexing cost; Cities and towns; Content management; Costs; Data mining; Indexing; Information management; Query processing; Universal Serial Bus

【Paper Link】【Pages】:988-991

【Authors】: Emre Sarigöl ; Oriana Riva ; Gustavo Alonso

【Abstract】: Social networking is increasingly becoming a popular means of communication for online users. The trend is also true for offline scenarios where people use their mobile phones to network with nearby buddies. In this paper, we propose a distributed tuple space for social networking on ad hoc networks. We describe the tuple space model and its operations, and give evidence of its advantages for ad hoc social networking through several applications.

【Keywords】: ad hoc networks; mobile computing; mobile handsets; social networking (online); ad hoc network; distributed tuple space; mobile phone; social networking; Ad hoc networks; Communication system control; Computer science; Global Positioning System; Mobile ad hoc networks; Mobile communication; Mobile handsets; Network topology; Portals; Social network services

【Paper Link】【Pages】:992-995

【Authors】: Arnau Padrol-Sureda ; Guillem Perarnau-Llobet ; Julian Pfeifle ; Victor Muntés-Mulero

【Abstract】: Finding decompositions of a graph into a family of clusters is crucial to understanding its underlying structure. While most existing approaches focus on partitioning the nodes, real-world datasets suggest the presence of overlapping communities. We present OCA, a novel algorithm to detect overlapped communities in large data graphs. It outperforms previous proposals in terms of execution time, and efficiently handles large graphs containing more than 10⁸ nodes and edges.

【Keywords】: graph theory; OCA; community search overlapping; data graphs; graph decompositions; real world datasets; social networks

Industry Session: Data Warehousing 3

114. Hive - a petabyte scale data warehouse using Hadoop.

【Paper Link】【Pages】:996-1005

【Authors】: Ashish Thusoo ; Joydeep Sen Sarma ; Namit Jain ; Zheng Shao ; Prasad Chakka ; Ning Zhang ; Suresh Anthony ; Hao Liu ; Raghotham Murthy

【Abstract】: The size of data sets being collected and analyzed in the industry for business intelligence is growing rapidly, making traditional warehousing solutions prohibitively expensive. Hadoop is a popular open-source map-reduce implementation which is being used in companies like Yahoo, Facebook etc. to store and process extremely large data sets on commodity hardware. However, the map-reduce programming model is very low level and requires developers to write custom programs which are hard to maintain and reuse. In this paper, we present Hive, an open-source data warehousing solution built on top of Hadoop. Hive supports queries expressed in a SQL-like declarative language - HiveQL, which are compiled into map-reduce jobs that are executed using Hadoop. In addition, HiveQL enables users to plug in custom map-reduce scripts into queries. The language includes a type system with support for tables containing primitive types, collections like arrays and maps, and nested compositions of the same. The underlying IO libraries can be extended to query data in custom formats. Hive also includes a system catalog - Metastore - that contains schemas and statistics, which are useful in data exploration, query optimization and query compilation. In Facebook, the Hive warehouse contains tens of thousands of tables and stores over 700TB of data and is being used extensively for both reporting and ad-hoc analyses by more than 200 users per month.

【Keywords】: SQL; competitive intelligence; data warehouses; public domain software; query processing; Hadoop software; HiveQL language; Metastore system catalog; SQL-like declarative language; arrays; business intelligence; data exploration; map-reduce jobs; maps; nested compositions; open-source map-reduce implementation; petabyte scale data warehouse; primitive types; query compilation; query optimization; Companies; Data warehouses; Facebook; Hardware; Libraries; Open source software; Plugs; Query processing; Statistics; Warehousing

115. Tuning servers, storage and database for energy efficient data warehouses.

【Paper Link】【Pages】:1006-1017

【Authors】: Meikel Poess ; Raghunath Othayoth Nambiar

【Abstract】: Undoubtedly, reducing power consumption is at the top of the priority list for system vendors, data center managers who are challenged by customers, analysts, and government agencies to implement green initiatives. Hardware and software vendors have developed an array of power preserving techniques. On-demand-driven clock speeds for processors, energy efficient power supplies, and operating-system-controlled dynamic power modes are just a few hardware examples. Software vendors have contributed to energy efficiency by implementing power efficient coding methods, such as advanced compression and enabling applications to take advantage of large memory caches. However, adoption of these power-preserving technologies in data centers is not straightforward, especially, for large, complex applications such as data warehouses. Data warehouse workloads typically have oscillating resource utilizations, which makes identifying the largest power consumers difficult. Most importantly, while preserving power remains a critical consideration, performance and availability goals must still be met with systems using power-preserving technologies. This paper evaluates the tradeoffs between existing power-saving techniques and their performance impact on data warehouse applications. Our analysis will guide system developers and data center managers in making informed decisions regarding adopting power-preserving techniques.

【Keywords】: computer centres; data warehouses; energy conservation; tuning; advanced compression; data center managers; database tuning; energy efficient data warehouses; energy efficient power supplies; green initiatives; memory caches; on-demand-driven clock speeds; operating system controlled dynamic power modes; servers tuning; storage tuning; tradeoffs; Application software; Data warehouses; Databases; Energy consumption; Energy efficiency; Energy management; Energy storage; Hardware; Power system management; Tuning

116. A new algorithm for small-large table outer joins in parallel DBMS.

【Paper Link】【Pages】:1018-1024

【Authors】: Yu Xu ; Pekka Kostamaa

【Abstract】: Large enterprises have been relying on parallel database management systems (PDBMS) to process their ever-increasing data volume and complex queries. Business intelligence tools used by enterprises frequently generate a large number of outer joins and require high performance from the underlying database systems. A common type of outer joins in business applications is the small-large table outer join studied in this paper where one table is relatively small and the other is large. We present an efficient and easy to implement algorithm called DER (Duplication and Efficient Redistribution) for small and large table outer joins. Our experimental results show that the DER algorithm significantly speeds up query elapsed time and scales linearly.

【Keywords】: competitive intelligence; data warehouses; parallel algorithms; business intelligence tools; duplication-and-efficient redistribution algorithm; parallel database management systems; small-large table outer joins; Bismuth; Communication industry; Computer architecture; Data warehouses; Database systems; Deductive databases; Density estimation robust algorithm; History; Law enforcement; Pattern analysis

Industry Session: Data, Data, and More Data 3

117. Data cleansing as a transient service.

【Paper Link】【Pages】:1025-1036

【Authors】: Tanveer A. Faruquie ; K. Hima Prasad ; L. Venkata Subramaniam ; Mukesh K. Mohania ; Girish Venkatachaliah ; Shrinivas Kulkarni ; Pramit Basu

【Abstract】: There is often a transient need within enterprises for data cleansing which can be satisfied by offering data cleansing as a transient service. Every time a data cleansing need arises it should be possible to provision hardware, software and staff for accomplishing the task and then dismantling the set up. In this paper we present such a system that uses virtualized hardware and software for data cleansing. We share actual experiences gained from building such a system.We use a cloud infrastructure to offer virtualized data cleansing instances that can be accessed as a service. We build a system that is scalable, elastic and configurable. Each enterprise has unique needs which makes it necessary to customize both the infrastructure and the cleansing algorithms to address these needs. In this paper we will present a system that is easily configurable to suit the data cleansing needs of an enterprise.

【Keywords】: Internet; data mining; cleansing algorithms; cloud infrastructure; data cleansing; transient service; virtualized data cleansing; virtualized hardware; virtualized software; Clouds; Costs; Customer service; Databases; Decision making; Delay; Error analysis; Hardware; Investments; Software maintenance

118. XBRL repository - an industrial approach of management of XBRL documents.

【Paper Link】【Pages】:1037-1047

【Authors】: Zhen Hua Liu ; Thomas Baby ; Sriram Krishnamurthy ; Ying Lu ; Qin Yu ; Anguel Novoselsky ; Vikas Arora

【Abstract】: XBRL (extensible Business Reporting Language) is an XML based standard for electronic data exchange and communication of business financial documents and reports among government bodies, regulators, financial institutions and reporting entities. In this paper, we analyze the XBRL model from a perspective of data management to show why XBRL has created a new emerging vertical domain of database applications that offer opportunities and technical challenges for the database community. We outline the concept of an XBRL repository as a data hub to store and process collections of XBRL documents while maintaining document integrity based on XBRL semantics, providing document manageability, operational completeness and XBRL dictionary, query and business intelligence services. We then share our experiences of building the main components of the XBRL repository using the Oracle XML enabled RDBMS that leverages both XML and relational technologies as a backbone to host industrial strength XBRL applications. We also highlight technical challenges and potential solutions to each of the components. Finally we propose the potential of leveraging the XBRL data model concept for providing XML data normalization and XML having multi-hierarchy.

【Keywords】: XML; document handling; query processing; relational databases; XBRL dictionary; XBRL documents management; XBRL repository; XBRL semantics; XML based standard; XML data normalization; business financial documents; business intelligence service; data management perspective; document integrity; document manageability; electronic data exchange; extensible business reporting language; operational completeness; query service; relational technologies; Business communication; Communication standards; Data models; Databases; Dictionaries; Government; Industrial relations; Regulators; Spine; XML

119. Visualizing large-scale RDF data using Subsets, Summaries, and Sampling in Oracle.

【Paper Link】【Pages】:1048-1059

【Authors】: Seema Sundara ; Medha Atre ; Vladimir Kolovski ; Souripriya Das ; Zhe Wu ; Eugene Inseok Chong ; Jagannathan Srinivasan

【Abstract】: The paper addresses the problem of visualizing large scale RDF data via a 3-S approach, namely, by using, (1) Subsets: to present only relevant data for visualisation; both static and dynamic subsets can be specified, (2) Summaries: to capture the essence of RDF data being viewed; summarized data can be expanded on demand thereby allowing users to create hybrid (summary-detail) fisheye views of RDF data, and (3) Sampling: to further optimize visualization of large-scale data where a representative sample suffices. The visualization scheme works with both asserted and inferred triples (generated using RDF(S) and OWL semantics). This scheme is implemented in Oracle by developing a plug-in for the Cytoscape graph visualization tool, which uses functions defined in a Oracle PL/SQL package, to provide fast and optimized access to Oracle Semantic Store containing RDF data. Interactive visualization of a synthesized RDF data set (LUBM 1 million triples), two native RDF datasets (Wikipedia 47 million triples and UniProt 700 million triples), and an OWL ontology (eClassOwl with a large class hierarchy including over 25,000 OWL classes, 5,000 properties, and 400,000 class-properties) demonstrates the effectiveness of our visualization scheme.

【Keywords】: data visualisation; knowledge representation languages; meta data; ontologies (artificial intelligence); optimisation; 3S approach; Cytoscape graph visualization; OWL ontology; Oracle; PL package; RDF data; SQL package; data summarization; data visualization; resource description framework; Data visualization; Delay; Displays; Large-scale systems; OWL; Ontologies; Packaging; Resource description framework; Sampling methods; Wikipedia

Industry Session: Query Optimization 3

120. Incorporating partitioning and parallel plans into the SCOPE optimizer.

【Paper Link】【Pages】:1060-1071

【Authors】: Jingren Zhou ; Per-Åke Larson ; Ronnie Chaiken

【Abstract】: Massive data analysis on large clusters presents new opportunities and challenges for query optimization. Data partitioning is crucial to performance in this environment. However, data repartitioning is a very expensive operation so minimizing the number of such operations can yield very significant performance improvements. A query optimizer for this environment must therefore be able to reason about data partitioning including its interaction with sorting and grouping. SCOPE is a SQL-like scripting language used at Microsoft for massive data analysis. A transformation-based optimizer is responsible for converting scripts into efficient execution plans for the Cosmos distributed computing platform. In this paper, we describe how reasoning about data partitioning is incorporated into the SCOPE optimizer. We show how relational operators affect partitioning, sorting and grouping properties and describe how the optimizer reasons about and exploits such properties to avoid unnecessary operations. In most optimizers, consideration of parallel plans is an afterthought done in a postprocessing step. Reasoning about partitioning enables the SCOPE optimizer to fully integrate consideration of parallel, serial and mixed plans into the cost-based optimization. The benefits are illustrated by showing the variety of plans enabled by our approach.

【Keywords】: SQL; data analysis; query processing; relational databases; sorting; Cosmos distributed computing platform; SCOPE optimizer; SQL-like scripting language; data partitioning; grouping; massive data analysis; parallel plans; partitioning plans; query optimization; relational operators; sorting; transformation-based optimizer; Data analysis; Distributed computing; Memory; Network servers; Parallel processing; Performance analysis; Query processing; Sorting; Web and internet services; Web services

121. Rule profiling for query optimizers and their implications.

【Paper Link】【Pages】:1072-1080

【Authors】: Surajit Chaudhuri ; Leo Giakoumakis ; Vivek R. Narasayya ; Ravishankar Ramamurthy

【Abstract】: Many modern optimizers use a transformation rule based framework. While there has been a lot of work on identifying new transformation rules, there has been little work focused on empirically evaluating the effectiveness of these transformation rules. In this paper we present the results of an empirical study of "profiling" transformation rules in Microsoft SQL Server using a diverse set of real world and benchmark query workloads. We also discuss the implications of these results for designing and testing query optimizers.

【Keywords】: SQL; optimisation; query processing; microsoft SQL server; query optimizers rule profiling; transformation rule; Benchmark testing; Constraint optimization; Design optimization; Prototypes; Software tools; Volcanoes

122. Data desensitization of customer data for use in optimizer performance experiments.

【Paper Link】【Pages】:1081-1092

【Authors】: Malú Castellanos ; Bin Zhang ; Ivo Jimenez ; Perla Ruiz ; Miguel Durazo ; Umeshwar Dayal ; Lily Jow

【Abstract】: Improving the performance and functionality of database system optimizers requires experimentation on real customer data. Often these data are of sensitive nature and the only way to keep them is by applying a non-reversible transformation to obfuscate them. However, in order that the database optimizer generates exactly the same query plans as for the sensitive data, the transformation has to preserve the order and some important properties of the data distribution. Unfortunately, existing data obfuscation techniques do not preserve all of these properties and therefore are not applicable in this context. In this paper we present a Desensitizer tool that we have developed for optimizer performance experiments of HP's Neoview high availability data warehousing product. The tool is based on novel numeric and string desensitization algorithms which are agnostic to the database system. We explain the core concepts behind the algorithms, how they preserve the required data properties and important implementation considerations that were made. We present the architecture of the Desensitizer tool and results of the extensive validation that we conducted.

【Keywords】: data handling; data warehouses; customer data; data desensitization; data obfuscation techniques; data warehousing product; database system optimizers; desensitizer tool; nonreversible transformation; optimizer performance; query plans; Availability; Cryptography; Database systems; Laboratories; Research and development; Robustness; Secure storage; Simultaneous localization and mapping; Sorting; Warehousing

Demo Session 1: Events, Streams, Services, Mashups and Search 15

123. A demonstration of the MaxStream federated stream processing system.

【Paper Link】【Pages】:1093-1096

【Authors】: Irina Botan ; Younggoo Cho ; Roozbeh Derakhshan ; Nihal Dindar ; Ankush Gupta ; Laura M. Haas ; Kihong Kim ; Chulwon Lee ; Girish Mundada ; Ming-Chien Shan ; Nesime Tatbul ; Ying Yan ; Beomjin Yun ; Jin Zhang

【Abstract】: MaxStream is a federated stream processing system that seamlessly integrates multiple autonomous and heterogeneous Stream Processing Engines (SPEs) and databases. In this paper, we propose to demonstrate the key features of MaxStream using two application scenarios, namely the Sales Map & Spikes business monitoring scenario and the Linear Road Benchmark, each with a different set of requirements. More specifically, we will show how the MaxStream Federator can translate and forward the application queries to two different commercial SPEs (Coral8 and StreamBase), as well as how it does so under various persistency requirements.

【Keywords】: competitive intelligence; distributed databases; query processing; relational databases; Linear Road Benchmark; MaxStream; Sales Map & Spikes business monitoring scenario; autonomous stream processing engines; databases; federated stream processing system; heterogeneous stream processing engines; Application software; Bismuth; Business; Computerized monitoring; Delay; Engines; Forward contracts; Intelligent sensors; Marketing and sales; Spatial databases

124. E-Cube: Multi-dimensional event sequence processing using concept and pattern hierarchies.

【Paper Link】【Pages】:1097-1100

【Authors】: Mo Liu ; Elke A. Rundensteiner ; Kara Greenfield ; Chetan Gupta ; Song Wang ; Ismail Ari ; Abhay Mehta

【Abstract】: Many modern applications including tag based mass transit systems, RFID-based supply chain management systems and online financial feeds require special purpose event stream processing technology to analyze vast amounts of sequential multi-dimensional data available in real-time data feeds. Traditional online analytical processing (OLAP) systems are not designed for real-time pattern-based operations, while Complex Event Processing (CEP) systems are designed for sequence detection and do not support OLAP operations. We will demonstrate a novel E-Cube model that combines CEP and OLAP techniques for multi-dimensional event pattern analysis at different abstraction levels. A London transit scenario will be given to demonstrate the utility and performance of this proposed technology.

【Keywords】: data mining; E-Cube; London transit scenario; OLAP; RFID based supply chain management systems; complex event processing; concept hierarchies; event stream processing technology; multidimensional event pattern analysis; online analytical processing; online financial feeds; pattern hierarchies; real time pattern based operation; sequential multidimensional data; tag based mass transit system; Diseases; Event detection; Feeds; Hospitals; Monitoring; Navigation; Pattern analysis; Real time systems; Supply chain management; Urban areas

125. TargetSearch: A ranking friendly XML keyword search engine.

【Paper Link】【Pages】:1101-1104

【Authors】: Ziyang Liu ; Yichuan Cai ; Yi Chen

【Abstract】: This demo illustrates an XML search engine TargetSearch that addresses an open problem in XML keyword search: given relevant matches to keywords, how to compose query results properly so that they can be effectively ranked and easily digested by users. The approaches adopted in the literature generate either overwhelmingly large results or fragmentary results, both of which may cause the ranking schemes to be ineffective. Intuitively, each query has a search target and each result should contain exactly one instance of the search target along with its evidence. We developed TargetSearch which composes atomic and intact query results driven by users' search targets.

【Keywords】: XML; search engines; TargetSearch search engine; XML keyword search engine; fragmentary results; large results; query results; ranking schemes; search target; Databases; Informatics; Joining processes; Keyword search; Pattern matching; Query processing; Search engines; Systems engineering and theory; XML

126. Efficient fuzzy type-ahead search in TASTIER.

【Paper Link】【Pages】:1105-1108

【Authors】: Guoliang Li ; Shengyue Ji ; Chen Li ; Jiannan Wang ; Jianhua Feng

【Abstract】: TASTIER is a research project on the new information-access paradigm called type-ahead search, in which systems find answers to a keyword query on-the-fly as users type in the query. In this paper we study how to support fuzzy type-ahead search in TASTIER. Supporting fuzzy search is important when users have limited knowledge about the exact representation of the entities they are looking for, such as people records in an online directory. We have developed and deployed several such systems, some of which have been used by many people on a daily basis. The systems received overwhelmingly positive feedbacks from users due to their friendly interfaces with the fuzzy-search feature. We describe the design and implementation of the systems, and demonstrate several such systems. We show that our efficient techniques can indeed allow this search paradigm to scale on large amounts of data.

【Keywords】: Internet; fuzzy set theory; human computer interaction; query formulation; TASTIER; friendly interfaces; fuzzy search feature; fuzzy type ahead search; information access paradigm; keyword query on-the-fly; Computer science; Feedback; Fuzzy systems; Information science; Information systems; Keyword search; Laboratories; Search engines; Web search; Web services

127. MASS: a multi-facet domain-specific influential blogger mining system.

【Paper Link】【Pages】:1109-1112

【Authors】: Yichuan Cai ; Yi Chen

【Abstract】: With rapid development of web 2.0 technology and e-business, bloggers play significant roles in the blogosphere as well as the external world. In particular, influential bloggers can bring great business values to modern enterprise. Despite that several systems for mining influential bloggers are available, they measure the influence of bloggers in general rather than domain specific, which is not applicable for real application requirements, such as business advertisement, personalized recommendation and so on. In this paper, we propose an effective model to mine the top-k influential bloggers according to their interest domains, the impact and attitude of the comments to their posts, as well as their authority in the network of page links. In this demonstration, we present MASS, an effective system for mining influential bloggers. We will present the techniques in MASS, its experimental evaluation, as well as its applications.

【Keywords】: Internet; data mining; MASS; Web 2.0 technology; blogosphere; e-business; interest domains; multifacet domain specific influential blogger mining system; top-k influential blogger

【Paper Link】【Pages】:1113-1116

【Authors】: Jongwuk Lee ; Seung-won Hwang ; Zaiqing Nie ; Ji-Rong Wen

【Abstract】: We demonstrate Product EntityCube, a product recommendation and navigation system. While the unprecedented scale of a product search portal enables to satisfy users with diverse needs, this scale also complicates product recommendation. Specifically, our target application poses a unique challenge of overcoming insufficient user profiles and feedbacks. To address this problem, we organize query results into clusters representing different user perceptions of similarity, and provide a navigational UI to handle personal interests. Specifically, we first discuss hybrid object clustering capturing diverse user interests from millions of Web pages and disambiguating different perceptions using feature-based similarity. We then discuss skyline object ranking to highlight interesting items at each cluster. Our demonstration illustrates how Product EntityCube can enrich user product shopping experiences.

【Keywords】: pattern clustering; portals; recommender systems; retail data processing; user interfaces; hybrid object clustering; navigational user interface; product entity cube; product search portal; query results; recommendation system; skyline object ranking; user product shopping experiences; Asia; Collaboration; Feedback; Filtering; Motion pictures; Navigation; Portals; Recommender systems; Taxonomy; Web pages

129. Navigating through Mashed-up Applications with COMPASS.

【Paper Link】【Pages】:1117-1120

【Authors】: Daniel Deutch ; Ohad Greenshpan ; Tova Milo

【Abstract】: Mashups integrate a set of complementary Web-services and data sources, often referred to as mashlets. We consider here a common scenario where the integrated mashlets are part of larger Web-applications, and their integration yields a set of inter-connected applications. We refer to them as Mashed-up Applications (abbr. MashAPP). The inter-connections between the mashlets enrich the individual Web-applications, but at the same time make the user navigation within them more intricate as actions in one application may affect others. To address this difficulty, we present COMPASS, a system that assists users in their navigation through MashAPPs. The system employs a novel top-k algorithm to propose users the most effective navigation paths for their specified goals. The suggestions are continually adapted to choices taken by the users while navigating.

【Keywords】: Web services; application program interfaces; COMPASS; MashAPPs; Mashed-up applications; Web applications; Web services; data sources; mashlets; mashups integrate; navigation paths; top-k algorithm; Current measurement; Data security; Drugs; Feeds; Information retrieval; Mashups; Navigation; Pharmaceuticals; Portals; Web services

130. GenerIE: Information extraction using database queries.

【Paper Link】【Pages】:1121-1124

【Authors】: Luis Tari ; Phan Huy Tu ; Jörg Hakenberg ; Yi Chen ; Tran Cao Son ; Graciela Gonzalez ; Chitta Baral

【Abstract】: Information extraction systems are traditionally implemented as a pipeline of special-purpose processing modules. A major drawback of such an approach is that whenever a new extraction goal emerges or a module is improved, extraction has to be re-applied from scratch to the entire text corpus even though only a small part of the corpus might be affected. In this demonstration proposal, we describe a novel paradigm for information extraction: we store the parse trees output by text processing in a database, and then express extraction needs using queries, which can be evaluated and optimized by databases. Compared with the existing approaches, database queries for information extraction enable generic extraction and minimize reprocessing. However, such an approach also poses a lot of technical challenges, such as language design, optimization and automatic query generation. We will present the opportunities and challenges that we met when building GenerIE, a system that implements this paradigm.

【Keywords】: query processing; text analysis; tree data structures; GenerlE; automatic query generation; database queries; generic extraction; information extraction; information extraction systems; text processing; Biomedical engineering; Biomedical informatics; Computer science; Data engineering; Data mining; Database languages; Pipelines; Proposals; Tagging; Text processing

131. Power-aware data analysis in sensor networks.

【Paper Link】【Pages】:1125-1128

【Authors】: Daniel Klan ; Katja Hose ; Marcel Karnstedt ; Kai-Uwe Sattler

【Abstract】: Sensor networks have evolved to a powerful infrastructure component for event monitoring in many application scenarios. In addition to simple filter and aggregation operations, an important task in processing sensor data is data mining - the identification of relevant information and patterns. Limited capabilities of sensor nodes in terms of storage and processing capacity, battery lifetime, and communication demand a power-efficient, preferably sensor-local processing. In this paper, we present AnduIN, a system for developing, deploying, and running in-network data mining tasks. The system consists of a data stream processing engine, a library of operators for sensor-local processing, a box-and-arrow editor for specifying data mining tasks and deployment, a GUI providing the user with current information about the network and running queries, and an alerter notifying the user if a better query execution plan is available. At the demonstration site, we plan to show our system in action using burst detection as example application.

【Keywords】: data analysis; data mining; graphical user interfaces; power aware computing; AnduIN system; alerter; box-and-arrow editor; burst detection; data stream processing engine; graphical user interface; in-network data mining tasks; operators library; pattern identification; power-aware data analysis; relevant information identification; sensor networks; Batteries; Capacitive sensors; Data analysis; Data mining; Engines; Graphical user interfaces; Information filtering; Information filters; Libraries; Monitoring

132. A view-based Monitoring for Privacy-aware Web services.

【Paper Link】【Pages】:1129-1132

【Authors】: Hassina Meziane ; Salima Benbernou ; Aouda K. Zerdali ; Mohand-Said Hacid ; Mike P. Papazoglou

【Abstract】: The demonstration addresses the problem of monitoring the compliance of privacy agreement that spells out a consumer's privacy rights and how their private information must be handled by the service provider. We present a Privacy Agreement Monitoring system, an easy-to-use, and an efficient tool for tightly controlling the private data usage flow dynamically in the area of web services. Some reasoning can be made upon the observations to enhance the compliance of the privacy agreement and enrich the knowledge on misuses.

【Keywords】: Web services; business data processing; data privacy; Web service provider; Web services; consumer privacy right; privacy agreement monitoring system; private data usage flow; Authentication; Authorization; Control systems; Data privacy; Gain control; Law; Legal factors; Monitoring; Protection; Web services

133. Viewing a world of annotations through AnnoVIP.

【Paper Link】【Pages】:1133-1136

【Authors】: Konstantinos Karanasos ; Spyros Zoupanos

【Abstract】: The proliferation of electronic content has notably lead to the apparition of large corpora of interrelated structured documents (such as HTML and XML Web pages) and semantic annotations (typically expressed in RDF), which further complement these documents. Documents and annotations may be authored independently by different users or programs. We present AnnoVIP, a peer-to-peer platform, capable of efficiently exploiting a multitude of annotated documents, based on innovative materialized views.

【Keywords】: Internet; hypermedia markup languages; peer-to-peer computing; HTML Web pages; XML Web pages; anno VIP; annotated documents; electronic content proliferation; interrelated structured documents; large corpora; peer-to-peer platform; semantic annotations; HTML; Peer to peer computing; Publishing; Query processing; Resource description framework; Social network services; Software tools; Tagging; Web pages; XML

134. MashRank: Towards uncertainty-aware and rank-aware mashups.

【Paper Link】【Pages】:1137-1140

【Authors】: Mohamed A. Soliman ; Mina Saleeb ; Ihab F. Ilyas

【Abstract】: Mashups are situational applications that build data flows to link the contents of multiple Web sources. Often times, ranking the results of a mashup is handled in a materialize-then-sort fashion, since combining multiple data sources usually destroys their original rankings. Moreover, although uncertainty is ubiquitous on the Web, most mashup tools do not reason about or reflect such uncertainty. We introduce MashRank, a mashup tool that treats ranking as a first-class citizen in mashup construction, and allows for rank-joining Web sources with uncertain information. To the best of our knowledge, no current tools allow for similar functionalities. MashRank encapsulates a new probabilistic model reflecting uncertainty in ranking, a set of techniques implemented as pipelined operators in mashup plans, and a probabilistic ranking infrastructure based on Monte-Carlo sampling.

【Keywords】: Internet; Monte Carlo methods; sampling methods; uncertainty handling; MashRank; Monte-Carlo sampling; mashup tool; multiple Web sources; multiple data sources combination; probabilistic model reflecting uncertainty; probabilistic ranking infrastructure; rank aware mashups; rank joining Web sources; uncertainty aware mashup; Mashups

135. T-Warehouse: Visual OLAP analysis on trajectory data.

【Paper Link】【Pages】:1141-1144

【Authors】: Luca Leonardi ; Gerasimos Marketos ; Elias Frentzos ; Nikos Giatrakos ; Salvatore Orlando ; Nikos Pelekis ; Alessandra Raffaetà ; Alessandro Roncato ; Claudio Silvestri ; Yannis Theodoridis

【Abstract】: Technological advances in sensing technologies and wireless telecommunication devices enable novel research fields related to the management of trajectory data. As it usually happens in the data management world, the challenge after storing the data is the implementation of appropriate analytics for extracting useful knowledge. However, traditional data warehousing systems and techniques were not designed for analyzing trajectory data. Thus, in this work, we demonstrate a framework that transforms the traditional data cube model into a trajectory warehouse. As a proof-of-concept, we implemented T-WAREHOUSE, a system that incorporates all the required steps for Visual Trajectory Data Warehousing, from trajectory reconstruction and ETL processing to Visual OLAP analysis on mobility data.

【Keywords】: data mining; data warehouses; ETL processing; T-Warehouse; traditional data cube model; trajectory data management; visual OLAP analysis; visual trajectory data warehousing; Cities and towns; Data analysis; Data mining; Energy management; Informatics; Information analysis; Knowledge management; Paper technology; Roads; Warehousing

136. WikiAnalytics: Ad-hoc querying of highly heterogeneous structured data.

【Paper Link】【Pages】:1145-1148

【Authors】: Andrey Balmin ; Emiran Curtmola

【Abstract】: Searching and extracting meaningful information out of highly heterogeneous datasets is a hot topic that received a lot of attention. However, the existing solutions are based on either rigid complex query languages (e.g., SQL, XQuery/XPath) which are hard to use without full schema knowledge, without an expert user, and which require up-front data integration. At the other extreme, existing solutions employ keyword search queries over relational databases, as well as over semistructured data, which are too imprecise to specify exactly the user's intent. To address these limitations, we propose an alternative search paradigm in order to derive tables of precise and complete results from a very sparse set of heterogeneous records. Our approach allows users to disambiguate search results by navigation along conceptual dimensions that describe the records. Therefore, we cluster documents based on fields and values that contain the query keywords. We build a universal navigational lattice (UNL) over all such discovered clusters. Conceptually, the UNL encodes all possible ways to group the documents in the data corpus based on where the keywords hit. We describe, WikiAnalytics, a system that facilitates data extraction from the Wikipedia infobox collection. WikiAnalytics provides a dynamic and intuitive interface that lets the average user explore the search results and construct homogeneous structured tables, which can be further queried and mashed up (e.g., filtered and aggregated) using the conventional tools.

【Keywords】: query languages; relational databases; user interfaces; WikiAnalytics system; Wikipedia infobox collection; ad-hoc querying; data extraction; data integration; expert user; highly heterogeneous structured data; keyword search queries; query languages; relational databases; schema knowledge; search results disambiguation; universal navigational lattice; user interface; Catalogs; Data mining; Database languages; HTML; Keyword search; Lattices; Navigation; Query processing; Relational databases; Wikipedia

137. SMARTINT: A system for answering queries over web databases using attribute dependencies.

【Paper Link】【Pages】:1149-1152

【Authors】: Ravi Gummadi ; Anupam Khulbe ; Aravind Kalavagattu ; Sanil Salvi ; Subbarao Kambhampati

【Abstract】: Many web databases can be seen as providing partial and overlapping information about entities in the world. To answer queries effectively, we need to integrate the information about the individual entities that are fragmented over multiple sources. At first blush this is just the inverse of traditional database normalization problem - rather than go from a universal relation to normalized tables, we want to reconstruct the universal relation given the tables (sources). The standard way of reconstructing the entities will involve joining the tables. Unfortunately, because of the autonomous and decentralized way in which the sources are populated, they often do not have Primary Key - Foreign Key relations. While tables do share attributes, naive joins over these shared attributes can result in reconstruction of many spurious entities thus seriously compromising precision. Our system, SMARTINT is aimed at addressing the problem of data integration in such scenarios. Given a query, our system uses the Approximate Functional Dependencies(AFDs) to piece together a tree of relevant tables and schemas for joining them. The result tuples produced by our system are able to strike a favorable balance between precision and recall.

【Keywords】: Internet; database management systems; query processing; SMARTINT system; Web database; approximate functional dependencies; attribute dependencies; data integration; database normalization problem; normalized tables; primary key-foreign key relations; query answering system; universal relation; Computer science; Databases; Engine cylinders; Intrusion detection; Vehicles

Demo Session 2: Scalability, Design, Optimization and Miscellaneous 14

138. Mini-Me: A min-repro system for database software.

【Paper Link】【Pages】:1153-1156

【Authors】: Nicolas Bruno ; Rimma V. Nehme

【Abstract】: Testing and debugging database software is often challenging and time consuming. A very arduous task for DB testers is finding a min-repro - the Â¿simplest possible setupÂ¿ that reproduces the original problem. Currently, a great deal of searching for min-repros is carried out manually using non-database-specific tools, which is both slow and error-prone. We propose to demonstrate a system, called Mini-Me¹, designed to ease and speed-up the task of finding min-repros in database-related products. Mini-Me employs several effective tools, including: the novel simplification transformations, the high-level language for creating search scripts and automation, the Â¿record-and-replayÂ¿ functionality, and the visualization of the search space and results. In addition to the standard application mode, the system can be interacted with in the game mode. The latter can provide an intrinsically motivating environment for developing successful search strategies by DB testers, which can be data-mined and recorded as patterns and used as recommendations for DB testers in the future. Potentially, a system like Mini-Me can save hours of time (for both customers and testers to isolate a problem), which could result in faster fixes and large cost savings to organizations.

【Keywords】: database management systems; program debugging; program testing; Mini-Me; application mode; database-specific tools; debugging database software; game mode; high-level language; min-repro system; record-and-replay functionality; simplification transformations; testing database software; Automation; Costs; High level languages; Software debugging; Software systems; Software testing; Spatial databases; System testing; Visual databases; Visualization

139. I/O-efficient statistical computing with RIOT.

【Paper Link】【Pages】:1157-1160

【Authors】: Yi Zhang ; Weiping Zhang ; Jun Yang

【Abstract】: Statistical analysis of massive data is becoming indispensable to science, commerce, and society today. Such analysis requires efficient, flexible storage support and special optimization techniques. In this demo, we present RIOT (R with I/O Transparency), a system that extends R, a popular computing environment for statistical data analysis. RIOT makes R programs I/O-efficient in a way transparent to users. It features a flexible array storage manager and an optimization engine suitable for statistical and numerical operations. RIOT also seamlessly integrates with external database systems, offering additional opportunities for processing data that reside in databases by blurring the boundary between database and host-language processing. This demo will show how statistical computation can be effectively and efficiently handled by RIOT.

【Keywords】: input-output programs; optimisation; statistical analysis; I/O-efficient statistical computing; RIOT; external database systems; flexible array storage manager; host language processing; optimization techniques; statistical analysis; Adaptive arrays; Business; Computer science; Data analysis; Database systems; Libraries; MATLAB; Multidimensional systems; Spatial databases; Switches

140. Interactive physical design tuning.

【Paper Link】【Pages】:1161-1164

【Authors】: Nicolas Bruno ; Surajit Chaudhuri

【Abstract】: In the last decade, automated physical design tuning became a relevant area of research. The process of tuning a workload became more flexible but also more complex, and getting the best design upfront became difficult. We propose a paradigm shift for physical design tuning, in which sessions are highly interactive, allowing DBAs to quickly try different options, identify problems, and obtain physical designs in an agile manner.

【Keywords】: database management systems; tuning; DBA; interactive physical design tuning; workload tuning; Concrete; Costs; Data structures; Data visualization; Database systems; Design optimization; Engines; Feedback loop; File systems; Prototypes

141. Visualizing cost-based XQuery optimization.

【Paper Link】【Pages】:1165-1168

【Authors】: Andreas M. Weiner ; Theo Härder ; Renato Oliveira da Silva

【Abstract】: Developing a full-fledged cost-based XQuery optimizer is a fairly complex task. Nowadays, there is little knowledge concerning suitable cost formulae and optimization strategies for exploring and constraining the tremendously large search space. To allow for a fair assessment of different optimization strategies, physical algebra operators, and indexing approaches, we developed an extensible optimization framework. The framework is accompanied by a supportive visual explain tool that enables user interactions to refine the inspection and the comprehension of the query plans proposed. Using this tool, the optimizer can be dynamically reconfigured and the impact of different optimization strategies on the final query execution plan can be immediately visualized.

【Keywords】: XML; data visualisation; optimisation; query processing; transaction processing; cost-based XQuery optimization; extensible optimization framework; indexing approaches; optimization strategies; physical algebra operators; query plans; supportive visual explain tool; user interactions; Algebra; Computer science; Cost function; Database systems; Indexing; Information systems; Query processing; Visual databases; Visualization; XML

142. XML reasoning made practical.

【Paper Link】【Pages】:1169-1172

【Authors】: Pierre Genevès ; Nabil Layaïda

【Abstract】: We present a tool for the static analysis of XPath queries and XML Schemas. The tool introduces techniques used in the field of verification (such as binary decision diagrams) in order to efficiently solve XPath query satisfiability, containment, and equivalence, in the presence of real-world XML Schemas. The tool can be used in query optimizers, in order to prove soundness of query rewriting. It can also be used in type-checkers and optimizing compilers that need to perform all kinds of compile-time analyses involving XPath queries and XML tree constraints.

【Keywords】: XML; binary decision diagrams; formal verification; optimising compilers; query processing; XML reasoning; XML schemas; XML tree constraints; XPath queries; binary decision diagrams; compile-time analyses; optimizing compilers; query containment; query equivalence; query optimizers; query rewriting; query satisfiability; type-checkers; verification field; Boolean functions; Costs; Data structures; Independent component analysis; Java; Logic; Navigation; Optimizing compilers; Performance analysis; XML

143. TransScale: Scalability transformations for declarative applications.

【Paper Link】【Pages】:1173-1176

【Authors】: Alexander Böhm ; Erich Marth ; Carl-Christian Kanne

【Abstract】: The goal of the Demaq/TransScale system is to automate the distribution of complex application processes to large numbers of hosts. We implement distribution as a source-level transformation that turns the distribution-unaware application specification for a single host into a set of programs that can be executed on the various machines of a cluster.

【Keywords】: Internet; Demaq system; TransScale system; declarative application; distribution unaware application specification; scalability transformation; source level transformation; Application software; Cloud computing; Computer science; Data models; Data processing; Hardware; Logic; Mathematics; Scalability; XML

144. Reverse engineering models from databases to bootstrap application development.

【Paper Link】【Pages】:1177-1180

【Authors】: Ankit Malpani ; Philip A. Bernstein ; Sergey Melnik ; James F. Terwilliger

【Abstract】: Object-relational mapping systems have become often-used tools to provide application access to relational databases. In a database-first development scenario, the onus is on the developer to construct a meaningful object layer for the application because shipping tools, as ORM tools only ship database reverse-engineering tools that generate objects with a trivial one-to-one mapping. We built a tool, EdmGen++, that combines pattern-finding rules from conceptual modelling literature with configurable conditions that increase the likelihood that found patterns are semantically relevant. EdmGen++ produces a conceptual model with inheritance in Microsoft's Entity Data Model, which Microsoft's Entity Framework uses to support an executable object-to-relational mapping. The execution time of EdmGen++ on customer databases is reasonable for design-time.

【Keywords】: C++ language; object-oriented databases; relational databases; reverse engineering; EdmGen++ tool; Microsoft entity data model; bootstrap application development; conceptual model; customer databases; database-first development scenario; databases; object-relational mapping systems; one-to-one mapping; relational databases; reverse engineering models; Accidents; Data models; Marine vehicles; Object oriented databases; Object oriented modeling; Pattern matching; Rails; Relational databases; Reverse engineering; Robustness

145. HECATAEUS: Regulating schema evolution.

【Paper Link】【Pages】:1181-1184

【Authors】: George Papastefanatos ; Panos Vassiliadis ; Alkis Simitsis ; Yannis Vassiliou

【Abstract】: HECATAEUS is an open-source software tool for enabling impact prediction, what-if analysis, and regulation of relational database schema evolution. We follow a graph theoretic approach and represent database schemas and database constructs, like queries and views, as graphs. Our tool enables the user to create hypothetical evolution events and examine their impact over the overall graph before these are actually enforced on it. It also allows definition of rules for regulating the impact of evolution via (a) default values for all the nodes of the graph and (b) simple annotations for nodes deviating from the default behavior. Finally, HECATAEUS includes a metric suite for evaluating the impact of evolution events and detecting crucial and vulnerable parts of the system.

【Keywords】: graph theory; public domain software; relational databases; HECATAEUS; database construction; graph theoretic approach; open source software; regulating schema evolution; relational database; Application software; Computer crashes; Control systems; Embedded software; Event detection; Floods; Open source software; Relational databases; Remuneration; Software tools

146. ROX: The robustness of a run-time XQuery optimizer against correlated data.

【Paper Link】【Pages】:1185-1188

【Authors】: Riham Abdel Kader ; Peter A. Boncz ; Stefan Manegold ; Maurice van Keulen

【Abstract】: We demonstrate ROX, a run-time optimizer of XQueries, that focuses on finding the best execution order of XPath steps and relational joins in an XQuery. The problem of join ordering has been extensively researched, but the proposed techniques are still unsatisfying. These either rely on a cost model which might result in inaccurate estimations, or explore only a restrictive number of plans from the search space. ROX is developed to tackle these problems. ROX does not need any cost model, and defers query optimization to run-time intertwining optimization and execution steps. In every optimization step, sampling techniques are used to estimate the cardinality of unexecuted steps and joins to make a decision which sequence of operators to process next. Consequently, each execution step will provide updated and accurate knowledge about intermediate results, which will be used during the next optimization round. This demonstration will focus on: (i) illustrating the steps that ROX follows and the decisions it makes to choose a good join order, (ii) showing ROX's robustness in the face of data with different degree of correlation, (iii) comparing the performance of the plan chosen by ROX to different plans picked from the search space, (iv) proving that the run-time overhead needed by ROX is restricted to a small fraction of the execution time.

【Keywords】: optimisation; query processing; ROX; XPath steps; correlated data; run time XQuery Optimizer; run-time intertwining optimization; Context modeling; Cost function; Degradation; Query processing; Relational databases; Robustness; Runtime; Sampling methods; Statistics; XML

147. Symphony: A platform for search-driven applications.

【Paper Link】【Pages】:1189-1192

【Authors】: John C. Shafer ; Rakesh Agrawal ; Hady Wirawan Lauw

【Abstract】: We present the design of Symphony, a platform that enables non-developers to build and deploy a new class of search-driven applications that combine their data and domain expertise with content from search engines and other web services. The Symphony prototype has been built on top of Microsoft's Bing infrastructure. While Symphony naturally makes use of the customization capabilities exposed by Bing, its distinguishing feature is the capability it provides to the application creator to combine their proprietary data and domain expertise with content obtained from Bing. They can also integrate specialized data obtained from web services to enhance the richness of their applications. Finally, Symphony is targeted at non-developers and provides cloud services for the creation and deployment of applications.

【Keywords】: Web services; search engines; Microsoft Bing infrastructure; Symphony platform; Web services; cloud services; search engines; search-driven applications; Advertising; Books; Clouds; Indexing; Motion pictures; Prototypes; Search engines; Uniform resource locators; Web search; Web services

148. ProbClean: A probabilistic duplicate detection system.

【Paper Link】【Pages】:1193-1196

【Authors】: George Beskales ; Mohamed A. Soliman ; Ihab F. Ilyas ; Shai Ben-David ; Yubin Kim

【Abstract】: One of the most prominent data quality problems is the existence of duplicate records. Current data cleaning systems usually produce one clean instance (repair) of the input data, by carefully choosing the parameters of the duplicate detection algorithms. Finding the right parameter settings can be hard, and in many cases, perfect settings do not exist. We propose ProbClean, a system that treats duplicate detection procedures as data processing tasks with uncertain outcomes. We use a novel uncertainty model that compactly encodes the space of possible repairs corresponding to different parameter settings. ProbClean efficiently supports relational queries and allows new types of queries against a set of possible repairs.

【Keywords】: data integrity; ProbClean; data cleaning systems; data quality problems; probabilistic duplicate detection system; Business; Cleaning; Computer science; Data mining; Data processing; Data warehouses; Detection algorithms; Query processing; Relational databases; Uncertainty

149. TransDec: A spatiotemporal query processing framework for transportation systems.

【Paper Link】【Pages】:1197-1200

【Authors】: Ugur Demiryurek ; Farnoush Banaei Kashani ; Cyrus Shahabi

【Abstract】: In this paper, we present TransDec, an end-to-end-data-driven system which enables spatiotemporal queries in transportation systems with dynamic, real-time and historical data. TransDec fuses a variety of real-world spatiotemporal datasets including massive traffic sensor data, trajectory data, transportation network data, and point-of-interest data to create an immersive and realistic virtual model of a transportation system. With TransDec, we address the challenges in visualization, monitoring, querying and analysis of dynamic and large-scale transportation data in both time and space.

【Keywords】: query processing; temporal databases; traffic information systems; TransDec; end-to-end-data driven system; spatiotemporal query processing; traffic sensor data; transportation network data; transportation systems; Data visualization; Fuses; Query processing; Real time systems; Sensor fusion; Sensor systems; Spatiotemporal phenomena; Telecommunication traffic; Traffic control; Transportation

150. Provenance browser: Displaying and querying scientific workflow provenance graphs.

【Paper Link】【Pages】:1201-1204

【Authors】: Manish Kumar Anand ; Shawn Bowers ; Bertram Ludäscher

【Abstract】: This demonstration presents an interactive provenance browser for visualizing and querying data dependency (lineage) graphs produced by scientific workflow runs. The browser allows users to explore different views of provenance as well as to express complex and recursive graph queries through a high-level query language (QLP). Answers to QLP queries are lineage preserving in that queries return sets of lineage dependencies (denoting provenance graphs), which can be further queried and visually displayed (as graphs) in the browser. By combining provenance visualization, navigation, and query, the provenance browser can enable scientists to more easily access and explore scientific workflow provenance information.

【Keywords】: online front-ends; query languages; workflow management software; data dependency graphs; high-level query language; provenance browser; provenance navigation; provenance query; provenance visualization; scientific workflow provenance graphs; Anatomy; Bioinformatics; Computer science; Data visualization; Database languages; Genomics; Navigation; Query processing; Resource description framework; XML

151. Inconsistency resolution in online databases.

【Paper Link】【Pages】:1205-1208

【Authors】: Yannis Katsis ; Alin Deutsch ; Yannis Papakonstantinou ; Vasilis Vassalos

【Abstract】: Shared online databases allow community members to collaboratively maintain knowledge. Collaborative editing though inevitably leads to inconsistencies as different members enter erroneous data or conflicting opinions. Ideally community members should be able to see and resolve these inconsistencies in a collaborative fashion. However most current online databases do not support inconsistency resolution. Instead they try to by-pass the problem by either ignoring inconsistencies and treating data as if they were not conflicting or by requiring inconsistencies to be resolved outside the system. To address this limitation, we propose Ricolla; an online database system that, by treating inconsistencies as first-class citizens, supports a natural workflow for the management of conflicting data. The system captures inconsistencies (so that community members can easily inspect them) and remains fully functional in their presence, thus enabling inconsistency resolution in an Â¿as-you-goÂ¿ fashion. Moreover it supports several schemes for the resolution of inconsistencies, allowing among others users to collaboratively resolve certain conflicts while disagreeing on others.

【Keywords】: database management systems; groupware; information retrieval systems; information services; Ricolla database system; collaborative editing; inconsistency resolution; online databases; resolve inconsistencies in collaborative environments; Collaborative work; Database systems; Environmental management; Motion pictures; Online Communities/Technical Collaboration

Panels 2

152. Cloudy skies for data management.

【Paper Link】【Pages】:1209

【Authors】: David Campbell ; Brian Cooper ; Dean Jacobs ; Ashok Joshi ; Volker Markl ; Srinivas Narayanan

【Abstract】: This paper discusses about what cloud computing means for data management from an industrial perspective, illustrate challenges, opportunities, technologies and impact. We will also ask where cloud computing stands on the hype curve, if we are reinventing the wheel with some technologies, and outline important research questions the data management community should address to move cloud computing to the next level.

【Keywords】: Internet; data handling; cloud computing; data management; hype curve; industrial perspective; Application software; Cloud computing; Concurrent computing; Database languages; Facebook; Jacobian matrices; Large-scale systems; Mashups; Relational databases; Technology management

153. Database architecture (R)evolution: New hardware vs. new software.

【Paper Link】【Pages】:1210

【Authors】: Stavros Harizopoulos ; Tassos Argyros ; Peter A. Boncz ; Dan Dietterich ; Samuel Madden ; Florian M. Waas

【Abstract】: The last few years have been exciting for data management system designers. The explosion in user and enterprise data coupled with the availability of newer, cheaper, and more capable hardware have lead system designers and researchers to rethink and, in some cases, reinvent the traditional DBMS architecture. In the space of data warehousing and analytics alone, more than a dozen new database product offerings have recently appeared, and dozens of research system papers are routinely published each year. In this panel, we will ask our panelists, a mix of industry and academic experts, which of those trends will have lasting effects on database system design, and which directions hold the biggest potential for future research. We are particularly interested in the differences in views and approaches between academic and industrial research.

【Keywords】: database management systems; DBMS; data management system; data warehousing; database architecture; database system design; Computer architecture; Data analysis; Data processing; Database systems; Educational institutions; Explosions; Field programmable gate arrays; Hardware; Software algorithms; Warehousing

Seminars 5

154. Anonymized Data: Generation, models, usage.

【Paper Link】【Pages】:1211-1212

【Authors】: Graham Cormode ; Divesh Srivastava

【Abstract】: Data anonymization techniques enable publication of detailed information, which permits ad hoc queries and analyses, while guaranteeing the privacy of sensitive information in the data against a variety of attacks. In this tutorial, we aim to present a unified framework of data anonymization techniques, viewed through the lens of data uncertainty. Essentially, anonymized data describes a set of possible worlds that include the original data. We show that anonymization approaches generate different working models of uncertain data, and that the privacy guarantees offered by k-anonymization and l-diversity can be naturally understood in terms of the sets of possible worlds that correspond to the anonymized data. Work in query evaluation over uncertain databases can hence be used for answering ad hoc queries over anonymized data. We identify new research problems for both the Data Anonymization and the Uncertain Data communities.

【Keywords】: data privacy; data anonymization techniques; data uncertainty; k-anonymization; l-diversity; sensitive information privacy; uncertain databases; Data engineering; Data privacy; Database systems; Diseases; Information analysis; Lenses; Performance analysis; Query processing; Remuneration; Uncertainty

155. Privacy in data publishing.

【Paper Link】【Pages】:1213

【Authors】: Johannes Gehrke ; Daniel Kifer ; Ashwin Machanavajjhala

【Abstract】: This tutorial gives an overview of techniques for releasing data about individuals while preserving privacy.

【Keywords】: data privacy; publishing; data publishing privacy; tutorial; Application software; Biomedical engineering; Computer science; Data mining; Data privacy; Database systems; History; Machine learning; Protection; Publishing

156. Representation, composition and application of preferences in databases.

【Paper Link】【Pages】:1214-1215

【Authors】: Georgia Koutrika ; Evaggelia Pitoura ; Kostas Stefanidis

【Abstract】: This tutorial provides an overview of the key research results in the area of user preferences from a database perspective. The objective is to survey in a systematic and holistic way a number of approaches for preference representation and composition, querying with preferences and preference learning. Open research problems are also presented.

【Keywords】: database management systems; database perspective; open research problems; preference composition; preference learning; preference querying; preference representation; user preferences; Application software; Computer science; Database systems; Decision making; Decision theory; Humans; Ice; Investments; Psychology; Query processing

157. Database as a service (DBaaS).

【Paper Link】【Pages】:1216-1217

【Authors】: Wolfgang Lehner ; Kai-Uwe Sattler

【Abstract】: Modern Web or Â¿Eternal-BetaÂ¿ applications necessitate a flexible and easy-to-use data management platform that allows the evolutionary development of databases and applications. The classical approach of relational database systems following strictly the ACID properties has to be extended by an extensible and easy-to-use persistency layer with specialized DB features. Using the underlying concept of Software as a Service (SaaS) also enables an economic advantage based on the Â¿economy of the scaleÂ¿, where application and system environments only need to be provided once but can be used by thousands of users. Within this tutorial, we are looking at the current state-of-the-art from different perspectives. We outline foundations and techniques to build database services based on the SaaS-paradigm. We discuss requirements from a programming perspective, show different dimensions in the context of consistency and reliability, and also describe different non-functional properties under the umbrella of Service-Level agreements (SLA).

【Keywords】: Web services; relational databases; Eternal-Beta applications; Web applications; data management platform; database services; database-as-a-service; persistency layer; programming perspective; relational database; service-level agreement; software-as-a-service; Application software; Context-aware services; Database systems; Environmental economics; Object oriented modeling; Relational databases; Research and development management; Satellite broadcasting; Spatial databases; Technology management

158. Techniques for efficiently searching in spatial, temporal, spatio-temporal, and multimedia databases.

【Paper Link】【Pages】:1218-1219

【Authors】: Hans-Peter Kriegel ; Peer Kröger ; Matthias Renz

【Abstract】: This tutorial provides a comprehensive and comparative overview of general techniques to efficiently support similarity queries in spatial, temporal, spatio-temporal, and multimedia databases. In particular, it identifies the most generic query types and discusses general algorithmic methods to answer such queries efficiently. In addition, the tutorial sketches important applications of the introduced methods, and presents sample implementations of the general approaches within each of the aforementioned database types. The intended audience of this tutorial ranges from novice researchers to advanced experts as well as practitioners from any application domain dealing with spatial, temporal, spatio-temporal, and/or multimedia data.

【Keywords】: multimedia databases; query processing; visual databases; multimedia database; similarity queries; spatial database; spatio-temporal database; temporal database; Bridges; Image databases; Indexing; Informatics; Multimedia databases; Query processing; Spatial databases; Spatiotemporal phenomena; Telecommunication traffic; Vocabulary