ACM SIGMOD Conference 2014:Snowbird, UT, USA

International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22-27, 2014. ACM 【DBLP Link

Paper Num: 166 || Session Num: 44

Keynote 1 1

1. How i learned to stop worrying and love compilers.

Paper Link】 【Pages】:1-2

【Authors】: Eric Sedlar

【Abstract】: The modern platforms that we want to use to manage our data are far more complex to program efficiently than the machines we used in the past. Every computer we run on is a massively parallel machine with many architectural "surprises" for programmers who are unaware of the way the underlying hardware architecture works. A simple way to think about this is that optimal programs must specify how & where the data should move, not just what computations should be performed and in what order. Architecture-oblivious software that leaves the decisions about data movement to a low-level coherence protocol is becoming much less efficient, relatively speaking. After an extended flirtation with using imperative programming frameworks such as Map-Reduce and NoSQL, many people are returning back to declarative languages like SQL, where the language compiler & runtime are free to make most of the data movement decisions for the programmer. Another way to think about a SQL compiler is that it includes an "algorithm picker" and the runtime includes libraries of useful algorithm implementations (which contain the data movement specifications). This talk will discuss the needs and opportunities for expanding the domain of algorithm-picking languages like SQL. Doing so will require integration with managed-language runtime compilers (e.g. Java or Javascript compilers) that are integrated with the SQL compiler not just to provide efficiency gains during query execution, but also to use managed language runtime profiling to help in algorithm selection as well as assembly-level compilation decisions.

【Keywords】: hardware conscious software; sql

Research session 1: transaction processing 4

2. PLANET: making progress with commit processing in unpredictable environments.

Paper Link】 【Pages】:3-14

【Authors】: Gene Pang ; Tim Kraska ; Michael J. Franklin ; Alan Fekete

【Abstract】: Latency unpredictability in a database system can come from many factors, such as load spikes in the workload, inter-query interactions from consolidation, or communication costs in cloud computing or geo-replication. High variance and high latency environments make developing interactive applications difficult, because transactions may take too long to complete, or fail unexpectedly. We propose Predictive Latency-Aware NEtworked Transactions (PLANET), a new transaction programming model and underlying system support to address this issue. The model exposes the internal progress of the transaction, provides opportunities for application callbacks, and incorporates commit likelihood prediction to enable good user experience even in the presence of significant transaction delays. The mechanisms underlying PLANET can be used for admission control, thus improving overall performance in high contention situations. In this paper, we present this new transaction programming model, demonstrate its expressiveness via several use cases, and evaluate its performance using a strongly consistent geo-replicated database across five data centers.

【Keywords】: commit likelihood; geo-replication; speculative commits; transaction programming model

3. Lazy evaluation of transactions in database systems.

Paper Link】 【Pages】:15-26

【Authors】: Jose M. Faleiro ; Alexander Thomson ; Daniel J. Abadi

【Abstract】: Existing database systems employ an \textit{eager} transaction processing scheme---that is, upon receiving a transaction request, the system executes all the operations entailed in running the transaction (which typically includes reading database records, executing user-specified transaction logic, and logging updates and writes) before reporting to the client that the transaction has completed. We introduce a \textit{lazy} transaction execution engine, in which a transaction may be considered durably completed after only partial execution, while the bulk of its operations (notably all reads from the database and all execution of transaction logic) may be deferred until an arbitrary future time, such as when a user attempts to read some element of the transaction's write-set---all without modifying the semantics of the transaction or sacrificing ACID guarantees. Lazy transactions are processed deterministically, so that the final state of the database is guaranteed to be equivalent to what the state would have been had all transactions been executed eagerly. Our prototype of a lazy transaction execution engine improves temporal locality when executing related transactions, reduces peak provisioning requirements by deferring more non-urgent work until off-peak load times, and reduces contention footprint of concurrent transactions. However, we find that certain queries suffer increased latency, and therefore lazy database systems may not be appropriate for read-latency sensitive applications. We introduce a lazy transaction execution engine, in which a transaction may be considered durably completed after only partial execution, while the bulk of its operations (notably all reads from the database and all execution of transaction logic) may be deferred until an arbitrary future time, such as when a user attempts to read some element of the transaction's write-set---all without modifying the semantics of the transaction or sacrificing ACID guarantees. Lazy transactions are processed deterministically, so that the final state of the database is guaranteed to be equivalent to what the state would have been had all transactions been executed eagerly. Our prototype of a lazy transaction execution engine improves temporal locality when executing related transactions, reduces peak provisioning requirements by deferring more non-urgent work until off-peak load times, and reduces contention footprint of concurrent transactions. However, we find that certain queries suffer increased latency, and therefore lazy database systems may not be appropriate for read-latency sensitive applications.

【Keywords】: acid transactions; deterministic database systems; load balancing

4. Scalable atomic visibility with RAMP transactions.

Paper Link】 【Pages】:27-38

【Authors】: Peter Bailis ; Alan Fekete ; Joseph M. Hellerstein ; Ali Ghodsi ; Ion Stoica

【Abstract】: Databases can provide scalability by partitioning data across several servers. However, multi-partition, multi-operation transactional access is often expensive, employing coordination-intensive locking, validation, or scheduling mechanisms. Accordingly, many real-world systems avoid mechanisms that provide useful semantics for multi-partition operations. This leads to incorrect behavior for a large class of applications including secondary indexing, foreign key enforcement, and materialized view maintenance. In this work, we identify a new isolation model---Read Atomic (RA) isolation---that matches the requirements of these use cases by ensuring atomic visibility: either all or none of each transaction's updates are observed by other transactions. We present algorithms for Read Atomic Multi-Partition (RAMP) transactions that enforce atomic visibility while offering excellent scalability, guaranteed commit despite partial failures (via synchronization independence), and minimized communication between servers (via partition independence). These RAMP transactions correctly mediate atomic visibility of updates and provide readers with snapshot access to database state by using limited multi-versioning and by allowing clients to independently resolve non-atomic reads. We demonstrate that, in contrast with existing algorithms, RAMP transactions incur limited overhead---even under high contention---and scale linearly to 100 servers.

【Keywords】: atomic visibility; atomicity; non-blocking; performance; read atomic; scalability; transactions

5. JECB: a join-extension, code-based approach to OLTP data partitioning.

Paper Link】 【Pages】:39-50

【Authors】: Khai Q. Tran ; Jeffrey F. Naughton ; Bruhathi Sundarmurthy ; Dimitris Tsirogiannis

【Abstract】: Scaling complex transactional workloads in parallel and distributed systems is a challenging problem. When transactions span data partitions that reside in different nodes, significant overheads emerge that limit the throughput of these systems. In this paper, we present a low-overhead data partitioning approach, termed JECB, that can reduce the number of distributed transactions in complex database workloads such as TPC-E. The proposed approach analyzes the transaction source code of the given workload and the database schema to find a good partitioning solution. JECB leverages partitioning by key-foreign key relationships to automatically identify the best way to partition tables using attributes from tables. We experimentally compare our approach with the state of the art data-partitioning techniques and show that over the benchmarks considered, JECB provides better partitioning solutions with significantly less overhead.

【Keywords】: OLTP; distributed; parallel; partitioning

Research session 2: social networks 1 4

6. HYDRA: large-scale social identity linkage via heterogeneous behavior modeling.

Paper Link】 【Pages】:51-62

【Authors】: Siyuan Liu ; Shuhui Wang ; Feida Zhu ; Jinbo Zhang ; Ramayya Krishnan

【Abstract】: We study the problem of large-scale social identity linkage across different social media platforms, which is of critical importance to business intelligence by gaining from social data a deeper understanding and more accurate profiling of users. This paper proposes HYDRA, a solution framework which consists of three key steps: (I) modeling heterogeneous behavior by long-term behavior distribution analysis and multi-resolution temporal information matching; (II) constructing structural consistency graph to measure the high-order structure consistency on users' core social structures across different platforms; and (III) learning the mapping function by multi-objective optimization composed of both the supervised learning on pair-wise ID linkage information and the cross-platform structure consistency maximization. Extensive experiments on 10 million users across seven popular social network platforms demonstrate that HYDRA correctly identifies real user linkage across different platforms, and outperforms existing state-of-the-art algorithms by at least 20% under different settings, and 4 times better in most settings.

【Keywords】: heterogeneous behavior model; multiple information; social networks; user linkage

Paper Link】 【Pages】:63-74

【Authors】: Kaiyu Feng ; Gao Cong ; Sourav S. Bhowmick ; Shuai Ma

【Abstract】: Recently, with the emergence of event-based online social services(e.g. Meetup), there have been increasing online activities to create, distribute, and organize social events. In this paper, we take the first systematic step to discover influential event organizers from online social networks who are essential to the overall success of social events. Informally, such event organizers comprise a small group of people who not only have the relevant skills or expertise that are required for an event (e.g. conference) but they are also able to influence largest number of people to actively contribute to it. We formulate it as the problem of mining influential cover set (ICS) where we wish to find k users in a social network G that together have the required skills or expertise (modeled as attributes of nodes in G) to organize an event such that they can influence the greatest number of individuals to participate in the event. The problem is, however, NP-hard. Hence, we propose three algorithms to find approximate solutions to the problem. The first two algorithms are greedy; they run faster, but have no guarantees. The third algorithm is 2-approximate and guarantees to find a feasible solution if any. Our empirical study over several real-world networks demonstrates the superiority of our proposed solutions.

【Keywords】: event organization; event-based social network; influence maximization

8. Influence maximization: near-optimal time complexity meets practical efficiency.

Paper Link】 【Pages】:75-86

【Authors】: Youze Tang ; Xiaokui Xiao ; Yanchen Shi

【Abstract】: Given a social network G and a constant $k$, the influence maximization problem asks for k nodes in G that (directly and indirectly) influence the largest number of nodes under a pre-defined diffusion model. This problem finds important applications in viral marketing, and has been extensively studied in the literature. Existing algorithms for influence maximization, however, either trade approximation guarantees for practical efficiency, or vice versa. In particular, among the algorithms that achieve constant factor approximations under the prominent independent cascade (IC) model or linear threshold (LT) model, none can handle a million-node graph without incurring prohibitive overheads. This paper presents TIM, an algorithm that aims to bridge the theory and practice in influence maximization. On the theory side, we show that TIM runs in O((k+ l) (n+m) log n/ε2) expected time and returns a (1-1/e-ε)-approximate solution with at least 1 - n-l probability. The time complexity of TIM is near-optimal under the IC model, as it is only a log n factor larger than the Ω(m + n) lower-bound established in previous work (for fixed k, l, and ε). Moreover, TIM supports the triggering model, which is a general diffusion model that includes both IC and LT as special cases. On the practice side, TIM incorporates novel heuristics that significantly improve its empirical efficiency without compromising its asymptotic performance. We experimentally evaluate TIM with the largest datasets ever tested in the literature, and show that it outperforms the state-of-the-art solutions (with approximation guarantees) by up to four orders of magnitude in terms of running time. In particular, when k = 50, ε = 0.2, and l = 1, TIM requires less than one hour on a commodity machine to process a network with 41.6 million nodes and 1.4 billion edges. This demonstrates that influence maximization algorithms can be made practical while still offering strong theoretical guarantees.

【Keywords】: algorithm; influence maximization

9. Efficient location-aware influence maximization.

Paper Link】 【Pages】:87-98

【Authors】: Guoliang Li ; Shuo Chen ; Jianhua Feng ; Kian-Lee Tan ; Wen-Syan Li

【Abstract】: users in a social network to maximize the expected number of users influenced by the selected users (called influence spread), has been extensively studied, existing works neglected the fact that the location information can play an important role in influence maximization. Many real-world applications such as location-aware word-of-mouth marketing have location-aware requirement. In this paper we study the location-aware influence maximization problem. One big challenge in location-aware influence maximization is to develop an efficient scheme that offers wide influence spread. To address this challenge, we propose two greedy algorithms with 1-1/e approximation ratio. To meet the instant-speed requirement, we propose two efficient algorithms with ε· (1-1/e) approximation ratio for any ε ∈ (0,1]. Experimental results on real datasets show our method achieves high performance while keeping large influence spread and significantly outperforms state-of-the-art algorithms.

【Keywords】: influence maximization; location aware; social network

Research session 3: spatial data 4

10. Density-based place clustering in geo-social networks.

Paper Link】 【Pages】:99-110

【Authors】: Jieming Shi ; Nikos Mamoulis ; Dingming Wu ; David W. Cheung

【Abstract】: Spatial clustering deals with the unsupervised grouping of places into clusters and finds important applications in urban planning and marketing. Current spatial clustering models disregard information about the people who are related to the clustered places. In this paper, we show how the density-based clustering paradigm can be extended to apply on places which are visited by users of a geo-social network. Our model considers both spatial information and the social relationships between users who visit the clustered places. After formally defining the model and the distance measure it relies on, we present efficient algorithms for its implementation, based on spatial indexing. We evaluate the effectiveness of our model via a case study on real data; in addition, we design two quantitative measures, called social entropy and community score to evaluate the quality of the discovered clusters. The results show that geo-social clusters have special properties and cannot be found by applying simple spatial clustering approaches. The efficiency of our index-based implementation is also evaluated experimentally.

【Keywords】: density-based clustering; geo-social network; spatial indexing

11. Hypersphere dominance: an optimal approach.

Paper Link】 【Pages】:111-122

【Authors】: Cheng Long ; Raymond Chi-Wing Wong ; Bin Zhang ; Min Xie

【Abstract】: Hyperspheres are commonly used for representing uncertain objects (in uncertain databases) and for indexing spatial objects (in spatial databases). An interesting operator on hyperspheres called dominance is to decide for two given hyperspheres whether one dominates (or is closer than) the other wrt a given query hypersphere. In this paper, we propose an approach called Hyperbola which is optimal in the sense that it gives neither false positives nor false negatives and runs in linear time wrt the dimensionality. To the best of our knowledge, Hyperbola is the first optimal approach for the dominance problem on hyperespheres with any dimensionality. We also study an application of the dominance problem which relies on the dominance operator as the core component. We conducted extensive experiments on both real and synthetic datasets which verified our approaches.

【Keywords】: hyperbola; hypersphere dominance; pruning

12. Efficient algorithms for optimal location queries in road networks.

Paper Link】 【Pages】:123-134

【Authors】: Zhitong Chen ; Yubao Liu ; Raymond Chi-Wing Wong ; Jiamin Xiong ; Ganglin Mai ; Cheng Long

【Abstract】: In this paper, we study the optimal location query problem based on road networks. Specifically, we have a road network on which some clients and servers are located. Each client finds the server that is closest to her for service and her cost of getting served is equal to the (network) distance between the client and the server serving her multiplied by her weight or importance. The optimal location query problem is to find a location for setting up a new server such that the maximum cost of clients being served by the servers (including the new server) is minimized. This problem has been studied before, but the state-of-the-art is still not efficient enough. In this paper, we propose an efficient algorithm for the optimal location query problem, which is based on a novel idea of \emph{nearest location component}. We also discuss three extensions of the optimal location query problem, namely the optimal multiple-location query problem, the optimal location query problem on 3D road networks, and the optimal location query problem with another objective. Extensive experiments were conducted which showed that our algorithms are faster than the state-of-the-art by at least an order of magnitude on large real benchmark datasets. For example, on our largest real datasets, the state-of-the-art ran for more than 10 hours but our algorithm ran within 3 minutes only (i.e., >200 times faster).

【Keywords】: nearest location component; optimal location query; road network

13. Robust set reconciliation.

Paper Link】 【Pages】:135-146

【Authors】: Di Chen ; Christian Konrad ; Ke Yi ; Wei Yu ; Qin Zhang

【Abstract】: Set reconciliation is a fundamental problem in distributed databases, where two parties each holding a set of elements wish to find their difference, so as to establish data consistency. Efficient algorithms exist for this problem with communication cost proportional only to the difference of the two sets, as opposed to the cardinality of the sets themselves. However, all existing work on set reconciliation considers two elements to be the same only if they are exactly equal. We observe that, in many applications, the elements correspond to objects on which a distance function can be defined, e.g., points in the Euclidean space, and close points often actually represent the same object. During the reconciliation, the algorithm should only find the truly different elements in the two sets while tolerating small perturbations. In this paper, we propose the robust set reconciliation problem, and take a principled approach to address this issue via the earth mover's distance. We have developed a communication and time-efficient algorithm with provable guarantees on the quality of the reconciliation. This is then complemented with an essentially matching lower bound showing the optimality of the algorithm. Our experimental results on both synthetic and real data sets have demonstrated that our algorithm also performs very well in practice.

【Keywords】: communication cost; error correction under noise; near linear-time algorithm; randomization; set reconciliation

Industry session 1: real-time/complex data analytics 4

14. Storm@twitter.

Paper Link】 【Pages】:147-156

【Authors】: Ankit Toshniwal ; Siddarth Taneja ; Amit Shukla ; Karthikeyan Ramasamy ; Jignesh M. Patel ; Sanjeev Kulkarni ; Jason Jackson ; Krishna Gade ; Maosong Fu ; Jake Donham ; Nikunj Bhagat ; Sailesh Mittal ; Dmitriy V. Ryaboy

【Abstract】: This paper describes the use of Storm at Twitter. Storm is a real-time fault-tolerant and distributed stream data processing system. Storm is currently being used to run various critical computations in Twitter at scale, and in real-time. This paper describes the architecture of Storm and its methods for distributed scale-out and fault-tolerance. This paper also describes how queries (aka. topologies) are executed in Storm, and presents some operational stories based on running Storm at Twitter. We also present results from an empirical evaluation demonstrating the resilience of Storm in dealing with machine failures. Storm is under active development at Twitter and we also present some potential directions for future work.

【Keywords】: real-time query processing; stream data management

15. Druid: a real-time analytical data store.

Paper Link】 【Pages】:157-168

【Authors】: Fangjin Yang ; Eric Tschetter ; Xavier Léauté ; Nelson Ray ; Gian Merlino ; Deep Ganguli

【Abstract】: Druid is an open source data store designed for real-time exploratory analytics on large data sets. The system combines a column-oriented storage layout, a distributed, shared-nothing architecture, and an advanced indexing structure to allow for the arbitrary exploration of billion-row tables with sub-second latencies. In this paper, we describe Druid's architecture, and detail how it supports fast aggregations, flexible filters, and low latency data ingestion.

【Keywords】: OLAP; analytics; column-oriented; distributed; fault-tolerant; highly available; open source; real-time

16. The next generation operational data historian for IoT based on informix.

Paper Link】 【Pages】:169-176

【Authors】: Sheng Huang ; Yaoliang Chen ; Xiaoyan Chen ; Kai Liu ; Xiaomin Xu ; Chen Wang ; Kevin Brown ; Inge Halilovic

【Abstract】: In the era of the Internet of Things (IoT), increasing numbers of applications face the challenge of using current data management systems to manage the massive volume of operational data gener-ated by sensors and devices. Databases based on time series data model, like PI Server, are developed to handle such data with operational technology (OT) characteristics (high volume, long lifecycle, and simple format). However, while achieving excellent write performance, these database systems provide limited query capabilities. In this paper, we present the next-generation Opera-tional Data Historian (ODH) system that is based on the IBM© Informix© system architecture. The system combines the write advantages of the existing time series databases and the ability to run complex queries in SQL. We demonstrate the high efficiency of our system for both writing and querying data with a variety of case studies in the industries of Energy and Utilities and con-nected vehicles. In addition, we present the first benchmark, IoT-X, to evaluate technologies on operational data management for IoT.

【Keywords】: relational DB; time series DB

17. GenBase: a complex analytics genomics benchmark.

Paper Link】 【Pages】:177-188

【Authors】: Rebecca Taft ; Manasi Vartak ; Nadathur Rajagopalan Satish ; Narayanan Sundaram ; Samuel Madden ; Michael Stonebraker

【Abstract】: This paper introduces a new benchmark designed to test database management system (DBMS) performance on a mix of data management tasks (joins, filters, etc.) and complex analytics (regression, singular value decomposition, etc.) Such mixed workloads are prevalent in a number of application areas including most science workloads and web analytics. As a specific use case, we have chosen genomics data for our benchmark and have constructed a collection of typical tasks in this domain. In addition to being representative of a mixed data management and analytics workload, this benchmark is also meant to scale to large dataset sizes and multiple nodes across a cluster. Besides presenting this benchmark, we have run it on a variety of storage systems including traditional row stores, newer column stores, Hadoop, and an array DBMS. We present performance numbers on all systems on single and multiple nodes, and show that performance differs by orders of magnitude between the various solutions. In addition, we demonstrate that most platforms have scalability issues. We also test offloading the analytics onto a coprocessor. The intent of this benchmark is to focus research interest in this area; to this end, all of our data, data generators, and scripts are available on our web site.

【Keywords】: array dbms; benchmarking; scientific dbms

Tutorial 1 1

18. How to stop under-utilization and love multicores.

Paper Link】 【Pages】:189-192

【Authors】: Anastasia Ailamaki ; Erietta Liarou ; Pinar Tözün ; Danica Porobic ; Iraklis Psaroudakis

【Abstract】: Designing scalable database management systems on modern hardware has been a challenge for almost a decade. Hardware trends oblige software to overcome three major challenges against systems scalability: (1) Exploiting the abundant thread-level parallelism provided by multicores, (2) Achieving predictively efficient execution despite the variability in communication latencies among cores on multisocket multicores, and (3) Taking advantage of the aggressive micro-architectural features. In this tutorial, we shed light on the above three challenges and survey recent proposals to alleviate them. First, we present a systematic way of eliminating scalability bottlenecks based on minimizing unbounded communication and show several techniques that minimize bottlenecks in major components of database management systems. In addition, we demonstrate methods to parallelize major database operations. Then, we analyze the problems that arise from the non-uniform nature of communication latencies on modern multisockets and ways to address them for systems that already scale well on multicores. Finally, we examine the sources of under-utilization within a modern processor and present insights and techniques to better exploit the micro-architectural resources of a processor by improving cache locality at the right level of the memory hierarchy.

【Keywords】: micro-architectural utilization; multicores; numa; olap; oltp

Research session 4: streams and complex event processing 4

19. AutoPlait: automatic mining of co-evolving time sequences.

Paper Link】 【Pages】:193-204

【Authors】: Yasuko Matsubara ; Yasushi Sakurai ; Christos Faloutsos

【Abstract】: Given a large collection of co-evolving multiple time-series, which contains an unknown number of patterns of different durations, how can we efficiently and effectively find typical patterns and the points of variation? How can we statistically summarize all the sequences, and achieve a meaningful segmentation? In this paper we present AutoPlait, a fully automatic mining algorithm for co-evolving time sequences. Our method has the following properties: (a) effectiveness: it operates on large collections of time-series, and finds similar segment groups that agree with human intuition; (b) scalability: it is linear with the input size, and thus scales up very well; and (c) AutoPlait is parameter-free, and requires no user intervention, no prior training, and no parameter tuning. Extensive experiments on 67GB of real datasets demonstrate that AutoPlait does indeed detect meaningful patterns correctly, and it outperforms state-of-the-art competitors as regards accuracy and speed: AutoPlait achieves near-perfect, over 95% precision and recall, and it is up to 472 times faster than its competitors.

【Keywords】: automatic mining; time-series data

20. Resource-oriented approximation for frequent itemset mining from bursty data streams.

Paper Link】 【Pages】:205-216

【Authors】: Yoshitaka Yamamoto ; Koji Iwanuma ; Shoshi Fukuda

【Abstract】: This study considers approximation techniques for frequent itemset mining from data streams (FIM-DS) under resource constraints. In FIM-DS, a challenging problem is handling a huge combinatorial number of entries (i.e., itemsets) to be generated from each streaming transaction and stored in memory. Various types of approximation methods have been proposed for FIM-DS. However, these methods require almost O(2L) space for the maximal length L of transactions. If some transaction contains sudden and intensive bursty events for a short span, they cannot work since memory consumption exponentially increases as L becomes larger. Thus, we present resource-oriented approximation algorithms that fix an upper bound for memory consumption to tolerate bursty transactions. The proposed algorithm requires only O(k) space for a resource-specified constant k and processes every transaction in O(kL) time. Consequently, the proposed algorithm can treat any transaction without memory overflow nor fatal response delay, while the output can be guaranteed to be no false negative under some conditions. Moreover, any (even if false negative) output is bounded within the approximation error which is dynamically determined in a resource-oriented manner. From an empirical viewpoint, it is necessary to maintain the error as low as possible. We tackle this problem by dynamically reducing the original stream. Through experimental results, we show that the resource-oriented approach can break the space limitation of previously proposed FIM-DS methods.

【Keywords】: approximation; bursty data stream; itemset mining

21. On complexity and optimization of expensive queries in complex event processing.

Paper Link】 【Pages】:217-228

【Authors】: Haopeng Zhang ; Yanlei Diao ; Neil Immerman

【Abstract】: Pattern queries are widely used in complex event processing (CEP) systems. Existing pattern matching techniques, however, can provide only limited performance for expensive queries in real-world applications, which may involve Kleene closure patterns, flexible event selection strategies, and events with imprecise timestamps. To support these expensive queries with high performance, we begin our study by analyzing the complexity of pattern queries, with a focus on the fundamental understanding of which features make pattern queries more expressive and at the same time more computationally expensive. This analysis allows us to identify performance bottlenecks in processing those expensive queries, and provides key insights for us to develop a series of optimizations to mitigate those bottlenecks. Microbenchmark results show superior performance of our system for expensive pattern queries while most state-of-the-art systems suffer from poor performance. A thorough case study on Hadoop cluster monitoring further demonstrates the efficiency and effectiveness of our proposed techniques.

【Keywords】: complex event processing; complexity analysis; query optimization

22. Complex event analytics: online aggregation of stream sequence patterns.

Paper Link】 【Pages】:229-240

【Authors】: Yingmei Qi ; Lei Cao ; Medhabi Ray ; Elke A. Rundensteiner

【Abstract】: Complex Event Processing (CEP) is a technology of choice for high performance analytics in time-critical decision-making applications. Yet while effective technologies for complex pattern detection on continuous event streams have been developed, the problem of scalable online aggregation of such patterns has been overlooked. Instead, aggregation is typically applied as a post processing step after CEP pattern detection, leading to an extremely ineffective solution. In this paper, we demonstrate that CEP aggregation can be pushed into the sequence construction process. Based on this insight our A-Seq strategy successfully aggregates sequence pattern online without ever constructing sequence matches. This drives down the complexity of the CEP aggregation problem from polynomial to linear. We further extend our A-Seq strategy to support the shared processing of concurrent CEP aggregation queries. The A-Seq solution is shown to achieve over four orders of magnitude performance improvement for a wide range of tested scenarios compared to the state-of-the-art solution.

【Keywords】: aggregation; complex event processing; sequence pattern

Research session 5: data analytics 4

23. Towards indexing functions: answering scalar product queries.

Paper Link】 【Pages】:241-252

【Authors】: Arijit Khan ; Pouya Yanki ; Bojana Dimcheva ; Donald Kossmann

【Abstract】: We consider a broad category of analytic queries, denoted by scalar product queries, which can be expressed as a scalar product between a known function over multiple database attributes and an unknown set of parameters. More specifically, given a set of d-dimensional data points, we retrieve all points x which satisfy an inequality given by a scalar product: Efficiently answering such scalar product queries are essential in a wide range of applications including evaluation of complex SQL functions, time series prediction, scientific simulation, and active learning. Although some specific subclasses of the aforementioned scalar product queries and their applications have been studied in computational geometry, machine learning, and in moving object queries, surprisingly no generalized indexing scheme has been proposed for efficiently computing scalar product queries. We present a lightweight, yet scalable, dynamic, and generalized indexing scheme, called the planar index, for answering scalar product queries in an accurate manner, which is based on the idea of indexing function f(x) for each data point x using multiple sets of parallel hyperplanes. Planar index has loglinear indexing time and linear space complexity, and the query time ranges from logarithmic to being linear in the number of data points. Based on an extensive set of experiments on several real-world and synthetic datasets, we show that planar index is not only scalable and dynamic, but also effective in various real-world applications including intersection finding between moving objects and active learning. Efficiently answering such scalar product queries are essential in a wide range of applications including evaluation of complex SQL functions, time series prediction, scientific simulation, and active learning. Although some specific subclasses of the aforementioned scalar product queries and their applications have been studied in computational geometry, machine learning, and in moving object queries, surprisingly no generalized indexing scheme has been proposed for efficiently computing scalar product queries. We present a lightweight, yet scalable, dynamic, and generalized indexing scheme, called the planar index, for answering scalar product queries in an accurate manner, which is based on the idea of indexing function f(x) for each data point x using multiple sets of parallel hyperplanes. Planar index has loglinear indexing time and linear space complexity, and the query time ranges from logarithmic to being linear in the number of data points. Based on an extensive set of experiments on several real-world and synthetic datasets, we show that planar index is not only scalable and dynamic, but also effective in various real-world applications including intersection finding between moving objects and active learning.

【Keywords】: function indexing; moving object indexing; planar index; scalar product query

24. LINVIEW: incremental view maintenance for complex analytical queries.

Paper Link】 【Pages】:253-264

【Authors】: Milos Nikolic ; Mohammed Elseidy ; Christoph Koch

【Abstract】: Many analytics tasks and machine learning problems can be naturally expressed by iterative linear algebra programs. In this paper, we study the incremental view maintenance problem for such complex analytical queries. We develop a framework, called LINVIEW, for capturing deltas of linear algebra programs and understanding their computational cost. Linear algebra operations tend to cause an avalanche effect where even very local changes to the input matrices spread out and infect all of the intermediate results and the final view, causing incremental view maintenance to lose its performance benefit over re-evaluation. We develop techniques based on matrix factorizations to contain such epidemics of change. As a consequence, our techniques make incremental view maintenance of linear algebra practical and usually substantially cheaper than re-evaluation. We show, both analytically and experimentally, the usefulness of these techniques when applied to standard analytics tasks. Our evaluation demonstrates the efficiency of LINVIEW in generating parallel incremental programs that outperform re-evaluation techniques by more than an order of magnitude.

【Keywords】: compilation; incremental view maintenance; linear algebra; machine learning; spark

25. Materialization optimizations for feature selection workloads.

Paper Link】 【Pages】:265-276

【Authors】: Ce Zhang ; Arun Kumar ; Christopher Ré

【Abstract】: There is an arms race in the data management industry to support analytics, in which one critical step is feature selection, the process of selecting a feature set that will be used to build a statistical model. Analytics is one of the biggest topics in data management, and feature selection is widely regarded as the most critical step of analytics; thus, we argue that managing the feature selection process is a pressing data management challenge. We study this challenge by describing a feature-selection language and a supporting prototype system that builds on top of current industrial, R-integration layers. From our interactions with analysts, we learned that feature selection is an interactive, human-in-the-loop process, which means that feature selection workloads are rife with reuse opportunities. Thus, we study how to materialize portions of this computation using not only classical database materialization optimizations but also methods that have not previously been used in database optimization, including structural decomposition methods (like QR factorization) and warmstart. These new methods have no analog in traditional SQL systems, but they may be interesting for array and scientific database applications. On a diverse set of data sets and programs, we find that traditional database-style approaches that ignore these new opportunities are more than two orders of magnitude slower than an optimal plan in this new tradeoff space across multiple R-backends. Furthermore, we show that it is possible to build a simple cost-based optimizer to automatically select a near-optimal execution plan for feature selection.

【Keywords】: feature selection; materialization; statistical analytics

26. The analytical bootstrap: a new method for fast error estimation in approximate query processing.

Paper Link】 【Pages】:277-288

【Authors】: Kai Zeng ; Shi Gao ; Barzan Mozafari ; Carlo Zaniolo

【Abstract】: Sampling is one of the most commonly used techniques in Approximate Query Processing (AQP)-an area of research that is now made more critical by the need for timely and cost-effective analytics over "Big Data". Assessing the quality (i.e., estimating the error) of approximate answers is essential for meaningful AQP, and the two main approaches used in the past to address this problem are based on either (i) analytic error quantification or (ii) the bootstrap method. The first approach is extremely efficient but lacks generality, whereas the second is quite general but suffers from its high computational overhead. In this paper, we introduce a probabilistic relational model for the bootstrap process, along with rigorous semantics and a unified error model, which bridges the gap between these two traditional approaches. Based on our probabilistic framework, we develop efficient algorithms to predict the distribution of the approximation results. These enable the computation of any bootstrap-based quality measure for a large class of SQL queries via a single-round evaluation of a slightly modified query. Extensive experiments on both synthetic and real-world datasets show that our method has superior prediction accuracy for bootstrap-based quality measures, and is several orders of magnitude faster than bootstrap.

【Keywords】: approximate query processing; bootstrap; error estimation

Research session 6: graph and RDF data processing 4

27. TriAD: a distributed shared-nothing RDF engine based on asynchronous message passing.

Paper Link】 【Pages】:289-300

【Authors】: Sairam Gurajada ; Stephan Seufert ; Iris Miliaraki ; Martin Theobald

【Abstract】: We investigate a new approach to the design of distributed, shared-nothing RDF engines. Our engine, coined "TriAD", combines join-ahead pruning via a novel form of RDF graph summarization with a locality-based, horizontal partitioning of RDF triples into a grid-like, distributed index structure. The multi-threaded and distributed execution of joins in TriAD is facilitated by an asynchronous Message Passing protocol which allows us to run multiple join operators along a query plan in a fully parallel, asynchronous fashion. We believe that our architecture provides a so far unique approach to join-ahead pruning in a distributed environment, as the more classical form of sideways information passing would not permit for executing distributed joins in an asynchronous way. Our experiments over the LUBM, BTC and WSDTS benchmarks demonstrate that TriAD consistently outperforms centralized RDF engines by up to two orders of magnitude, while gaining a factor of more than three compared to the currently fastest, distributed engines. To our knowledge, we are thus able to report the so far fastest query response times for the above benchmarks using a mid-range server and regular Ethernet setup.

【Keywords】: asynchronous message passing; distributed RDF indexing & SparQL processing; join-ahead pruning; parallel join evaluation

28. Querying big graphs within bounded resources.

Paper Link】 【Pages】:301-312

【Authors】: Wenfei Fan ; Xin Wang ; Yinghui Wu

【Abstract】: This paper studies the problem of querying graphs within bounded resources. Given a query Q, a graph G and a small ratio α, it aims to answer Q in G by accessing only a fraction GQ of G of size |GQ| ≤ α |G|. The need for this is evident when G is big while our available resources are limited, as indicated by α. We propose resource-bounded query answering via a dynamic scheme that reduces big G to GQ. We investigate when we can find the exact answers Q(G) from GQ, and if GQ cannot accommodate enough information, how accurate the approximate answers Q(GQ) are. To verify the effectiveness of the approach, we study two types of queries. One consists of pattern queries that have data locality, such as subgraph isomorphism and strong simulation. The other is the class of reachability queries, without data locality. We show that it is hard to get resource-bounded algorithms with 100% accuracy: NP-hard for pattern queries, and non-existing for reachability when α ≠ 1. Despite these, we develop resource-bounded algorithms for answering these queries. Using real-life and synthetic data, we experimentally evaluate the performance of the algorithms. We find that they scale well for both types of queries, and our approximate answers are accurate, even 100% for small α.

【Keywords】: bounded resource; graph querying; pattern matching

29. Natural language question answering over RDF: a graph data driven approach.

Paper Link】 【Pages】:313-324

【Authors】: Lei Zou ; Ruizhe Huang ; Haixun Wang ; Jeffrey Xu Yu ; Wenqiang He ; Dongyan Zhao

【Abstract】: RDF question/answering (Q/A) allows users to ask questions in natural languages over a knowledge base represented by RDF. To answer a national language question, the existing work takes a two-stage approach: question understanding and query evaluation. Their focus is on question understanding to deal with the disambiguation of the natural language phrases. The most common technique is the joint disambiguation, which has the exponential search space. In this paper, we propose a systematic framework to answer natural language questions over RDF repository (RDF Q/A) from a graph data-driven perspective. We propose a semantic query graph to model the query intention in the natural language question in a structural way, based on which, RDF Q/A is reduced to subgraph matching problem. More importantly, we resolve the ambiguity of natural language questions at the time when matches of query are found. The cost of disambiguation is saved if there are no matching found. We compare our method with some state-of-the-art RDF Q/A systems in the benchmark dataset. Extensive experiments confirm that our method not only improves the precision but also speeds up query performance greatly.

【Keywords】: RDF; graph database; question answering

Paper Link】 【Pages】:325-336

【Authors】: Mitsuru Kusumoto ; Takanori Maehara ; Ken-ichi Kawarabayashi

【Abstract】: SimRank, proposed by Jeh and Widom, provides a good similarity score and has been successfully used in many of the above mentioned applications. While there are many algorithms proposed so far to compute SimRank, but unfortunately, none of them are scalable up to graphs of billions size. Motivated by this fact, we consider the following SimRank-based similarity search problem: given a query vertex u, find top-k vertices v with the k highest SimRank scores s(u,v) with respect to u. We propose a very fast and scalable algorithm for this similarity search problem. Our method consists of the following ingredients: (1) We first introduce a "linear" recursive formula for SimRank. This allows us to formulate a problem that we can propose a very fast algorithm. (2) We establish a Monte-Carlo based algorithm to compute a single pair SimRank score s(u,v), which is based on the random-walk interpretation of our linear recursive formula. (3) We empirically show that SimRank score s(u,v) decreases rapidly as distance d(u,v) increases. Therefore, in order to compute SimRank scores for a query vertex u for our similarity search problem, we only need to look at very "local" area. (4) We can combine two upper bounds for SimRank score s(u,v) (which can be obtained by Monte-Carlo simulation in our preprocess), together with some adaptive sample technique, to prune the similarity search procedure. This results in a much faster algorithm. Once our preprocess is done (which only takes O(n) time), our algorithm finds, given a query vertex u, top-20 similar vertices v with the 20 highest SimRank scores s(u,v) in less than a few seconds even for graphs with billions edges. To the best of our knowledge, this is the first time to scale for graphs with at least billions edges(for the single source case).

【Keywords】: SimRank; scalable

Industry session 2: query optimization 4

31. Orca: a modular query optimizer architecture for big data.

Paper Link】 【Pages】:337-348

【Authors】: Mohamed A. Soliman ; Lyublena Antova ; Venkatesh Raghavan ; Amr El-Helw ; Zhongxian Gu ; Entong Shen ; George C. Caragea ; Carlos Garcia-Alvarado ; Foyzur Rahman ; Michalis Petropoulos ; Florian Waas ; Sivaramakrishnan Narayanan ; Konstantinos Krikellas ; Rhonda Baldwin

【Abstract】: The performance of analytical query processing in data management systems depends primarily on the capabilities of the system's query optimizer. Increased data volumes and heightened interest in processing complex analytical queries have prompted Pivotal to build a new query optimizer. In this paper we present the architecture of Orca, the new query optimizer for all Pivotal data management products, including Pivotal Greenplum Database and Pivotal HAWQ. Orca is a comprehensive development uniting state-of-the-art query optimization technology with own original research resulting in a modular and portable optimizer architecture. In addition to describing the overall architecture, we highlight several unique features and present performance comparisons against other systems.

【Keywords】: cost model; mpp; parallel processing; query optimization

32. Parallel I/O aware query optimization.

Paper Link】 【Pages】:349-360

【Authors】: Pedram Ghodsnia ; Ivan T. Bowman ; Anisoara Nica

【Abstract】: New trends in storage industry suggest that in the near future a majority of the hard disk drive-based storage subsystems will be replaced by solid state drives (SSDs). Database management systems can substantially benefit from the superior I/O performance of SSDs. Although the impact of using SSD in query processing has been studied in the past, exploiting the I/O parallelism of SSDs in query processing and optimization has not received enough attention. In this paper, at first, we show why the query optimizer needs to be aware of the benefit of the I/O parallelism in solid state drives. We characterize the benefit of exploiting I/O parallelism in database scan operators in SAP SQL Anywhere and propose a novel general I/O cost model that considers the impact of device I/O queue depth in I/O cost estimation. We show that using this model, the best plans found by the optimizer would be much closer to optimal. The proposed model is implemented in SAP SQL Anywhere. This model, dynamically defined by a calibration process, summarizes the behavior of the I/O subsystem, without having any prior knowledge about the type and the number of devices which are used in the storage subsystem.

【Keywords】: I/O cost model; SSD; access path; full table scan; index scan; parallel I/O; prefetching; query optimization

33. Exploiting ordered dictionaries to efficiently construct histograms with q-error guarantees in SAP HANA.

Paper Link】 【Pages】:361-372

【Authors】: Guido Moerkotte ; David DeHaan ; Norman May ; Anisoara Nica ; Alexander Böhm

【Abstract】: Histograms that guarantee a maximum multiplicative error (q-error) for estimates may significantly improve the plan quality of query optimizers. However, the construction time for histograms with maximum q-error was too high for practical use cases. In this paper we extend this concept with a threshold, i.e., an estimate or true cardinality θ, below which we do not care about the q-error because we still expect optimal plans. This allows us to develop far more efficient construction algorithms for histograms with bounded error. The test for θ, q-acceptability developed also exploits the order-preserving dictionary encoding of SAP HANA. We have integrated this family of histograms into SAP HANA, and we report on the construction time, histograms size, and estimation errors on real-world data sets. In virtually all cases the histograms can be constructed in far less than one second, requiring less than 5% of space compared to the original compressed data.

【Keywords】: cardinality estimation; histogram

34. Optimizing queries over partitioned tables in MPP systems.

Paper Link】 【Pages】:373-384

【Authors】: Lyublena Antova ; Amr El-Helw ; Mohamed A. Soliman ; Zhongxian Gu ; Michalis Petropoulos ; Florian Waas

【Abstract】: Partitioning of tables based on value ranges provides a powerful mechanism to organize tables in database systems. In the context of data warehousing and large-scale data analysis partitioned tables are of particular interest as the nature of queries favors scanning large swaths of data. In this scenario, eliminating partitions from a query plan that contain data not relevant to answering a given query can represent substantial performance improvements. Dealing with partitioned tables in query optimization has attracted significant attention recently, yet, a number of challenges unique to Massively Parallel Processing (MPP) databases and their distributed nature remain unresolved. In this paper, we present optimization techniques for queries over partitioned tables as implemented in Pivotal Greenplum Database. We present a concise and unified representation for partitioned tables and devise optimization techniques to generate query plans that can defer decisions on accessing certain partitions to query run-time. We demonstrate, the resulting query plans distinctly outperform conventional query plans in a variety of scenarios.

【Keywords】: mpp systems; partitioning; query optimization

Research session 7: multidimensional data 4

35. Parallel data analysis directly on scientific file formats.

Paper Link】 【Pages】:385-396

【Authors】: Spyros Blanas ; Kesheng Wu ; Surendra Byna ; Bin Dong ; Arie Shoshani

【Abstract】: Scientific experiments and large-scale simulations produce massive amounts of data. Many of these scientific datasets are arrays, and are stored in file formats such as HDF5 and NetCDF. Although scientific data management systems, such as SciDB, are designed to manipulate arrays, there are challenges in integrating these systems into existing analysis workflows. Major barriers include the expensive task of preparing and loading data before querying, and converting the final results to a format that is understood by the existing post-processing and visualization tools. As a consequence, integrating a data management system into an existing scientific data analysis workflow is time-consuming and requires extensive user involvement. In this paper, we present the design of a new scientific data analysis system that efficiently processes queries directly over data stored in the HDF5 file format. This design choice eliminates the tedious and error-prone data loading process, and makes the query results readily available to the next processing steps of the analysis workflow. Our design leverages the increasing main memory capacities found in supercomputers through bitmap indexing and in-memory query execution. In addition, query processing over the HDF5 data format can be effortlessly parallelized to utilize the ample concurrency available in large-scale supercomputers and modern parallel file systems. We evaluate the performance of our system on a large supercomputing system and experiment with both a synthetic dataset and a real cosmology observation dataset. Our system frequently outperforms the relational database system that the cosmology team currently uses, and is more than 10X faster than Hive when processing data in parallel. Overall, by eliminating the data loading step, our query processing system is more effective in supporting in situ scientific analysis workflows.

【Keywords】: HDF5; in situ; parallel processing

36. The PH-tree: a space-efficient storage structure and multi-dimensional index.

Paper Link】 【Pages】:397-408

【Authors】: Tilmann Zäschke ; Christoph Zimmerli ; Moira C. Norrie

【Abstract】: We propose the PATRICIA-hypercube-tree, or PH-tree, a multi-dimensional data storage and indexing structure. It is based on binary PATRICIA-tries combined with hypercubes for efficient data access. Space efficiency is achieved by combining prefix sharing with a space optimised implementation. This leads to storage space requirements that are comparable or below storage of the same data in non-index structures such as arrays of objects. The storage structure also serves as a multi-dimensional index on all dimensions of the stored data. This enables efficient access to stored data via point and range queries. We explain the concept of the PH-tree and demonstrate the performance of a sample implementation on various datasets and compare it to other spatial indices such as the kD-tree. The experiments show that for larger datasets beyond 10^7 entries, the PH-tree increasingly and consistently outperforms other structures in terms of space efficiency, query performance and update performance.

【Keywords】: hypercube; multi-dimensional index; patricia-trie; quadtree; skewed data; space efficiency; spatial index

37. Incremental elasticity for array databases.

Paper Link】 【Pages】:409-420

【Authors】: Jennie Duggan ; Michael Stonebraker

【Abstract】: Relational databases benefit significantly from elasticity, whereby they execute on a set of changing hardware resources provisioned to match their storage and processing requirements. Such flexibility is especially attractive for scientific databases because their users often have a no-overwrite storage model, in which they delete data only when their available space is exhausted. This results in a database that is regularly growing and expanding its hardware proportionally. Also, scientific databases frequently store their data as multidimensional arrays optimized for spatial querying. This brings about several novel challenges in clustered, skew-aware data placement on an elastic shared-nothing database. In this work, we design and implement elasticity for an array database. We address this challenge on two fronts: determining when to expand a database cluster and how to partition the data within it. In both steps we propose incremental approaches, affecting a minimum set of data and nodes, while maintaining high performance. We introduce an algorithm for gradually augmenting an array database's hardware using a closed-loop control system. After the cluster adds nodes, we optimize data placement for n-dimensional arrays. Many of our elastic partitioners incrementally reorganize an array, redistributing data only to new nodes. By combining these two tools, the scientific database efficiently and seamlessly manages its monotonically increasing hardware resources.

【Keywords】: data placement; elasticity

38. Efficient summarization framework for multi-attribute uncertain data.

Paper Link】 【Pages】:421-432

【Authors】: Jie Xu ; Dmitri V. Kalashnikov ; Sharad Mehrotra

【Abstract】: This paper studies the problem of automatically selecting a small subset of representatives from a set of objects, where objects: (a) are multi-attributed with each attribute corresponding to different aspects of the object and (b) are associated with uncertainty -- the problem that has received little attention in the past. Such object set leads to new challenges in modeling information contained in data, defining appropriate criteria for selecting objects, and in devising efficient algorithms for such a selection. We propose a framework that models objects as a set of the corresponding information units and reduces the ummarization problem to that of optimizing probabilistic coverage. To solve the resulting NP-hard problem, we develop a highly efficient greedy algorithm, which gains its efficiency by leveraging object-level and iteration-level optimization. A comprehensive empirical evaluation over three real datasets demonstrates that the proposed framework significantly outperforms baseline techniques in terms of quality and also scales very well against the size of dataset.

【Keywords】: multi-attributes; summarization; uncertain data

Research session 8: data cleaning 4

39. Fusing data with correlations.

Paper Link】 【Pages】:433-444

【Authors】: Ravali Pochampally ; Anish Das Sarma ; Xin Luna Dong ; Alexandra Meliou ; Divesh Srivastava

【Abstract】: Many applications rely on Web data and extraction systems to accomplish knowledge-driven tasks. Web information is not curated, so many sources provide inaccurate, or conflicting information. Moreover, extraction systems introduce additional noise to the data. We wish to automatically distinguish correct data and erroneous data for creating a cleaner set of integrated data. Previous work has shown that a naive voting strategy that trusts data provided by the majority or at least a certain number of sources may not work well in the presence of copying between the sources. However, correlation between sources can be much broader than copying: sources may provide data from complementary domains (negative correlation), extractors may focus on different types of information (negative correlation), and extractors may apply common rules in extraction (positive correlation, without copying). In this paper we present novel techniques modeling correlations between sources and applying it in truth finding. We provide a comprehensive evaluation of our approach on three real-world datasets with different characteristics, as well as on synthetic data, showing that our algorithms outperform the existing state-of-the-art techniques.

【Keywords】: correlated sources; data fusion; integration

40. Descriptive and prescriptive data cleaning.

Paper Link】 【Pages】:445-456

【Authors】: Anup Chalamalla ; Ihab F. Ilyas ; Mourad Ouzzani ; Paolo Papotti

【Abstract】: Data cleaning techniques usually rely on some quality rules to identify violating tuples, and then fix these violations using some repair algorithms. Oftentimes, the rules, which are related to the business logic, can only be defined on some target report generated by transformations over multiple data sources. This creates a situation where the violations detected in the report are decoupled in space and time from the actual source of errors. In addition, applying the repair on the report would need to be repeated whenever the data sources change. Finally, even if repairing the report is possible and affordable, this would be of little help towards identifying and analyzing the actual sources of errors for future prevention of violations at the target. In this paper, we propose a system to address this decoupling. The system takes quality rules defined over the output of a transformation and computes explanations of the errors seen on the output. This is performed both at the target level to describe these errors and at the source level to prescribe actions to solve them. We present scalable techniques to detect, propagate, and explain errors. We also study the effectiveness and efficiency of our techniques using the TPC-H Benchmark for different scenarios and classes of quality rules.

【Keywords】: data cleaning; data integration; data quality; explanations; provenance

41. Towards dependable data repairing with fixing rules.

Paper Link】 【Pages】:457-468

【Authors】: Jiannan Wang ; Nan Tang

【Abstract】:

【Keywords】: data cleaning rules; data repairing; dependable

42. A sample-and-clean framework for fast and accurate query processing on dirty data.

Paper Link】 【Pages】:469-480

【Authors】: Jiannan Wang ; Sanjay Krishnan ; Michael J. Franklin ; Ken Goldberg ; Tim Kraska ; Tova Milo

【Abstract】: In emerging Big Data scenarios, obtaining timely, high-quality answers to aggregate queries is difficult due to the challenges of processing and cleaning large, dirty data sets. To increase the speed of query processing, there has been a resurgence of interest in sampling-based approximate query processing (SAQP). In its usual formulation, however, SAQP does not address data cleaning at all, and in fact, exacerbates answer quality problems by introducing sampling error. In this paper, we explore an intriguing opportunity. That is, we explore the use of sampling to actually improve answer quality. We introduce the Sample-and-Clean framework, which applies data cleaning to a relatively small subset of the data and uses the results of the cleaning process to lessen the impact of dirty data on aggregate query answers. We derive confidence intervals as a function of sample size and show how our approach addresses error bias. We evaluate the Sample-and-Clean framework using data from three sources: the TPC-H benchmark with synthetic noise, a subset of the Microsoft academic citation index and a sensor data set. Our results are consistent with the theoretical confidence intervals and suggest that the Sample-and-Clean framework can produce significant improvements in accuracy compared to query processing without data cleaning and speed compared to data cleaning without sampling.

【Keywords】: aggregate query; data cleaning; dirty data; sampling

Research session 9: data exploration 4

43. Knowing when you're wrong: building fast and reliable approximate query processing systems.

Paper Link】 【Pages】:481-492

【Authors】: Sameer Agarwal ; Henry Milner ; Ariel Kleiner ; Ameet Talwalkar ; Michael I. Jordan ; Samuel Madden ; Barzan Mozafari ; Ion Stoica

【Abstract】: Modern data analytics applications typically process massive amounts of data on clusters of tens, hundreds, or thousands of machines to support near-real-time decisions.The quantity of data and limitations of disk and memory bandwidth often make it infeasible to deliver answers at interactive speeds. However, it has been widely observed that many applications can tolerate some degree of inaccuracy. This is especially true for exploratory queries on data, where users are satisfied with "close-enough" answers if they can come quickly. A popular technique for speeding up queries at the cost of accuracy is to execute each query on a sample of data, rather than the whole dataset. To ensure that the returned result is not too inaccurate, past work on approximate query processing has used statistical techniques to estimate "error bars" on returned results. However, existing work in the sampling-based approximate query processing (S-AQP) community has not validated whether these techniques actually generate accurate error bars for real query workloads. In fact, we find that error bar estimation often fails on real world production workloads. Fortunately, it is possible to quickly and accurately diagnose the failure of error estimation for a query. In this paper, we show that it is possible to implement a query approximation pipeline that produces approximate answers and reliable error bars at interactive speeds.

【Keywords】: approximate query processing; diagnostics; error estimation

44. Discovering queries based on example tuples.

Paper Link】 【Pages】:493-504

【Authors】: Yanyan Shen ; Kaushik Chakrabarti ; Surajit Chaudhuri ; Bolin Ding ; Lev Novik

【Abstract】: An enterprise information worker is often aware of a few example tuples (but not the entire result) that should be present in the output of the query. We study the problem of discovering the minimal project join query that contains the given example tuples in its output. Efficient discovery of such queries is challenging. We propose novel algorithms to solve this problem. Our experiments on real-life datasets show that the proposed solution is significantly more efficient compared with na\"{i}ve adaptations of known techniques.

【Keywords】: SQL query discovery; example tuples; filter selection; project join query; pruning

45. Interactive data exploration using semantic windows.

Paper Link】 【Pages】:505-516

【Authors】: Alexander Kalinin ; Ugur Çetintemel ; Stanley B. Zdonik

【Abstract】: We present a new interactive data exploration approach, called Semantic Windows (SW), in which users query for multidimensional "windows" of interest via standard DBMS-style queries enhanced with exploration constructs. Users can specify SWs using (i) shape-based properties, e.g., "identify all 3-by-3 windows", as well as (ii) content-based properties, e.g., "identify all windows in which the average brightness of stars exceeds 0.8". This SW approach enables the interactive processing of a host of useful exploratory queries that are difficult to express and optimize using standard DBMS techniques. SW uses a sampling-guided, data-driven search strategy to explore the underlying data set and quickly identify windows of interest. To facilitate human-in-the-loop style interactive processing, SW is optimized to produce online results during query execution. To control the tension between online performance and query completion time, it uses a tunable, adaptive prefetching technique. To enable exploration of big data, the framework supports distributed computation. We describe the semantics and implementation of SW as a distributed layer on top of PostgreSQL. The experimental results with real astronomical and artificial data reveal that SW can offer online results quickly and continuously with little or no degradation in query completion times.

【Keywords】: data exploration; query processing

46. Explore-by-example: an automatic query steering framework for interactive data exploration.

Paper Link】 【Pages】:517-528

【Authors】: Kyriaki Dimitriadou ; Olga Papaemmanouil ; Yanlei Diao

【Abstract】: Interactive Data Exploration (IDE) is a key ingredient of a diverse set of discovery-oriented applications, including ones from scientific computing and evidence-based medicine. In these applications, data discovery is a highly ad hoc interactive process where users execute numerous exploration queries using varying predicates aiming to balance the trade-off between collecting all relevant information and reducing the size of returned data. Therefore, there is a strong need to support these human-in-the-loop applications by assisting their navigation in the data to find interesting objects. In this paper, we introduce AIDE, an Automatic Interactive Data Exploration framework, that iteratively steers the user towards interesting data areas and predicts a query that retrieves his objects of interest. Our approach leverages relevance feedback on database samples to model user interests and strategically collects more samples to refine the model while minimizing the user effort. AIDE integrates machine learning and data management techniques to provide effective data exploration results (matching the user's interests with high accuracy) as well as high interactive performance. It delivers highly accurate query predictions for very common conjunctive queries with very small user effort while, given a reasonable number of samples, it can predict with high accuracy complex conjunctive queries. Furthermore, it provides interactive performance by limiting the user wait time per iteration to less than a few seconds in average. Our user study indicates that AIDE is a practical exploration framework as it significantly reduces the user effort and the total exploration time compared with the current state-of-the-art approach of manual exploration.

【Keywords】: data exploration; database sampling; query formulation

Industry session 3: storage management 4

47. Durable write cache in flash memory SSD for relational and NoSQL databases.

Paper Link】 【Pages】:529-540

【Authors】: Woon-Hak Kang ; Sang-Won Lee ; Bongki Moon ; Yang-Suk Kee ; Moonwook Oh

【Abstract】: In order to meet the stringent requirements of low latency as well as high throughput, web service providers with large data centers have been replacing magnetic disk drives with flash memory solid-state drives (SSDs). They commonly use relational and NoSQL database engines to manage OLTP workloads in the warehouse-scale computing environments. These modern database engines rely heavily on redundant writes and frequent cache flushes to guarantee the atomicity and durability of transactional updates. This has become a serious bottleneck of performance in both relational and NoSQL database engines. This paper presents a new SSD prototype called DuraSSD equipped with tantalum capacitors. The tantalum capacitors make the device cache inside DuraSSD durable, and additional firmware features of DuraSSD take advantage of the durable cache to support the atomicity and durability of page writes. It is the first time that a flash memory SSD with durable cache has been used to achieve an order of magnitude improvement in transaction throughput without compromising the atomicity and durability. Considering that the simple capacitors increase the total cost of an SSD no more than one percent, DuraSSD clearly provides a cost-effective means for transactional support. DuraSSD is also expected to alleviate the problem of high tail latency by minimizing write stalls.

【Keywords】: atomicity; durability; durable cache; ssd

48. Fast database restarts at facebook.

Paper Link】 【Pages】:541-549

【Authors】: Aakash Goel ; Bhuwan Chopra ; Ciprian Gerea ; Dhruv Mátáni ; Josh Metzler ; Fahim Ul Haq ; Janet L. Wiener

【Abstract】: Facebook engineers query multiple databases to monitor and analyze Facebook products and services. The fastest of these databases is Scuba, which achieves subsecond query response time by storing all of its data in memory across hundreds of servers. We are continually improving the code for Scuba and would like to push new software releases at least once a week. However, restarting a Scuba machine clears its memory. Recovering all of its data from disk --- about 120 GB per machine --- takes 2.5-3 hours to read and format the data per machine. Even 10 minutes is a long downtime for the critical applications that rely on Scuba, such as detecting user-facing errors. Restarting only 2% of the servers at a time mitigates the amount of unavailable data, but prolongs the restart duration to about 12 hours, during which users see only partial query results and one engineer needs to monitor the servers carefully. We need a faster, less engineer intensive, solution to enable frequent software upgrades. In this paper, we show that using shared memory provides a simple, effective, fast, solution to upgrading servers. Our key observation is that we can decouple the memory lifetime from the process lifetime. When we shutdown a server for a planned upgrade, we know that the memory state is valid (unlike when a server shuts down unexpectedly). We can therefore use shared memory to preserve memory state from the old server process to the new process. Our solution does not increase the server memory footprint and allows recovery at memory speeds, about 2-3 minutes per server. This solution maximizes uptime and availability, which has led to much faster and more frequent rollouts of new features and improvements. Furthermore, this technique can be applied to the in-memory state of any database, even if the memory contains a cache of a much larger disk-resident data set, as in most databases.

【Keywords】: database; recovery; restart; rollover; shared memory

49. SpongeFiles: mitigating data skew in mapreduce using distributed memory.

Paper Link】 【Pages】:551-562

【Authors】: Khaled Elmeleegy ; Christopher Olston ; Benjamin Reed

【Abstract】: Data skew is a major problem for data processing platforms like MapReduce. Skew causes worker tasks to spill to disk what they cannot fit in memory, which slows down the task and the overall job. Moreover, performance of other jobs sharing same disk degrades. In many cases, this situation occurs even as the cluster has plenty of spare memory it is just not used evenly. We introduce SpongeFiles, a novel distributed-memory abstraction tailored to data processing environments like MapReduce. A SpongeFile is a logical byte array, comprised of large chunks that can be stored in a variety of locations in the cluster. Spilled data goes to SpongeFiles, which route it to the nearest location with sufficient capacity (local memory, remote memory, local disk, or remote disk as a last resort). By enabling memory-sapped nodes to tap into the spare capacity of their neighbors, SpongeFiles minimize expensive disk spilling, thereby improving performance. In our experiments with Hadoop and Pig, SpongeFiles reduce overall job runtimes by up to 55% and by up to 85% under disk contention.

【Keywords】: handling data skew

50. Leveraging compression in the tableau data engine.

Paper Link】 【Pages】:563-573

【Authors】: Richard Michael Grantham Wesley ; Pawel Terlecki

【Abstract】: Data sets are growing rapidly and there is an attendant need for tools that facilitate human analysis of them in a timely manner. To help meet this need, column-oriented databases (or "column stores") have come into wide use because of their low latency on analytic workloads. Column stores use a number of techniques to produce these dramatic performance techniques, including the ability to perform operations directly on compressed data. In this paper, we describe how the Tableau Data Engine (an internally developed column store) leverages a number of compression techniques to improve query performance. The approach is simpler than existing systems for operating on compressed data and more unified, removing the necessity for custom data access mechanisms. The approach also uses some novel metadata extraction techniques to improve the choices made by the system's run-time optimizer.

【Keywords】: column stores; compression; vizualization

Keynote 2 1

51. Fun with hardware transactional memory.

Paper Link】 【Pages】:575

【Authors】: Maurice Herlihy

【Abstract】: Leading hardware vendors such as Intel and IBM are releasing a new generation of processor architectures that provide hardware transactional memory (HTM), a synchronization mechanisms for fast in-memory transactions. This talk will argue that HTM is not just a faster way of doing the same old latches and monitors. Instead, it could bring about a fundamental positive change in the way we program multicores (and eventually perhaps even databases) by allowing a fundamental rethinking of basic synchronization structures such as locks, memory management, and a range of concurrent data structures.

【Keywords】: transactional memory

Research session 10: crowdsourcing 3

52. CrowdFill: collecting structured data from the crowd.

Paper Link】 【Pages】:577-588

【Authors】: Hyunjung Park ; Jennifer Widom

【Abstract】: We present CrowdFill, a system for collecting structured data from the crowd. While a typical microtask-based approach would pose specific questions to each worker and assemble the answers, CrowdFill shows a partially-filled table to all participating workers. Workers contribute by filling in empty cells, as well as upvoting and downvoting data entered by other workers. The system's synchronization scheme, based on a careful model of primitive operations, enables workers to collaboratively complete the table without latency overhead. CrowdFill allows the specification of constraints on the collected data, and has mechanisms for resolving inconsistencies. Its compensation scheme takes into account each worker's contribution to the final table, and the varying difficulty of data entry tasks. The paper includes some preliminary experimental results.

【Keywords】: crowdsourcing; data collection; distributed tasks

53. OASSIS: query driven crowd mining.

Paper Link】 【Pages】:589-600

【Authors】: Yael Amsterdamer ; Susan B. Davidson ; Tova Milo ; Slava Novgorodov ; Amit Somech

【Abstract】: Crowd data sourcing is increasingly used to gather information from the crowd and to obtain recommendations. In this paper, we explore a novel approach that broadens crowd data sourcing by enabling users to pose general questions, to mine the crowd for potentially relevant data, and to receive concise, relevant answers that represent frequent, significant data patterns. Our approach is based on (1) a simple generic model that captures both ontological knowledge as well as the individual history or habits of crowd members from which frequent patterns are mined; (2) a query language in which users can declaratively specify their information needs and the data patterns of interest; (3) an efficient query evaluation algorithm, which enables mining semantically concise answers while minimizing the number of questions posed to the crowd; and (4) an implementation of these ideas that mines the crowd through an interactive user interface. Experimental results with both real-life crowd and synthetic data demonstrate the feasibility and effectiveness of the approach.

【Keywords】: crowd mining; ontologies; query languages

54. Corleone: hands-off crowdsourcing for entity matching.

Paper Link】 【Pages】:601-612

【Authors】: Chaitanya Gokhale ; Sanjib Das ; AnHai Doan ; Jeffrey F. Naughton ; Narasimhan Rampalli ; Jude W. Shavlik ; Xiaojin Zhu

【Abstract】: Recent approaches to crowdsourcing entity matching (EM) are limited in that they crowdsource only parts of the EM workflow, requiring a developer to execute the remaining parts. Consequently, these approaches do not scale to the growing EM need at enterprises and crowdsourcing startups, and cannot handle scenarios where ordinary users (i.e., the masses) want to leverage crowdsourcing to match entities. In response, we propose the notion of hands-off crowdsourcing (HOC)}, which crowdsources the entire workflow of a task, thus requiring no developers. We show how HOC can represent a next logical direction for crowdsourcing research, scale up EM at enterprises and crowdsourcing startups, and open up crowdsourcing for the masses. We describe Corleone, a HOC solution for EM, which uses the crowd in all major steps of the EM process. Finally, we discuss the implications of our work to executing crowdsourced RDBMS joins, cleaning learning models, and soliciting complex information types from crowd workers.

【Keywords】: active learning; crowdsourcing; entity matching

Research session 11: parallel graph processing 3

55. Efficient cohesive subgraphs detection in parallel.

Paper Link】 【Pages】:613-624

【Authors】: Yingxia Shao ; Lei Chen ; Bin Cui

【Abstract】: A cohesive subgraph is a primary vehicle for massive graph analysis, and a newly introduced cohesive subgraph, k-truss, which is motivated by a natural observation of social cohesion, has attracted more and more attention. However, the existing parallel solutions to identify the k-truss are inefficient for very large graphs, as they still suffer from huge communication cost and large number of iterations during the computation. In this paper, we propose a novel parallel and efficient truss detection algorithm, called PeTa. The PeTa produces a triangle complete subgraph (TC-subgraph) for every computing node. Based on the TC-subgraphs, PeTa can detect the local k-truss in parallel within a few iterations. We theoretically prove, within this new paradigm, the communication cost of PeTa is bounded by three times of the number of triangles, the total computation complexity of PeTa is the same order as the best known serial algorithm and the number of iterations for a given partition scheme is minimized as well. Furthermore, we present a subgraph-oriented model to efficiently express PeTa in parallel graph computing systems. The results of comprehensive experiments demonstrate, compared with the existing solutions, PeTa saves 2X to 19X in communication cost, reduces 80% to 95% number of iterations and improves the overall performance by 80% across various real-world graphs.

【Keywords】: cohesive subgraph; graph algorithm; subgraph-oriented; truss

56. Parallel subgraph listing in a large-scale graph.

Paper Link】 【Pages】:625-636

【Authors】: Yingxia Shao ; Bin Cui ; Lei Chen ; Lin Ma ; Junjie Yao ; Ning Xu

【Abstract】: Subgraph listing is a fundamental operation to many graph and network analyses. The problem itself is computationally expensive and is well-studied in centralized processing algorithms. However, the centralized solutions cannot scale well to large graphs. Recently, several parallel approaches are introduced to handle the large graphs. Unfortunately, these parallel approaches still rely on the expensive join operations, thus cannot achieve high performance. In this paper, we design a novel parallel subgraph listing framework, named PSgL. The PSgL iteratively enumerates subgraph instances and solves the subgraph listing in a divide-and-conquer fashion. The framework completely relies on the graph traversal, and avoids the explicit join operation. Moreover, in order to improve its performance, we propose several solutions to balance the workload and reduce the size of intermediate results. Specially, we prove the problem of partial subgraph instance distribution for workload balance is NP-hard, and carefully design a set of heuristic strategies. To further reduce the enormous intermediate results, we introduce three independent mechanisms, which are automorphism breaking of the pattern graph, initial pattern vertex selection based on a cost model, and a pruning method based on a light-weight index. We have implemented the prototype of PSgL, and run comprehensive experiments of various graph listing operations on diverse large graphs. The experiments clearly demonstrate that PSgL is robust and can achieve performance gain over the state-of-the-art solutions up to 90%.

【Keywords】: graph; graph algorithm; graph enumeration; subgraph listing

57. OPT: a new framework for overlapped and parallel triangulation in large-scale graphs.

Paper Link】 【Pages】:637-648

【Authors】: Jinha Kim ; Wook-Shin Han ; Sangyeon Lee ; Kyungyeol Park ; Hwanjo Yu

【Abstract】: Graph triangulation, which finds all triangles in a graph, has been actively studied due to its wide range of applications in the network analysis and data mining. With the rapid growth of graph data size, disk-based triangulation methods are in demand but little researched. To handle a large-scale graph which does not fit in memory, we must iteratively load small parts of the graph. In the existing literature, achieving the ideal cost has been considered to be impossible for billion-scale graphs due to the memory size constraint. In this paper, we propose an overlapped and parallel disk-based triangulation framework for billion-scale graphs, OPT, which achieves the ideal cost by (1) full overlap of the CPU and I/O operations and (2) full parallelism of multi-core CPU and FlashSSD I/O. In OPT, triangles in memory are called the internal triangles while triangles constituting vertices in memory and vertices in external memory are called the external triangles. At the macro level, OPT overlaps the internal triangulation and the external triangulation, while it overlaps the CPU and I/O operations at the micro level. Thereby, the cost of OPT is close to the ideal cost. Moreover, OPT instantiates both vertex-iterator and edge-iterator models and benefits from multi-thread parallelism on both types of triangulation. Extensive experiments conducted on large-scale datasets showed that (1) OPT achieved the elapsed time close to that of the ideal method with less than 7% of overhead under the limited memory budget, (2) OPT achieved linear speed-up with an increasing number of CPU cores, (3) OPT outperforms the state-of-the-art parallel method by up to an order of magnitude with 6 CPU cores, and (4) for the first time in the literature, the triangulation results are reported for a billion-vertex scale real-world graph.

【Keywords】: big data; parallel processing; triangulation

Research session 12: potpouri 3

58. Knowledge expansion over probabilistic knowledge bases.

Paper Link】 【Pages】:649-660

【Authors】: Yang Chen ; Daisy Zhe Wang

【Abstract】: Information extraction and human collaboration techniques are widely applied in the construction of web-scale knowledge bases. However, these knowledge bases are often incomplete or uncertain. In this paper, we present ProbKB, a probabilistic knowledge base designed to infer missing facts in a scalable, probabilistic, and principled manner using a relational DBMS. The novel contributions we make to achieve scalability and high quality are: 1) We present a formal definition and a novel relational model for probabilistic knowledge bases. This model allows an efficient SQL-based inference algorithm for knowledge expansion that applies inference rules in batches; 2) We implement ProbKB on massive parallel processing databases to achieve further scalability; and 3) We combine several quality control methods that identify erroneous rules, facts, and ambiguous entities to improve the precision of inferred facts. Our experiments show that ProbKB system outperforms the state-of-the-art inference engine in terms of both performance and quality.

【Keywords】: databases; knowledge bases; probabilistic reasoning

59. InsightNotes: summary-based annotation management in relational databases.

Paper Link】 【Pages】:661-672

【Authors】: Dongqing Xiao ; Mohamed Y. Eltabakh

【Abstract】: In this paper, we address the challenges that arise from the growing scale of annotations in scientific databases. On one hand, end-users and scientists are incapable of analyzing and extracting knowledge from the large number of reported annotations, e.g., one tuple may have hundreds of annotations attached to it over time. On the other hand, current annotation management techniques fall short in providing advanced processing over the annotations beyond just propagating them to end-users. To address this limitation, we propose the InsightNotes system, a summary-based annotation management engine in relational databases. InsightNotes integrates data mining and summarization techniques into annotation management in novel ways with the objective of creating and reporting concise representations (summaries) of the raw annotations. We propose an extended summary-aware query processing engine for efficient manipulation and propagation of the annotation summaries in the query pipeline. We introduce several optimizations for the creation, maintenance, and zoom-in processing over the annotations summaries. InsightNotes is implemented on top of an existing annotation management system within which it is experimentally evaluated using real-world datasets. The results illustrate significant performance gain from the proposed techniques and optimizations (up to 100x in some operations) compared to the naive approaches.

【Keywords】: annotation management; query processing; summarization

60. A pivotal prefix based filtering algorithm for string similarity search.

Paper Link】 【Pages】:673-684

【Authors】: Dong Deng ; Guoliang Li ; Jianhua Feng

【Abstract】: We study the string similarity search problem with edit-distance constraints, which, given a set of data strings and a query string, finds the similar strings to the query. Existing algorithms use a signature-based framework. They first generate signatures for each string and then prune the dissimilar strings which have no common signatures to the query. However existing methods involve large numbers of signatures and many signatures are unnecessary. Reducing the number of signatures not only increases the pruning power but also decreases the filtering cost. To address this problem, we propose a novel pivotal prefix filter which significantly reduces the number of signatures. We prove the pivotal filter achieves larger pruning power and less filtering cost than state-of-the-art filters. We develop a dynamic programming method to select high-quality pivotal prefix signatures to prune dissimilar strings with non-consecutive errors to the query. We propose an alignment filter that considers the alignments between signatures to prune large numbers of dissimilar pairs with consecutive errors to the query. Experimental results on three real datasets show that our method achieves high performance and outperforms the state-of-the-art methods by an order of magnitude.

【Keywords】: edit distance; pivotal prefix filter; similarity search

Demo A 10

61. Versatile optimization of UDF-heavy data flows with sofa.

Paper Link】 【Pages】:685-688

【Authors】: Astrid Rheinländer ; Martin Beckmann ; Anja Kunkel ; Arvid Heise ; Thomas Stoltmann ; Ulf Leser

【Abstract】: Currently, we witness an increased interest in large-scale analytical data flows on non-relational data. The predominant building blocks of such data flows are user-defined functions (UDFs), a fact that is not well taken into account for data flow language design and optimization in current systems. In this demonstration, we present Meteor, a declarative data flow language, and Sofa, a logical optimizer for UDF-heavy data flows, which are both part of the Stratosphere system. Meteor queries seamlessly combine self-descriptive, domain-specific operators with standard relational operators. Such queries are optimized by Sofa, building on a concise set of UDF annotations and a small set of rewrite rules to enable semantically equivalent plan rewriting of UDF-heavy data flows. A salient feature of Meteor and Sofa is extensibility: User-defined operators and their properties are arranged into a subsumption hierarchy, which considerably eases integration and optimization of new operators. In this demonstration, we will let users pose arbitrary Meteor queries and graphically showcase versatility and extensibility of Sofa during query optimization.

【Keywords】: non-relational data flows; query optimization; user-defined functions

62. ERIS live: a NUMA-aware in-memory storage engine for tera-scale multiprocessor systems.

Paper Link】 【Pages】:689-692

【Authors】: Tim Kiefer ; Thomas Kissinger ; Benjamin Schlegel ; Dirk Habich ; Daniel Molka ; Wolfgang Lehner

【Abstract】: The ever-growing demand for more computing power forces hardware vendors to put an increasing number of multiprocessors into a single server system, which usually exhibits a non-uniform memory access (NUMA). In-memory database systems running on NUMA platforms face several issues such as the increased latency and the decreased bandwidth when accessing remote main memory. To cope with these NUMA-related issues, a DBMS has to allow flexible data partitioning and data placement at runtime. In this demonstration, we present ERIS, our NUMA-aware in-memory storage engine. ERIS uses an adaptive partitioning approach that exploits the topology of the underlying NUMA platform and significantly reduces NUMA-related issues. We demonstrate throughput numbers and hardware performance counter evaluations of ERIS and a NUMA-unaware index for different workloads and configurations. All experiments are conducted on a standard server system as well as on a system consisting of 64 multiprocessors, 512 cores, and 8 TBs main memory.

【Keywords】: eris; in-memory; multiprocessors; numa; scalability; storage engine

63. Demonstrating efficient query processing in heterogeneous environments.

Paper Link】 【Pages】:693-696

【Authors】: Tomas Karnagel ; Matthias Hille ; Mario Ludwig ; Dirk Habich ; Wolfgang Lehner ; Max Heimel ; Volker Markl

【Abstract】: The increasing heterogeneity in hardware systems gives developers many opportunities to add more functionality and computational power to the system. As a consequence, modern database systems will need to be able to adapt to a wide variety of heterogeneous architectures. While porting single operators to accelerator architectures is well-understood, a more generic approach is needed for the whole database system. In prior work, we presented a generic hardware-oblivious database system, where the operators can be executed on the main processor as well as on a large number of accelerator architectures. However, to achieve fully heterogeneous query processing, placement decisions are needed for the database operators. We enhance the presented system with heterogeneity-aware operator placement (HOP) to take a major step towards designing a database system that can efficiently exploit highly heterogeneous hardware environments. In this demonstration, we are focusing on the placement-integration aspect as well as presenting the resulting database system.

【Keywords】: hardware-oblivious data processing; heterogeneous hardware; modern hardware architecture; operator placement; query optimization

64. One DBMS for all: the brawny few and the wimpy crowd.

Paper Link】 【Pages】:697-700

【Authors】: Tobias Mühlbauer ; Wolf Rödiger ; Robert Seilbeck ; Angelika Reiser ; Alfons Kemper ; Thomas Neumann

【Abstract】: Shipments of smartphones and tablets with wimpy CPUs are outpacing brawny PC and server shipments by an ever-increasing margin. While high performance database systems have traditionally been optimized for brawny systems, wimpy systems have received only little attention; leading to poor performance and energy inefficiency on such systems. This demonstration presents HyPer, a high-performance hybrid OLTP&OLAP main memory database system that we optimized for both, brawny and wimpy systems. The efficient compilation of transactions and queries into efficient machine code allows for high performance, independent of the target platform. HyPer has a memory footprint of just a few megabytes, even though it supports the SQL-92 standard, a PL/SQL-like scripting language, and ACID-compliant transactions. It is the goal of this demonstration to showcase the same HyPer codebase running on (a) a wimpy ARM-based smartphone system and (b) a brawny x86-64-based server system. In particular, we run the TPC-C, TPC-H, and a combined CH-benCHmark and report performance and energy numbers. The demonstration further allows the interactive execution of arbitrary SQL queries and the visualization of optimized query plans.

【Keywords】: brawny; energy efficiency; high performance; wimpy

65. VQA: vertica query analyzer.

Paper Link】 【Pages】:701-704

【Authors】: Alkis Simitsis ; Kevin Wilkinson ; Jason Blais ; Joe Walsh

【Abstract】: Database query monitoring tools collect performance metrics, such as memory and cpu usage, while a query is executing and make them available through log files or system tables. The metrics can be used to understand and diagnose query performance issues. However, analytic queries over big data presents new challenges for query monitoring tools. A long-running query may generate tens of thousands of values so simply reporting the metrics may overwhelm the user. Second, analytic queries may be written by database novices who have trouble interpreting the metrics. Third, analytic queries may access data or processing outside the database through user-defined functions and connectors. The impact of these on query performance must be understood. Vertica Query Analyzer (VQA) is a query monitoring tool to address these challenges. VQA is both a useful tool and a research platform for query analytics. It presents query performance metrics through a variety of views and granularities. In addition, it analyzes the metrics for typical performance problems and suggests corrective actions. We demonstrate VQA using TPC-DS queries which have a wide range of query duration and complexity.

【Keywords】: SQL; SQL query analysis; SQL query monitoring; parallel database

66. Palette: enabling scalable analytics for big-memory, multicore machines.

Paper Link】 【Pages】:705-708

【Authors】: Fei Chen ; Tere Gonzalez ; Jun Li ; Manish Marwah ; Jim Pruyne ; Krishnamurthy Viswanathan ; Mijung Kim

【Abstract】: Hadoop and its variants have been widely used for processing large scale analytics tasks in a cluster environment. However, use of a commodity cluster for analytics tasks needs to be reconsidered based on two key observations: (1) in recent years, large memory, multicore machines have become more affordable; and (2) recent studies show that most analytics tasks in practice are smaller than 100 GB. Thus, replacing a commodity cluster with a large memory, multicore machine can enable in-memory analytics at an affordable cost. However= programming on a big-memory, multicore machine is a challenge. Multi-threaded programming is notoriously difficult. Further, the memory design of most large memory servers follows non-uniform memory access (NUMA) architecture. While NUMA-aware programming often leads to high efficiency in analytics tasks, it is usually done in an ad hoc manner. In this demo, we present Palette, an analytics framework that exploits large memory to trade space for time while also addressing the challenges of multi-threaded, NUMA-aware programming. Palette manages multiple, index-like data representations for input datasets. An operator may have multiple implementations, each of which uses a different data representation. Palette uses a cost-based approach to automatically select the fastest one on a given dataset. Palette addresses challenges of multi-threaded and NUMA-aware programming by adapting Hadoop for a single multicore machine and modifying it by considering the characteristics of modern NUMA hardware. Users can write programs using exactly the same APIs as those used in traditional Hadoop, while transparently benefiting from multi-threaded and NUMA-aware infrastructure. We have developed a research prototype of Palette. Specifically, at SIGMOD we will demonstrate how to (1) create an operator, such as time series similarity search, on Palette, (2) execute the operator with Palette's automatic implementation selection feature, and (3) monitor and compare different operator implementations.

【Keywords】: hadoop; multi-core; numa-aware

67. NaLIR: an interactive natural language interface for querying relational databases.

Paper Link】 【Pages】:709-712

【Authors】: Fei Li ; Hosagrahar Visvesvaraya Jagadish

【Abstract】: In this demo, we present NaLIR, a generic interactive natural language interface for querying relational databases. NaLIR can accept a logically complex English language sentence as query input. This query is first translated into a SQL query, which may include aggregation, nesting, and various types of joins, among other things, and then evaluated against an RDBMS. In this demonstration, we show that NaLIR, while far from being able to pass the Turing test, is perfectly usable in practice, and able to handle even quite complex queries in a variety of application domains. In addition, we also demonstrate how carefully designed interactive communication can avoid misinterpretation with minimum user burden.

【Keywords】: natural language interface; relational database; usability

68. BabbleFlow: a translator for analytic data flow programs.

Paper Link】 【Pages】:713-716

【Authors】: Petar Jovanovic ; Alkis Simitsis ; Kevin Wilkinson

【Abstract】: A complex analytic data flow may perform multiple, inter-dependent tasks where each task uses a different processing engine. Such a multi-engine flow, termed a hybrid flow, may comprise subflows written in more than one programming language. However, as the number and variety of these engines grow, developing and maintaining hybrid flows at the physical level becomes increasingly challenging. To address this problem, we present BabbleFlow, a system for enabling flow design at a logical level and automatic translation to physical flows. BabbleFlow translates a hybrid flow expressed in a number of languages to a semantically equivalent hybrid flow expressed in the same or a different set of languages. To this end, it composes the multiple physical flows of a hybrid flow into a single logical representation expressed in a unified flow language called xLM. In doing so, it enables a number of graph transformations such as (de-)composition and optimization. Then, it converts the, possibly transformed, xLM data flow graph into an executable form by expressing it in one or more target programming languages.

【Keywords】: SQL; analytics flows; code generation; language; translation

69. Indexing on modern hardware: hekaton and beyond.

Paper Link】 【Pages】:717-720

【Authors】: Justin J. Levandoski ; David B. Lomet ; Sudipta Sengupta ; Adrian Birka ; Cristian Diaconu

【Abstract】: Recent OLTP support exploits new techniques, running on modern hardware, to achieve unprecedented performance compared with prior approaches. In SQL Server, the Hekaton main-memory database engine embodies this new OLTP support. Hekaton uses the Bw-tree to achieve its great indexing performance. The Bw-Tree is a latch-free B-tree index that also exploits log-structured storage when used "beyond" Hekaton as a separate key value store. It is designed from the ground up to address two hardware trends: (1) Multi-core and main memory hierarchy: the Bw-tree is completely latch-free, using an atomic compare-and-swap instruction to install state changes on a "page address" mapping table; it performs updates as "deltas" to avoid update-in-place. These improve performance by eliminating thread blocking while improving cache hit ratios. (2) Flash storage: the Bw-tree organizes secondary storage in a log-structured manner, using large sequential writes to avoid entirely the adverse performance impact of random writes. We demonstrate the architectural versatility and performance of the Bw-tree in two scenarios: (a) running live within Hekaton and (2) running as a standalone key value store compared to both BerkeleyDB and a state-of-the-art in-memory range index (latch-free skiplists). Using workloads from real-world applications (Microsoft XBox Live Primetime and enterprise deduplication), we show the Bw-tree is 19x faster than BerkeleyDB and 3x faster than skiplists.

【Keywords】: storage

70. CrowdMatcher: crowd-assisted schema matching.

Paper Link】 【Pages】:721-724

【Authors】: Chen Jason Zhang ; Ziyuan Zhao ; Lei Chen ; H. V. Jagadish ; Caleb Chen Cao

【Abstract】: Schema matching is a central challenge for data integration systems. Due to the inherent uncertainty arose from the inability of schema in fully capturing the semantics of the represented data, automatic tools are often uncertain about suggested matching results. However, human is good at understanding data represented in various forms and crowdsourcing platforms are making the human annotation process more affordable. Thus in this demo, we will show how to utilize the crowd to find the right matching. In order to do that, we need to make the tasks posted on the crowdsouricng platforms extremely simple, to be performed by non-expert people, and reduce the number of tasks as less as possible to save the cost. We demonstrate CrowdMatcher, a hybrid machine-crowd system for schema matching. The machine-generated matchings are verified by correspondence correctness queries (CCQs), which is to ask the crowd to determine whether a given correspondence is correct or not. CrowdMatcher includes several original features: it integrates different matchings generated from classical schema matching tools; in order to minimize the cost of crowdsourcing, it automatically selects the most informative set of CCQs from the possible matchings; it is able to manage inaccurate answers provided by the workers; the crowdsourced answers are used to improve matching results.

【Keywords】: crowdsourcing; schema matching

Tutorial 2 1

71. Cloud-based RDF data management.

Paper Link】 【Pages】:725-729

【Authors】: Zoi Kaoudi ; Ioana Manolescu

【Abstract】: The W3C's Resource Description Framework (or RDF, in short) is set to deliver many of the original semi-structured data promises: flexible structure, optional schema, and rich, flexible URIs as a basis for information sharing. Moreover, RDF is uniquely positioned to benefit from the efforts of scientific communities studying databases, knowledge representation, and Web technologies. As a consequence, numerous collections of RDF data are published, going from scientific data to general-purpose ontologies to open government data, in particular published as part of the Linked Data movement. Managing such large volumes of RDF data is challenging, due to the sheer size, the heterogeneity, and the further complexity brought by RDF reasoning. To tackle the size challenge, distributed storage architectures are required. Cloud computing is an emerging distributed paradigm massively adopted in many applications for the scalability, fault-tolerance and elasticity features it provides. This tutorial presents the challenges faced in order to efficiently handle massive amounts of RDF data in a cloud environment. We provide the necessary background, analyze and classify existing solutions, and discuss open problems and perspectives.

【Keywords】: RDF; cloud; data management; mapreduce; noSQL

Research session 13: data management over modern hardware 4

72. Patience is a virtue: revisiting merge and sort on modern processors.

Paper Link】 【Pages】:731-742

【Authors】: Badrish Chandramouli ; Jonathan Goldstein

【Abstract】: The vast quantities of log-based data appearing in data centers has generated an interest in sorting almost-sorted datasets. We revisit the problem of sorting and merging data in main memory, and show that a long-forgotten technique called Patience Sort can, with some key modifications, be made competitive with today's best comparison-based sorting techniques for both random and almost sorted data. Patience sort consists of two phases: the creation of sorted runs, and the merging of these runs. Through a combination of algorithmic and architectural innovations, we dramatically improve Patience sort for both random and almost-ordered data. Of particular interest is a new technique called ping-pong merge for merging sorted runs in main memory. Together, these innovations produce an extremely fast sorting technique that we call P3 Sort (for Ping-Pong Patience+ Sort), which is competitive with or better than the popular implementations of the fastest comparison-based sort techniques of today. For example, our implementation of P3 sort is around 20% faster than GNU Quicksort on random data, and 20% to 4x faster than Timsort for almost sorted data. Finally, we investigate replacement selection sort in the context of single-pass sorting of logs with bounded disorder, and leverage P3 sort to improve replacement selection. Experiments show that our proposal, P3 replacement selection, significantly improves performance, with speedups of 3x to 20x over classical replacement selection.

【Keywords】: merging; patience; performance; replacement selection; sorting

73. Morsel-driven parallelism: a NUMA-aware query evaluation framework for the many-core age.

Paper Link】 【Pages】:743-754

【Authors】: Viktor Leis ; Peter A. Boncz ; Alfons Kemper ; Thomas Neumann

【Abstract】: With modern computer architecture evolving, two problems conspire against the state-of-the-art approaches in parallel query execution: (i) to take advantage of many-cores, all query work must be distributed evenly among (soon) hundreds of threads in order to achieve good speedup, yet (ii) dividing the work evenly is difficult even with accurate data statistics due to the complexity of modern out-of-order cores. As a result, the existing approaches for plan-driven parallelism run into load balancing and context-switching bottlenecks, and therefore no longer scale. A third problem faced by many-core architectures is the decentralization of memory controllers, which leads to Non-Uniform Memory Access (NUMA). In response, we present the morsel-driven query execution framework, where scheduling becomes a fine-grained run-time task that is NUMA-aware. Morsel-driven query processing takes small fragments of input data (morsels) and schedules these to worker threads that run entire operator pipelines until the next pipeline breaker. The degree of parallelism is not baked into the plan but can elastically change during query execution, so the dispatcher can react to execution speed of different morsels but also adjust resources dynamically in response to newly arriving queries in the workload. Further, the dispatcher is aware of data locality of the NUMA-local morsels and operator state, such that the great majority of executions takes place on NUMA-local memory. Our evaluation on the TPC-H and SSB benchmarks shows extremely high absolute performance and an average speedup of over 30 with 32 cores.

【Keywords】: NUMA-awareness; morsel-driven parallelism

74. A comprehensive study of main-memory partitioning and its application to large-scale comparison- and radix-sort.

Paper Link】 【Pages】:755-766

【Authors】: Orestis Polychroniou ; Kenneth A. Ross

【Abstract】: Analytical database systems can achieve high throughput main-memory query execution by being aware of the dynamics of highly-parallel modern hardware. Such systems rely on partitioning to cluster or divide data into smaller pieces and thus achieve better parallelism and memory locality. This paper considers a comprehensive collection of variants of main-memory partitioning tuned for various layers of the memory hierarchy. We revisit the pitfalls of in-cache partitioning, and utilizing the crucial performance factors, we introduce new variants for partitioning out-of-cache. Besides non-in-place variants where linear extra space is used, we introduce large-scale in-place variants, and propose NUMA-aware partitioning that guarantees locality on multiple processors. Also, we make range partitioning comparably fast with hash or radix, by designing a novel cache-resident index to compute ranges. All variants are combined to build three NUMA-aware sorting algorithms: a stable LSB radix-sort; an in-place MSB radix-sort using different variants across memory layers; and a comparison-sort utilizing wide-fanout range partitioning and SIMD-optimal in-cache sorting. To the best of our knowledge, all three are the fastest to date on billion-scale inputs for both dense and sparse key domains. As shown for sorting, our work can serve as a tool for building other operations (e.g., join, aggregation) by combining the most suitable variants that best meet the design goals.

【Keywords】: NUMA; SIMD; multi-core; parallelism; partitioning; sorting

75. An application-specific instruction set for accelerating set-oriented database primitives.

Paper Link】 【Pages】:767-778

【Authors】: Oliver Arnold ; Sebastian Haas ; Gerhard Fettweis ; Benjamin Schlegel ; Thomas Kissinger ; Wolfgang Lehner

【Abstract】: The key task of database systems is to efficiently manage large amounts of data. A high query throughput and a low query latency are essential for the success of a database system. Lately, research focused on exploiting hardware features like superscalar execution units, SIMD, or multiple cores to speed up processing. Apart from these software optimizations for given hardware, even tailor-made processing circuits running on FPGAs are built to run mostly stateless query plans with incredibly high throughput. A similar idea, which was already considered three decades ago, is to build tailor-made hardware like a database processor. Despite their superior performance, such application-specific processors were not considered to be beneficial because general-purpose processors eventually always caught up so that the high development costs did not pay off. In this paper, we show that the development of a database processor is much more feasible nowadays through the availability of customizable processors. We illustrate exemplarily how to create an instruction set extension for set-oriented database primitives. The resulting application-specific processor provides not only a high performance but it also enables very energy-efficient processing. Our processor requires in various configurations more than 960x less energy than a high-end x86 processor while providing the same performance.

【Keywords】: customizable processors; hardware/software-codesign; instruction set extensions

Research session 14: non-traditional data 4

76. Which concepts are worth extracting?

Paper Link】 【Pages】:779-790

【Authors】: Arash Termehchy ; Ali Vakilian ; Yodsawalai Chodpathumwan ; Marianne Winslett

【Abstract】: It is well established that extracting and annotating occurrences of entities in a collection of unstructured text documents with their concepts improve the effectiveness of answering queries over the collection. However, it is very resource intensive to create and maintain large annotated collections. Since the available resources of an enterprise are limited and/or its users may have urgent information needs, it may have to select only a subset of relevant concepts for extraction and annotation. We call this subset a conceptual design for the annotated collection. In this paper, we introduce the problem of cost effective conceptual design, where given a collection, a set of relevant concepts, and a fixed budget, one likes to find a conceptual design that improves the effectiveness of answering queries over the collection the most. We prove that the problem is generally NP-hard in the number of relevant concepts and propose two efficient approximation algorithms to solve the problem: Approximate Popularity Maximization (APM for short) and Approximate Annotation-benefit Maximization (AAM for short). We show that if there is not any constraints regrading the overlap of concepts, APM is a fully polynomial time approximation scheme. We also prove that if the relevant concepts are mutually exclusive, APM has a constant approximation ratio and AAM is a fully polynomial time approximation scheme. Our empirical results using Wikipedia collection and a search engine query log validate the proposed formalization of the problem and show that APM and AAM efficiently compute conceptual designs. They also indicate that in general APM delivers the optimal conceptual designs if the relevant concepts are not mutually exclusive. Also, if the relevant concepts are mutually exclusive, the conceptual designs delivered by AAM improve the effectiveness of answering queries over the collection more than the solutions provided by APM.

【Keywords】: concept extraction; conceptual design; effective query answering

77. Querying virtual hierarchies using virtual prefix-based numbers.

Paper Link】 【Pages】:791-802

【Authors】: Curtis E. Dyreson ; Sourav S. Bhowmick ; Ryan Grapp

【Abstract】: Prefix-based numbering is a popular method for numbering nodes in a hierarchy. But prefix-based numbering breaks down when a node's location within a hierarchy changes, such as when XML data is queried after being transformed by an XSLT program or when data is reformatted in the return clause of an inner FLWR expression in a nested XQuery program. A query on transformed data cannot be evaluated as efficiently since the extant prefix-based node numbers cannot be used (unless the data is materialized and then renumbered, which can be expensive). In this paper we present a novel strategy to virtually transform the data without instantiating and renumbering. Our method, which we call virtual prefix-based numbering, couples each prefix-based node number with a level array that locates the node in the numbering space of the virtual hierarchy. The virtual numbering space preserves the property that location-based relationships between nodes can be determined by comparing (virtual) numbers.

【Keywords】: prefix-based numbering; view query; virtual hierarchy; xml

78. NLyze: interactive programming by natural language for spreadsheet data analysis and manipulation.

Paper Link】 【Pages】:803-814

【Authors】: Sumit Gulwani ; Mark Marron

【Abstract】: Millions of computer end users need to perform tasks over tabular spreadsheet data, yet lack the programming knowledge to do such tasks automatically. This paper describes the design and implementation of a robust natural language based interface to spreadsheet programming. Our methodology involves designing a typed domain-specific language (DSL) that supports an expressive algebra of map, filter, reduce, join, and formatting capabilities at a level of abstraction appropriate for non-expert users. The key algorithmic component of our methodology is a translation algorithm for converting a natural language specification in the context of a given spreadsheet to a ranked set of likely programs in the DSL. The translation algorithm leverages the spreadsheet spatial and temporal context to assign interpretations to specifications with implicit references, and is thus robust to a variety of ways in which end users can express the same task. The translation algorithm builds over ideas from keyword programming and semantic parsing to achieve both high precision and high recall. We implemented the system as an Excel add-in called NLyze that supports a rich user interaction model including annotating the user's natural language specification and explaining the synthesized DSL programs by paraphrasing them into structured English. We collected a total of 3570 English descriptions for 40 spreadsheet tasks and our system was able to generate the intended interpretation as the top candidate for 94% (97 for the top 3) of those instances.

【Keywords】: end-user programming; program synthesis; spreadsheet programming; user intent

79. Sinew: a SQL system for multi-structured data.

Paper Link】 【Pages】:815-826

【Authors】: Daniel Tahara ; Thaddeus Diamond ; Daniel J. Abadi

【Abstract】: As applications are becoming increasingly dynamic, the notion that a schema can be created in advance for an application and remain relatively stable is becoming increasingly unrealistic. This has pushed application developers away from traditional relational database systems and away from the SQL interface, despite their many well-established benefits. Instead, developers often prefer self-describing data models such as JSON, and NoSQL systems designed specifically for their relaxed semantics. In this paper, we discuss the design of a system that enables developers to continue to represent their data using self-describing formats without moving away from SQL and traditional relational database systems. Our system stores arbitrary documents of key-value pairs inside physical and virtual columns of a traditional relational database system, and adds a layer above the database system that automatically provides a dynamic relational view to the user against which fully standard SQL queries can be issued. We demonstrate that our design can achieve an order of magnitude improvement in performance over alternative solutions, including existing relational database JSON extensions, MongoDB, and shredding systems that store flattened key-value data inside a relational database.

【Keywords】: JSON; MongoDB; NoSQL; RDBMS; SQL; dynamic schema

Research session 15: mapreduce processing 4

80. Scalable big graph processing in MapReduce.

Paper Link】 【Pages】:827-838

【Authors】: Lu Qin ; Jeffrey Xu Yu ; Lijun Chang ; Hong Cheng ; Chengqi Zhang ; Xuemin Lin

【Abstract】: MapReduce has become one of the most popular parallel computing paradigms in cloud, due to its high scalability, reliability, and fault-tolerance achieved for a large variety of applications in big data processing. In the literature, there are MapReduce Class MRC and Minimal MapReduce Class MMC to define the memory consumption, communication cost, CPU cost, and number of MapReduce rounds for an algorithm to execute in MapReduce. However, neither of them is designed for big graph processing in MapReduce, since the constraints in MMC can be hardly achieved simultaneously on graphs and the conditions in MRC may induce scalability problems when processing big graph data. In this paper, we study scalable big graph processing in MapReduce. We introduce a Scalable Graph processing Class SGC by relaxing some constraints in MMC to make it suitable for scalable graph processing. We define two graph join operators in SGC, namely, EN join and NE join, using which a wide range of graph algorithms can be designed, including PageRank, breadth first search, graph keyword search, Connected Component (CC) computation, and Minimum Spanning Forest (MSF) computation. Remarkably, to the best of our knowledge, for the two fundamental graph problems CC and MSF computation, this is the first work that can achieve O(log(n)) MapReduce rounds with $O(n+m)$ total communication cost in each round and constant memory consumption on each machine, where $n$ and $m$ are the number of nodes and edges in the graph respectively. We conducted extensive performance studies using two web-scale graphs Twitter and Friendster with different graph characteristics. The experimental results demonstrate that our algorithms can achieve high scalability in big graph processing.

【Keywords】: MapReduce; big data; cloud computing; graph

81. Anti-combining for MapReduce.

Paper Link】 【Pages】:839-850

【Authors】: Alper Okcan ; Mirek Riedewald

【Abstract】: We propose Anti-Combining, a novel optimization for MapReduce programs to decrease the amount of data transferred from mappers to reducers. In contrast to Combiners, which decrease data transfer by performing reduce work on the mappers, Anti-Combining shifts mapper work to the reducers. It is also conceptually different from traditional compression techniques. While the latter are applied outside the MapReduce framework by compressing map output and then decompressing it before the data is fed into the reducer, Anti-Combining is integrated into mapping and reducing functionality itself. This enables lightweight algorithms and data reduction even for cases where the Map output data shows no redundancy that could be exploited by traditional compression techniques. Anti-Combining can be enabled automatically for any given MapReduce program through purely syntactic transformations. In some cases, in particular for certain non-deterministic Map and Partition functions, only a weaker version can be applied. At runtime the Anti-Combining enabled MapReduce program will dynamically and adaptively decrease data transfer by making fine-grained local decisions. Our experiments show that Anti-Combining can achieve data transfer reduction similar to or better than traditional compression techniques, while also reducing CPU and local I/O cost. It can even be applied in combination with them to greater effect.

【Keywords】: MapReduce; anti-combining; throughput optimization

82. Opportunistic physical design for big data analytics.

Paper Link】 【Pages】:851-862

【Authors】: Jeff LeFevre ; Jagan Sankaranarayanan ; Hakan Hacigümüs ; Jun'ichi Tatemura ; Neoklis Polyzotis ; Michael J. Carey

【Abstract】: Big data analytical systems, such as MapReduce, perform aggressive materialization of intermediate job results in order to support fault tolerance. When jobs correspond to exploratory queries submitted by data analysts, these materializations yield a large set of materialized views that we propose to treat as an opportunistic physical design. We present a semantic model for UDFs that enables effective reuse of views containing UDFs along with a rewrite algorithm that provably finds the minimum-cost rewrite under certain assumptions. An experimental study on real-world datasets using our prototype based on Hive shows that our approach can result in dramatic performance improvements.

【Keywords】: UDFs; big data; exploratory analysis; opportunistic physical design; opportunistic views; query processing; query rewriting

83. Stratified-sampling over social networks using mapreduce.

Paper Link】 【Pages】:863-874

【Authors】: Roy Levin ; Yaron Kanza

【Abstract】: Sampling is being used in statistical surveys to select a subset of individuals from some population, to estimate properties of the population. In stratified sampling, the surveyed population is partitioned into homogeneous subgroups and individuals are selected within the subgroups, to reduce the sample size. In this paper we consider sampling of large-scale, distributed online social networks, and we show how to deal with cases where several surveys are conducted in parallel---in some surveys it may be desired to share individuals to reduce costs, while in other surveys, sharing should be minimized, e.g., to prevent survey fatigue. A multi-survey stratified sampling is the task of choosing the individuals for several surveys, in parallel, according to sharing constraints, without a bias. In this paper, we present a scalable distributed algorithm, designed for the MapReduce framework, for answering stratified-sampling queries over a population of a social network. We also present an algorithm to effectively answer multi-survey stratified sampling, and we show how to implement it using MapReduce. An experimental evaluation illustrates the efficiency of our algorithms and their effectiveness for multi-survey stratified sampling.

【Keywords】: mapreduce; social networks; stratified sampling

Demo B 9

84. Demonstration of the Myria big data management service.

Paper Link】 【Pages】:881-884

【Authors】: Daniel Halperin ; Victor Teixeira de Almeida ; Lee Lee Choo ; Shumo Chu ; Paraschos Koutris ; Dominik Moritz ; Jennifer Ortiz ; Vaspol Ruamviboonsuk ; Jingjing Wang ; Andrew Whitaker ; Shengliang Xu ; Magdalena Balazinska ; Bill Howe ; Dan Suciu

【Abstract】: In this demonstration, we will showcase Myria, our novel cloud service for big data management and analytics designed to improve productivity. Myria's goal is for users to simply upload their data and for the system to help them be self-sufficient data science experts on their data -- self-serve analytics. Using a web browser, Myria users can upload data, author efficient queries to process and explore the data, and debug correctness and performance issues. Myria queries are executed on a scalable, parallel cluster that uses both state-of-the-art and novel methods for distributed query processing. Our interactive demonstration will guide visitors through an exploration of several key Myria features by interfacing with the live system to analyze big datasets over the web.

【Keywords】: Myria; big data management; data management service

Paper Link】 【Pages】:885-888

【Authors】: Aditya G. Parameswaran ; Ming Han Teh ; Hector Garcia-Molina ; Jennifer Widom

【Abstract】: Traditional search engines are unable to support a large number of potential queries issued by users, for instance, queries containing non-textual fragments such as images or videos, queries that are very long, ambiguous, or those that require subjective judgment, or semantically-rich queries over non-textual corpora. We demonstrate DataSift, a crowd-powered search toolkit that can be instrumented over any corpus supporting a keyword search API, and supports efficient and accurate querying for a rich general class of queries, including those described previously. Our demonstration will allow conference attendees to issue live queries for image, video, and product search, as well as "play back" the results of a wide variety of prior queries issued on DataSift. Attendees will also be able to perform a side-by-side comparison between DataSift and traditional retrieval schemes.

【Keywords】: crowdsourcing; information retrieval; search

86. Reactive and proactive sharing across concurrent analytical queries.

Paper Link】 【Pages】:889-892

【Authors】: Iraklis Psaroudakis ; Manos Athanassoulis ; Matthaios Olma ; Anastasia Ailamaki

【Abstract】: Today an ever increasing amount of data is collected and analyzed by researchers, businesses, and scientists in data warehouses (DW). In addition to the data size, the number of users and applications querying data grows exponentially. The increasing concurrency is itself a challenge in query execution, but also introduces an opportunity favoring synergy between concurrent queries. Traditional execution engines of DW follows a query-centric approach, where each query is optimized and executed independently. On the other hand, workloads with increased concurrency have several queries with common parts of data and work, creating the opportunity for sharing among concurrent queries. Sharing can be reactive to the inherently existing sharing opportunities, or proactive by redesigning query operators to maximize the sharing opportunities. This demonstration showcases the impact of proactive and reactive sharing by comparing and integrating representative state-of-the-art techniques: Simultaneous Pipelining (SP), for reactive sharing, which shares intermediate results of common sub-plans, and Global Query Plans (GQP) for proactive sharing, which build and evaluate a single query plan with shared operators. We visually demonstrate, in an interactive interface, the behavior of both sharing approaches on top of a state-of-the-art storage engine using the original prototypes. We show that pull-based sharing for SP eliminates the serialization point imposed by the original push-based approach. Then, we compare, through a sensitivity analysis, the performance of SP and GQP. Finally, we show that SP can improve the performance of GQP for a query mix with common sub-plans.

【Keywords】: CJOIN; QPipe; data sharing; data warehouses; global query plans; proactive sharing; query processing; reactive sharing; shared pages list; simultaneous pipelining; work sharing

87. SLQ: a user-friendly graph querying system.

Paper Link】 【Pages】:893-896

【Authors】: Shengqi Yang ; Yanan Xie ; Yinghui Wu ; Tianyi Wu ; Huan Sun ; Jian Wu ; Xifeng Yan

【Abstract】: Querying complex graph databases such as knowledge graphs is a challenging task for non-professional users. In this demo, we present SLQ, a user-friendly graph querying system enabling schemales and structures graph querying, where a user need not describe queries precisely as required by most databases. SLQ system combines searching and ranking: it leverages a set of transformation functions, including abbreviation, ontology, synonym, etc., that map keywords and linkages from a query to their matches in a data graph, based on an automatically learned ranking model. To help users better understand search results at different levels of granularity, it supports effective result summarization with "drill-down" and "roll-up" operations. Better still, the architecture of SLQ is elastic for new transformation functions, query logs and user feedback, to iteratively refine the ranking model. SLQ significantly improves the usability of graph querying. This demonstration highlights (1) SLQ can automatically learn an effective ranking model, without assuming manually labeled training examples, (2) it can efficiently return top ranked matches over noisy, large data graphs, (3) it can summarize the query matches to help users easily access, explore and understand query results, and (4) its GUI can interact with users to help them construct queries, explore data graphs and inspect matches in a user-friendly manner.

【Keywords】: graph databases; keyword query; schemaless graph querying

88. TAREEG: a MapReduce-based web service for extracting spatial data from OpenStreetMap.

Paper Link】 【Pages】:897-900

【Authors】: Louai Alarabi ; Ahmed Eldawy ; Rami Alghamdi ; Mohamed F. Mokbel

【Abstract】: Real spatial data, e.g., detailed road networks, rivers, buildings, parks, are not really available in most of the world. This hinders the practicality of many research ideas that need a real spatial data for testing experiments. Such data is often available for governmental use, or at major software companies, but it is prohibitively expensive to build or buy for academia or individual researchers. This demo presents TAREEG; a web-service that makes real spatial data, from anywhere in the world, available at the fingertips of every researcher or individual. TAREEG gets all its data by leveraging the richness of OpenStreetMap dataset; the most comprehensive available spatial data of the world. Yet, it is still challenging to obtain OpenStreetMap data due to the size limitations, special data format, and the noisy nature of spatial data. TAREEG employs MapReduce-based techniques to make it efficient and easy to extract OpenStreetMap data in a standard form with minimal effort. TAREEG is accessible via {http://www.tareeg.org/}

【Keywords】: mapreduce; openstreetmap; spatial data; spatial extraction; spatial indexing; spatialhadoop

Paper Link】 【Pages】:901-904

【Authors】: Davide Mottin ; Matteo Lissandrini ; Yannis Velegrakis ; Themis Palpanas

【Abstract】: We demonstrate XQ, a query engine that implements a novel technique for searching relevant information on the web and in various data sources, called Exemplar Queries. While the traditional query model expects the user to provide a set of specifications that the elements of interest need to satisfy, XQ expects the user to provide only an element of interest and we infer the desired answer set based on that element. Through the various examples we demonstrate the functionality of the system and its applicability in various cases. At the same time, we highlight the technical challenges for this type of query answering and illustrate the implementation approach we have materialized. The demo is intended for both researchers and practitioners and aims at illustrating the benefits of the adoption of this new form of query answering in practical applications and the further study and advancement of its technical solutions.

【Keywords】: exemplar queries; labeled graphs; query paradigms

Paper Link】 【Pages】:905-908

【Authors】: Mehdi Kargar ; Aijun An ; Nick Cercone ; Parke Godfrey ; Jaroslaw Szlichta ; Xiaohui Yu

【Abstract】: Keyword search in relational databases was introduced in the last decade to assist users who are not familiar with a query language, the schema of the database, or the content of the data. An answer is a join tree of tuples that contains the query keywords. When searching a database with a complex schema, there are potentially many answers to the query. Therefore, ranking answers based on their relevance is crucial in this context. Prior work has addressed relevance based on the size of the answer or the IR scores of the tuples. However, this is not sufficient when searching a complex schema. We demonstrate MeanKS, a new system for meaningful keyword search over relational databases. The system first captures the user's interest by determining the roles of the keywords. Then, it uses schema-based ranking to rank join trees that cover the keyword roles. This uses the relevance of relations and foreign-key relationships in the schema over the information content of the database. In the demonstration, attendees can execute queries against the TPC-E warehouse and compare the proposed measures against a gold standard derived from a real workload over TPC-E to test the effectiveness of our methods.

【Keywords】: keyword search; meaningful search; relational databases

91. H2RDF+: an efficient data management system for big RDF graphs.

Paper Link】 【Pages】:909-912

【Authors】: Nikolaos Papailiou ; Dimitrios Tsoumakos ; Ioannis Konstantinou ; Panagiotis Karras ; Nectarios Koziris

【Abstract】: The proliferation of data in RDF format has resulted in the emergence of a plethora of specialized management systems. While the ability to adapt to the complexity of a SPARQL query -- given their inherent diversity -- is crucial, current approaches do not scale well when faced with substantially complex, non-selective joins, resulting in exponential growth of execution times. In this demonstration we present H2 RDF+, an RDF store that efficiently performs distributed Merge and Sort-Merge joins using a multiple-index scheme over HBase indexes. Through a greedy planner that incorporates our cost-model, it adaptively commands for either single or multi-machine query execution based on join complexity. In this paper, we present its key scientific contributions and allow participants to interact with an H2RDF+ deployment over a Cloud infrastructure. Using a web-based GUI we allow users to load different datasets (both real and synthetic), apply any query (custom or predefined) and monitor its execution. By allowing real-time inspection of cluster status, response times and committed resources the audience will evaluate the validity of H2RDF+'s claims and perform direct comparisons to two other state-of-the-art RDF stores.

【Keywords】: hadoop; hbase; joins; mapreduce; nosql; rdf; sparql

92. DoomDB: kill the query.

Paper Link】 【Pages】:913-916

【Authors】: Carsten Binnig ; Abdallah Salama ; Erfan Zamanian

【Abstract】: Typically, fault-tolerance in parallel database systems is handled by restarting a query completely when a node failure happens. However, when deploying a parallel database on a cluster of commodity machines or on IaaS offerings such as Amazon's Spot Instances, node failures are a common case. This requires a more fine-granular fault-tolerance scheme. Therefore, most recent parallel data management platforms such as Hadoop or Shark use a fine-grained fault-tolerance scheme, which materializes all intermediate results in order to be able to recover from mid-query faults. While such a fine-grained fault-tolerance scheme is able to efficiently handle node failures for complex and long-running queries, it is not optimal for short-running latency-sensitive queries since the additional costs for materialization often outweigh the costs for actually executing the query. In this demo, we showcase our novel cost-based fault-tolerance scheme in XDB. It selects which intermediate results to materialize such that the overall query runtime is minimized in the presence of node failures. For the demonstration, we present a computer game called DoomDB. DoomDB is designed as an ego-shooter game with the goal of killing nodes in an XDB database cluster and thus prevent a given query to produce its final result in a given time frame. One interesting use-case of DoomDB is to use it for crowdsourcing the testing activities of XDB.

【Keywords】:

Panel 1

93. Should we all be teaching "intro to data science" instead of "intro to databases"?

Paper Link】 【Pages】:917-918

【Authors】: Bill Howe ; Michael J. Franklin ; Juliana Freire ; James Frew ; Tim Kraska ; Raghu Ramakrishnan

【Abstract】: The Database Community has a unique perspective on the challenges and solutions of long-term management of data and the value of data as a resource. In current computer science curricula, however, these insights are typically locked up in the context of the traditional Intro to Databases class that was developed years (or in some cases, decades) before the modern concept of Data Science arose and embedded in the discussion of legacy data management systems. We consider how to bring these concepts front and center into the emerging wave of Data Science courses, degree programs and even departments.

【Keywords】: data science; databases

Research session 16: distributed and parallel data management 4

94. Characterizing and selecting fresh data sources.

Paper Link】 【Pages】:919-930

【Authors】: Theodoros Rekatsinas ; Xin Luna Dong ; Divesh Srivastava

【Abstract】: Data integration is a challenging task due to the large numbers of autonomous data sources. This necessitates the development of techniques to reason about the benefits and costs of acquiring and integrating data. Recently the problem of source selection (i.e., identifying the subset of sources that maximizes the profit from integration) was introduced as a preprocessing step before the actual integration. The problem was studied for static sources and used the accuracy of data fusion to quantify the integration profit. In this paper, we study the problem of source selection considering dynamic data sources whose content changes over time. We define a set of time-dependent metrics, including coverage, freshness and accuracy, to characterize the quality of integrated data. We show how statistical models for the evolution of sources can be used to estimate these metrics. While source selection is NP-complete, we show that for a large class of practical cases, near-optimal solutions can be found, propose an algorithmic framework with theoretical guarantees for our problem and show its effectiveness with an extensive experimental evaluation on both real-world and synthetic data.

【Keywords】: data integration; dynamic data sources; source selection

95. Sloth: being lazy is a virtue (when issuing database queries).

Paper Link】 【Pages】:931-942

【Authors】: Alvin Cheung ; Samuel Madden ; Armando Solar-Lezama

【Abstract】: Many web applications store persistent data in databases. During execution, such applications spend a significant amount of time communicating with the database for retrieval and storing of persistent data over the network. These network round trips represent a significant fraction of the overall execution time for many applications and as a result increase their latency. While there has been prior work that aims to eliminate round trips by batching queries, they are limited by 1) a requirement that developers manually identify batching opportunities, or 2) the fact that they employ static program analysis techniques that cannot exploit many opportunities for batching. In this paper, we present Sloth, a new system that extends traditional lazy evaluation to expose query batching opportunities during application execution, even across loops, branches, and method boundaries. We evaluated Sloth using over 100 benchmarks from two large-scale open-source applications, and achieved up to a 3x reduction in page load time by delaying computation.

【Keywords】: lazy evaluation; optimization; query batching

96. Dynamically optimizing queries over large scale data platforms.

Paper Link】 【Pages】:943-954

【Authors】: Konstantinos Karanasos ; Andrey Balmin ; Marcel Kutsch ; Fatma Ozcan ; Vuk Ercegovac ; Chunyang Xia ; Jesse Jackson

【Abstract】: Enterprises are adapting large-scale data processing platforms, such as Hadoop, to gain actionable insights from their "big data". Query optimization is still an open challenge in this environment due to the volume and heterogeneity of data, comprising both structured and un/semi-structured datasets. Moreover, it has become common practice to push business logic close to the data via user-defined functions (UDFs), which are usually opaque to the optimizer, further complicating cost-based optimization. As a result, classical relational query optimization techniques do not fit well in this setting, while at the same time, suboptimal query plans can be disastrous with large datasets. In this paper, we propose new techniques that take into account UDFs and correlations between relations for optimizing queries running on large scale clusters. We introduce "pilot runs", which execute part of the query over a sample of the data to estimate selectivities, and employ a cost-based optimizer that uses these selectivities to choose an initial query plan. Then, we follow a dynamic optimization approach, in which plans evolve as parts of the queries get executed. Our experimental results show that our techniques produce plans that are at least as good as, and up to 2x (4x) better for Jaql (Hive) than, the best hand-written left-deep query plans.

【Keywords】: adaptive query processing; large-scale data platforms; pilot runs; query optimization

97. A software-defined networking based approach for performance management of analytical queries on distributed data stores.

Paper Link】 【Pages】:955-966

【Authors】: PengCheng Xiong ; Hakan Hacigümüs ; Jeffrey F. Naughton

【Abstract】: Nowadays data analytics applications are accessing more and more data from distributed data stores, creating a large amount of data traffic on the network. Therefore, distributed analytic queries are prone to suffer from poor performance when they encounter network contention, which can be quite common in a shared network. Typical distributed query optimizers do not have a way to solve this problem because they treat the network as a black-box: they are unable to monitor it, let alone control it. With the new era of software-defined networking (SDN), we show how SDN can be effectively exploited for performance management for analytical queries in distributed data store environments. More specifically, we present a group of methods to leverage SDN's visibility into and control of the network's state that enable distributed query processors to achieve performance improvements and differentiation for analytical queries. We demonstrate the effectiveness of the methods through detailed experimental studies on a system running on a software-defined network with commercial switches. To the best of our knowledge, this is the first work to analyze and show the opportunities of SDN for distributed query optimization. It is our hope that this will open up a rich area of research and technology development in distributed data intensive computing.

【Keywords】: analytical queries; distributed data stores; software-defined networking

Research session 17: graph analytics 4

98. The pursuit of a good possible world: extracting representative instances of uncertain graphs.

Paper Link】 【Pages】:967-978

【Authors】: Panos Parchas ; Francesco Gullo ; Dimitris Papadias ; Francesco Bonchi

【Abstract】: Data in several applications can be represented as an uncertain graph, whose edges are labeled with a probability of existence. Exact query processing on uncertain graphs is prohibitive for most applications, as it involves evaluation over an exponential number of instantiations. Even approximate processing based on sampling is usually extremely expensive since it requires a vast number of samples to achieve reasonable quality guarantees. To overcome these problems, we propose algorithms for creating deterministic representative instances of uncertain graphs that maintain the underlying graph properties. Specifically, our algorithms aim at preserving the expected vertex degrees because they capture well the graph topology. Conventional processing techniques can then be applied on these instances to closely approximate the result on the uncertain graph. We experimentally demonstrate, with real and synthetic uncertain graphs, that indeed the representative instances can be used to answer, efficiently and accurately, queries based on several properties such as shortest path distance, clustering coefficient and betweenness centrality.

【Keywords】: possible world; representative; uncertain graph

Paper Link】 【Pages】:979-990

【Authors】: Nadathur Satish ; Narayanan Sundaram ; Md. Mostofa Ali Patwary ; Jiwon Seo ; Jongsoo Park ; M. Amber Hassaan ; Shubho Sengupta ; Zhaoming Yin ; Pradeep Dubey

【Abstract】: Graph algorithms are becoming increasingly important for analyzing large datasets in many fields. Real-world graph data follows a pattern of sparsity, that is not uniform but highly skewed towards a few items. Implementing graph traversal, statistics and machine learning algorithms on such data in a scalable manner is quite challenging. As a result, several graph analytics frameworks (GraphLab, CombBLAS, Giraph, SociaLite and Galois among others) have been developed, each offering a solution with different programming models and targeted at different users. Unfortunately, the "Ninja performance gap" between optimized code and most of these frameworks is very large (2-30X for most frameworks and up to 560X for Giraph) for common graph algorithms, and moreover varies widely with algorithms. This makes the end-users' choice of graph framework dependent not only on ease of use but also on performance. In this work, we offer a quantitative roadmap for improving the performance of all these frameworks and bridging the "ninja gap". We first present hand-optimized baselines that get performance close to hardware limits and higher than any published performance figure for these graph algorithms. We characterize the performance of both this native implementation as well as popular graph frameworks on a variety of algorithms. This study helps end-users delineate bottlenecks arising from the algorithms themselves vs. programming model abstractions vs. the framework implementations. Further, by analyzing the system-level behavior of these frameworks, we obtain bottlenecks that are agnostic to specific algorithms. We recommend changes to alleviate these bottlenecks (and implement some of them) and reduce the performance gap with respect to native code. These changes will enable end-users to choose frameworks based mostly on ease of use.

【Keywords】: Galois; analysis; cluster; combblas; frameworks; giraph; graph analytics; graphlab; graphs; performance; socialite

Paper Link】 【Pages】:991-1002

【Authors】: Wanyun Cui ; Yanghua Xiao ; Haixun Wang ; Wei Wang

【Abstract】: Community search is important in social network analysis. For a given vertex in a graph, the goal is to find the best community the vertex belongs to. Intuitively, the best community for a given vertex should be in the vicinity of the vertex. However, existing solutions use \emph{global search} to find the best community. These algorithms, although straight-forward, are very costly, as all vertices in the graph may need to be visited. In this paper, we propose a \emph{local search} strategy, which searches in the neighborhood of a vertex to find the best community for the vertex. We show that, because the minimum degree measure used to evaluate the goodness of a community is not \emph{monotonic}, designing efficient local search solutions is a very challenging task. We present theories and algorithms of local search to address this challenge. The efficiency of our local search strategy is verified by extensive experiments on both synthetic networks and a variety of real networks with millions of nodes.

【Keywords】: community search; graph mining; social networks

101. Mining statistically significant connected subgraphs in vertex labeled graphs.

Paper Link】 【Pages】:1003-1014

【Authors】: Akhil Arora ; Mayank Sachan ; Arnab Bhattacharya

【Abstract】: The steady growth of graph data in various applications has resulted in wide-spread research in finding significant sub-structures in a graph. In this paper, we address the problem of finding statistically significant connected subgraphs where the nodes of the graph are labeled. The labels may be either discrete where they assume values from a pre-defined set, or continuous where they assume values from a real domain and can be multi-dimensional. We motivate the problem citing applications in spatial co-location rule mining and outlier detection. We use the chi-square statistic as a measure for quantifying the statistical significance. Since the number of connected subgraphs in a general graph is exponential, the naive algorithm is impractical. We introduce the notion of contracting edges that merge vertices together to form a super-graph. We show that if the graph is dense enough to start with, the number of super-vertices is quite low, and therefore, running the naive algorithm on the super-graph is feasible. If the graph is not dense, we provide an algorithm to reduce the number of super-vertices further, thereby providing a trade-off between accuracy and time. Empirically, the chi-square value obtained by this reduction is always within 96% of the optimal value, while the time spent is only a fraction of that for the optimal. In addition, we also show that our algorithm is scalable and it significantly enhances the ability to analyze real datasets.

【Keywords】: chi-square; connected subgraphs; graph mining; significant subgraphs; statistical significance; vertex labels

Research session 18: query processing and optimization 1 4

Paper Link】 【Pages】:1015-1026

【Authors】: Ioana Ileana ; Bogdan Cautis ; Alin Deutsch ; Yannis Katsis

【Abstract】: We revisit the Chase&Backchase (C&B) algorithm for query reformulation under constraints, which provides a uniform solution to such particular-case problems as view-based rewriting under constraints, semantic query optimization, and physical access path selection in query optimization. For an important class of queries and constraints, C&B has been shown to be complete, i.e. guaranteed to find all (join-)minimal reformulations under constraints. C&B is based on constructing a canonical rewriting candidate called a universal plan, then inspecting its exponentially many sub-queries in search for minimal reformulations, essentially removing redundant joins in all possible ways. This inspection involves chasing the subquery. Because of the resulting exponentially many chases, the conventional wisdom has held that completeness is a concept of mainly theoretical interest. We show that completeness can be preserved at practically relevant cost by introducing Prov-C&B, a novel reformulation algorithm that instruments the chase to maintain provenance information connecting the joins added during the chase to the universal plan subqueries responsible for adding these joins. This allows it to directly "read off" the minimal reformulations from the result of a single chase of the universal plan, saving exponentially many chases of its subqueries. We exhibit natural scenarios yielding speedups of over two orders of magnitude between the execution of the best view-based rewriting found by a commercial query optimizer and that of the best rewriting found by Prov-C&B (which the optimizer misses because of limited reasoning about constraints).

【Keywords】: chase; database views; integrity constraints; query optimization

103. Query shredding: efficient relational evaluation of queries over nested multisets.

Paper Link】 【Pages】:1027-1038

【Authors】: James Cheney ; Sam Lindley ; Philip Wadler

【Abstract】: Nested relational query languages have been explored extensively, and underlie industrial language-integrated query systems such as Microsoft's LINQ. However, relational databases do not natively support nested collections in query results. This can lead to major performance problems: if programmers write queries that yield nested results, then such systems typically either fail or generate a large number of queries. We present a new approach to query shredding, which converts a query returning nested data to a fixed number of SQL queries. Our approach, in contrast to prior work, handles multiset semantics, and generates an idiomatic SQL:1999 query directly from a normal form for nested queries. We provide a detailed description of our translation and present experiments showing that it offers comparable or better performance than a recent alternative approach on a range of examples.

【Keywords】: language-integrated query; querying nested collections

104. Plan bouquets: query processing without selectivity estimation.

Paper Link】 【Pages】:1039-1050

【Authors】: Anshuman Dutt ; Jayant R. Haritsa

【Abstract】: Selectivity estimates for optimizing OLAP queries often differ significantly from those actually encountered during query execution, leading to poor plan choices and inflated response times. We propose here a conceptually new approach to address this problem, wherein the compile-time estimation process is completely eschewed for error-prone selectivities. Instead, a small "bouquet" of plans is identified from the set of optimal plans in the query's selectivity error space, such that at least one among this subset is near-optimal at each location in the space. Then, at run time, the actual selectivities of the query are incrementally "discovered" through a sequence of partial executions of bouquet plans, eventually identifying the appropriate bouquet plan to execute. The duration and switching of the partial executions is controlled by a graded progression of isocost surfaces projected onto the optimal performance profile. We prove that this construction results in bounded overheads for the selectivity discovery process and consequently, guaranteed worst-case performance. In addition, it provides repeatable execution strategies across different invocations of a query. The plan bouquet approach has been empirically evaluated on both PostgreSQL and a commercial DBMS, over the TPC-H and TPC-DS benchmark environments. Our experimental results indicate that, even with conservative assumptions, it delivers substantial improvements in the worst-case behavior, without impairing the average-case performance, as compared to the native optimizers of these systems. Moreover, the bouquet technique can be largely implemented using existing optimizer infrastructure, making it relatively easy to incorporate in current database engines. Overall, the bouquet approach provides novel guarantees that open up new possibilities for robust query processing.

【Keywords】: plan bouquets; robust query processing; selectivity estimation

105. Schema-free SQL.

Paper Link】 【Pages】:1051-1062

【Authors】: Fei Li ; Tianyin Pan ; Hosagrahar Visvesvaraya Jagadish

【Abstract】: Querying data in relational databases is often challenging since SQL requires its users to know the exact schema of the database, the roles of various entities in a query, and the precise join paths to be followed. On the other hand, keyword search is unable to express much desired query semantics. In this paper, we propose a query language, Schema-free SQL, which enables its users to query a relational database using whatever partial schema they know. If they know the full schema, they can write full SQL. But, to the extent they do not know the schema, Schema-free SQL is tolerant of unknown or inaccurately specified relation names and attribute names, and it also does not require information regarding which relations are involved and how they are joined. We present techniques to evaluate Schema-free SQL by first converting it to full SQL. We show experimentally that a small amount of schema information, which one can reasonably expect most users to have, is enough to get queries evaluated as if they had been completely and correctly specified.

【Keywords】: query language; relational databases; usability

Demo C 10

106. iCheck: computationally combating "lies, d-ned lies, and statistics".

Paper Link】 【Pages】:1063-1066

【Authors】: You Wu ; Brett Walenz ; Peggy Li ; Andrew Shim ; Emre Sonmez ; Pankaj K. Agarwal ; Chengkai Li ; Jun Yang ; Cong Yu

【Abstract】: Are you fed up with "lies, d---ned lies, and statistics" made up from data in our media? For claims based on structured data, we present a system to automatically assess the quality of claims (beyond their correctness) and counter misleading claims that cherry-pick data to advance their conclusions. The key insight is to model such claims as parameterized queries and consider how parameter perturbations affect their results. We demonstrate our system on claims drawn from U.S. congressional voting records, sports statistics, and publication records of database researchers.

【Keywords】: computational journalism; data-driven fact-checking; system demonstration

107. ABS: a system for scalable approximate queries with accuracy guarantees.

Paper Link】 【Pages】:1067-1070

【Authors】: Kai Zeng ; Shi Gao ; Jiaqi Gu ; Barzan Mozafari ; Carlo Zaniolo

【Abstract】: Approximate Query Processing (AQP) based on sampling is critical for supporting timely and cost-effective analytics over big data. To be applied successfully, AQP must be accompanied by reliable estimates on the quality of sample-produced approximate answers; the two main techniques used in the past for this purpose are (i) closed-form analytic error estimation, and (ii) the bootstrap method. Approach (i) is extremely efficient but lacks generality, whereas (ii) is general but suffers from high computational overhead. Our recently introduced Analytical Bootstrap method combines the strengths of both approaches and provides the basis for our ABS system, which will be demonstrated at the conference. The ABS system models bootstrap by a probabilistic relational model, and extends relational algebra with operations on probabilistic relations to predict the distributions of the AQP results. Thus, ABS entails a very fast computation of bootstrap-based quality measures for a general class of SQL queries, which is several orders of magnitude faster than the standard simulation-based bootstrap. In this demo, we will demonstrate the generality, automaticity, and ease of use of the ABS system, and its superior performance over the traditional approaches described above.

【Keywords】: approximate query processing; bootstrap; error estimation

108. NADEEF/ER: generic and interactive entity resolution.

Paper Link】 【Pages】:1071-1074

【Authors】: Ahmed K. Elmagarmid ; Ihab F. Ilyas ; Mourad Ouzzani ; Jorge-Arnulfo Quiané-Ruiz ; Nan Tang ; Si Yin

【Abstract】:

【Keywords】: NADEEF; entity resolution; generic; interactive

109. SerpentTI: flexible analytics of users, boards and domains for pinterest.

Paper Link】 【Pages】:1075-1078

【Authors】: Alex Cheng ; Mary Malit ; Chuanxi Zhang ; Nick Koudas

【Abstract】: Pinterest is a pinboard style photo sharing web service that allows its users to manage, share and express their interests via a collection of theme based photos. A few design choices of Pinterest makes it highly desirable to social media practitioners and marketers as a new, high quality data source for deep analysis, or as a complimentary data stream to existing social data such as Twitter and Facebook. The analysis capabilities at the current Pinterest site are minimal however as the focus is currently on user experience. We provide a description of SerpentTI, a system that currently crawls, indexes and aggregates more than 31 million users, 96 million boards and 3.1 billion pins from Pinterest to enable flexible and deep analytics.

【Keywords】: analytics; pinterest; social media

110. Interactive redescription mining.

Paper Link】 【Pages】:1079-1082

【Authors】: Esther Galbrun ; Pauli Miettinen

【Abstract】: Exploratory data analysis consists of multiple iterated steps: a data mining method is run on the data, the results are interpreted, new insights are formed, and the resulting knowledge is utilized when executing the method in a next round, and so on until satisfactory results are obtained. We focus on redescription mining, a powerful data analysis method that aims at finding alternative descriptions of the same entities, for example, ways to characterize geographical regions in terms of both the fauna that inhabits them and their bioclimatic conditions, so-called bioclimatic niches. We present Siren, a tool for interactive redescription mining. It is designed to facilitate the exploratory analysis of data by providing a seamless environment for mining, visualizing and editing redescriptions in an interactive fashion, supporting the analysis process in all its stages. We demonstrate its use for exploratory data mining. Simultaneously, Siren exemplifies the power of the various visualizations and means of interaction integrated into it; Techniques that reach beyond the task of redescription mining considered here, to other analysis methods.

【Keywords】: brush-and-link; interactive data mining; parallel coordinates; redescription mining; visualization

111. ONTOCUBO: cube-based ontology construction and exploration.

Paper Link】 【Pages】:1083-1086

【Authors】: Carlos Garcia-Alvarado ; Carlos Ordonez

【Abstract】: One of the major challenges of big data analytics is the diverse information content, which has no pre-defined structure or classification. This is in contrast to the well-designed structure of a database specified on an ER model. A standard mechanism for understanding interrelationships and the structure of documents is using ontologies. With such motivation in mind, we present a system that enables data management and querying of documents based on ontologies by leveraging the functionality of the DBMS. In this paper, we present ONTOCUBO, a novel system based on our research for text summarization using ontologies and automatic extraction of concepts for building ontologies using Online Analytical Processing (OLAP) cubes. ONTOCUBO is a database-centric approach that excels in its performance, due to an SQL-based single pass summarization phase through the original data set that computes values such as keyword frequency, standard deviation, and lift. This approach is complemented with a set of User-Defined-Function-based algorithms that analyze the summarization results for concepts and their interrelationships. Finally, we show in detail our application that extracts and builds an ontology, but also allows concept summarizations and allows domain experts to explore and modify the resulting ontology.

【Keywords】: DBMS; OLAP; ontologies

112. An extendable framework for managing uncertain spatio-temporal data.

Paper Link】 【Pages】:1087-1090

【Authors】: Tobias Emrich ; Maximilian Franzke ; Hans-Peter Kriegel ; Johannes Niedermayer ; Matthias Renz ; Andreas Züfle

【Abstract】: This demonstration presents our Uncertain-Spatio-Temporal (UST)} framework that we have developed in recent years. The framework allows not only to visualize and explore spatio-temporal data consisting of (location, time, object)-triples but also provides an extensive codebase easily extensible and customizable by developers and researchers. The main research focus of this UST-framework is the explicit consideration of uncertainty, an aspect that is inherent in spatio-temporal data, due to infrequent position updates, due to physical limitations and due to power constraints. The UST-framework can be used to obtain a deeper intuition of the quality of spatio-temporal data models. Such models aim at estimating the position of a spatio-temporal object at a time where the object's position is not explicitly known, for example by using both historic (traffic-) pattern information, and by using explicit observations of objects. The UST-framework illustrates the resulting distributions by allowing a user to move forward and backward in time. Additionally the framework allows users to specify simple spatio-temporal queries, such as spatio-temporal window queries and spatio-temporal nearest neighbor (NN) queries. Based on recently published theoretic concepts, the UST-framework allows to visually explore the impact of different models and parameters on spatio-temporal data. The main result showcased by the UST-framework is a minimization of uncertainty by employing stochastic processes, leading to small expected distances between ground truth trajectories and modelled positions.

【Keywords】: framework; probabilistic; spatio-temporal; trajectories; uncertain

113. NewsNetExplorer: automatic construction and exploration of news information networks.

Paper Link】 【Pages】:1091-1094

【Authors】: Fangbo Tao ; George Brova ; Jiawei Han ; Heng Ji ; Chi Wang ; Brandon Norick ; Ahmed El-Kishky ; Jialu Liu ; Xiang Ren ; Yizhou Sun

【Abstract】: News data is one of the most abundant and familiar data sources. News data can be systematically utilized and ex- plored by database, data mining, NLP and information re- trieval researchers to demonstrate to the general public the power of advanced information technology. In our view, news data contains rich, inter-related and multi-typed data objects, forming one or a set of gigantic, interconnected, het- erogeneous information networks. Much knowledge can be derived and explored with such an information network if we systematically develop effective and scalable data-intensive information network analysis technologies. By further developing a set of information extraction, in- formation network construction, and information network mining methods, we extract types, topical hierarchies and other semantic structures from news data, construct a semi- structured news information network NewsNet. Further, we develop a set of news information network exploration and mining mechanisms that explore news in multi-dimensional space, which include (i) OLAP-based operations on the hierarchical dimensional and topical structures and rich-text, such as cell summary, single dimension analysis, and promo- tion analysis, (ii) a set of network-based operations, such as similarity search and ranking-based clustering, and (iii) a set of hybrid operations or network-OLAP operations, such as entity ranking at different granularity levels. These form the basis of our proposed NewsNetExplorer system. Although some of these functions have been studied in recent research, effective and scalable realization of such functions in large networks still poses multiple challenging research problems. Moreover, some functions are our on-going research tasks. By integrating these functions, NewsNetExplorer not only provides with us insightful recommendations in NewsNet exploration system but also helps us gain insight on how to perform effective information extraction, integration and mining in large unstructured datasets.

【Keywords】: information network construction; network-olap

114. IQR: an interactive query relaxation system for the empty-answer problem.

Paper Link】 【Pages】:1095-1098

【Authors】: Davide Mottin ; Alice Marascu ; Senjuti Basu Roy ; Gautam Das ; Themis Palpanas ; Yannis Velegrakis

【Abstract】: We present IQR, a system that demonstrates optimization based interactive relaxations for queries that return an empty answer. Given an empty answer, IQR dynamically suggests one relaxation of the original query conditions at a time to the user, based on certain optimization objectives, and the user responds by either accepting or declining the relaxation, until the user arrives at a non-empty answer, or a non-empty answer is impossible to achieve with any further relaxations. The relaxation suggestions hinge on a proba- bilistic framework that takes into account the probability of the user accepting a suggested relaxation, as well as how much that relaxation serves towards the optimization objec- tive. IQR accepts a wide variety of optimization objectives - user centric objectives, such as, minimizing the number of user interactions (i.e., effort) or returning relevant results, as well as seller centric objectives, such as, maximizing profit. IQR offers principled exact and approximate solutions for gen- erating relaxations that are demonstrated using multiple, large real datasets.

【Keywords】: empty-answer problem; optimization framework; query relaxation

115. OceanRT: real-time analytics over large temporal data.

Paper Link】 【Pages】:1099-1102

【Authors】: Shiming Zhang ; Yin Yang ; Wei Fan ; Liang Lan ; Mingxuan Yuan

【Abstract】: We demonstrate OceanRT, a novel cloud-based infrastructure that performs online analytics in real time, over large-scale temporal data such as call logs from a telecommunication company. Apart from proprietary systems for which few details have been revealed, most existing big-data analytics systems are built on top of an offline, MapReduce-style infrastructure, which inherently limits their efficiency. In contrast, OceanRT employs a novel computing architecture consisting of interconnected Access Query Engines (AQEs), as well as a new storage scheme that ensures data locality and fast access for temporal data. Our preliminary evaluation shows that OceanRT can be up to 10x faster than Impala [10], 12x faster than Shark [5], and 200x faster than Hive [13]. The demo will show how OceanRT manages a real call log dataset (around 5TB per day) from a large mobile network operator in China. Besides presenting the processing of a few preset queries, we also allow the audience to issue ad hoc HiveQL [13] queries, watch how OceanRT answers them, and compare the speed of OceanRT with its competitors.

【Keywords】: design; management; performance

Research session 19: storage and indexing 3

116. H2O: a hands-free adaptive store.

Paper Link】 【Pages】:1103-1114

【Authors】: Ioannis Alagiannis ; Stratos Idreos ; Anastasia Ailamaki

【Abstract】: Modern state-of-the-art database systems are designed around a single data storage layout. This is a fixed decision that drives the whole architectural design of a database system, i.e., row-stores, column-stores. However, none of those choices is a universally good solution; different workloads require different storage layouts and data access methods in order to achieve good performance. In this paper, we present the H2O system which introduces two novel concepts. First, it is flexible to support multiple storage layouts and data access patterns in a single engine. Second, and most importantly, it decides on-the-fly, i.e., during query processing, which design is best for classes of queries and the respective data parts. At any given point in time, parts of the data might be materialized in various patterns purely depending on the query workload; as the workload changes and with every single query, the storage and access patterns continuously adapt. In this way, H2O makes no a priori and fixed decisions on how data should be stored, allowing each single query to enjoy a storage and access pattern which is tailored to its specific properties. We present a detailed analysis of H2O using both synthetic benchmarks and realistic scientific workloads. We demonstrate that while existing systems cannot achieve maximum performance across all workloads, H2O can always match the best case performance without requiring any tuning or workload knowledge.

【Keywords】: adaptive hybrids; adaptive storage; dynamic operators

117. Fine-grained partitioning for aggressive data skipping.

Paper Link】 【Pages】:1115-1126

【Authors】: Liwen Sun ; Michael J. Franklin ; Sanjay Krishnan ; Reynold S. Xin

【Abstract】: Modern query engines are increasingly being required to process enormous datasets in near real-time. While much can be done to speed up the data access, a promising technique is to reduce the need to access data through data skipping. By maintaining some metadata for each block of tuples, a query may skip a data block if the metadata indicates that the block does not contain relevant data. The effectiveness of data skipping, however, depends on how well the blocking scheme matches the query filters. In this paper, we propose a fine-grained blocking technique that reorganizes the data tuples into blocks with a goal of enabling queries to skip blocks aggressively. We first extract representative filters in a workload as features using frequent itemset mining. Based on these features, each data tuple can be represented as a feature vector. We then formulate the blocking problem as a optimization problem on the feature vectors, called Balanced MaxSkip Partitioning, which we prove is NP-hard. To find an approximate solution efficiently, we adopt the bottom-up clustering framework. We prototyped our blocking techniques on Shark, an open-source data warehouse system. Our experiments on TPC-H and a real-world workload show that our blocking technique leads to 2-5x improvement in query response time over traditional range-based blocking techniques.

【Keywords】: algorithms; data warehouse; partitioning; query processing

118. DSH: data sensitive hashing for high-dimensional k-nnsearch.

Paper Link】 【Pages】:1127-1138

【Authors】: Jinyang Gao ; Hosagrahar Visvesvaraya Jagadish ; Wei Lu ; Beng Chin Ooi

【Abstract】: The need to locate the k-nearest data points with respect to a given query point in a multi- and high-dimensional space is common in many applications. Therefore, it is essential to provide efficient support for such a search. Locality Sensitive Hashing (LSH) has been widely accepted as an effective hash method for high-dimensional similarity search. However, data sets are typically not distributed uniformly over the space, and as a result, the buckets of LSH are unbalanced, causing the performance of LSH to degrade. In this paper, we propose a new and efficient method called Data Sensitive Hashing (DSH) to address this drawback. DSH improves the hashing functions and hashing family, and is orthogonal to most of the recent state-of-the-art approaches which mainly focus on indexing and querying strategies. DSH leverages data distributions and is capable of directly preserving the nearest neighbor relations. We show the theoretical guarantee of DSH, and demonstrate its efficiency experimentally.

【Keywords】: hashing; lsh; high dimensions; similarity search

Research session 20: top-k query processing 3

Paper Link】 【Pages】:1139-1150

【Authors】: Yubao Wu ; Ruoming Jin ; Xiang Zhang

【Abstract】: Given a large graph and a query node, finding its k-nearest-neighbor (kNN) is a fundamental problem. Various random walk based measures have been developed to measure the proximity (similarity) between nodes. Existing algorithms for the random walk based top-k proximity search can be categorized as global and local methods based on their search strategies. Global methods usually require an expensive precomputing step. By only searching the nodes near the query node, local methods have the potential to support more efficient query. However, most existing local search methods cannot guarantee the exactness of the solution. Moreover, they are usually designed for specific proximity measures. Can we devise an efficient local search method that applies to different measures and also guarantees result exactness? In this paper, we present FLoS (Fast Local Search), a unified local search method for efficient and exact top-k proximity query in large graphs. FLoS is based on the no local optimum property of proximity measures. We show that many measures have no local optimum. Utilizing this property, we introduce several simple operations on transition probabilities, which allow developing lower and upper bounds on the proximity. The bounds monotonically converge to the exact proximity when more nodes are visited. We further show that FLoS can also be applied to measures having local optimum by utilizing relationship among different measures. We perform comprehensive experiments to evaluate the efficiency and applicability of the proposed method.

【Keywords】: local search; nearest neighbors; proximity search; random walk; top-k search

120. Global immutable region computation.

Paper Link】 【Pages】:1151-1162

【Authors】: Jilian Zhang ; Kyriakos Mouratidis ; HweeHwa Pang

【Abstract】: A top-k query shortlists the k records in a dataset that best match the user's preferences. To indicate her preferences, the user typically determines a numeric weight for each data dimension (i.e., attribute). We refer to these weights collectively as the query vector. Based on this vector, each data record is implicitly mapped to a score value (via a weighted sum function). The records with the k largest scores are reported as the result. In this paper we propose an auxiliary feature to standard top-k query processing. Specifically, we compute the maximal locus within which the query vector incurs no change in the current top-k result. In other words, we compute all possible query weight settings that produce exactly the same top-k result as the user's original query. We call this locus the global immutable region (GIR). The GIR can be used as a guide to query vector readjustments, as a sensitivity measure for the top-k result, as well as to enable effective result caching. We develop efficient algorithms for GIR computation, and verify their robustness using a variety of real and synthetic datasets.

【Keywords】: sensitivity analysis; top-k search

121. Answering top-k representative queries on graph databases.

Paper Link】 【Pages】:1163-1174

【Authors】: Sayan Ranu ; Minh X. Hoang ; Ambuj K. Singh

【Abstract】: Given a function that classifies a data object as relevant or irrelevant, we consider the task of selecting k objects that best represent all relevant objects in the underlying database. This problem occurs naturally when analysts want to familiarize themselves with the relevant objects in a database using a small set of k exemplars. In this paper, we solve the problem of top-k representative queries on graph databases. While graph databases model a wide range of scientific data, solving the problem in the context of graphs presents us with unique challenges due to the inherent complexity of matching structures. Furthermore, top-k representative queries map to the classic Set Cover problem, making it NP-hard. To overcome these challenges, we develop a greedy approximation with theoretical guarantees on the quality of the answer set, noting that a better approximation is not feasible in polynomial time. To further optimize the quadratic computational cost of the greedy algorithm, we propose an index structure called NB-Index to index the \theta-neighborhoods of the database graphs by employing a novel combination of Lipschitz embedding and agglomerative clustering. Extensive experiments on real graph datasets validate the efficiency and effectiveness of the proposed techniques that achieve up to two orders of magnitude speed-up over state-of-the-art algorithms.

【Keywords】: graph search; representative power; top-k

Research session 21: entity matching 4

122. Modeling entity evolution for temporal record matching.

Paper Link】 【Pages】:1175-1186

【Authors】: Yueh-Hsuan Chiang ; AnHai Doan ; Jeffrey F. Naughton

【Abstract】: Temporal record matching recognizes that if the entities represented by the records change over time, approaches that use temporal information may do better than approaches that do not. Any such temporal matching method relies at its heart on a temporal model that captures information about how entities evolve. In their pioneering work, Li {\it et al.} used an efficiently computable model that simply tries to predict if an attribute is expected to change over a given time interval. In our work, we propose and evaluate a more detailed model that focuses on the probability that a given attribute value reappears over time. The intuition here is that an entity might change its attribute value in the way that is dependent on its past values. In addition, our model considers sets of records (rather than simply pairs of records) to improve robustness and accuracy. Experimental results show that the resulting approach improves both accuracy and resistance to noise while incurring a minimal overhead.

【Keywords】: data integration; data matching; entity resolution

123. Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation.

Paper Link】 【Pages】:1187-1198

【Authors】: Qi Li ; Yaliang Li ; Jing Gao ; Bo Zhao ; Wei Fan ; Jiawei Han

【Abstract】: In many applications, one can obtain descriptions about the same objects or events from a variety of sources. As a result, this will inevitably lead to data or information conflicts. One important problem is to identify the true information (i.e., the truths) among conflicting sources of data. It is intuitive to trust reliable sources more when deriving the truths, but it is usually unknown which one is more reliable a priori. Moreover, each source possesses a variety of properties with different data types. An accurate estimation of source reliability has to be made by modeling multiple properties in a unified model. Existing conflict resolution work either does not conduct source reliability estimation, or models multiple properties separately. In this paper, we propose to resolve conflicts among multiple sources of heterogeneous data types. We model the problem using an optimization framework where truths and source reliability are defined as two sets of unknown variables. The objective is to minimize the overall weighted deviation between the truths and the multi-source observations where each source is weighted by its reliability. Different loss functions can be incorporated into this framework to recognize the characteristics of various data types, and efficient computation approaches are developed. Experiments on real-world weather, stock and flight data as well as simulated multi-source data demonstrate the necessity of jointly modeling different data types in the proposed framework.

【Keywords】: data fusion; heterogeneous data; truth discovery

124. A probabilistic model for linking named entities in web text with heterogeneous information networks.

Paper Link】 【Pages】:1199-1210

【Authors】: Wei Shen ; Jiawei Han ; Jianyong Wang

【Abstract】: Heterogeneous information networks that consist of multi-type, interconnected objects are becoming ubiquitous and increasingly popular, such as social media networks and bibliographic networks. The task to link named entity mentions detected from the unstructured Web text with their corresponding entities existing in a heterogeneous information network is of practical importance for the problem of information network population and enrichment. This task is challenging due to name ambiguity and limited knowledge existing in the information network. Most existing entity linking methods focus on linking entities with Wikipedia or Wikipedia-derived knowledge bases (e.g., YAGO), and are largely dependent on the special features associated with Wikipedia (e.g., Wikipedia articles or Wikipedia-based relatedness measures). Since heterogeneous information networks do not have such features, these previous methods cannot be applied to our task. In this paper, we propose SHINE, the first probabilistic model to link the named entities in Web text with a heterogeneous information network to the best of our knowledge. Our model consists of two components: the entity popularity model that captures the popularity of an entity, and the entity object model that captures the distribution of multi-type objects appearing in the textual context of an entity, which is generated using meta-path constrained random walks over networks. As different meta-paths express diverse semantic meanings and lead to various distributions over objects, different paths have different weights in entity linking. We propose an effective iterative approach to automatically learning the weights for each meta-path based on the expectation-maximization (EM) algorithm without requiring any training data. Experimental results on a real world data set demonstrate the effectiveness and efficiency of our proposed model in comparison with the baselines.

【Keywords】: domain-specific entity linking; entity linking; heterogeneous information networks

125. Matching heterogeneous event data.

Paper Link】 【Pages】:1211-1222

【Authors】: Xiaochen Zhu ; Shaoxu Song ; Xiang Lian ; Jianmin Wang ; Lei Zou

【Abstract】: Identifying duplicate events are essential to various business process applications such as provenance querying or process mining. Distinct features of heterogeneous events including opaque names, dislocated traces and composite events, prevent existing data integration from techniques performing well. To address these issues, in this paper, we propose an event similarity function by iteratively evaluating similar neighbors. We prove the convergence of iterative similarity computation, and propose several pruning and estimation methods. To efficiently support matching composite events, we devise upper bounds of event similarities. Experiments on real and synthetic datasets demonstrate that the proposed event matching approaches can achieve significantly higher accuracy than the state-of-the-art matching methods.

【Keywords】: event matching; schema matching

Industry session 4: big data systems 3

126. HAWQ: a massively parallel processing SQL engine in hadoop.

Paper Link】 【Pages】:1223-1234

【Authors】: Lei Chang ; Zhanwei Wang ; Tao Ma ; Lirong Jian ; Lili Ma ; Alon Goldshuv ; Luke Lonergan ; Jeffrey Cohen ; Caleb Welton ; Gavin Sherry ; Milind Bhandarkar

【Abstract】: HAWQ, developed at Pivotal, is a massively parallel processing SQL engine sitting on top of HDFS. As a hybrid of MPP database and Hadoop, it inherits the merits from both parties. It adopts a layered architecture and relies on the distributed file system for data replication and fault tolerance. In addition, it is standard SQL compliant, and unlike other SQL engines on Hadoop, it is fully transactional. This paper presents the novel design of HAWQ, including query processing, the scalable software interconnect based on UDP protocol, transaction management, fault tolerance, read optimized storage, the extensible framework for supporting various popular Hadoop based data stores and formats, and various optimization choices we considered to enhance the query performance. The extensive performance study shows that HAWQ is about 40x faster than Stinger, which is reported 35x-45x faster than the original Hive.

【Keywords】: Hadoop; database; query processing

127. Major technical advancements in apache hive.

Paper Link】 【Pages】:1235-1246

【Authors】: Yin Huai ; Ashutosh Chauhan ; Alan Gates ; Günther Hagleitner ; Eric N. Hanson ; Owen O'Malley ; Jitendra Pandey ; Yuan Yuan ; Rubao Lee ; Xiaodong Zhang

【Abstract】: Apache Hive is a widely used data warehouse system for Apache Hadoop, and has been adopted by many organizations for various big data analytics applications. Closely working with many users and organizations, we have identified several shortcomings of Hive in its file formats, query planning, and query execution, which are key factors determining the performance of Hive. In order to make Hive continuously satisfy the requests and requirements of processing increasingly high volumes data in a scalable and efficient way, we have set two goals related to storage and runtime performance in our efforts on advancing Hive. First, we aim to maximize the effective storage capacity and to accelerate data accesses to the data warehouse by updating the existing file formats. Second, we aim to significantly improve cluster resource utilization and runtime performance of Hive by developing a highly optimized query planner and a highly efficient query execution engine. In this paper, we present a community-based effort on technical advancements in Hive. Our performance evaluation shows that these advancements provide significant improvements on storage efficiency and query execution performance. This paper also shows how academic research lays a foundation for Hive to improve its daily operations.

【Keywords】: data warehouse; databases; hadoop; hive; mapreduce

128. JSON data management: supporting schema-less development in RDBMS.

Paper Link】 【Pages】:1247-1258

【Authors】: Zhen Hua Liu ; Beda Christoph Hammerschmidt ; Doug McMahon

【Abstract】: Relational Database Management Systems (RDBMS) have been very successful at managing structured data with well-defined schemas. Despite this, relational systems are generally not the first choice for management of data where schemas are not pre-defined or must be flexible in the face of variations and changes. Instead, No-SQL database systems supporting JSON are often selected to provide persistence to such applications. JSON is a light-weight and flexible semi-structured data format supporting constructs common in most programming languages. In this paper, we analyze the way in which requirements differ between management of relational data and management of JSON data. We present three architectural principles that facilitate a schema-less development style within an RDBMS so that RDBMS users can store, query, and index JSON data without requiring schemas. We show how these three principles can be applied to industry-leading RDBMS platforms, such as the Oracle RDBMS Server, with relatively little effort. Consequently, an RDBMS can unify the management of both relational data and JSON data in one platform and use SQL with an embedded JSON path language as a single declarative language to query both relational data and JSON data. This SQL/JSON approach offers significant benefits to application developers as they can use one product to manage both relational data and semi-structured flexible schema data.

【Keywords】: JSON; SQL/JSON; SQL/XML; XML; no-SQL; schema-less

Tutorial 3 1

129. Querying encrypted data.

Paper Link】 【Pages】:1259-1261

【Authors】: Arvind Arasu ; Ken Eguro ; Raghav Kaushik ; Ravishankar Ramamurthy

【Abstract】: Data security is a serious concern when we migrate data to a cloud DBMS. Database encryption, where sensitive columns are encrypted before they are stored in the cloud, has been proposed as a mechanism to address such data security concerns. The intuitive expectation is that an adversary cannot "learn" anything about the encrypted columns, since she does not have access to the encryption key. However, query processing becomes a challenge since it needs to "look inside" the data. This tutorial explores the space of designs studied in prior work on processing queries over encrypted data. We cover approaches based on both classic client-server and involving the use of a trusted hardware module where data can be securely decrypted. We discuss the privacy challenges that arise in both approaches and how they may be addressed. Briefly, supporting the full complexity of a modern DBMS including complex queries, transactions and stored procedures leads to significant challenges that we survey and open problems which we highlight.

【Keywords】: data encryption; query processing

Research session 22: query processing and optimization 2 4

130. Towards unified ad-hoc data processing.

Paper Link】 【Pages】:1263-1274

【Authors】: Xiaogang Shi ; Bin Cui ; Gillian Dobbie ; Beng Chin Ooi

【Abstract】: It is important to provide efficient execution for ad-hoc data processing programs. In contrast to constructing complex declarative queries, many users prefer to write their programs using procedural code with simple queries. As many users are not expert programmers, their programs usually exhibit poor performance in practice and it is a challenge to automatically optimize these programs and efficiently execute the programs. In this paper, we present UniAD, a system designed to simplify the programming of data processing tasks and provide efficient execution for user programs. We propose a novel intermediate representation named UniQL which utilizes HOQs to describe the operations performed in programs. By combining both procedural and declarative logics, we can perform various optimizations across the boundary between procedural and declarative codes. We describe optimizations and conduct extensive empirical studies using UniAD. The experimental results on four benchmarks demonstrate that our techniques can significantly improve the performance of a wide range of data processing programs.

【Keywords】: ad-hoc data processing; program analysis; unified optimization

131. Partial results in database systems.

Paper Link】 【Pages】:1275-1286

【Authors】: Willis Lang ; Rimma V. Nehme ; Eric Robinson ; Jeffrey F. Naughton

【Abstract】: As the size and complexity of analytic data processing systems continue to grow, the effort required to mitigate faults and performance skew has also risen. However, in some environments we have encountered, users prefer to continue query execution even in the presence of failures (e.g., the unavailability of certain data sources), and receive a "partial" answer to their query. We explore ways to characterize and classify these partial results, and describe an analytical framework that allows the system to perform coarse to fine-grained analysis to determine the semantics of a partial result. We propose that if the system is equipped with such a framework, in some cases it is better to return and explain partial results than to attempt to avoid them.

【Keywords】: failures; partial results; result semantics

132. Parallel in-situ data processing with speculative loading.

Paper Link】 【Pages】:1287-1298

【Authors】: Yu Cheng ; Florin Rusu

【Abstract】: Traditional databases incur a significant data-to-query delay due to the requirement to load data inside the system before querying. Since this is not acceptable in many domains generating massive amounts of raw data, e.g., genomics, databases are entirely discarded. External tables, on the other hand, provide instant SQL querying over raw files. Their performance across a query workload is limited though by the speed of repeated full scans, tokenizing, and parsing of the entire file. In this paper, we propose SCANRAW, a novel database physical operator for in-situ processing over raw files that integrates data loading and external tables seamlessly while preserving their advantages: optimal performance across a query workload and zero time-to-query. Our major contribution is a parallel super-scalar pipeline implementation that allows SCANRAW to take advantage of the current many- and multi-core processors by overlapping the execution of independent stages. Moreover, SCANRAW overlaps query processing with loading by speculatively using the additional I/O bandwidth arising during the conversion process for storing data into the database such that subsequent queries execute faster. As a result, SCANRAW makes optimal use of the available system resources -- CPU cycles and I/O bandwidth -- by switching dynamically between tasks to ensure that optimal performance is achieved. We implement SCANRAW in a state-of-the-art database system and evaluate its performance across a variety of synthetic and real-world datasets. Our results show that SCANRAW with speculative loading achieves optimal performance for a query sequence at any point in the processing. Moreover, SCANRAW maximizes resource utilization for the entire workload execution while speculatively loading data and without interfering with normal query processing.

【Keywords】: data loading; external tables; raw file processing

133. Approximation schemes for many-objective query optimization.

Paper Link】 【Pages】:1299-1310

【Authors】: Immanuel Trummer ; Christoph Koch

【Abstract】: The goal of multi-objective query optimization (MOQO) is to find query plans that realize a good compromise between conflicting objectives such as minimizing execution time and minimizing monetary fees in a Cloud scenario. A previously proposed exhaustive MOQO algorithm needs hours to optimize even simple TPC-H queries. This is why we propose several approximation schemes for MOQO that generate guaranteed near-optimal plans in seconds where exhaustive optimization takes hours. We integrated all MOQO algorithms into the Postgres optimizer and present experimental results for TPC-H queries; we extended the Postgres cost model and optimize for up to nine conflicting objectives in our experiments. The proposed algorithms are based on a formal analysis of typical cost functions that occur in the context of MOQO. We identify properties that hold for a broad range of objectives and can be exploited for the design of future MOQO algorithms.

【Keywords】: multi-objective optimization; query optimization

Research session 23: processing dynamic graphs 4

134. Querying k-truss community in large and dynamic graphs.

Paper Link】 【Pages】:1311-1322

【Authors】: Xin Huang ; Hong Cheng ; Lu Qin ; Wentao Tian ; Jeffrey Xu Yu

【Abstract】: Community detection which discovers densely connected structures in a network has been studied a lot. In this paper, we study online community search which is practically useful but less studied in the literature. Given a query vertex in a graph, the problem is to find meaningful communities that the vertex belongs to in an online manner. We propose a novel community model based on the k-truss concept, which brings nice structural and computational properties. We design a compact and elegant index structure which supports the efficient search of k-truss communities with a linear cost with respect to the community size. In addition, we investigate the k-truss community search problem in a dynamic graph setting with frequent insertions and deletions of graph vertices and edges. Extensive experiments on large real-world networks demonstrate the effectiveness and efficiency of our community model and search algorithms.

【Keywords】: community search; dynamic graph; k-truss

135. Reachability queries on large dynamic graphs: a total order approach.

Paper Link】 【Pages】:1323-1334

【Authors】: Andy Diwen Zhu ; Wenqing Lin ; Sibo Wang ; Xiaokui Xiao

【Abstract】: Reachability queries are a fundamental type of queries on graphs that find important applications in numerous domains. Although a plethora of techniques have been proposed for reachability queries, most of them require that the input graph is static, i.e., they are inapplicable to the {\em dynamic} graphs (e.g., social networks and the Semantic Web) commonly encountered in practice. There exist a few techniques that can handle dynamic graphs, but none of them can scale to sizable graphs without significant loss of efficiency. To address this deficiency, this paper presents a novel study on reachability indices for large dynamic graphs. We first introduce a general indexing framework that summarizes a family of reachability indices with the best performance among the existing techniques for static graphs. Then, we propose general and efficient algorithms for handling vertex insertions and deletions under the proposed framework. In addition, we show that our update algorithms can be used to improve the existing reachability techniques on static graphs, and we also propose a new approach for constructing a reachability index from scratch under our framework. We experimentally evaluate our solution on a large set of benchmark datasets, and we demonstrate that our solution not only supports efficient updates on dynamic graphs, but also provides even better query performance than the state-of-the-art techniques for static graphs.

【Keywords】: graphs; reachability; updates

136. EAGr: supporting continuous ego-centric aggregate queries over large dynamic graphs.

Paper Link】 【Pages】:1335-1346

【Authors】: Jayanta Mondal ; Amol Deshpande

【Abstract】: In this paper, we present EAGr, a system for supporting large numbers of continuous neighborhood-based ("ego-centric") aggregate queries over large, highly dynamic, rapidly evolving graphs. Examples of such queries include computation of personalized, tailored trends in social networks, anomaly or event detection in communication or financial transaction networks, local search and alerts in spatio-temporal networks, to name a few. Key challenges in supporting such continuous queries include very high update rates typically seen in these situations, large numbers of queries that need to be executed simultaneously, stringent low latency requirements. In this paper, we propose a flexible, general, extensible in-memory framework for executing different types of ego-centric aggregate queries over large dynamic graphs with low latencies. Our framework is built around the notion of an aggregation overlay graph, a pre-compiled data structure that encodes the computations to be performed when an update or a query is received. The overlay graph enables sharing of partial aggregates across different ego centric queries (corresponding to different nodes in the graph), also allows partial pre-computation of the aggregates to minimize the query latencies. We present several highly scalable techniques for constructing an overlay graph given an aggregation function, also design incremental algorithms for handling changes to the structure of the underlying graph itself, that may result in significant changes to the neighborhoods on which queries are posed. We also present an optimal, polynomial-time algorithm for making the pre-computation decisions given an overlay graph. Although our approach is naturally parallelizable, we focus on a single-machine deployment in this paper and show that our techniques can easily handle graphs of size up to 320 million nodes and edges, achieve update and query throughputs of over 500,000/s using a single, powerful machine.

【Keywords】: aggregates; continuous queries; data streams; ego-centric analysis; graph compression; graph databases; social networks

137. Localizing anomalous changes in time-evolving graphs.

Paper Link】 【Pages】:1347-1358

【Authors】: Kumar Sricharan ; Kamalika Das

【Abstract】: Given a time-evolving sequence of undirected, weighted graphs, we address the problem of localizing anomalous changes in graph structure over time. In this paper, we use the term `localization' to refer to the problem of identifying abnormal changes in node relationships (edges) that cause anomalous changes in graph structure. While there already exist several methods that can detect whether a graph transition is anomalous or not, these methods are not well suited for localizing the edges which are responsible for a transition being marked as an anomaly. This is a limitation in applications such as insider threat detection, where identifying the anomalous graph transitions is not sufficient, but rather, identifying the anomalous node relationships and associated nodes is key. To this end, we propose a novel, fast method based on commute time distance called CAD (Commute-time based Anomaly detection in Dynamic graphs) that detects node relationships responsible for abnormal changes in graph structure. In particular, CAD localizes anomalous edges by tracking a measure that combines information regarding changes in graph structure (in terms of commute time distance) as well as changes in edge weights. For large, sparse graphs, CAD returns a list of these anomalous edges and associated nodes in O(n\log n) time per graph instance in the sequence, where $n$ is the number of nodes. We analyze the performance of CAD on several synthetic and real-world data sets such as the Enron email network, the DBLP co-authorship network and a worldwide precipitation network data. Based on experiments conducted, we conclude that CAD consistently and efficiently identifies anomalous changes in relationships between nodes over time.

【Keywords】: anomaly detection; anomaly localization; commute time distance; dynamic graph analysis; random walks; temporal graphs

Research session 24: cloud computing 4

138. Online optimization and fair costing for dynamic data sharing in a cloud data market.

Paper Link】 【Pages】:1359-1370

【Authors】: Ziyang Liu ; Hakan Hacigümüs

【Abstract】: The data markets are emerging to address many organizations' need to find more useful external data sets for deeper insights. Existing data markets have two main limitations. First, they either sell a whole data set or some fixed views, but do not allow arbitrary ad-hoc queries. Second, they only sell static data, but not data that are frequently updated. While there exist proposed solutions for selling ad-hoc queries, it is an open question what mechanism should be used to sell ad-hoc queries on dynamic data. In this paper, we study a data market framework that enables the sale or sharing of dynamic data, where each sharing is specified by an ad-hoc query. To keep the shared data up-to-date, the service provider should create a view of the shared data and maintain the view for the data buyer. We provide solutions to two important problems in this framework: how to efficiently maintain the views, and how to fairly determine the cost each sharing incurs for its view to be created and maintained by the service provider. Both problems are challenging since in the first problem, different sharings with the same operations in their plans may reuse these operations, and each sharing plan must be generated online. In the second problem, there are several factors that impact the fairness of a costing function and a straightforward mechanism won't work. We propose an intuitive online algorithm for sharing plan selection, as well as a set of fair costing criteria and an algorithm that maximizes the fairness.

【Keywords】: costing; data market; fairness; online; sharing

139. A comparison of platforms for implementing and running very large scale machine learning algorithms.

Paper Link】 【Pages】:1371-1382

【Authors】: Zhuhua Cai ; Zekai J. Gao ; Shangyu Luo ; Luis Leopoldo Perez ; Zografoula Vagena ; Christopher M. Jermaine

【Abstract】: We describe an extensive benchmark of platforms available to a user who wants to run a machine learning (ML) inference algorithm over a very large data set, but cannot find an existing implementation and thus must "roll her own" ML code. We have carefully chosen a set of five ML implementation tasks that involve learning relatively complex, hierarchical models. We completed those tasks on four different computational platforms, and using 70,000 hours of Amazon EC2 compute time, we carefully compared running times, tuning requirements, and ease-of-programming of each.

【Keywords】: distributed machine learning

140. Re-evaluating designs for multi-tenant OLTP workloads on SSD-basedI/O subsystems.

Paper Link】 【Pages】:1383-1394

【Authors】: Ning Zhang ; Jun'ichi Tatemura ; Jignesh M. Patel ; Hakan Hacigümüs

【Abstract】: Multi-tenancy is a common practice that is employed to maximize server resources and reduce the total cloud operation costs. The focus of this work is on multi-tenancy for OLTP workloads. Several designs for OLTP multi-tenancy have been proposed that vary the trade-offs made between performance and isolation. However, existing studies have not considered the impact of OLTP multi-tenancy designs when using an SSD-based I/O subsystem. In this paper, we compare three designs using both open-source and proprietary DBMSs on SSD-based I/O subsystems. Our study reveals that in contrast to the case of an HDD-based I/O subsystem, VM-based designs have fairly competitive performance compared to the non-virtualized designs (generally within 1.3-2X of the best performing case) on SSD-based I/O subsystems. Whereas previous studies were based on traditional hard disk-based environments, our results indicate that switching to a pure SSD-based I/O subsystem requires rethinking the trade-offs for multi-tenant OLTP workloads.

【Keywords】: database virtualization; flash-optimized; multi-tenancy; nvram storage

141. Secure query processing with data interoperability in a cloud database environment.

Paper Link】 【Pages】:1395-1406

【Authors】: Wai Kit Wong ; Ben Kao ; David Wai-Lok Cheung ; Rongbin Li ; Siu-Ming Yiu

【Abstract】: We address security issues in a cloud database system which employs the DBaaS model. In such a model, a data owner (DO) exports its data to a cloud database service provider (SP). To provide data security, sensitive data is encrypted by the DO before it is uploaded to the SP. Existing encryption schemes, however, are only partially homomorphic in the sense that each of them was designed to allow one specific type of computation to be done on encrypted data. These existing schemes cannot be integrated to answer real practical queries that involve operations of different kinds. We propose and analyze a secure query processing system (SDB) on relational tables and a set of elementary operators on encrypted data that allow data interoperability, which allows a wide range of SQL queries to be processed by the SP on encrypted information. We prove that our encryption scheme is secure against two types of threats and that it is practically efficient.

【Keywords】: query on encrypted data; secure database

Industrial panel 1

142. Are we experiencing a big data bubble?

Paper Link】 【Pages】:1407-1408

【Authors】: Fatma Özcan ; Nesime Tatbul ; Daniel J. Abadi ; Marcel Kornacker ; C. Mohan ; Karthik Ramasamy ; Janet L. Wiener

【Abstract】:

【Keywords】: NewSQL; NoSQL; SQL-on-HADOOP

Tutorial 4 1

143. Mining latent entity structures from massive unstructured and interconnected data.

Paper Link】 【Pages】:1409-1410

【Authors】: Jiawei Han ; Chi Wang

【Abstract】: The "big data" era is characterized by an explosion of information in the form of digital data collections, ranging from scientific knowledge, to social media, news, and everyone's daily life. Examples of such collections include scientific publications, enterprise logs, news articles, social media and general Web pages. Valuable knowledge about multi-typed entities is often hidden in the unstructured or loosely structured but interconnected data. Mining latent structured information around entities uncovers sematic structures from massive unstructured data and hence enables many high-impact applications. In this tutorial, we summarize the closely related literature in database systems, data mining, Web, information extraction, information retrieval, and natural language processing, overview a spectrum of data-driven methods that extract and infer such latent structures, from an interdisciplinary point of view, and demonstrate how these structures support entity discovery and management, data understanding, and some new database applications. We present three categories of studies: mining conceptual, topical and relational structures. Moreover, we present case studies on real datasets, including research papers, news articles and social networks, and show how interesting and organized knowledge can be discovered by mining latent entity structures from these datasets.

【Keywords】: entity knowledge engineering; latent structure

Research session 25: security and privacy 4

144. Explainable security for relational databases.

Paper Link】 【Pages】:1411-1422

【Authors】: Gabriel Bender ; Lucja Kot ; Johannes Gehrke

【Abstract】: Companies and organizations collect and use vast troves of sensitive user data whose release must be carefully controlled. In practice, the access policies that govern this data are often fine-grained, complex, poorly documented, and difficult to reason about. As a result, principals frequently request and are granted access to data they never use. To encourage developers and administrators to use security mechanisms more effectively, we propose a novel security model in which all security decisions are formally explainable. Whether a query is accepted or denied, the system returns a concise yet formal explanation which can allow the issuer to reformulate a rejected query or adjust his/her security credentials. Our approach has a strong formal foundation based on previously unexplored connections between disclosure lattices and policy algebras. We build on this foundation and implement a disclosure control system that handles a wide variety of real SQL queries and can accommodate complex policy constraints.

【Keywords】: access control; database security; view rewriting

145. PrivBayes: private data release via bayesian networks.

Paper Link】 【Pages】:1423-1434

【Authors】: Jun Zhang ; Graham Cormode ; Cecilia M. Procopiuc ; Divesh Srivastava ; Xiaokui Xiao

【Abstract】: Privacy-preserving data publishing is an important problem that has been the focus of extensive study. The state-of-the-art goal for this problem is differential privacy, which offers a strong degree of privacy protection without making restrictive assumptions about the adversary. Existing techniques using differential privacy, however, cannot effectively handle the publication of high-dimensional data. In particular, when the input dataset contains a large number of attributes, existing methods require injecting a prohibitive amount of noise compared to the signal in the data, which renders the published data next to useless. To address the deficiency of the existing methods, this paper presents PrivBayes, a differentially private method for releasing high-dimensional data. Given a dataset D, PrivBayes first constructs a Bayesian network N, which (i) provides a succinct model of the correlations among the attributes in D and (ii) allows us to approximate the distribution of data in D using a set P of low-dimensional marginals of D. After that, PrivBayes injects noise into each marginal in P to ensure differential privacy, and then uses the noisy marginals and the Bayesian network to construct an approximation of the data distribution in D. Finally, PrivBayes samples tuples from the approximate distribution to construct a synthetic dataset, and then releases the synthetic data. Intuitively, PrivBayes circumvents the curse of dimensionality, as it injects noise into the low-dimensional marginals in P instead of the high-dimensional dataset D. Private construction of Bayesian networks turns out to be significantly challenging, and we introduce a novel approach that uses a surrogate function for mutual information to build the model more accurately. We experimentally evaluate PrivBayes on real data, and demonstrate that it significantly outperforms existing solutions in terms of accuracy.

【Keywords】: bayesian network; differential privacy; synthetic data generation

146. PriView: practical differentially private release of marginal contingency tables.

Paper Link】 【Pages】:1435-1446

【Authors】: Wahbeh H. Qardaji ; Weining Yang ; Ninghui Li

【Abstract】: We consider the problem of publishing a differentially private synopsis of a d-dimensional dataset so that one can reconstruct any k-way marginal contingency tables from the synopsis. Marginal tables are the workhorses of categorical data analysis. Thus, the private release of such tables has attracted a lot of attention from the research community. However, for situations where $d$ is moderate to large and k is beyond 3, no accurate and practical method exists. We introduce PriView, which computes marginal tables for a number of strategically chosen sets of attributes that we call views, and then use these view marginal tables to reconstruct any desired k-way marginal. In PriView, we apply maximum entropy optimization to reconstruct k-way marginals from views. We also develop a novel method to efficiently making all view marginals consistent while correcting negative entries to improve accuracy. For view selection, we borrow the concept of covering design from combinatorics theory. We conduct extensive experiments on real and synthetic datasets, and show that PriView reduces the error over existing approaches by 2 to 3 orders of magnitude.

【Keywords】: contingency table; differential privacy; marginal table

147. Blowfish privacy: tuning privacy-utility trade-offs using policies.

Paper Link】 【Pages】:1447-1458

【Authors】: Xi He ; Ashwin Machanavajjhala ; Bolin Ding

【Abstract】: Privacy definitions provide ways for trading-off the privacy of individuals in a statistical database for the utility of downstream analysis of the data. In this paper, we present Blowfish, a class of privacy definitions inspired by the Pufferfish framework, that provides a rich interface for this trade-off. In particular, we allow data publishers to extend differential privacy using a policy, which specifies (a) secrets, or information that must be kept secret, and (b) constraints that may be known about the data. While the secret specification allows increased utility by lessening protection for certain individual properties, the constraint specification provides added protection against an adversary who knows correlations in the data (arising from constraints). We formalize policies and present novel algorithms that can handle general specifications of sensitive information and certain count constraints. We show that there are reasonable policies under which our privacy mechanisms for k-means clustering, histograms and range queries introduce significantly lesser noise than their differentially private counterparts. We quantify the privacy-utility trade-offs for various policies analytically and empirically on real datasets.

【Keywords】: blowfish privacy; differential privacy; privacy

Research session 26: join processing 4

148. Overlap interval partition join.

Paper Link】 【Pages】:1459-1470

【Authors】: Anton Dignös ; Michael H. Böhlen ; Johann Gamper

【Abstract】: Each tuple in a valid-time relation includes an interval attribute T that represents the tuple's valid time. The overlap join between two valid-time relations determines all pairs of tuples with overlapping intervals. Although overlap joins are common, existing partitioning and indexing schemes are inefficient if the data includes long-lived tuples or if intervals intersect partition boundaries. We propose Overlap Interval Partitioning (OIP), a new partitioning approach for data with an interval. OIP divides the time range of a relation into k base granules and defines overlapping partitions for sequences of contiguous granules. OIP is the first partitioning method for interval data that gives a constant clustering guarantee: the difference in duration between the interval of a tuple and the interval of its partition is independent of the duration of the tuple's interval. We offer a detailed analysis of the average false hit ratio and the average number of partition accesses for queries with overlap predicates, and we prove that the average false hit ratio is independent of the number of short- and long-lived tuples. To compute the overlap join, we propose the Overlap Interval Partition Join (OIPJoin), which uses OIP to partition the input relations on-the-fly. Only the tuples from overlapping partitions have to be joined to compute the result. We analytically derive the optimal number of granules, k, for partitioning the two input relations, from the size of the data, the cost of CPU operations, and the cost of main memory or disk IOs. Our experiments confirm the analytical results and show that the OIPJoin outperforms state-of-the-art techniques for the overlap join.

【Keywords】: interval partitioning; overlap join; temporal databases

149. Similarity joins for uncertain strings.

Paper Link】 【Pages】:1471-1482

【Authors】: Manish Patil ; Rahul Shah

【Abstract】: A string similarity join finds all similar string pairs between two input string collections. It is an essential operation in many applications, such as data integration and cleaning, and has been extensively studied for deterministic strings. Increasingly, many applications have to deal with imprecise strings or strings with fuzzy information in them. This work presents the first solution for answering similarity join queries over uncertain strings that implements possible-world semantics, using the edit distance as the measure of similarity. Given two collections of uncertain strings R, S, and input (k,τ), our task is to find string pairs (R,S) between collections such that $Pr(ed(R,S) ≤ k) > τ i.e., the probability of the edit distance between R and S being at most k is more than probability threshold τ. We can address the join problem by obtaining all strings in S that are similar to each string R in R. However, existing solutions for answering such similarity search queries on uncertain string databases only support a deterministic string as input. Exploiting these solutions would require exponentially many possible worlds of $R$ to be considered, which is not only ineffective but also prohibitively expensive. We propose various filtering techniques that give upper and (or) lower bound on Pr(ed(R,S) ≤ k) without instantiating possible worlds for either of the strings. We then incorporate these techniques into an indexing scheme and significantly reduce the filtering overhead. Further, we alleviate the verification cost of a string pair that survives pruning by using a trie structure which allows us to overlap the verification cost of exponentially many possible instances of the candidate string pair. Finally, we evaluate the effectiveness of the proposed approach by thorough practical experimentation.

【Keywords】: edit distance; string joins; uncertain strings

150. Track join: distributed joins with minimal network traffic.

Paper Link】 【Pages】:1483-1494

【Authors】: Orestis Polychroniou ; Rajkumar Sen ; Kenneth A. Ross

【Abstract】: Network communication is the slowest component of many operators in distributed parallel databases deployed for large-scale analytics. Whereas considerable work has focused on speeding up databases on modern hardware, communication reduction has received less attention. Existing parallel DBMSs rely on algorithms designed for disks with minor modifications for networks. A more complicated algorithm may burden the CPUs, but could avoid redundant transfers of tuples across the network. We introduce track join, a novel distributed join algorithm that minimizes network traffic by generating an optimal transfer schedule for each distinct join key. Track join extends the trade-off options between CPU and network. Our evaluation based on real and synthetic data shows that track join adapts to diverse cases and degrees of locality. Considering both network traffic and execution time, even with no locality, track join outperforms hash join on the most expensive queries of real workloads.

【Keywords】: join; network optimization

151. On-the-fly token similarity joins in relational databases.

Paper Link】 【Pages】:1495-1506

【Authors】: Nikolaus Augsten ; Armando Miraglia ; Thomas Neumann ; Alfons Kemper

【Abstract】: Token similarity joins represent data items as sets of tokens, for example, strings are represented as sets of q-grams (substrings of length q). Two items are considered similar and match if their token sets have a large overlap. Previous work on similarity joins in databases mainly focuses on expressing the overlap computation with relational operators. The tokens are assumed to preexist in the database, and the token generation cannot be expressed as part of the query. Our goal is to efficiently compute token similarity joins on-the-fly, i.e., without any precomputed tokens or indexes. We define tokenize, a new relational operator that generates tokens and allows the similarity join to be fully integrated into relational databases. This allows us to (1) optimize the token generation as part of the query plan, (2) provide the query optimizer with cardinality estimates for tokens, (3) choose efficient algorithms based on the query context. We discuss algebraic properties, cardinality estimates, and an efficient iterator algorithm for tokenize. We implemented our operator in the kernel of PostgreSQL and empirically evaluated its performance for similarity joins.

【Keywords】: q-grams; set similarity join; string similarity; tokenize operator

Research session 27: social networks 2 4

152. Tracking set correlations at large scale.

Paper Link】 【Pages】:1507-1518

【Authors】: Foteini Alvanaki ; Sebastian Michel

【Abstract】: In this work, we consider the continuous computation of correlations between co-occurring tags that appear in messages published in social media streams. The vast amount and pace these messages are created makes it necessary to parallelise the computation of correlations to various nodes in a computing cluster. The main challenge in this is to ensure that each node will compute a subset of the coefficients and every coefficient will be computed by some node. The core task is to continuously create and maintain partitions of the tags and forward the incoming messages based on them. Our approach proposes and evaluates several algorithms that partition the tags to the nodes while at the same time they minimise the replication of tags to the nodes and balance the load on them. The proposed framework is implemented in Java within the Storm stream processing platform. We evaluate the partitioning algorithms and validate the feasibility of our approach through a thorough experimental study performed using real data.

【Keywords】: distributed stream processing; partitioning; set correlations

153. Aggregate estimation over a microblog platform.

Paper Link】 【Pages】:1519-1530

【Authors】: Saravanan Thirumuruganathan ; Nan Zhang ; Vagelis Hristidis ; Gautam Das

【Abstract】: Microblogging platforms such as Twitter have experienced a phenomenal growth of popularity in recent years, making them attractive platforms for research in diverse fields from computer science to sociology. However, most microblogging platforms impose strict access restrictions (e.g., API rate limits) that prevent scientists with limited resources - e.g., who cannot afford microblog-data-access subscriptions offered by GNIP et al. - to leverage the wealth of microblogs for analytics. For example, Twitter allows only 180 queries per 15 minutes, and its search API only returns tweets posted within the last week. In this paper, we consider a novel problem of estimating aggregate queries over microblogs, e.g., "how many users mentioned the word 'privacy' in 2013?". We propose novel solutions exploiting the user-timeline information that is publicly available in most microblogging platforms. Theoretical analysis and extensive real-world experiments over Twitter, Google+ and Tumblr confirm the effectiveness of our proposed techniques.

【Keywords】: aggregate estimation; microblogs; random walk; twitter

154. Tripartite graph clustering for dynamic sentiment analysis on social media.

Paper Link】 【Pages】:1531-1542

【Authors】: Linhong Zhu ; Aram Galstyan ; James Cheng ; Kristina Lerman

【Abstract】: The growing popularity of social media (e.g., Twitter) allows users to easily share information with each other and influence others by expressing their own sentiments on various subjects. In this work, we propose an unsupervised tri-clustering framework, which analyzes both user-level and tweet-level sentiments through co-clustering of a tripartite graph. A compelling feature of the proposed framework is that the quality of sentiment clustering of tweets, users, and features can be mutually improved by joint clustering. We further investigate the evolution of user-level sentiments and latent feature vectors in an online framework and devise an efficient online algorithm to sequentially update the clustering of tweets, users and features with newly arrived data. The online framework not only provides better quality of both dynamic user-level and tweet-level sentiment analysis, but also improves the computational and storage efficiency. We verified the effectiveness and efficiency of the proposed approaches on the November 2012 California ballot Twitter data.

【Keywords】: matrix factorization; sentiment analysis; tripartite graph clustering

155. A temporal context-aware model for user behavior modeling in social media systems.

Paper Link】 【Pages】:1543-1554

【Authors】: Hongzhi Yin ; Bin Cui ; Ling Chen ; Zhiting Hu ; Zi Huang

【Abstract】: Social media provides valuable resources to analyze user behaviors and capture user preferences. This paper focuses on analyzing user behaviors in social media systems and designing a latent class statistical mixture model, named temporal context-aware mixture model (TCAM), to account for the intentions and preferences behind user behaviors. Based on the observation that the behaviors of a user in social media systems are generally influenced by intrinsic interest as well as the temporal context (e.g., the public's attention at that time), TCAM simultaneously models the topics related to users' intrinsic interests and the topics related to temporal context and then combines the influences from the two factors to model user behaviors in a unified way. To further improve the performance of TCAM, an item-weighting scheme is proposed to enable TCAM to favor items that better represent topics related to user interests and topics related to temporal context, respectively. Based on TCAM, we design an efficient query processing technique to support fast online recommendation for large social media data. Extensive experiments have been conducted to evaluate the performance of TCAM on four real-world datasets crawled from different social media sites. The experimental results demonstrate the superiority of the TCAM models, compared with the state-of-the-art competitor methods, by modeling user behaviors more precisely and making more effective and efficient recommendations.

【Keywords】: probabilistic generative model; social media mining; temporal recommender system; user behavior modeling

Research session 28: big data 4

156. Indexing for interactive exploration of big data series.

Paper Link】 【Pages】:1555-1566

【Authors】: Kostas Zoumpatianos ; Stratos Idreos ; Themis Palpanas

【Abstract】: Numerous applications continuously produce big amounts of data series, and in several time critical scenarios analysts need to be able to query these data as soon as they become available, which is not currently possible with the state-of-the-art indexing methods and for very large data series collections. In this paper, we present the first adaptive indexing mechanism, specifically tailored to solve the problem of indexing and querying very large data series collections. The main idea is that instead of building the complete index over the complete data set up-front and querying only later, we interactively and adaptively build parts of the index, only for the parts of the data on which the users pose queries. The net effect is that instead of waiting for extended periods of time for the index creation, users can immediately start exploring the data series. We present a detailed design and evaluation of adaptive data series indexing over both synthetic data and real-world workloads. The results show that our approach can gracefully handle large data series collections, while drastically reducing the data to query delay: by the time state-of-the-art indexing techniques finish indexing 1 billion data series (and before answering even a single query), adaptive data series indexing has already answered $3*10^5$ queries.

【Keywords】: adaptive data-series index; adaptive indexing; data-series; nearest neighbor; similarity search

157. Histograms as a side effect of data movement for big data.

Paper Link】 【Pages】:1567-1578

【Authors】: Zsolt István ; Louis Woods ; Gustavo Alonso

【Abstract】: Histograms are a crucial part of database query planning but their computation is resource-intensive. As a consequence, generating histograms on database tables is typically performed as a batch job, separately from query processing. In this paper, we show how to calculate statistics as a side effect of data movement within a DBMS using a hardware accelerator in the data path. This accelerator analyzes tables as they are transmitted from storage to the processing unit, and provides histograms on the data retrieved for queries at virtually no extra performance cost. To evaluate our approach, we implemented this accelerator on an FPGA. This prototype calculates histograms faster and with similar or better accuracy than commercial databases. Moreover, the FPGA can provide various types of histograms such as Equi-depth, Compressed, or Max-diff on the same input data in parallel, without additional overhead.

【Keywords】: FPGA; histogram; query optimization; statistics

158. A formal approach to finding explanations for database queries.

Paper Link】 【Pages】:1579-1590

【Authors】: Sudeepa Roy ; Dan Suciu

【Abstract】: As a consequence of the popularity of big data, many users with a variety of backgrounds seek to extract high level information from datasets collected from various sources and combined using data integration techniques. A major challenge for research in data management is to develop tools to assist users in explaining observed query outputs. In this paper we introduce a principled approach to provide explanations for answers to SQL queries based on intervention: removal of tuples from the database that significantly affect the query answers. We provide a formal definition of intervention in the presence of multiple relations which can interact with each other through foreign keys. First we give a set of recursive rules to compute the intervention for any given explanation in polynomial time (data complexity). Then we give simple and efficient algorithms based on SQL queries that can compute the top-K explanations by using standard database management systems under certain conditions. We evaluate the quality and performance of our approach by experiments on real datasets.

【Keywords】: causality; data cube; explanations; intervention; recursion

159. MISO: souping up big data query processing with a multistore system.

Paper Link】 【Pages】:1591-1602

【Authors】: Jeff LeFevre ; Jagan Sankaranarayanan ; Hakan Hacigümüs ; Jun'ichi Tatemura ; Neoklis Polyzotis ; Michael J. Carey

【Abstract】: Multistore systems utilize multiple distinct data stores such as Hadoop's HDFS and an RDBMS for query processing by allowing a query to access data and computation in both stores. Current approaches to multistore query processing fail to achieve the full potential benefits of utilizing both systems due to the high cost of data movement and loading between the stores. Tuning the physical design of a multistore, i.e., deciding what data resides in which store, can reduce the amount of data movement during query processing, which is crucial for good multistore performance. In this work, we provide what we believe to be the first method to tune the physical design of a multistore system, by focusing on which store to place data. Our method, called MISO for MultISstore Online tuning, is adaptive, lightweight, and works in an online fashion utilizing only the by-products of query processing, which we term as opportunistic views. We show that MISO significantly improves the performance of ad-hoc big data query processing by leveraging the specific characteristics of the individual stores while incurring little additional overhead on the stores.

【Keywords】: analytics; big data; database; hadoop; multistore; physical design

Undergraduate abstracts 7

160. Efficient top-K SimRank-based similarity join.

Paper Link】 【Pages】:1603-1604

【Authors】: Wenbo Tao ; Guoliang Li

【Abstract】: SimRank is an effective and widely adopted measure to quantify the structural similarity between pairs of nodes in a graph. In this paper we study the problem of top-k SimRank-based similarity join, which finds k pairs of nodes with the largest SimRank values. To the best of our knowledge, this is the first attempt to address this problem. We propose a random-walk-based method to efficiently identify top-k pairs. Experiment results on real datasets show that our method significantly outperforms baseline approaches.

【Keywords】: algorithm; database; graph; simrank

161. Multi-dimensional data statistics for columnar in-memory databases.

Paper Link】 【Pages】:1605-1606

【Authors】: Curtis Kroetsch

【Abstract】: The research presented here studies the multi-dimensional data statistics in the context of columnar in-memory database systems. Such systems, for example SAP HANA, Server Apollo, or IBM BLU, use an order-preserving dictionary with dense encoding on the read-optimized storage which encodes the values of a single column in an ordered, dense-domain dictionary. The dictionary maps variable-length domain values to fixed-size dictionary entries. This encoding reduces memory consumption as dictionary values can be represented with fewer bits than the original values, and allows queries to be evaluated efficiently on the encoded data. The main characteristics of the dictionary encoding is that it results in a dense domain of values which can be exploited for building efficient data statistics objects.

【Keywords】: histograms; multi-dimensional data statistics; sap hana

162. A user interaction based community detection algorithm for online social networks.

Paper Link】 【Pages】:1607-1608

【Authors】: Himel Dev

【Abstract】: Existing community detection techniques either rely on content analysis or only consider the underlying structure of the social network graph, while identifying communities in online social networks (OSNs). As a result, these approaches fail to identify active communities, i.e., communities based on actual interactions rather than mere friendship. To alleviate the limitations of existing approaches, we propose a novel solution of community detection in OSNs.

【Keywords】: community detection; social networks

163. EDS: a segment-based distance measure for sub-trajectory similarity search.

Paper Link】 【Pages】:1609-1610

【Authors】: Min Xie

【Abstract】: In this paper, we study a sub-trajectory similarity search problem which returns for a query trajectory some trajectories from the trajectory database each of which contains a sub-trajectory similar to the query trajectory. We show the insufficiency of the distance measures that are originally designed for trajectory similarity search where each trajectory as a whole is compared with the query trajectory, and thus we introduce a new segment-based distance measure called EDS (Edit Distance on Segment) for sub-trajectory similarity search. We conducted experiments on a real data set showing the superiority of our EDS distance measure.

【Keywords】: similarity search; sub-trajectory

164. Spatio-temporal visual analysis for event-specific tweets.

Paper Link】 【Pages】:1611-1612

【Authors】: Mashaal Musleh

【Abstract】: Twitter is one of the most popular social networks where people use to tweet about their opinions, feelings, desires, ...etc. One of the most important and consistent behaviors of Twitter users is posting a plethora of tweets about events of different types, e.g., Oscars celebration, soccer games, and natural disasters. For such kind of event-specific tweets, geotagged tweets grab the biggest attention because all events by nature have a spatial extent. For example, while Boston Marathon explosions were going on, in April 2013, users rush to Twitter seeking tweets from the marathon location. Thus, within a large project called Taghreed for comprehensive real-time and offline analysis and visualization for Twitter data (see www.gistic.org/taghreed), we are working on a module that aims to semi-automate the task of visually analyzing event-specific tweets, for arbitrary events, over the spatial and temporal dimensions. In this poster, we present our on-going work on this module and discuss three of its use cases. We also discuss our future plans to extend the module for larger data sizes and make it interactive while the events are going-on.

【Keywords】: analysis; events; ge-tagged; spatio-temporal; twitter

165. PackageBuilder: querying for packages of tuples.

Paper Link】 【Pages】:1613-1614

【Authors】: Kevin Fernandes ; Matteo Brucato ; Rahul Ramakrishna ; Azza Abouzied ; Alexandra Meliou

【Abstract】: PackageBuilder is a system that extends query engines to support package generation. A package is a collection of tuples with certain global properties defined on the collection as a whole. In contrast to traditional query answers where each answer tuple needs to satisfy the query predicates, each answer package needs to satisfy global constraints on the collection of tuples: e.g., a package of recipes that collectively do not exceed 2,200 calories. PackageBuilder introduces simple extensions to the SQL language to support package-level predicates, and includes a simple interface that allows users to load datasets and interactively specify package queries. Our system allows users to interactively navigate through the result packages, and to provide feedback by fixing tuples within a package. PackageBuilder automatically processes this feedback to refine the package queries, and generate new sets of results.

【Keywords】: constraint optimization; package queries; packagebuilder

166. Privacy preserving social graphs for high precision community detection.

Paper Link】 【Pages】:1615-1616

【Authors】: Himel Dev

【Abstract】: Discovering communities from a social network requires publishing the social network's data. However, community detection from raw data of a social network may reveal many sensitive information of the involved parties, e.g., how much a user is involved in which communities. An individual may not want to reveal such sensitive information. To resolve this issue, we address the problem of privacy preserving community detection in social networks. More specifically, we want to ensure that community detection is possible from the published social graph/data but the identity of users involved in a community should not be disclosed.

【Keywords】: community detection; privacy; social networks